<a href="https://colab.research.google.com/github/cocoisland/DS-Unit-4-Sprint-2-NLP/blob/master/module3-Document-Classification/LS_DS_423_Document_Classification_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv"

df = pd.read_csv(url, encoding="ISO-8859-1")
print(df.shape)
df.head()

(500, 3)


Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientistÂ,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [2]:
df.job.value_counts()

Data Scientist    250
Data Analyst      250
Name: job, dtype: int64

In [0]:
pd.set_option('display.max_colwidth', 200)
##df = df.drop(['title','job'], axis=1)

#df.head()

In [5]:
from bs4 import BeautifulSoup


def clean_html_with_bs4(string):
    soup = BeautifulSoup(string)
    string = soup.get_text()
    return string

listings = []
for x in df['description']:
    # Remove extra quotation marks
    x = str(x)[2:-1]
    # Clean out HTML
    x = clean_html_with_bs4(x)
    # Remove line breaks
    x = x.replace('\\n',' ')
    # Translate unicode characters to ASCII
#     x = unidecode(x)
    listings.append(x)
    
df['description'] = listings

# Create a numerical label column
df['label_num'] = df.job.map({'Data Analyst': 0, 'Data Scientist': 1})
df.head()


Unnamed: 0,description,title,job,label_num
0,"Job Requirements: Conceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN alo...",Data scientistÂ,Data Scientist,1
1,"Job Description As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journey. You will do so by ...",Data Scientist I,Data Scientist,1
2,"As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to actionable r...",Data Scientist - Entry Level,Data Scientist,1
3,"$4,969 - $6,756 a monthContractUnder the general supervision of Professors Dana Mukamel and Kai Zheng, the incumbent will join the CalMHSA Mental Health Tech Suite Innovation (INN) Evaluation Team...",Data Scientist,Data Scientist,1
4,"Location: USA \xe2\x80\x93 multiple locations 2+ years of Analytics experience Understand business requirements and technical requirements Can handle data extraction, preparation and transformatio...",Data Scientist,Data Scientist,1


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, 
                             ngram_range=(1,1), 
                             stop_words='english')

word_counts = vectorizer.fit_transform(df.description)

vect_count = pd.DataFrame(
            word_counts.toarray(), 
                columns=vectorizer.get_feature_names())

print(word_counts.shape)
vect_count.head()

(500, 9514)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, 
                             ngram_range=(1,1), 
                             stop_words='english')

word_counts = vectorizer.fit_transform(df.description)

vect_tfidf = pd.DataFrame(
            word_counts.toarray(), 
                columns=vectorizer.get_feature_names())

print(word_counts.shape)
vect_tfidf.head()

(500, 9514)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.109323,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.pipeline import Pipeline

In [0]:
vectorizers = [
    TfidfVectorizer(stop_words='english',
                    max_features=None),
    CountVectorizer(stop_words='english',
                   max_features=None)
]

classifiers = [
    MultinomialNB(),
    LinearSVC(),
    LogisticRegression(),
    RandomForestClassifier()
]

clf_names = [
         "Naive Bayes",
         "Linear SVM",
         "Logistic Regression",
         "Random Forest"
        ]

vect_names = [
    "TfidfVectorizer",
    "CountVectorizer"
]
clf_params = [
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__alpha': (1e-2, 1e-3)},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__C': (np.logspace(-5, 1, 5))},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__C': (np.logspace(-5, 1, 5))},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__max_depth': (1, 2)},
             ]


In [13]:
for classifier, clf_name, params in zip(classifiers, 
                                        clf_names, 
                                        clf_params):
    for vectorizer, vect_name in zip(vectorizers, 
                                     vect_names):
        pipe = Pipeline([
            ('vect', vectorizer),
            ('clf', classifier),
        ])
        gs = GridSearchCV(pipe, 
                          param_grid=params, 
                          n_jobs=-1,
                          scoring='roc_auc',
                          cv=5,
                          verbose=10)
        
        gs.fit(df.description, df.label_num)
        score = gs.best_score_
        print(f'''
Classifier: {clf_name}
Vectorizer: {vect_name}
Score: {gs.best_score_:.4f}
Params: {gs.best_params_}
------------------------------
            ''')

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    8.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   11.8s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   11.8s finished



Classifier: Naive Bayes
Vectorizer: TfidfVectorizer
Score: 0.9155
Params: {'clf__alpha': 0.01, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    9.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    9.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.



Classifier: Naive Bayes
Vectorizer: CountVectorizer
Score: 0.8989
Params: {'clf__alpha': 0.01, 'vect__ngram_range': (1, 1)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    9.8s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:   13.3s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   17.0s
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   22.4s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   26.6s finished



Classifier: Linear SVM
Vectorizer: TfidfVectorizer
Score: 0.9602
Params: {'clf__C': 0.31622776601683794, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   10.2s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:   14.5s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   20.2s
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   27.0s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   30.9s finished



Classifier: Linear SVM
Vectorizer: CountVectorizer
Score: 0.9761
Params: {'clf__C': 0.01, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   18.8s
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   23.6s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   26.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.



Classifier: Logistic Regression
Vectorizer: TfidfVectorizer
Score: 0.9603
Params: {'clf__C': 10.0, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   10.5s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:   14.1s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   23.7s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   26.8s finished



Classifier: Logistic Regression
Vectorizer: CountVectorizer
Score: 0.9748
Params: {'clf__C': 0.31622776601683794, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   10.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   10.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.



Classifier: Random Forest
Vectorizer: TfidfVectorizer
Score: 0.8444
Params: {'clf__max_depth': 2, 'vect__ngram_range': (1, 1)}
------------------------------
            
Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.4s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   10.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   10.1s finished



Classifier: Random Forest
Vectorizer: CountVectorizer
Score: 0.8536
Params: {'clf__max_depth': 2, 'vect__ngram_range': (1, 1)}
------------------------------
            


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
