# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [69]:
import sys
!conda update --yes --prefix {sys.prefix} pandas

Collecting package metadata: done
Solving environment: done


  current version: 4.6.7
  latest version: 4.6.8

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /home/mark/anaconda3/envs/myf

  added / updated specs:
    - pandas


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.1.23  |                0         126 KB
    certifi-2019.3.9           |           py36_0         155 KB
    openssl-1.0.2r             |       h7b6447c_0         3.2 MB
    pandas-0.24.2              |   py36he6710b0_0        11.1 MB
    ------------------------------------------------------------
                                           Total:        14.5 MB

The following packages will be UPDATED:

  pandas                              0.24.1-py36he6710b0_0 --> 0.24.2-py36he6710b0_0

The following packages will be SUPERSEDED 

In [2]:
import pandas as pd
import html
from bs4 import BeautifulSoup as Soup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
from xgboost import plot_importance
import xgboost as xgb
import numpy as np
import category_encoders as ce
from sklearn.pipeline import Pipeline
from matplotlib.pylab import rcParams
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [3]:
pd.__version__

'0.24.2'

In [4]:
url = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv'
df = pd.read_csv(url, encoding="ISO-8859-1" )

In [5]:
df.dropna(inplace=True)

In [6]:
df['text'] = df.description.apply(lambda d: Soup(d).get_text())
df_ = df.drop('description', inplace=False, axis=1)
df_.head()

Unnamed: 0,title,job,text
0,Data scientistÂ,Data Scientist,"b""Job Requirements:\nConceptual understanding ..."
1,Data Scientist I,Data Scientist,"b'Job Description\n\nAs a Data Scientist 1, yo..."
2,Data Scientist - Entry Level,Data Scientist,b'As a Data Scientist you will be working on c...
3,Data Scientist,Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen..."
4,Data Scientist,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...


In [7]:
X = df_.text
enc = ce.OrdinalEncoder()
y = enc.fit_transform(df_.job.values)[0]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(399,)
(100,)
(399,)
(100,)


In [10]:
print(type(X_train),type(X_test))

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


In [11]:
vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [12]:
train_word_counts = vectorizer.transform(X_train)
X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

test_word_counts = vectorizer.transform(X_test)
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())
print(X_train_vectorized.shape)
print(X_test_vectorized.shape)

(399, 10003)
(100, 10003)


In [13]:
XGB = XGBClassifier(n_estimators=200,num_class=len(df_.job.unique()), objective='multi:softmax').fit(X_train_vectorized, y_train)
train_predictions = XGB.predict(X_train_vectorized)
test_predictions = XGB.predict(X_test_vectorized)
print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print(f'Train Roc Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc Auc: {roc_auc_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.92
Train Roc Auc: 0.9974874371859296
Test Roc Auc: 0.9211684673869548


In [14]:

LR = LogisticRegression(random_state=42, solver="newton-cg").fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)
print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print(f'Train Roc Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc Auc: {roc_auc_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.89
Train Roc Auc: 0.9974874371859296
Test Roc Auc: 0.8905562224889956


In [15]:
RFC = RandomForestClassifier(n_estimators=200).fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print(f'Train Roc Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc Auc: {roc_auc_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.93
Train Roc Auc: 0.9974874371859296
Test Roc Auc: 0.9305722288915567


In [16]:
vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [17]:
train_word_counts = vectorizer.transform(X_train)
X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

test_word_counts = vectorizer.transform(X_test)
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())
print(X_train_vectorized.shape)
print(X_test_vectorized.shape)

(399, 10003)
(100, 10003)


In [18]:

LR = LogisticRegression(random_state=42, solver="newton-cg").fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)
print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print(f'Train Roc Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc Auc: {roc_auc_score(y_test, test_predictions)}')

Train Accuracy: 0.974937343358396
Test Accuracy: 0.86
Train Roc Auc: 0.9749246231155778
Test Roc Auc: 0.8615446178471389


In [19]:
RFC = RandomForestClassifier(n_estimators=200).fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print(f'Train Roc Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc Auc: {roc_auc_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.9
Train Roc Auc: 0.9975
Test Roc Auc: 0.9011604641856743


In [71]:
XGB = XGBClassifier(n_estimators=200, num_class=len(df_.job.unique()), objective='multi:softmax', \
    boost='dart', max_depth=8, eta=0.2, sub_sample=0.5 ) \
.fit(X_train_vectorized, y_train)
train_predictions = XGB.predict(X_train_vectorized)
test_predictions = XGB.predict(X_test_vectorized)
print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print(f'Train Roc Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc Auc: {roc_auc_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.93
Train Roc Auc: 0.9974874371859296
Test Roc Auc: 0.9309723889555822


Train Accuracy: 0.9974937343358395
Test Accuracy: 0.93
Train Roc Auc: 0.9975
Test Roc Auc: 0.9309723889555822

In [69]:
pipe = Pipeline(steps = [
    ('xgb', XGBClassifier(num_class=len(df_.job.unique()), objective='multi:softmax' ))
])

In [58]:
param_grid = {
    #     'pca__n_components': [28],
#     "xgb__booster": ["gbtree","dart"],
#     "xgb__gamma": [0],
#     "xgb__learning_rate": [0.2],
#     "xgb__n_estimators": [200],
#     #     "gb__min_samples_leaf": [3],
#     #     "gb__min_impurity_decrease": [1.2],
#     "xgb__max_depth": [4]
}
# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(pipe, param_grid=param_grid, cv=2, n_jobs=-1,
                  scoring='roc_auc',
                  verbose=1)

In [59]:
gsf = gs.fit(X_train_vectorized,y_train)

Fitting 2 folds for each of 1 candidates, totalling 2 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    8.7s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    8.7s finished


In [60]:
print('Best Parameter (roc_auc score=%0.3f):' % gsf.best_score_)
print(gsf.best_params_)

Best Parameter (roc_auc score=0.927):
{}


In [61]:
py = gsf.predict(X_test_vectorized)
roc_auc_score(y_test,py)

0.9207683073229291

# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
