<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Assignment 3*

# Document Classification

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [4]:
import pandas as pd
pd.set_option('display.max_colwidth', 200)


df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv')

df.head()

Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>\nConceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random For...",Data scientist,Data Scientist
1,"b'<div>Job Description<br/>\n<br/>\n<p>As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journ...",Data Scientist I,Data Scientist
2,"b'<div><p>As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to ac...",Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-u-xs-mb--md""><div class=""jobsearch-JobMetadataHeader-item ""><span class=""icl-u-xs-mr--xs"">$4,969 - $6,756 a month</span></div><div class=""jobsearch-Jo...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple locations</li>\n<li>2+ years of Analytics experience</li>\n<li>Understand business requirements and technical requirements</li>\n<li>Can handle data e...,Data Scientist,Data Scientist


In [3]:
df.shape

(500, 3)

In [6]:
df['job'].value_counts()

Data Scientist    250
Data Analyst      250
Name: job, dtype: int64

In [10]:
df['title'].value_counts(normalize=True).sort_values().tail()

Senior Data Analyst      0.016032
Senior Data Scientist    0.016032
Data Analyst Intern      0.026052
Data Analyst             0.160321
Data Scientist           0.168337
Name: title, dtype: float64

In [24]:
df.isna().sum()

description    1
title          1
job            0
dtype: int64

In [0]:
df = df.dropna()
df.shape

(499, 3)

# Count Vectorizer Model


In [195]:
from sklearn.model_selection import train_test_split

X = df['description']
y = df['job']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=812)

X_train.shape, X_test.shape

((324,), (175,))

In [196]:
# Transform train data

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000, ngram_range=(1,2), stop_words='english')

vectorizer.fit(X_train)

train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(324, 10000)


Unnamed: 0,00,000,000 employees,000 year,10,10 million,10 years,100,100 000,100 companies,...,years xe2,yes,york,york city,youth,yrs,zeus,zillow,zone,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [197]:
# Transform test data

test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(175, 10000)


Unnamed: 0,00,000,000 employees,000 year,10,10 million,10 years,100,100 000,100 companies,...,years xe2,yes,york,york city,youth,yrs,zeus,zillow,zone,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,1,2,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Logistic Regression

In [198]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

columns = ['model', 'acc_train', 'acc_test', 'vect']

lr_result = {}
lr_result['model'] = 'Logistic Regression'
lr_result['acc_train'] = accuracy_score(y_train, train_predictions)
lr_result['acc_test'] = accuracy_score(y_test, test_predictions)
lr_result['vect_type'] = 'Count'

results = []
results.append(lr_result)

Train Accuracy: 0.9969135802469136
Test Accuracy: 0.92




### Multinomial Naive Bayes

In [199]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Multinomial Naive Bayes'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)

Train Accuracy: 0.9753086419753086
Test Accuracy: 0.8742857142857143


### Random Forest

In [200]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Random Forest'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)

Train Accuracy: 0.9969135802469136
Test Accuracy: 0.8971428571428571




### XGB Model

In [201]:
from xgboost import XGBClassifier

XGB = XGBClassifier(max_depth=8,
                    n_estimators=20).fit(X_train_vectorized, y_train)

train_predictions = XGB.predict(X_train_vectorized)
test_predictions = XGB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'XGB'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)

Train Accuracy: 0.9907407407407407
Test Accuracy: 0.9257142857142857


In [202]:
results

[{'acc_test': 0.92,
  'acc_train': 0.9969135802469136,
  'model': 'Logistic Regression',
  'vect_type': 'Count'},
 {'acc_test': 0.8742857142857143,
  'acc_train': 0.9753086419753086,
  'model': 'Multinomial Naive Bayes',
  'vect_type': 'Count'},
 {'acc_test': 0.8971428571428571,
  'acc_train': 0.9969135802469136,
  'model': 'Random Forest',
  'vect_type': 'Count'},
 {'acc_test': 0.9257142857142857,
  'acc_train': 0.9907407407407407,
  'model': 'XGB',
  'vect_type': 'Count'}]

# TF-IDF
(same classifiers as above...)

In [203]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1,2), stop_words='english')

vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=10000, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [204]:
# Vectorize train data

train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(324, 10000)


Unnamed: 0,00,000,000 employees,000 year,10,10 million,10 years,100,100 000,100 companies,...,years xe2,yes,york,york city,youth,yrs,zeus,zillow,zone,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [205]:
# Vectorize test data

test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(175, 10000)


Unnamed: 0,00,000,000 employees,000 year,10,10 million,10 years,100,100 000,100 companies,...,years xe2,yes,york,york city,youth,yrs,zeus,zillow,zone,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.019587,0.0,0.028907,0.040831,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Logistic Regression

In [206]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Logistic'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.9845679012345679
Test Accuracy: 0.88




### Multinomial NB

In [207]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Naive Bayes'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.9629629629629629
Test Accuracy: 0.8342857142857143


### RFC

In [208]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Random Forest'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)



Train Accuracy: 0.9938271604938271
Test Accuracy: 0.8628571428571429


### XGB

In [209]:
from xgboost import XGBClassifier

XGB = XGBClassifier(max_depth=8,
                    n_estimators=20).fit(X_train_vectorized, y_train)

train_predictions = XGB.predict(X_train_vectorized)
test_predictions = XGB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'XGB'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.9938271604938271
Test Accuracy: 0.9428571428571428


In [210]:
results = pd.DataFrame.from_records(results)
results.sort_values(by='acc_test', ascending=False)

Unnamed: 0,acc_test,acc_train,model,vect_type
7,0.942857,0.993827,XGB,Tfidf
3,0.925714,0.990741,XGB,Count
0,0.92,0.996914,Logistic Regression,Count
2,0.897143,0.996914,Random Forest,Count
4,0.88,0.984568,Logistic,Tfidf
1,0.874286,0.975309,Multinomial Naive Bayes,Count
6,0.862857,0.993827,Random Forest,Tfidf
5,0.834286,0.962963,Naive Bayes,Tfidf


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
