<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Assignment 3*

# Document Classification

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('job_listings.csv')

In [7]:
pd.set_option('display.max_colwidth', 200)
df.head()

Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>\nConceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random For...",Data scientist,Data Scientist
1,"b'<div>Job Description<br/>\n<br/>\n<p>As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journ...",Data Scientist I,Data Scientist
2,"b'<div><p>As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to ac...",Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-u-xs-mb--md""><div class=""jobsearch-JobMetadataHeader-item ""><span class=""icl-u-xs-mr--xs"">$4,969 - $6,756 a month</span></div><div class=""jobsearch-Jo...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple locations</li>\n<li>2+ years of Analytics experience</li>\n<li>Understand business requirements and technical requirements</li>\n<li>Can handle data e...,Data Scientist,Data Scientist


In [4]:
df['job'].value_counts()

Data Scientist    250
Data Analyst      250
Name: job, dtype: int64

In [5]:
df['description'].sample(5)

312    b'<div><p>Safe Banking Systems, a part of Accu...
488    b"<div></div><div><div><div>We are presently s...
466    b'<div class="jobsearch-JobMetadataHeader icl-...
207    b'<div class="jobsearch-JobMetadataHeader icl-...
355    b'<div class="jobsearch-JobMetadataHeader icl-...
Name: description, dtype: object

In [9]:
# Strip HTML Tags
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [21]:
vals = []

for text in df['description']:
    clean = strip_tags(str(text))
    clean = clean.replace('\n', '')
    clean = clean.replace('\\', '')
    clean = clean.replace('/', '')
    vals.append(clean)
    
df['clean_desc'] = vals

In [24]:
df['label'] = df['job'].map({
    'Data Scientist': '0',
    'Data Analyst': '1'
})

In [26]:
df.label.value_counts()

1    250
0    250
Name: label, dtype: int64

## CountVectorizer

In [28]:
from sklearn.model_selection import train_test_split

X = df['clean_desc']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [30]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((400,), (100,), (400,), (100,))

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(len(vectorizer.get_feature_names()))

13965


In [34]:
X_train_vectorized = vectorizer.transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

## Logistic 

In [44]:
results = []

In [38]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='lbfgs')

lr.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [39]:
train_predictions = lr.predict(X_train_vectorized)
test_predictions = lr.predict(X_test_vectorized)

In [42]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 1.0
Test Accuracy: 0.88


In [45]:
lr_result = {}
lr_result['model'] = 'Logistic Regression'
lr_result['acc_train'] = accuracy_score(y_train, train_predictions)
lr_result['acc_test'] = accuracy_score(y_test, test_predictions)
lr_result['vect_type'] = 'Count'

results.append(lr_result)

## Random Forest Classifier

In [51]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=100, max_depth=4)

RFC.fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9475
Test Accuracy: 0.87


In [50]:
lr_result = {}
lr_result['model'] = 'Random Forest'
lr_result['acc_train'] = accuracy_score(y_train, train_predictions)
lr_result['acc_test'] = accuracy_score(y_test, test_predictions)
lr_result['vect_type'] = 'Count'

results.append(lr_result)

[{'model': 'Logistic Regression',
  'acc_train': 1.0,
  'acc_test': 0.88,
  'vect_type': 'Count'},
 {'model': 'Random Forest',
  'acc_train': 0.995,
  'acc_test': 0.84,
  'vect_type': 'Count'},
 {'model': 'Random Forest',
  'acc_train': 0.91,
  'acc_test': 0.84,
  'vect_type': 'Count'},
 {'model': 'Random Forest',
  'acc_train': 0.955,
  'acc_test': 0.87,
  'vect_type': 'Count'}]

## TF-IDF Vectorization

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,10), stop_words='english')

vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 10), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [74]:
train_tf = vectorizer.transform(X_train)
test_tf = vectorizer.transform(X_test)

### Logistic Regression

In [75]:
lr = LogisticRegression(solver='lbfgs')

lr.fit(train_tf, y_train)

train_pred = lr.predict(train_tf)
test_pred = lr.predict(test_tf)

print(f'Train Accuracy: {accuracy_score(y_train, train_pred)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_pred)}')

Train Accuracy: 1.0
Test Accuracy: 0.86


In [76]:
result = {}
result['model'] = 'Logistic'
result['acc_train'] = accuracy_score(y_train, train_pred)
result['acc_test'] = accuracy_score(y_test, test_pred)
result['vect_type'] = 'Tfidf'

results.append(result)

### Random Forest Classifier

In [77]:
rfc = RandomForestClassifier(max_depth=4, n_estimators=100)

rfc.fit(train_tf, y_train)

train_pred = rfc.predict(train_tf)
test_pred = rfc.predict(test_tf)

print(f'Train Accuracy: {accuracy_score(y_train, train_pred)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_pred)}')

Train Accuracy: 0.9525
Test Accuracy: 0.82


In [78]:
result = {}
result['model'] = 'Random Forest'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

In [79]:
pd.DataFrame(results)

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.88,1.0,Logistic Regression,Count
1,0.84,0.995,Random Forest,Count
2,0.84,0.91,Random Forest,Count
3,0.87,0.955,Random Forest,Count
4,0.84,0.98,Logistic,Tfidf
5,0.87,0.9475,Random Forest,Tfidf
6,0.86,1.0,Logistic,Tfidf
7,0.87,0.9475,Random Forest,Tfidf


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
