<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Assignment 3*

# Document Classification

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

# Load DataFrame

In [16]:
import re
import string

# !pip install -U nltk

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize # Sentence Tokenizer
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.probability import FreqDist

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

[nltk_data] Downloading package punkt to /home/superio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/superio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
import pandas as pd
df = pd.read_csv(
    'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv')
df.head()

Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


# Clean Dataframe

In [20]:
# Removing numbers and comma before description text
def clean_columns(columns):
    for i in columns:
        df[i] = df[i].str.lstrip('1234567890,').str.strip('b\"\'')
        df[i] = df[i].str.replace('<[^<]+?>', '')
        df[i] = df[i].replace('\n','', regex=False)
        df[i] = df[i].apply(lambda x:str(x).replace('\\n', ' '))
        df[i] = df[i].apply(lambda x:str(x).replace('\xe2\x80\x93', ' '))
clean_columns(['description', 'title'])
df.head()

Unnamed: 0,description,title,job,tokenized_description
0,Job Requirements: Conceptual understanding in ...,Data scientist,Data Scientist,"[job, requirements, conceptual, understanding,..."
1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I,Data Scientist,"[job, description, data, scientist, help, us, ..."
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,Data Scientist,"[data, scientist, working, consulting, side, b..."
3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist,Data Scientist,"[monthcontractunder, general, supervision, pro..."
4,Location: USA \xe2\x80\x93 multiple locations ...,Data Scientist,Data Scientist,"[location, usa, multiple, locations, years, an..."


In [21]:
# FROM LECTURE
import string
table = str.maketrans('','', string.punctuation)
stop_words = set(stopwords.words('english'))

def tokenize(texts):
    # Tokenize by word
    tokens = word_tokenize(texts)
#     print("Tokens:", tokens)
    # Make all words lowercase
    lowercase_tokens = [w.lower() for w in tokens]
#     print("Lowercase:", lowercase_tokens)
    # Strip punctuation from within words
    no_punctuation = [x.translate(table) for x in lowercase_tokens]
#     print("No Punctuation:", no_punctuation)
    # Remove words that aren't alphabetic
    alphabetic = [word for word in no_punctuation if word.isalpha()]
#     print("Alphabetic:", alphabetic)
    # Remove stopwords
    words = [w for w in alphabetic if not w in stop_words]
#     print("Cleaned Words:", words)
    return words

df['tokenized_description'] = df['description'].apply(tokenize)
df['tokenized_title'] = df['title'].apply(tokenize)
df.head()

Unnamed: 0,description,title,job,tokenized_description,tokenized_title
0,Job Requirements: Conceptual understanding in ...,Data scientist,Data Scientist,"[job, requirements, conceptual, understanding,...","[data, scientist]"
1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I,Data Scientist,"[job, description, data, scientist, help, us, ...","[data, scientist]"
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,Data Scientist,"[data, scientist, working, consulting, side, b...","[data, scientist, entry, level]"
3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist,Data Scientist,"[monthcontractunder, general, supervision, pro...","[data, scientist]"
4,Location: USA \xe2\x80\x93 multiple locations ...,Data Scientist,Data Scientist,"[location, usa, multiple, locations, years, an...","[data, scientist]"


# Train & Test Split

In [35]:
from sklearn.model_selection import train_test_split, GridSearchCV
X = df.dropna(axis=0).description
y = df.dropna(axis=0).job.map({'Data Scientist' : 1, 'Data Analyst' : 0})

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(400,) (100,) (400,) (100,)


In [37]:
X_train

249    Company Overview  Digital Assets Data is a lea...
433    Position: Data Analyst  Job Purpose: Na Ali\xe...
19     $70,000 - $100,000 a yearTitle: Data Analyst/J...
322    This position will report directly to the Data...
332    TemporaryData Analyst Job #: req2358  Organiza...
56     Temporary, InternshipDescription:  The Data An...
301    ContractJob Description  Job Summary This is a...
229    Expedia Do you wish for the opportunity to tra...
331    Essential Functions: The Data Visualization An...
132    Invitae envisions a world in which genomic seq...
137    We\xe2\x80\x99re Elliott Davis, a rapidly grow...
423    Responsible for analytic data needs of the bus...
335    Trinity Industries is searching for a passiona...
25     As a Data Scientist for Ads Measurement in the...
464    Provides subject matter expertise to departmen...
281    General Summary: You will be responsible for h...
247    Position Description you need to have PhD in c...
237    Microsoft envisions a wo

# Count Vectorize

## Vectorize Data

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.get_feature_names()[300:325])

['abstract', 'abstracting', 'abstraction', 'abstractions', 'abstractly', 'abstracts', 'abundant', 'abuse', 'aca', 'academi', 'academia', 'academic', 'academies', 'accelerate', 'accelerates', 'accelerating', 'acceleration', 'accelerator', 'accept', 'acceptability', 'acceptable', 'acceptance', 'accepted', 'accepting', 'accepts']


In [43]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(400, 8706)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 8706)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
results = []

## Logistic Regression

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

columns = ['model', 'acc_train', 'acc_test', 'vect']

result = {}
result['model'] = 'Logistic Regression'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)

Train Accuracy: 0.9725
Test Accuracy: 0.88




## Multinomial Naive Bayes

In [60]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Multinomial Naive Bayes'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)

Train Accuracy: 0.965
Test Accuracy: 0.86


## Random Forest Classifier

In [61]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Random Forest'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)

Train Accuracy: 0.99
Test Accuracy: 0.82




## Result Summary

In [62]:
pd.DataFrame.from_records(results)

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.88,0.9725,Logistic Regression,Count
1,0.86,0.965,Multinomial Naive Bayes,Count
2,0.82,0.99,Random Forest,Count


# TF-IDF Vectorization Method

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

## Vectorize Data

In [64]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(400, 8706)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.253112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.015946,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 8706)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [66]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Logistic'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.9725
Test Accuracy: 0.88




## Multinomial Naive Bayes

In [67]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Naive Bayes'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.965
Test Accuracy: 0.86


## Random Forest Classifier

In [68]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Random Forest'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)



Train Accuracy: 0.99
Test Accuracy: 0.79


In [70]:
from xgboost.sklearn import XGBClassifier

XGB = XGBClassifier(n_jobs = -1).fit(X_train_vectorized, y_train)

train_predictions = XGB.predict(X_train_vectorized)
test_predictions = XGB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Xgboost'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.9925
Test Accuracy: 0.93


## Result Summary

In [71]:
pd.DataFrame.from_records(results)

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.88,0.9725,Logistic Regression,Count
1,0.86,0.965,Multinomial Naive Bayes,Count
2,0.82,0.99,Random Forest,Count
3,0.88,0.9725,Logistic,Tfidf
4,0.86,0.965,Naive Bayes,Tfidf
5,0.79,0.99,Random Forest,Tfidf
6,0.93,0.9925,Xgboost,Tfidf


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
