<a href="https://colab.research.google.com/github/trista-paul/DS-Unit-4-Sprint-2-NLP/blob/master/Copy_of_LS_DS_423_Document_Classification_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

# Data Processing

In [4]:
!pip install -U nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/73/56/90178929712ce427ebad179f8dc46c8deef4e89d4c853092bee1efd57d05/nltk-3.4.1.zip (3.1MB)
[K     |████████████████████████████████| 3.1MB 5.0MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/97/8a/10/d646015f33c525688e91986c4544c68019b19a473cb33d3b55
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.4.1


In [1]:
import pandas as pd
import numpy as np
import random
import string
import re

#models and model validation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.naive_bayes import MultinomialNB
from xgboost.sklearn import XGBClassifier


#vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [26]:
df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv')
df.head()

Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [27]:
df['job'].value_counts(normalize=True)

Data Scientist    0.5
Data Analyst      0.5
Name: job, dtype: float64

In [28]:
#let's predict if a job description is for a data scientist or analyst
#drop title for being redundant
df = df.drop(columns=['title'])
#let's say we only want to apply to scientist jobs.
#scientist = 1 analyst = 0
df['job'] = df['job'].map({'Data Analyst': 0, 'Data Scientist':1})
df['description'] = df['description'].str.strip("b'")
df.head()

Unnamed: 0,description,job
0,"""<div><div>Job Requirements:</div><ul><li><p>\...",1
1,<div>Job Description<br/>\n<br/>\n<p>As a Data...,1
2,<div><p>As a Data Scientist you will be workin...,1
3,"<div class=""jobsearch-JobMetadataHeader icl-u-...",1
4,<ul><li>Location: USA \xe2\x80\x93 multiple lo...,1


In [0]:
#from Derek's notebook, string conversion
text =[]
for i in df['description']:
  s= str(i)
  text.append(s)
  

df['d']=text

In [30]:
df['description'].values.shape

(500,)

In [31]:
dlist = df['d'].tolist()
len(dlist)

500

In [0]:
dlist_nohtml = []

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  cleantext = cleantext.replace(r"\n", " ")
  return cleantext

for desc in dlist:
    desc = cleanhtml(desc)
    dlist_nohtml.append(desc)

In [33]:
cleaned_description = []
from string import punctuation
table = str.maketrans('', '', punctuation)
stop_words = stopwords.words('english')

for description in dlist_nohtml:
    #sep by word
    words = word_tokenize(description)

    #lowercase
    words = [word.lower() for word in words]

    #remove non alphanumeric characters
    words = [word.translate(table) for word in words]
    words = [word for word in words if word.isalpha()]

    #remove stopwords
    words = [word for word in words if not word in stop_words]
    
    #lemmatize (remove prefixes and suffixes)
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    
    #make listing
    cleaned_description.append(words)

job_listings = cleaned_description
len(job_listings)

500

In [24]:
job_listings[-1]

['location',
 'el',
 'segundo',
 'california',
 'united',
 'state',
 'job',
 'summary',
 'amp',
 'seeking',
 'data',
 'analyst',
 'partner',
 'entertainment',
 'operation',
 'product',
 'engineering',
 'team',
 'build',
 'better',
 'data',
 'capability',
 'dig',
 'analysis',
 'around',
 'usage',
 'performance',
 'linear',
 'streaming',
 'video',
 'experience',
 'role',
 'objective',
 'drive',
 'understanding',
 'insight',
 'analyzing',
 'current',
 'product',
 'user',
 'experience',
 'work',
 'closely',
 'leadership',
 'team',
 'understand',
 'analytic',
 'need',
 'communicate',
 'insight',
 'broader',
 'organization',
 'responsibility',
 'solve',
 'challenging',
 'problem',
 'regarding',
 'video',
 'platform',
 'transforming',
 'raw',
 'data',
 'action',
 'driven',
 'analysis',
 'report',
 'insight',
 'building',
 'rich',
 'highvisibility',
 'reportsdashboardstools',
 'used',
 'direct',
 'stakeholder',
 'senior',
 'management',
 'diverse',
 'set',
 'team',
 'across',
 'organization',


In [34]:
df['job_listings'] = job_listings
df = df.drop(columns=['description', 'd'])
df.head()

Unnamed: 0,job,job_listings
0,1,"[job, requirement, conceptual, understanding, ..."
1,1,"[job, description, data, scientist, help, u, b..."
2,1,"[data, scientist, working, consulting, side, b..."
3,1,"[monthcontractunder, general, supervision, pro..."
4,1,"[location, usa, multiple, location, year, anal..."


In [36]:
listings = []
for listing in df['job_listings']:
  listing = " ".join(listing)
  listings.append(listing)
  
df['job_listings'] = listings
df.head()

Unnamed: 0,job,job_listings
0,1,job requirement conceptual understanding machi...
1,1,job description data scientist help u build ma...
2,1,data scientist working consulting side busines...
3,1,monthcontractunder general supervision profess...
4,1,location usa multiple location year analytics ...


In [0]:
X = df['job_listings']
y = df['job']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Count Vectorizer Tests

In [62]:
vectorizer = CountVectorizer(max_features=2000, ngram_range=(1,2),
                             min_df = 5, max_df = .80,
                             stop_words = 'english')

vectorizer.fit(X_train)
train_word_counts = vectorizer.transform(X_train)
X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())
                             
model = XGBClassifier(max_depth = 2,
                      learning_rate = 0.1,
                      verbosity = 0,
                      n_jobs = -1,
                      random_state = 0)

model.fit(X_train_vectorized, y_train)
xg_scores = cross_val_score(model, X_train_vectorized, y_train, cv=5, scoring='roc_auc')
mean_xgscore = sum(xgscores) / len(xgscores)
mean_xgscore

0.9840060975609756

In [63]:
vectorizer.fit(X_test)
test_word_counts = vectorizer.transform(X_test)
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

model.fit(X_test_vectorized, y_test)
y_score = model.predict_proba(X_test_vectorized)[:, 1]
test_xgscore = roc_auc_score(y_test, y_score)
test_xgscore

1.0

In [64]:
row1 = [['Count', 'xgboost', mean_xgscore, test_xgscore]]
model_comp = pd.DataFrame(row1, columns = ['vectorizer', 'model', 'avg cross_val', 'test roc_auc'])

model_comp

Unnamed: 0,vectorizer,model,avg cross_val,test roc_auc
0,Count,xgboost,0.984006,1.0


In [66]:
bayes = MultinomialNB()

bayes.fit(X_train_vectorized, y_train)
bscores = cross_val_score(bayes, X_train_vectorized, y_train, cv=5, scoring='roc_auc')
mean_bscore = sum(bscores) / len(bscores)

mean_bscore

0.9505374843652282

In [67]:
bayes.fit(X_test_vectorized, y_test)
y_score = bayes.predict_proba(X_test_vectorized)[:, 1]
test_bscore = roc_auc_score(y_test, y_score)

test_bscore

0.9987995198079231

In [69]:
row2 = ['Count', 'Naive Bayes', mean_bscore, test_bscore]
model_comp.loc[1] = row2

model_comp

Unnamed: 0,vectorizer,model,avg cross_val,test roc_auc
0,Count,xgboost,0.984006,1.0
1,Count,Naive Bayes,0.950537,0.9988


# tfidf Vectorizer Tests

In [70]:
vectorizer = TfidfVectorizer(max_features=2000, ngram_range=(1,2),
                             min_df = 5, max_df = .80,
                             stop_words = 'english')
vectorizer.fit(X_train)
train_word_counts = vectorizer.transform(X_train)
X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

model.fit(X_train_vectorized, y_train)

xgscores = cross_val_score(model, X_train_vectorized, y_train, cv=5, scoring='roc_auc')
mean_xgscore = sum(xgscores) / len(xgscores)
mean_xgscore

0.9840060975609756

In [71]:
vectorizer.fit(X_test)
test_word_counts = vectorizer.transform(X_test)
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

model.fit(X_test_vectorized, y_test)
y_score = model.predict_proba(X_test_vectorized)[:, 1]
test_xgscore = roc_auc_score(y_test, y_score)
test_xgscore

1.0

In [72]:
row3 = ['Tfidf', 'xgboost', mean_xgscore, 1]
model_comp.loc[2] = row3
model_comp

Unnamed: 0,vectorizer,model,avg cross_val,test roc_auc
0,Count,xgboost,0.984006,1.0
1,Count,Naive Bayes,0.950537,0.9988
2,Tfidf,xgboost,0.984006,1.0


In [73]:
bayes.fit(X_train_vectorized, y_train)
bscores = cross_val_score(bayes, X_train_vectorized, y_train, cv=5, scoring='roc_auc')
mean_bscore = sum(bscores) / len(bscores)
mean_bscore

0.9555270481550968

In [74]:
bayes.fit(X_test_vectorized, y_test)
y_score = bayes.predict_proba(X_test_vectorized)[:, 1]
test_bscore = roc_auc_score(y_test, y_score)

test_bscore

0.9991996798719487

In [0]:
row4 = ['Tfidf', 'Naive Bayes', mean_bscore, test_bscore]
model_comp.loc[3] = row4

In [76]:
model_comp

Unnamed: 0,vectorizer,model,avg cross_val,test roc_auc
0,Count,xgboost,0.984006,1.0
1,Count,Naive Bayes,0.950537,0.9988
2,Tfidf,xgboost,0.984006,1.0
3,Tfidf,Naive Bayes,0.955527,0.9992


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
