# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [58]:
import pandas as pd
import numpy as np
import re
import requests
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

from bs4 import BeautifulSoup

from collections import defaultdict
from collections import Counter

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

In [35]:
url = 'job_listings.csv'
jobs = pd.read_csv(url)
print(jobs.shape)
jobs.head()

(500, 3)


Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [36]:
jobs.isnull().sum()

description    1
title          1
job            0
dtype: int64

In [37]:
# Dropping null values
jobs = jobs.dropna(axis=0)
jobs.isnull().sum()

description    0
title          0
job            0
dtype: int64

In [38]:
def clean_html(string):
    soup = BeautifulSoup(string)
    string = soup.get_text()
    return string

summary = []
for job in jobs['description']:
    # Remove extra quotation marks
    job = str(job)[2:-1]
    # Clean out HTML
    job = clean_html(job)
    # Remove line breaks
    job = job.replace('\\n',' ')
    # Translate unicode characters to ASCII
    # job = unidecode(job)
    summary.append(job)
    
jobs['description'] = summary

# Create a numerical label column
jobs['assigned_value'] = jobs.job.map({'Data Analyst': 0, 'Data Scientist': 1})
jobs.head()

Unnamed: 0,description,title,job,assigned_value
0,Job Requirements: Conceptual understanding in ...,Data scientist,Data Scientist,1
1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I,Data Scientist,1
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,Data Scientist,1
3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist,Data Scientist,1
4,Location: USA \xe2\x80\x93 multiple locations ...,Data Scientist,Data Scientist,1


In [39]:
# Assigning X and y values
X = jobs.description
y = jobs.assigned_value

# Splitting data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Confirming correct shapes of data
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(399,)
(100,)
(399,)
(100,)


In [41]:
# Count Vectorization on training and testing data
vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train)
train_word_counts = vectorizer.transform(X_train)
test_word_counts = vectorizer.transform(X_test)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

# print(X_train_vectorized)
# print(X_test_vectorized)

In [65]:
# Logistic Regression model
logistic_reg = LogisticRegression(random_state=42, solver="lbfgs").fit(X_train_vectorized, y_train)

train_predictions = logistic_reg.predict(X_train_vectorized)
test_predictions = logistic_reg.predict(X_test_vectorized)

print('REPORTS: Logistic Regression')
print('---------------------')
print(f'Train Accuracy: {round(accuracy_score(y_train, train_predictions),4)}')
print(f'Test Accuracy: {round(accuracy_score(y_test, test_predictions),4)}')
print('---------------------')
print(f'Train Roc Auc: {round(roc_auc_score(y_train, train_predictions),4)}')
print(f'Test Roc Auc: {round(roc_auc_score(y_test, test_predictions),4)}')

REPORTS: Logistic Regression
---------------------
Train Accuracy: 0.9975
Test Accuracy: 0.89
---------------------
Train Roc Auc: 0.9975
Test Roc Auc: 0.8906




In [64]:
# Random Forest Classifier
rand_for= RandomForestClassifier(n_estimators=200).fit(X_train_vectorized, y_train)

train_predictions = rand_for.predict(X_train_vectorized)
test_predictions = rand_for.predict(X_test_vectorized)

print('REPORTS: Random Forest Classifier')
print('---------------------')
print(f'Train Accuracy: {round(accuracy_score(y_train, train_predictions),4)}')
print(f'Test Accuracy: {round(accuracy_score(y_test, test_predictions),4)}')
print('---------------------')
print(f'Train Roc Auc: {round(roc_auc_score(y_train, train_predictions),4)}')
print(f'Test Roc Auc: {round(roc_auc_score(y_test, test_predictions),4)}')

REPORTS: Random Forest Classifier
---------------------
Train Accuracy: 0.9975
Test Accuracy: 0.93
---------------------
Train Roc Auc: 0.9975
Test Roc Auc: 0.9306


In [63]:
# Pipeline Created, Tfidf Vectorizer, Multinomial NB, GridSearch CV
stop = stopwords.words('english')
nb_tfidf = make_pipeline(TfidfVectorizer(stop_words=stop),
                        MultinomialNB())
nb_grid_params = [{'tfidfvectorizer__ngram_range' : [(1,1), (1,2), (1,3)],
                  'tfidfvectorizer__max_features' : [50, 100, None]}]

nb_grid = GridSearchCV(nb_tfidf, nb_grid_params, cv=4)
nb_grid.fit(X_train, y_train)

print('REPORTS: Pipeline')
print('---------------------')
print ('Best Parameters:', nb_grid.best_params_)
print('---------------------')
print ('CV Score', nb_grid.best_score_)
print ('Test Score', nb_grid.score(X_test, y_test))

REPORTS: Pipeline
---------------------
Best Parameters: {'tfidfvectorizer__max_features': 100, 'tfidfvectorizer__ngram_range': (1, 2)}
---------------------
CV Score 0.9122807017543859
Test Score 0.92


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
