# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [15]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## CountVectorizer work

In [2]:
# Load data

url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv"

countvec_df = pd.read_csv(url, encoding="ISO-8859-1")
print(countvec_df.shape)
countvec_df.head()

(500, 3)


Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientistÂ,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [3]:
# Given that description values have HTML, investigate the description column in countvec_df

countvec_df = countvec_df.dropna()  # dropped 1 description, 1 title
countvec_df.isna().sum()

description    0
title          0
job            0
dtype: int64

In [4]:
# Clean HTML out of description values
from bs4 import BeautifulSoup


def clean_html_with_bs4(string):
    soup = BeautifulSoup(string)
    string = soup.get_text()
    return string


countvec_df.description = countvec_df.description.apply(clean_html_with_bs4)

In [5]:
# Clean out newlines from description values

for description in countvec_df.description[:3]:
    print('\\n' in description)


def remove_newlines(s):
    if '\\n' in s:
        s = ' '.join(s.split('\\n'))
        remove_newlines(s)
    return s


countvec_df.description = countvec_df.description.apply(remove_newlines)

True
True
True


In [6]:
# Check countvec_df descriptions

countvec_df.head()

Unnamed: 0,description,title,job
0,"b""Job Requirements: Conceptual understanding i...",Data scientistÂ,Data Scientist
1,"b'Job Description As a Data Scientist 1, you ...",Data Scientist I,Data Scientist
2,b'As a Data Scientist you will be working on c...,Data Scientist - Entry Level,Data Scientist
3,"b'$4,969 - $6,756 a monthContractUnder the gen...",Data Scientist,Data Scientist
4,b'Location: USA \xe2\x80\x93 multiple location...,Data Scientist,Data Scientist


In [7]:
# Widen description column

pd.set_option('display.max_colwidth', 200)
countvec_df.head()

Unnamed: 0,description,title,job
0,"b""Job Requirements: Conceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN a...",Data scientistÂ,Data Scientist
1,"b'Job Description As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journey. You will do so b...",Data Scientist I,Data Scientist
2,"b'As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to actionable...",Data Scientist - Entry Level,Data Scientist
3,"b'$4,969 - $6,756 a monthContractUnder the general supervision of Professors Dana Mukamel and Kai Zheng, the incumbent will join the CalMHSA Mental Health Tech Suite Innovation (INN) Evaluation Te...",Data Scientist,Data Scientist
4,"b'Location: USA \xe2\x80\x93 multiple locations 2+ years of Analytics experience Understand business requirements and technical requirements Can handle data extraction, preparation and transformat...",Data Scientist,Data Scientist


In [8]:
# Look at job column values

countvec_df.job.value_counts()

Data Scientist    250
Data Analyst      249
Name: job, dtype: int64

In [9]:
# Categorical encoding of job values

countvec_df['label_num'] = countvec_df.job.map({'Data Scientist': 0, 'Data Analyst': 1})
countvec_df.head()

Unnamed: 0,description,title,job,label_num
0,"b""Job Requirements: Conceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN a...",Data scientistÂ,Data Scientist,0
1,"b'Job Description As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journey. You will do so b...",Data Scientist I,Data Scientist,0
2,"b'As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to actionable...",Data Scientist - Entry Level,Data Scientist,0
3,"b'$4,969 - $6,756 a monthContractUnder the general supervision of Professors Dana Mukamel and Kai Zheng, the incumbent will join the CalMHSA Mental Health Tech Suite Innovation (INN) Evaluation Te...",Data Scientist,Data Scientist,0
4,"b'Location: USA \xe2\x80\x93 multiple locations 2+ years of Analytics experience Understand business requirements and technical requirements Can handle data extraction, preparation and transformat...",Data Scientist,Data Scientist,0


In [10]:
# Train-test split model validation

X = countvec_df.description
y = countvec_df.label_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Check shape of 4 pandas Series objects

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((399,), (100,), (399,), (100,))

In [12]:
# Run vectorizer to get bigrams, and look at tokenization of vocabulary

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,2), stop_words='english')
vectorizer.fit(X_train)
print(vectorizer.vocabulary_)

{'temporary': 66022, 'internshipdescription': 34796, 'data': 16077, 'analytics': 4113, 'innovations': 33693, 'group': 29330, 'docomo': 20405, 'seeking': 58430, 'ph': 47316, 'students': 63227, 'computer': 13170, 'science': 57663, 'engineering': 22740, 'related': 54062, '12': 170, 'week': 71074, 'summer': 63667, 'internship': 34735, 'aws': 6627, 'operation': 44535, 'cost': 14667, 'analysis': 3457, 'required': 54996, 'skills': 59910, 'experiences': 25245, 'education': 21472, 'candidates': 9229, 'position': 48337, 'solid': 60523, 'background': 6754, 'machine': 38825, 'learning': 37176, 'optimization': 44912, 'theory': 66429, 'experience': 24704, 'algorithm': 3004, 'design': 18475, 'developing': 19025, 'supervised': 63709, 'unsupervised': 68828, 'models': 41995, 'depth': 18359, 'concepts': 13248, 'statistics': 62425, 'linux': 38108, 'ssh': 61781, 'github': 28685, 'sql': 61629, 'highly': 30594, 'desirable': 18664, 'candidate': 9167, 'attributes': 6209, 'strongly': 63142, 'self': 58579, 'moti

In [13]:
# Complete vectorization by transforming X_train and X_test

train_word_counts = vectorizer.transform(X_train)
type(train_word_counts)

# X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())
# print(X_train_vectorized.shape)
# X_train_vectorized.head()

scipy.sparse.csr.csr_matrix

In [14]:
test_word_counts = vectorizer.transform(X_test)
type(test_word_counts)

scipy.sparse.csr.csr_matrix

In [19]:
# Fit logistic regression model and get accuracy score

log_reg = LogisticRegression(solver='lbfgs', random_state=42).fit(train_word_counts, y_train)

train_predictions = log_reg.predict(train_word_counts)
test_predictions = log_reg.predict(test_word_counts)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.91


In [43]:
# Fit random forest classifier and get accuracy score

RFC = RandomForestClassifier(n_estimators=100).fit(train_word_counts, y_train)

train_predictions = RFC.predict(train_word_counts)
test_predictions = RFC.predict(test_word_counts)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.91


## TfidfVectorizer work

In [28]:
# Will use countvec copy for this work, due to significant preprocessing time spent on countvec

tfidfvec_df = countvec_df.copy()

In [29]:
tfidfvec_df.head()

Unnamed: 0,description,title,job,label_num
0,"b""Job Requirements: Conceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN a...",Data scientistÂ,Data Scientist,0
1,"b'Job Description As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journey. You will do so b...",Data Scientist I,Data Scientist,0
2,"b'As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to actionable...",Data Scientist - Entry Level,Data Scientist,0
3,"b'$4,969 - $6,756 a monthContractUnder the general supervision of Professors Dana Mukamel and Kai Zheng, the incumbent will join the CalMHSA Mental Health Tech Suite Innovation (INN) Evaluation Te...",Data Scientist,Data Scientist,0
4,"b'Location: USA \xe2\x80\x93 multiple locations 2+ years of Analytics experience Understand business requirements and technical requirements Can handle data extraction, preparation and transformat...",Data Scientist,Data Scientist,0


In [30]:
# Train-test split model validation

X1 = tfidfvec_df.description
y1 = tfidfvec_df.label_num

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=42)

In [32]:
# Check shape of 4 pandas Series objects

X1_train.shape, X1_test.shape, y1_train.shape, y1_test.shape

((399,), (100,), (399,), (100,))

In [33]:
# Run Tfidfvectorizer to get bigrams, and look at tokenization of vocabulary

vectorizer1 = TfidfVectorizer(max_features=None, ngram_range=(1, 2), stop_words='english')
vectorizer1.fit(X1_train)
print(vectorizer1.vocabulary_)

{'temporary': 66022, 'internshipdescription': 34796, 'data': 16077, 'analytics': 4113, 'innovations': 33693, 'group': 29330, 'docomo': 20405, 'seeking': 58430, 'ph': 47316, 'students': 63227, 'computer': 13170, 'science': 57663, 'engineering': 22740, 'related': 54062, '12': 170, 'week': 71074, 'summer': 63667, 'internship': 34735, 'aws': 6627, 'operation': 44535, 'cost': 14667, 'analysis': 3457, 'required': 54996, 'skills': 59910, 'experiences': 25245, 'education': 21472, 'candidates': 9229, 'position': 48337, 'solid': 60523, 'background': 6754, 'machine': 38825, 'learning': 37176, 'optimization': 44912, 'theory': 66429, 'experience': 24704, 'algorithm': 3004, 'design': 18475, 'developing': 19025, 'supervised': 63709, 'unsupervised': 68828, 'models': 41995, 'depth': 18359, 'concepts': 13248, 'statistics': 62425, 'linux': 38108, 'ssh': 61781, 'github': 28685, 'sql': 61629, 'highly': 30594, 'desirable': 18664, 'candidate': 9167, 'attributes': 6209, 'strongly': 63142, 'self': 58579, 'moti

In [34]:
# Complete vectorization by transforming X1_train and X1_test

train1_word_counts = vectorizer1.transform(X1_train)
test1_word_counts = vectorizer1.transform(X1_test)

In [35]:
# Fit logistic regression model and get accuracy score

log_reg1 = LogisticRegression(solver='lbfgs', random_state=42).fit(train1_word_counts, y1_train)

train1_predictions = log_reg1.predict(train1_word_counts)
test1_predictions = log_reg1.predict(test1_word_counts)

print(f'Train Accuracy: {accuracy_score(y1_train, train1_predictions)}')
print(f'Test Accuracy: {accuracy_score(y1_test, test1_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.89


In [47]:
# Fit random forest classifier and get accuracy score

RFC1 = RandomForestClassifier(n_estimators=100).fit(train1_word_counts, y1_train)

train1_predictions = RFC1.predict(train1_word_counts)
test1_predictions = RFC1.predict(test1_word_counts)

print(f'Train Accuracy: {accuracy_score(y1_train, train1_predictions)}')
print(f'Test Accuracy: {accuracy_score(y1_test, test1_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.94


## Compare individual vectorization methods and intra-method model accuracy

Mean length and similarity of the job descriptions should determine whether TF-IDF vectorizing is favored here
or if Count vectorizing is more optimal.

Based on dataset used here, it appears that Count vectorizing is more robust (more agnostic?) to changes in classification model than is TF-IDF vectorizing. Interestingly, TF-IDF vectorizing delivered both the lowest-
and highest-performing classification results out of all 4 results. Random forest classification within TF-IDF
vectorizing returned a test accuracy score of 0.94. 

# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
