# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [32]:
##### Your Code Here #####
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv')
df.head()

Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [33]:
df['job'].value_counts(normalize=True)

Data Scientist    0.5
Data Analyst      0.5
Name: job, dtype: float64

In [34]:
df['title'].value_counts(normalize=True)

Data Scientist                                                 0.168337
Data Analyst                                                   0.160321
Data Analyst Intern                                            0.026052
Senior Data Scientist                                          0.016032
Senior Data Analyst                                            0.016032
Data Scientist Intern                                          0.014028
Junior Data Analyst                                            0.014028
Junior Data Scientist                                          0.012024
Associate Data Scientist                                       0.012024
Sr. Data Scientist                                             0.010020
Data Scientist II                                              0.008016
Data Scientist, Product                                        0.008016
Data Scientist Internship                                      0.006012
Master Data Analyst                                            0

In [35]:
df.head()

Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [36]:
# Testing out BeautifulSoup for cleaning HTML tag elements from df['description']

from bs4 import BeautifulSoup

html_str = df.loc[:0, 'description']

soup = BeautifulSoup(str(html_str))

print(soup.get_text())

0    b"Job Requirements:...
Name: description, dtype: object


In [38]:
"""
from bs4 import BeautifulSoup

html_str = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)

print(soup.get_text()) 
"""

from bs4 import BeautifulSoup

description_no_html = []

for row in df['description']:
    no_html = BeautifulSoup(str(row))
    description_no_html.append(no_html.get_text())
    
df['new_description'] = description_no_html
df.head()

Unnamed: 0,description,title,job,new_description
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist,"b""Job Requirements:\nConceptual understanding ..."
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist,"b'Job Description\n\nAs a Data Scientist 1, yo..."
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist,b'As a Data Scientist you will be working on c...
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen..."
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...


In [39]:
df.loc[0, 'new_description']

'b"Job Requirements:\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them\\nIntermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)\\nExposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R\\nAbility to communicate Model findings to both Technical and Non-Technical stake holders\\nHands on experience in SQL/Hive or similar programming language\\nMust show past work via GitHub, Kaggle or any other published article\\nMaster\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.\\nApply Now"'

In [40]:
# Getting rid of the 'b' in front of every row in df.new_description

df['new_description'] = df['new_description'].str.replace("b", '', regex=True)
df.head()

Unnamed: 0,description,title,job,new_description
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist,"""Jo Requirements:\nConceptual understanding in..."
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist,"'Jo Description\n\nAs a Data Scientist 1, you ..."
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist,'As a Data Scientist you will e working on con...
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist,"'$4,969 - $6,756 a monthContractUnder the gene..."
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist,'Location: USA \xe2\x80\x93 multiple locations...


In [45]:
# Label Encoding our Target Variable

df['label_num'] = df.job.map({'Data Scientist': 1, 'Data Analyst': 0})
df.head()

Unnamed: 0,description,title,job,new_description,label_num
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist,"""Jo Requirements:\nConceptual understanding in...",1
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist,"'Jo Description\n\nAs a Data Scientist 1, you ...",1
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist,'As a Data Scientist you will e working on con...,1
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist,"'$4,969 - $6,756 a monthContractUnder the gene...",1
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist,'Location: USA \xe2\x80\x93 multiple locations...,1


In [46]:
# Train Test Split for Machine Learning

X = df.new_description
y = df.label_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [47]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(400,)
(100,)
(400,)
(100,)


### Using CountVectorizer to Create Bag-of-Words

In [49]:
vectorizer = CountVectorizer(max_features=None, ngram_range=(1, 1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)

{'company': 1519, 'overview': 7011, 'ndigital': 5559, 'assets': 813, 'data': 1985, 'leader': 4443, 'helping': 3625, 'sophisticated': 8687, 'institutional': 4053, 'investors': 4203, 'understand': 9540, 'cryptoasset': 1914, 'markets': 4746, 'unique': 9560, 'sets': 8477, 'insightful': 4031, 'analysis': 608, 'fast': 3047, 'growing': 3514, 'startup': 8841, '5m': 204, 'seed': 8404, 'funding': 3306, 'digital': 2242, 'rolled': 8202, 'software': 8660, 'crypto': 1912, 'funds': 3308, 'ndata': 5510, 'scientist': 8342, 'role': 8199, 'looking': 4619, 'join': 4284, 'analytics': 617, 'team': 9152, 'ased': 778, 'denver': 2114, 'xe2': 10107, 'x80': 10005, 'x99ll': 10044, 'high': 3637, 'performing': 7151, 'leading': 4446, 'disruption': 2335, 'investment': 4201, 'industry': 3940, 'motivated': 5034, 'passionate': 7091, 'aout': 673, 'designing': 2159, 'novel': 6206, 'trading': 9352, 'indicators': 3928, 'indices': 3929, 'analyses': 606, 'emerging': 2630, 'field': 3086, 'cryptoanalysis': 1913, 'producing': 75

In [51]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)

X_train_vectorized.head()

(400, 10158)


Unnamed: 0,00,000,00011236,00805,00am,00pm,01,02115,03,0356,...,zero,zetahu,zeus,zheng,zillow,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

X_test_vectorized.head()

Unnamed: 0,00,000,00011236,00805,00am,00pm,01,02115,03,0356,...,zero,zetahu,zeus,zheng,zillow,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Logistic Regression

In [53]:
lr = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_preds = lr.predict(X_train_vectorized)
test_preds = lr.predict(X_test_vectorized)



In [55]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_preds)}')

Train Accuracy: 0.9925
Test Accuracy: 0.91


### Random Forest

In [57]:
rf = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized, y_train)

train_predictions = rf.predict(X_train_vectorized)
test_predictions = rf.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9925
Test Accuracy: 0.9


### Using TF-IDF To Create Bag-of-Words

In [58]:
# from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1, 1), stop_words='english')

train_word_counts = vectorizer.fit_transform(X_train)
test_word_counts = vectorizer.transform(X_test)

x_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

In [59]:
X_train_vectorized.head()

Unnamed: 0,00,000,00011236,00805,00am,00pm,01,02115,03,0356,...,zero,zetahu,zeus,zheng,zillow,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
X_test_vectorized.head()

Unnamed: 0,00,000,00011236,00805,00am,00pm,01,02115,03,0356,...,zero,zetahu,zeus,zheng,zillow,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
lr = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_preds = lr.predict(X_train_vectorized)
test_preds = lr.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_preds)}')

Train Accuracy: 0.9925
Test Accuracy: 0.56




In [62]:
rf = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_preds = rf.predict(X_train_vectorized)
test_preds = rf.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_preds)}')

Train Accuracy: 0.9875
Test Accuracy: 0.46




# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
