# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

## Import and clean

In [47]:
import pandas as pd
from bs4 import BeautifulSoup
from unidecode import unidecode

In [69]:
url = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv'
df = pd.read_csv(url).dropna()
print(df.shape)
df.head()

(499, 3)


Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [70]:
test = df.description[0]
# test = "b'Nai\\xc2\\xa8ve'"
type(test)

str

In [79]:
# Clean the description 

def clean_html_with_bs4(string):
    soup = BeautifulSoup(string)
    string = soup.get_text()
    return string

listings = []
for x in df['description']:
    # Remove extra quotation marks
    x = x[2:-1]
    # Clean out HTML
    x = clean_html_with_bs4(x)
    # Remove line breaks
    x = x.replace('\\n',' ')
    # Translate unicode characters to ASCII
#     x = unidecode(x)
    listings.append(x)
    
df['description'] = listings

# Create a numerical label column
df['label_num'] = df.job.map({'Data Analyst': 0, 'Data Scientist': 1})
df.head()

Unnamed: 0,description,title,job,label_num
0,b Requirements: Conceptual understanding in Ma...,Data scientist,Data Scientist,1
1,"b Description As a Data Scientist 1, you will...",Data Scientist I,Data Scientist,1
2,a Data Scientist you will be working on consul...,Data Scientist - Entry Level,Data Scientist,1
3,",969 - $6,756 a monthContractUnder the general...",Data Scientist,Data Scientist,1
4,cation: USA \xe2\x80\x93 multiple locations 2+...,Data Scientist,Data Scientist,1


In [72]:
listings[0]

"Job Requirements: Conceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role) Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R Ability to communicate Model findings to both Technical and Non-Technical stake holders Hands on experience in SQL/Hive or similar programming language Must show past work via GitHub, Kaggle or any other published article Master's degree in Statistics/Mathematics/Computer Science or any other quant specific field. Apply Now"

## Count Vectorize

In [83]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, 
                             ngram_range=(1,1), 
                             stop_words='english')

word_counts = vectorizer.fit_transform(df.description)

vect_count = pd.DataFrame(
            word_counts.toarray(), 
                columns=vectorizer.get_feature_names())

print(word_counts.shape)
vect_count.head()

(499, 9653)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## TF-IDF Vectorize

In [84]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, 
                             ngram_range=(1,1), 
                             stop_words='english')

word_counts = vectorizer.fit_transform(df.description)

vect_tfidf = pd.DataFrame(
            word_counts.toarray(), 
                columns=vectorizer.get_feature_names())

print(word_counts.shape)
vect_tfidf.head()

(499, 9653)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.109329,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
