# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [92]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
import string

DATURL = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv"



In [98]:
def clean_html(s: str) -> str:
    soup_ = BeautifulSoup(re.sub(r'^[b\"]', '', s))
    soup = re.sub(r'^[b]', '', soup_.get_text())
    ifnotin = set(stopwords.words('english')).union(string.punctuation)
    return ' '.join([w for w in word_tokenize(soup) if w not in ifnotin]).replace('\\n', '')

def clean(dat: pd.DataFrame) -> pd.DataFrame: 
    ''' '''
    dat_ = dat.dropna(axis=0)
    return (dat_.drop('title', axis=1)
            #.dropna(axis=0)
            .assign(job = dat.job.replace({'Data Scientist': 1, 
                                               'Data Analyst': 0}))
            .assign(description = [clean_html(s) for s in dat_.description.values])
           )

df = clean(pd.read_csv(DATURL))

print(df.shape)
df.head(14)


(499, 2)


Unnamed: 0,description,job
0,`` Job Requirements Conceptual understanding M...,1
1,'Job DescriptionAs Data Scientist 1 help us bu...,1
2,'As Data Scientist working consulting side bus...,1
3,"4,969 6,756 monthContractUnder general supervi...",1
4,'Location USA \xe2\x80\x93 multiple locations2...,1
5,'Create various Business Intelligence Analytic...,1
6,'As Spotify Premium swells 96M subscribers aro...,1
7,`` Everytown Gun Safety nation 's largest gun ...,1
8,`` MS quantitative discipline Statistics Mathe...,1
9,'Slack hiring experienced data scientists join...,1


In [57]:
df.job.value_counts()

print("a balanced 2-class targeted classification problem. ")

a balanced 2-class targeted classification problem. 


In [58]:
df.dropna(axis=0).shape

(499, 2)

In [59]:
df.job.value_counts()

1    250
0    249
Name: job, dtype: int64

'b'

0x562f360f79a0
0x562f360f79c0
0x7facaec895e8
0x7facaec895e8
0x7facaec895e8


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
