<a href="https://colab.research.google.com/github/brit228/DS-Unit-4-Sprint-2-NLP/blob/master/module3-Document-Classification/LS_DS_423_Document_Classification_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [23]:
!pip install -U nltk

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import sent_tokenize # Sentence Tokenizer
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.probability import FreqDist

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd
from bs4 import BeautifulSoup
import bs4
import numpy as np

def text_with_newlines(elem):
  text = ''
  for e in elem.descendants:
    if isinstance(e, str):
      text += e.strip()
    elif e.name == 'br' or e.name == 'p':
      text += '\n'
  return text.replace("\\n", "\n")

def get_text(v):
  soup = BeautifulSoup(v, 'html.parser')
  if soup.find() == None:
    return v.replace("\\n", "\n")
  return "\n".join([text_with_newlines(c) for c in soup.find_all(recursive=False)])

def split_lemma(v):
  stop_words = stopwords.words('english')
  lemmatizer = WordNetLemmatizer()
  return [lemmatizer.lemmatize(w).lower() for w in word_tokenize(get_text(v)) if w.isalpha() and w not in stop_words and len(w) > 1]

df = pd.read_csv("https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv")
df["description"] = df["description"].apply(lambda v: split_lemma(v) if v is not np.nan else '')

Requirement already up-to-date: nltk in /usr/local/lib/python3.6/dist-packages (3.4)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

vectorizer = CountVectorizer(stop_words='english')
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2))
bag_of_words = tfidf.fit_transform([" ".join(v) for v in train["description"].values])
bag_of_words_test = tfidf.transform([" ".join(v) for v in test["description"].values])

train_vec = pd.DataFrame(bag_of_words.toarray(), columns=tfidf.get_feature_names(), index=train.index)
train_vec["DataRole"] = train["job"]

test_vec = pd.DataFrame(bag_of_words_test.toarray(), columns=tfidf.get_feature_names(), index=test.index)
test_vec["DataRole"] = test["job"]

In [0]:
from sklearn.metrics import roc_auc_score

In [26]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(train_vec[[c for c in train_vec if c is not "DataRole"]], train_vec["DataRole"])
print(LR.score(test_vec[[c for c in test_vec if c is not "DataRole"]], test_vec["DataRole"]))
print(roc_auc_score(test_vec["DataRole"], LR.predict_proba(test_vec[[c for c in test_vec if c is not "DataRole"]])[:,1]))

0.89
0.9483173076923077




In [27]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB().fit(train_vec[[c for c in train_vec if c is not "DataRole"]], train_vec["DataRole"])
print(mnb.score(test_vec[[c for c in test_vec if c is not "DataRole"]], test_vec["DataRole"]))
print(roc_auc_score(test_vec["DataRole"], mnb.predict_proba(test_vec[[c for c in test_vec if c is not "DataRole"]])[:,1]))

0.89
0.939903846153846


In [28]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=200).fit(train_vec[[c for c in train_vec if c is not "DataRole"]], train_vec["DataRole"])
print(RFC.score(test_vec[[c for c in test_vec if c is not "DataRole"]], test_vec["DataRole"]))
print(roc_auc_score(test_vec["DataRole"], RFC.predict_proba(test_vec[[c for c in test_vec if c is not "DataRole"]])[:,1]))

0.91
0.9651442307692308


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
