<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [2]:
from bs4 import BeautifulSoup
import requests

# read in df
df = pd.read_csv('https://raw.githubusercontent.com/mtoce/DS-Unit-4-Sprint-1-NLP/master/module2-vector-representations/data/job_listings.csv')
df = df.drop(columns='Unnamed: 0')
df.head()

Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [3]:
# yuck, this description is disgusting. Needs cleaning for sure.
print(df['description'][0])

b"<div><div>Job Requirements:</div><ul><li><p>\nConceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them</p>\n</li><li><p>Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)</p>\n</li><li><p>Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R</p>\n</li><li><p>Ability to communicate Model findings to both Technical and Non-Technical stake holders</p>\n</li><li><p>Hands on experience in SQL/Hive or similar programming language</p>\n</li><li><p>Must show past work via GitHub, Kaggle or any other published article</p>\n</li><li><p>Master's degree in Statistics/Mathematics/Computer Science or any other quant specific field.</p></li></ul><div><div><div><div><div><div>\nApply 

In [4]:
def clean_with_soup(text):
  '''
  Cleans text with beautiful soup
  '''
  soup = BeautifulSoup(text)

  # cuts off the 1st 2 characters b"
  text = soup.get_text()[2:-1]

  # Use regex to remove byte escapes
  text = text.replace('\\n', ' ')
  text = re.sub('\\\\x[a-f 0-9]{2}', '', text)
  return text

print(clean_with_soup(df['description'][0]))

Job Requirements: Conceptual understanding in Machine Learning models like Naive Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role) Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R Ability to communicate Model findings to both Technical and Non-Technical stake holders Hands on experience in SQL/Hive or similar programming language Must show past work via GitHub, Kaggle or any other published article Master's degree in Statistics/Mathematics/Computer Science or any other quant specific field. Apply Now


In [5]:
df['clean_desc'] = df['description'].apply(clean_with_soup)

df.head()

Unnamed: 0,description,title,clean_desc
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Job Requirements: Conceptual understanding in ...
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"Job Description As a Data Scientist 1, you wi..."
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,As a Data Scientist you will be working on con...
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"$4,969 - $6,756 a monthContractUnder the gener..."
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Location: USA multiple locations 2+ years of ...


## 2) Use Spacy to tokenize the listings 

In [6]:
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_lg")

# set up tokenizer
tokenizer = Tokenizer(nlp.vocab)

tokens = []
# tokenize every cleaned description
for doc in tokenizer.pipe(df['clean_desc'], batch_size=500):
    doc_tokens = [token.lemma_ for token in doc if (token.is_stop != True) and (token.is_punct != True)]
    tokens.append(doc_tokens)

df['tokens'] = tokens

df.head()

Unnamed: 0,description,title,clean_desc,tokens
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Job Requirements: Conceptual understanding in ...,"[Job, Requirements:, Conceptual, understand, M..."
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"Job Description As a Data Scientist 1, you wi...","[Job, Description, , Data, Scientist, 1,, hel..."
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,As a Data Scientist you will be working on con...,"[Data, Scientist, work, consult, business., re..."
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"$4,969 - $6,756 a monthContractUnder the gener...","[$4,969, $6,756, monthContractUnder, general, ..."
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Location: USA multiple locations 2+ years of ...,"[Location:, USA, , multiple, location, 2+, ye..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(stop_words='english', min_df=.03, ngram_range=(1,2))

docs = df['clean_desc'].copy()

vect.fit(docs)

dtm = vect.transform(docs)

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

dtm

Unnamed: 0,000,10,100,12,20,2019,25,3rd,40,401,...,years hands,years industry,years professional,years relevant,years work,york,york city,youll,youll work,youre
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,0,1,0,1,0,0
422,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
423,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
424,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [None]:
import squarify
import matplotlib.pyplot as plt
%matplotlib inline

wc = []
for col in dtm.columns:
    wc.append((col, sum(dtm[col])))

wc = pd.DataFrame(wc)
wc.columns = ['token', 'count']
wc = wc.sort_values(by='count', ascending=False)

squarify.plot(sizes=wc['count'][:20],
              label=wc['token'][:20].str.replace(' ', '\n'),
              alpha=0.8)
plt.title('20 most common words')
plt.axis('off')
plt.show()

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

docs = df['cleandesc'].copy()

dtm = tfidf.fit_transform(docs)

dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

dtm.head()

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
from sklearn.neighbors import NearestNeighbors

# fit nearest neighbors to the dtm generated by the tfidf
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

ideal_job = ['''
Looking for data science job. Uses python, SQL, git, conda, pipenv, etc. Relaxed working hours, no worry about time crunch. Looking for more of a development type job rather than a standard working one.

Hope to develop the new killer app or present interesting and important info for higher-ups.

Want to work somewhere near public transport, preferrably around the Seattle slash downtown area of Washington. Want to work in small to medium teams with friendly co-workers as well as have a healthy and active working relationship with bosses or higher-ups.
''']


new = tfidf.transform(ideal_job)

nn.kneighbors(new.todense())

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 