<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [8]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Optional:* Scrape 100 Job Listings that contain the title "Data Scientist" from indeed.com

At a minimum your final dataframe of job listings should contain
- Job Title
- Job Description

If you choose to not to scrape the data, there is a CSV with outdated data in the directory. Remeber, if you scrape Indeed, you're helping yourself find a job. ;)

In [39]:
# come back to scrape later, time allowing

df = pd.read_csv('./data/job_listings.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [7]:
df['description'][0]
# very ugly -- in particular, all the html

'b"<div><div>Job Requirements:</div><ul><li><p>\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them</p>\\n</li><li><p>Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)</p>\\n</li><li><p>Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R</p>\\n</li><li><p>Ability to communicate Model findings to both Technical and Non-Technical stake holders</p>\\n</li><li><p>Hands on experience in SQL/Hive or similar programming language</p>\\n</li><li><p>Must show past work via GitHub, Kaggle or any other published article</p>\\n</li><li><p>Master\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.</p></li></ul><div><div><div><div><div><d

In [10]:
soup = BeautifulSoup(df['description'][0])

soup.get_text()

# remaining garbage: initial b" and abundant \\n

'b"Job Requirements:\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them\\nIntermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)\\nExposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R\\nAbility to communicate Model findings to both Technical and Non-Technical stake holders\\nHands on experience in SQL/Hive or similar programming language\\nMust show past work via GitHub, Kaggle or any other published article\\nMaster\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.\\nApply Now"'

In [40]:
df['description'] = df['description'].apply(lambda d: BeautifulSoup(d).get_text())

df.head()

# good starting point

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""Job Requirements:\nConceptual understanding ...",Data scientist
1,1,"b'Job Description\n\nAs a Data Scientist 1, yo...",Data Scientist I
2,2,b'As a Data Scientist you will be working on c...,Data Scientist - Entry Level
3,3,"b'$4,969 - $6,756 a monthContractUnder the gen...",Data Scientist
4,4,b'Location: USA \xe2\x80\x93 multiple location...,Data Scientist


In [42]:
# take a description string and return it with the obvious garbage removed
def cut_the_crap(description):
    d2 = description.replace('\\n', ' ')
    d3 = d2.replace('b"', "")
    d4 = d3.replace("b'", "")
    return d4

df['description'] = df['description'].apply(cut_the_crap)

df.head()
# much better!

Unnamed: 0.1,Unnamed: 0,description,title
0,0,Job Requirements: Conceptual understanding in ...,Data scientist
1,1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I
2,2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level
3,3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist
4,4,Location: USA \xe2\x80\x93 multiple locations ...,Data Scientist


## 2) Use Spacy to tokenize / clean the listings 

In [15]:
# load it up

nlp = spacy.load("en_core_web_lg")

In [43]:
# new column for lemmas

df['lemmas'] = df['description'].apply(lambda t: [token.lemma_ for token in nlp(t) if (token.is_stop != True) and (token.is_punct != True)])

df.head()

Unnamed: 0.1,Unnamed: 0,description,title,lemmas
0,0,Job Requirements: Conceptual understanding in ...,Data scientist,"[job, requirement, conceptual, understanding, ..."
1,1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I,"[job, description, , Data, Scientist, 1, help..."
2,2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,"[Data, scientist, work, consult, business, res..."
3,3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist,"[$, 4,969, $, 6,756, monthcontractunder, gener..."
4,4,Location: USA \xe2\x80\x93 multiple locations ...,Data Scientist,"[location, USA, \xe2\x80\x93, multiple, locati..."


In [None]:
df['lemmas'][0]

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [46]:
# create transformer
vect = CountVectorizer()

# count vectorizer wants a list of strings where each string is a separate document
lem_list = []

# iterate through each set of lemmas and turn them into an acceptable document for vect
for lemma in df['lemmas']:
    lem = ' '.join(lemma)
    lem_list.append(lem)

In [51]:
# fit to lemmas 
vect.fit(lem_list)

dtm = vect.transform(lem_list)

# print(vect.get_feature_names())

# I'm seeing a lot of room for improvement in lemmatization, but it looks reasonable

In [52]:
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

In [53]:
print(dtm)

     00  000  02115  03  0305  0356  04  062  06366  08  ...  zero  zeus  zf  \
0     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
1     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
2     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
3     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
4     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
5     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
6     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
7     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
8     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
9     0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
10    0    0      0   0     0     0   0    0      0   0  ...     0     0   0   
11    0    2      0   0     0     0   0 

## 4) Visualize the most common word counts

In [None]:
# put a pin in it

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [55]:
# make vectorizer
tfidf = TfidfVectorizer(stop_words='english', min_df=0.025, max_df=.98)

dtm = tfidf.fit_transform(lem_list)

# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,000,10,100,12,15,20,2019,25,3rd,40,...,x99ve,x9cbig,x9d,xa6,xae,xc2,xe2,year,years,york
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.182325,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.135938,0.0,0.0,0.176321,0.023034,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035105,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.106363,0.111158,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [56]:
from sklearn.neighbors import NearestNeighbors

In [57]:
# make and fit model
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

In [58]:
# check results for first document
nn.kneighbors([dtm.iloc[0].values])

(array([[0.        , 1.248946  , 1.25100532, 1.25330489, 1.25377596]]),
 array([[  0, 294, 276, 393, 366]]))

In [67]:
# check whether the comparison looks good
print(df.iloc[0]['description'])
print(df.iloc[366]['description'])
# seems fine! not wildly wrong, at least

Job Requirements: Conceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role) Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R Ability to communicate Model findings to both Technical and Non-Technical stake holders Hands on experience in SQL/Hive or similar programming language Must show past work via GitHub, Kaggle or any other published article Master's degree in Statistics/Mathematics/Computer Science or any other quant specific field. Apply Now"
Data science encompasses the computational and statistical skills required to use data in support of scientific enquiry and sound business decision-making. We are looking t

In [68]:
# make up a job description and see what real ones match
dream = ["opportunities for mentorship. fluent in Python and experienced in SQL"]

new = tfidf.transform(dream)

nn.kneighbors(new.todense())

(array([[1.33833328, 1.34383729, 1.34555003, 1.34555003, 1.34555003]]),
 array([[300, 284, 164, 375, 296]]))

In [72]:
# check results
print(df.iloc[375]['description'])

Position Description We are looking for an experienced Data Scientist to join Vudu\xe2\x80\x99s growing Analytics team in Sunnyvale, CA and lead our efforts at modeling and researching consumer behavior. You will be leading our efforts around content recommendation, personalization, response modeling, churn analysis, A/B testing and much more. Sounds exciting? Here is more:  Who you are:  \xef\x83\x98 End-to-End model development: Build prediction models from the ground up, from data exploration through feature generation and into model construction and optimization. \xef\x83\x98 Mine Vudu (and other) data and deploy statistical modeling to gain robust insights into how consumers make entertainment choices. \xef\x83\x98 Run exploratory analyses into ambiguous problems and define metrics to build a quantitative understanding of our business. \xef\x83\x98 Lead collaboration with teams across the company: Product, marketing, Content and Engineering. \xef\x83\x98 Do whatever it takes to de

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 