<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [58]:
import re
import string

from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read thru the documentation to accomplish this task. 

`Tip:` You will need to install the `bs4` library inside your conda environment. 

In [3]:
from bs4 import BeautifulSoup
import requests

In [27]:
df = pd.read_csv("./data/job_listings.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [28]:
df["description"] = [BeautifulSoup(x).get_text() for x in df["description"]]
df["description"] = df["description"].str[2:]
df["description"] = [x.replace("\\n", " ") for x in df['description']]

In [30]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,Job Requirements: Conceptual understanding in ...,Data scientist
1,1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I
2,2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level
3,3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist
4,4,Location: USA \xe2\x80\x93 multiple locations ...,Data Scientist


## 2) Use Spacy to tokenize the listings 

In [34]:
nlp = spacy.load('en_core_web_lg')

def tokenize(data):
    doc = nlp(data)
    return [token.lemma_.strip() for token in doc if (token.is_punct==False) & (token.is_stop==False)]


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [51]:
data = [s for s in df['description']]

In [52]:
count = CountVectorizer(stop_words='english', tokenizer=tokenize)

dtm = count.fit_transform(data)

desc = pd.DataFrame(dtm.todense(), columns=count.get_feature_names())

In [53]:
desc.head()

Unnamed: 0,Unnamed: 1,$,+,+2,+3,-5,-PRON-,-data,-learn,-map,...,zone,zoom,zuckerberg,zurich,zurich\xe2\x80\x99s,|,||,~$70,~1,~4
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [41]:
count = CountVectorizer(stop_words='english', tokenizer=tokenize, max_features=10)
dtm = count.fit_transform(data)

desc = pd.DataFrame(dtm.todense(), columns=count.get_feature_names())

In [42]:
desc.head()

Unnamed: 0,Unnamed: 1,analytic,business,data,datum,experience,product,science,team,work
0,0,0,0,0,1,2,0,1,0,1
1,5,0,1,4,0,7,2,2,6,6
2,1,1,3,2,1,1,0,1,0,2
3,3,0,0,0,2,0,0,0,3,2
4,0,1,1,2,0,1,0,0,0,0


## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [54]:
tfid = TfidfVectorizer(stop_words='english', max_df=.95, min_df=3, tokenizer=tokenize)

dtm = tfid.fit_transform(data)

tf = pd.DataFrame(dtm.todense(), columns=tfid.get_feature_names())
tf.head()

Unnamed: 0,Unnamed: 1,$,+,/or,0,1,10,100,"100,000",11,...,yes,york,you\'ll,you\'re,you\xe2\x80\x99ll,you\xe2\x80\x99re,you\xe2\x80\x99ve,yrs,|,||
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.086468,0.0,0.029162,0.0,0.0,0.094018,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.054912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.068855,0.249558,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.146562,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [59]:
ideal_job = ['Utilize python and predictive modeling techniques to develop computer vision software for robots']

v_job = tfid.transform(ideal_job)

nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(tf)

nn.kneighbors(v_job.todense())

(array([[1.26914779, 1.28991133, 1.28991133, 1.29131959, 1.29259766]]),
 array([[185, 142,  52,  99, 328]], dtype=int64))

In [61]:
data[142]

'The challenge Adobe is looking for a Senior Data Scientist who will be building the next generation of marketing cloud products by leveraging machine learning, predictive modeling and optimization techniques. These products would help businesses understand, manage, and optimize the experience throughout the customer journey. Example applications include real-time online media optimization, media attribution, predictive sales analytics, product recommendation, mobile analytics, predictive customer scoring and segmentation and large-scale experimentation. Ideal candidates will have a strong academic background as well as technical skills including applied statistics, machine learning, data mining, and software development. Familiarity working with large-scale datasets and big data techniques would be a plus. What you\\xe2\\x80\\x99ll do Develop predictive models on large-scale datasets to address various business problems through leveraging advanced statistical modeling, machine learnin

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 