<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [3]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [4]:
from bs4 import BeautifulSoup
import requests

##### Your Code Here #####
df = pd.read_csv('data/job_listings.csv')

def get_descriptions(string):
    soup = BeautifulSoup(string)
    return soup.get_text()
df['description'] = df['description'].apply(lambda x: x.lstrip('b'))
df['description'] = df['description'].apply(get_descriptions)

df.head()


Unnamed: 0.1,Unnamed: 0,description,title
0,0,"""Job Requirements:\nConceptual understanding i...",Data scientist
1,1,"'Job Description\n\nAs a Data Scientist 1, you...",Data Scientist I
2,2,'As a Data Scientist you will be working on co...,Data Scientist - Entry Level
3,3,"'$4,969 - $6,756 a monthContractUnder the gene...",Data Scientist
4,4,'Location: USA \xe2\x80\x93 multiple locations...,Data Scientist


## 2) Use Spacy to tokenize the listings 

In [5]:
nlp = spacy.load("en_core_web_lg")

In [6]:
##### Your Code Here #####
stop_words = nlp.Defaults.stop_words.union(['job', 'data', 'scientist', 'location', 'business'])
def tokenize_column(text):
    doc = nlp(text)
    tokens = ([token.lemma_ for token in doc if (token.text.lower() not in stop_words) and (token.is_punct != True) and token.text.isalpha()])
    return tokens

df['tokens'] = df['description'].apply(tokenize_column)

df.head()

Unnamed: 0.1,Unnamed: 0,description,title,tokens
0,0,"""Job Requirements:\nConceptual understanding i...",Data scientist,"[understanding, Machine, Learning, model, like..."
1,1,"'Job Description\n\nAs a Data Scientist 1, you...",Data Scientist I,"[help, build, machine, learning, model, pipeli..."
2,2,'As a Data Scientist you will be working on co...,Data Scientist - Entry Level,"[work, consult, responsible, analyze, large, c..."
3,3,"'$4,969 - $6,756 a monthContractUnder the gene...",Data Scientist,"[monthcontractunder, general, supervision, Pro..."
4,4,'Location: USA \xe2\x80\x93 multiple locations...,Data Scientist,"[USA, multiple, year, Analytics, requirement, ..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [7]:
##### Your Code Here #####
vect = CountVectorizer()

vect.fit(df['description'])
dtm = vect.transform(df['description'])
dtm.shape

(426, 10069)

## 4) Visualize the most common word counts

In [8]:
##### Your Code Here #####
dtm.todense()
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm.sample(10)

Unnamed: 0,00,000,02115,03,0356,04,062,06366,08,10,...,zenreach,zero,zeus,zf,zheng,zillow,zones,zoom,zuckerberg,zurich
130,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
118,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
44,0,2,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
234,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
140,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
294,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
412,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
54,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [9]:
##### Your Code Here #####
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
dtm = tfidf.fit_transform(df['description'])
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
dtm.head()


Unnamed: 0,000,04,10,100,1079302,11,12,125,14,15,...,years,yearthe,yes,yeti,york,young,yrs,zeus,zf,zillow
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.093431,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
def tokenize(document):
    
    doc = nlp(document)
    
    return [token.lemma_.strip() for token in doc if (token.text.lower() not in stop_words) and (token.is_punct != True) and token.text.isalnum()]

In [14]:
tfidf = TfidfVectorizer(stop_words='english', 
                        ngram_range=(1,2),
                        max_df=.97,
                        min_df=3,
                        tokenizer=tokenize)

dtm = tfidf.fit_transform(df['description'])
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
dtm.head()

Unnamed: 0,0,0 2,1,1 year,10,10 time,10 year,100,100 company,100 country,...,year science,year simple,year technical,year work,yearthe,yes,york,york area,york city,yrs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.100871,0.060391,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [16]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)



NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [20]:
job_description = [""" Looking for a data scientist to work on health care related data. We are a company committed to solving
public health problems. Your data analysis skills can help us achieve our goals. We are looking for a data scientist with experience
in Python and SQL to build and analyze databases of health care data, and make predictions and suggestions to our implementation
team based on these databases. Must have a background in the medical field and experience in communicating technical information 
to non-technical peers."""]

In [21]:
new = tfidf.transform(job_description)

In [22]:
nn.kneighbors(new.todense())

(array([[1.25337666, 1.2595292 , 1.26677694, 1.26719827, 1.28173794]]),
 array([[213, 425, 388, 201,  47]], dtype=int64))

In [23]:
df['description'].iloc[213]

"'Houston Methodist (HM) is looking for passionate data scientists to join the Center for Outcomes Research (COR) to lead and develop informatics initiatives that transform healthcare via data science and informatics. The goal of the HM COR informatics initiative is to benefit patients and society as a whole by utilizing the skills and tools of data science to model patient populations clinically and economically within the context of the patient, the health care and hospital systems so that we can optimize business and clinical operations, improve patient care, reduce costs and position HM to more effectively address population health and other strategic health care needs into the future. The responsibility of the data scientist is to address the best uses of data science and informatics resources for HM clinical and business operations and patient care, including the organization\\xe2\\x80\\x99s needs to assess and understand clinical, financial, operational, population health and ma

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 