<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [2]:
from bs4 import BeautifulSoup
import requests

##### Your Code Here #####
df = pd.read_csv('./data/job_listings.csv')
                
print(df.shape)
df.head()

(426, 3)


Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [5]:
def get_text_from_html(html):
    soup = BeautifulSoup(html)
    return soup.text

df['description'].apply(get_text_from_html)[0]

'b"Job Requirements:\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them\\nIntermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)\\nExposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R\\nAbility to communicate Model findings to both Technical and Non-Technical stake holders\\nHands on experience in SQL/Hive or similar programming language\\nMust show past work via GitHub, Kaggle or any other published article\\nMaster\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.\\nApply Now"'

In [6]:
# now will save to description column
df['description'] = df['description'].apply(get_text_from_html)

## 2) Use Spacy to tokenize the listings 

In [7]:
##### Your Code Here #####
nlp = spacy.load("en_core_web_lg")

def get_tokens(text):
    doc = nlp(text)
    return [token.lemma_.strip() for token in doc if (token.is_stop != True) and (token.is_punct != True)]

df['tokens'] = df['description'].apply(get_tokens)

In [8]:
df['tokens'].iloc[0]

['b"Job',
 'requirements:\\nconceptual',
 'understanding',
 'Machine',
 'Learning',
 'model',
 'like',
 'nai\\xc2\\xa8ve',
 'Bayes',
 'K',
 'Means',
 'SVM',
 'Apriori',
 'Linear/',
 'Logistic',
 'Regression',
 'neural',
 'Random',
 'Forests',
 'decision',
 'Trees',
 'K',
 'NN',
 'hand',
 'experience',
 '2',
 'them\\nintermediate',
 'expert',
 'level',
 'coding',
 'skill',
 'Python',
 'R.',
 'ability',
 'write',
 'function',
 'clean',
 'efficient',
 'datum',
 'manipulation',
 'mandatory',
 'role)\\nexposure',
 'package',
 'like',
 'NumPy',
 'SciPy',
 'Pandas',
 'Matplotlib',
 'etc',
 'Python',
 'GGPlot2',
 'dplyr',
 'tidyR',
 'R\\nAbility',
 'communicate',
 'Model',
 'finding',
 'Technical',
 'Non',
 'technical',
 'stake',
 'holders\\nhand',
 'experience',
 'SQL',
 'Hive',
 'similar',
 'programming',
 'language\\nmust',
 'past',
 'work',
 'GitHub',
 'Kaggle',
 'publish',
 'article\\nmaster',
 'degree',
 'Statistics',
 'Mathematics',
 'Computer',
 'Science',
 'quant',
 'specific',
 'fiel

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [11]:
##### Your Code Here #####
vect = CountVectorizer(stop_words="english")
dmt = vect.fit_transform(df['description'])
print(dmt)

  (0, 4203)	1
  (0, 7787)	1
  (0, 5338)	1
  (0, 9239)	1
  (0, 4584)	1
  (0, 4389)	1
  (0, 4900)	1
  (0, 4459)	2
  (0, 5078)	1
  (0, 9778)	1
  (0, 9764)	1
  (0, 884)	1
  (0, 4725)	1
  (0, 8790)	1
  (0, 636)	1
  (0, 4470)	1
  (0, 4526)	1
  (0, 7666)	1
  (0, 5583)	1
  (0, 7528)	1
  (0, 3184)	1
  (0, 2078)	1
  (0, 9138)	1
  (0, 5979)	1
  (0, 3544)	1
  :	:
  (425, 964)	2
  (425, 3488)	1
  (425, 6641)	1
  (425, 8067)	1
  (425, 1280)	6
  (425, 7477)	1
  (425, 5292)	1
  (425, 7261)	1
  (425, 7901)	1
  (425, 813)	1
  (425, 5620)	1
  (425, 7812)	1
  (425, 4613)	1
  (425, 6588)	1
  (425, 4164)	1
  (425, 7992)	1
  (425, 1633)	1
  (425, 6624)	1
  (425, 1621)	1
  (425, 3807)	1
  (425, 3141)	1
  (425, 8260)	1
  (425, 6562)	1
  (425, 5265)	2
  (425, 4443)	1


In [14]:
df_dmt = pd.DataFrame(dmt.todense(), columns=vect.get_feature_names())
df_dmt.head()

Unnamed: 0,00,000,02115,03,0356,04,062,06366,08,10,...,zenreach,zero,zeus,zf,zheng,zillow,zones,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [19]:
##### Your Code Here #####
df_dmt.sum(axis=1).sort_values(ascending=False)[:5]

201    1728
143     957
411     873
336     783
410     773
dtype: int64

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [25]:
##### Your Code Here #####
tfidf = TfidfVectorizer(stop_words="english")

dmt = tfidf.fit_transform(df['description'])

In [27]:
df_dmt = pd.DataFrame(dmt.todense(), columns=tfidf.get_feature_names())
df_dmt.head()

Unnamed: 0,00,000,02115,03,0356,04,062,06366,08,10,...,zenreach,zero,zeus,zf,zheng,zillow,zones,zoom,zuckerberg,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.104421,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [29]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors

# Fit on DTM
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(df_dmt)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [31]:
test = "Machine Learning Engineer with a strong web development background"
nn.kneighbors(tfidf.transform([test]).todense())

(array([[1.30661601, 1.31543483, 1.32223717, 1.32243061, 1.32243061]]),
 array([[261, 173, 151,  33, 378]]))

In [34]:
df.iloc[261]['description']

"b'The Data Science Engineer, Mintel Futures is a core part of Mintel\\xe2\\x80\\x99s data science team that will have the opportunity to work on a wide array of initiatives across varying aspects of Mintel\\xe2\\x80\\x99s business and data. This individual will help manage the full analytics lifecycle of advanced projects; help to identify new, impactful ways to apply machine learning to Mintel\\xe2\\x80\\x99s data; work alongside data scientists and business stakeholders to implement solutions that provide valuable insights; and aide in the design and development of a modern data analytics environment.\\n\\nWhat You Will Do:\\n\\nPlay an integral role in shaping the underlying technology environment for Mintel\\xe2\\x80\\x99s fast growing team of data scientists and data analysts\\nAssist in the acquisition and management of a variety of data sources for large-scale analysis\\nIdentify opportunities for predictive modeling or other machine learning techniques and experiment with solu

In [35]:
df.iloc[173]['description']

"b'We are hiring a remote Data Scientist with strong Machine Learning background and about 3+ years of experience.\\nREQUIREMENTS\\nDegree Required\\nDeep understanding of, and experience with, machine learning models and data analysis\\nDeep understanding of both supervised and unsupervised learning methods\\nExperience with building end-to-end machine learning systems in production\\nStrong proficiency writing production-quality code\\nExperience handling large-scale data, big data platforms, and distributed systems\\nFOR IMMEDIATE CONSIDERATION, EMAIL RESUME WITH FIRST AND LAST NAME, HOME LOCATION, CELL AND EMAIL ADDRESS TO: ds@executecrecruiters.com. Please, No Calls and Recruiters do not send me your candidates.'"

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 