<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
                
df = pd.read_csv('./data/job_listings.csv')

In [3]:
def clean_description(desc):
    soup = BeautifulSoup(desc)
    return soup.get_text()
df['clean_desc'] = df['description'].apply(clean_description)

In [4]:
df0 = df.rename(columns={'Unnamed: 0':'index'})
df0.head()

Unnamed: 0,index,description,title,clean_desc
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,"b""Job Requirements:\nConceptual understanding ..."
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"b'Job Description\n\nAs a Data Scientist 1, yo..."
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,b'As a Data Scientist you will be working on c...
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen..."
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...


In [5]:
# df0['clean_desc'][0].replace("b",'',1).replace(r"\n", ' ')

In [6]:
trimmings = []

for x in range(len(df0['clean_desc'])):
    trimmed = df0['clean_desc'][x].replace("b",'',1)
    trimmed0 = trimmed.replace(r"\n", ' ')
    trimmed1 = trimmed0.replace("\\", ' ')
    trimmed2 = trimmed1.replace(r"/", ' ')
    trimmed3 = trimmed2.replace(r"'", ' ')
    trimmed4 = trimmed3.replace(r'"', ' ')
    trimmed5 = trimmed4.replace(' x/s/s', '')
    trimmed6 = trimmed5.replace(' xa8', '')
    trimmed7 = trimmed6.replace(' xe2', '')
    trimmed8 = trimmed7.replace(' x80', '')
    trimmed9 = trimmed8.replace(' xa6', '')
    trimmed10 = trimmed9.replace(' x99', '')

    trimmings.append(trimmed10)

df0['trimmed'] = trimmings

In [7]:
df0['trimmed'][4]

' Location: USA  x93 multiple locations 2+ years of Analytics experience Understand business requirements and technical requirements Can handle data extraction, preparation and transformation Create and implement data models '

## 2) Use Spacy to tokenize the listings 

In [8]:
import spacy
from spacy.tokenizer import Tokenizer

In [9]:
nlp = spacy.load('en_core_web_lg')
tokenizer = Tokenizer(nlp.vocab)

In [10]:
stop_jargon = [
    ' ','datum','data','science','scientist'
]

In [11]:
STOP_WORDS = nlp.Defaults.stop_words.union(stop_jargon)

In [12]:
import re

tokens = []

for doc in tokenizer.pipe(df0['trimmed']):
    
    doc_tokens = []
    
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False):
            lowers = re.sub('[^a-zA-z 0-9]', '', token.lemma_).lower()
        if lowers not in STOP_WORDS:
            doc_tokens.append(lowers)
        
    tokens.append(doc_tokens)

df0['tokens'] = tokens
df0.head()

Unnamed: 0,index,description,title,clean_desc,trimmed,tokens
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,"b""Job Requirements:\nConceptual understanding ...",Job Requirements: Conceptual understanding in...,"[job, requirements, conceptual, understand, un..."
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"b'Job Description\n\nAs a Data Scientist 1, yo...","Job Description As a Data Scientist 1, you w...","[job, description, 1, 1, 1, help, help, build,..."
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,b'As a Data Scientist you will be working on c...,As a Data Scientist you will be working on co...,"[work, work, consult, consult, consult, consul..."
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen...","$4,969 - $6,756 a monthContractUnder the gene...","[4969, 4969, 6756, 6756, monthcontractunder, m..."
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...,Location: USA x93 multiple locations 2+ year...,"[location, usa, x93, multiple, location, 2, ye..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [13]:
vect= CountVectorizer()

In [14]:
trimmings[:2]

# rawtext = 

[' Job Requirements: Conceptual understanding in Machine Learning models like Nai xc2ve Bayes, K-Means, SVM, Apriori, Linear  Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them Intermediate to expert level coding skills in Python R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role) Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R Ability to communicate Model findings to both Technical and Non-Technical stake holders Hands on experience in SQL Hive or similar programming language Must show past work via GitHub, Kaggle or any other published article Master s degree in Statistics Mathematics Computer Science or any other quant specific field. Apply Now ',
 ' Job Description  As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare

In [15]:
d1tokens = []

for listobj in df0['tokens']:
    for w in range(len(listobj)):
        d1tokens.append(listobj[w])

d1tokens[:50]

['job',
 'requirements',
 'conceptual',
 'understand',
 'understand',
 'machine',
 'learning',
 'model',
 'like',
 'nai',
 'xc2ve',
 'bayes',
 'kmeans',
 'svm',
 'apriori',
 'linear',
 'logistic',
 'regression',
 'neural',
 'random',
 'forests',
 'decision',
 'trees',
 'knn',
 'knn',
 'knn',
 'handson',
 'experience',
 'experience',
 'experience',
 'experience',
 '2',
 '2',
 '2',
 'intermediate',
 'intermediate',
 'expert',
 'level',
 'code',
 'skill',
 'skill',
 'python',
 'r',
 'ability',
 'ability',
 'write',
 'functions',
 'clean',
 'clean',
 'efficient']

In [16]:
vect.fit(d1tokens)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [17]:
dtm = vect.transform(trimmings)

In [18]:
dtm_df = pd.DataFrame(
    dtm.todense(),
    columns=vect.get_feature_names()
)

In [19]:
print(dtm_df.shape)
dtm_df.head()

(426, 8507)


Unnamed: 0,00,02,02115,03,030,030547069,04,06366,08,10,...,zf,zfs,zheng,zillow,zogsports,zone,zoom,zuckerberg,zurich,zurichs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [20]:
from collections import Counter

In [21]:
def count(docs):
    word_counts = Counter()
    appears_in = Counter()
    
    total_docs = len(docs)
    
    for doc in docs:
        word_counts.update(doc)
        appears_in.update(set(doc))
    
    temp = zip(word_counts.keys(), word_counts.values())
    
    wc = pd.DataFrame(temp, columns = ['word', 'count'])
    
    wc['rank'] = wc['count'].rank(
        method='first', ascending=False
    )
    total = wc['count'].sum()
    
    wc['pct_total'] = wc['count'].apply(
        lambda x: x/total
    )
    
    wc = wc.sort_values(by='rank')
    wc['cul_pct_total'] = wc['pct_total'].cumsum()
    
    t2 = zip(appears_in.keys(), appears_in.values())
    ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
    wc = ac.merge(wc, on='word')
    
    wc['appears_in_pct'] = wc['appears_in'].apply(
       lambda x: x/total_docs 
    )
    
    return wc.sort_values(by='rank')

In [22]:
wc = count(df0['tokens'])
wc.head(20)

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
46,experience,410,3542,1.0,0.018077,0.018077,0.962441
45,work,378,3139,2.0,0.01602,0.034098,0.887324
137,team,362,2427,3.0,0.012387,0.046484,0.849765
263,business,324,1527,4.0,0.007793,0.054278,0.760563
60,model,298,1513,5.0,0.007722,0.062,0.699531
114,analysis,313,1248,6.0,0.006369,0.068369,0.734742
206,product,256,1152,7.0,0.005879,0.074248,0.600939
278,learn,308,1122,8.0,0.005726,0.079975,0.723005
23,ability,248,1085,9.0,0.005537,0.085512,0.58216
322,analytics,249,1043,10.0,0.005323,0.090835,0.584507


## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [73]:
tfidf = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1,2),
    max_df=.99,
    min_df=.001
)

dtm0 = tfidf.fit_transform(trimmings)

dtm0_df = pd.DataFrame(
    dtm0.todense(), columns=tfidf.get_feature_names()
)

dtm0_df.head()

Unnamed: 0,00,00 non,00 preferred,000,000 100,000 125,000 350,000 85,000 annually,000 associates,...,zuckerberg 2015,zuckerberg initiative,zurich,zurich american,zurich customers,zurich does,zurich north,zurich place,zurichs,zurichs predictive
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [74]:
with pd.option_context(
    'display.float_format', '{:0.50f}'.format
):
    print(dtm0_df.iloc[4:5, :5])

# THESE VECTORS ARE TINY

                                                  00  \
4 0.000000000000000000000000000000000000000000000...   

                                              00 non  \
4 0.000000000000000000000000000000000000000000000...   

                                        00 preferred  \
4 0.000000000000000000000000000000000000000000000...   

                                                 000  \
4 0.000000000000000000000000000000000000000000000...   

                                             000 100  
4 0.000000000000000000000000000000000000000000000...  


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [75]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm='ball_tree')
nn.fit(dtm0_df)

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [76]:
nn.kneighbors([dtm0_df.iloc[8]])

(array([[0.        , 1.26290975, 1.31038307, 1.33774465, 1.3452615 ]]),
 array([[  8, 419, 226, 129,  14]], dtype=int64))

In [78]:
trimmings[8][:400]

' MS in a quantitative discipline such as Statistics, Mathematics, Physics, Engineering, Computer Science or Economics5+ years work experienceProficiency in at least one statistical software package such as Python, R or MatlabExpertise using SQL for acquiring and transforming dataOutstanding quantitative modeling and statistical analysis skillsExcellent verbal and written communication skills with '

In [82]:
trimmings[419][:400]

' Bachelors or Masters degree in a quantitative field such as Statistics, Applied Mathematics, Physics, Engineering, Computer Science, or Economics 2+ years of relevant working experience in an analytical role involving data extraction, analysis, and communication 2+ years of experience with data querying languages (e.g. SQL, Hadoop Hive) and statistical mathematical software (e.g. R, Weka, Matlab,'

In [84]:
ideal_job = ["""
    Junior Machine Learning Engineer to work in a Machine
    Learning team solving problems unique to geothermal
    energy production. Must be willing to do site visits at
    facilities throughout the west coast. Remote option
    available.
"""]

In [85]:
new_mtrx = tfidf.transform(ideal_job)

In [88]:
new_vec = new_mtrx.todense()

In [89]:
nn.kneighbors(new_vec)

(array([[1.36018599, 1.36480238, 1.36830694, 1.37023951, 1.37269608]]),
 array([[  2, 261, 173, 297, 283]], dtype=int64))

In [92]:
trimmings[2]

' As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to actionable recommendations. You will be performing thorough testing and validation of models, and support various aspects of the business with data analytics. Ability to do statistical modeling, build predictive models and leverage machine learning algorithms. This position will combine the typical Data Scientist math and analytical skills, with research, advanced business, communication, and presentation skills. Primary job location is in Sacramento, but work-from-home option is available.  Qualifications Bachelors, MS or PhD in a relevant field (Computer Science, Engineering, Statistics, Physics, Applied Math) Experience in R and or Python is preferred '

In [94]:
trimmings[261]

' The Data Science Engineer, Mintel Futures is a core part of Mintels data science team that will have the opportunity to work on a wide array of initiatives across varying aspects of Mintels business and data. This individual will help manage the full analytics lifecycle of advanced projects; help to identify new, impactful ways to apply machine learning to Mintels data; work alongside data scientists and business stakeholders to implement solutions that provide valuable insights; and aide in the design and development of a modern data analytics environment.  What You Will Do:  Play an integral role in shaping the underlying technology environment for Mintels fast growing team of data scientists and data analysts Assist in the acquisition and management of a variety of data sources for large-scale analysis Identify opportunities for predictive modeling or other machine learning techniques and experiment with solutions that focus on adding value to our clients and analysts Work alongsi

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 