<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [93]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read thru the documentation to accomplish this task. 

`Tip:` You will need to install the `bs4` library inside your conda environment. 

In [59]:
from bs4 import BeautifulSoup
import requests

df = pd.read_csv('data/job_listings.csv', index_col=0)
df.head()

Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [60]:
def clean_descriptions(desc):

    # transform unicode to ascii
    desc = (desc.
            replace('\\xe2\\x80\\x99', "'").
            replace('\\xc3\\xa9', 'e').
            replace('\\xc2\\xa8', '').
            replace('\\xe2\\x80\\x90', '-').
            replace('\\xe2\\x80\\x91', '-').
            replace('\\xe2\\x80\\x92', '-').
            replace('\\xe2\\x80\\x93', '-').
            replace('\\xe2\\x80\\x94', '-').
            replace('\\xe2\\x80\\x94', '-').
            replace('\\xe2\\x80\\x98', "'").
            replace('\\xe2\\x80\\x9b', "'").
            replace('\\xe2\\x80\\x9c', '"').
            replace('\\xe2\\x80\\x9c', '"').
            replace('\\xe2\\x80\\x9d', '"').
            replace('\\xe2\\x80\\x9e', '"').
            replace('\\xe2\\x80\\x9f', '"').
            replace('\\xe2\\x80\\xa6', '...').#
            replace('\\xe2\\x80\\xb2', "'").
            replace('\\xe2\\x80\\xb3', "'").
            replace('\\xe2\\x80\\xb4', "'").
            replace('\\xe2\\x80\\xb5', "'").
            replace('\\xe2\\x80\\xb6', "'").
            replace('\\xe2\\x80\\xb7', "'").
            replace('\\xe2\\x81\\xba', "+").
            replace('\\xe2\\x81\\xbb', "-").
            replace('\\xe2\\x81\\xbc', "=").
            replace('\\xe2\\x81\\xbd', "(").
            replace('\\xe2\\x81\\xbe', ")")
           )
    # use BeautifulSoup to strip html tags
    soup = BeautifulSoup(desc)
    for st in soup(['script', 'style']):
        s.decompose()
    desc = ' '.join(soup.stripped_strings)
    
    # remove '\n' from string
    desc = (desc.
            replace('\\n', ' ').
            replace('\n',  ' ').
            replace('/', ' ').
            replace('$', '').
            replace(',', '')
           )
    
    #remove b' at begining of each string
    desc = desc[2:]
    
    # remove non aplh characters
    desc = re.sub(r'[^a-zA-Z ]', '', desc)
    
    desc = str.strip(desc)
    
    return desc

In [61]:
df['description'][0]

'b"<div><div>Job Requirements:</div><ul><li><p>\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them</p>\\n</li><li><p>Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)</p>\\n</li><li><p>Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R</p>\\n</li><li><p>Ability to communicate Model findings to both Technical and Non-Technical stake holders</p>\\n</li><li><p>Hands on experience in SQL/Hive or similar programming language</p>\\n</li><li><p>Must show past work via GitHub, Kaggle or any other published article</p>\\n</li><li><p>Master\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.</p></li></ul><div><div><div><div><div><d

In [62]:
df['description'] = df['description'].apply(lambda x: clean_descriptions(x))

df.head()

Unnamed: 0,description,title
0,Job Requirements Conceptual understanding in ...,Data scientist
1,Job Description As a Data Scientist you w...,Data Scientist I
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level
3,a month Contract Under the general supervision...,Data Scientist
4,Location USA multiple locations years of A...,Data Scientist


In [63]:
df['description'][0]

'Job Requirements  Conceptual understanding in Machine Learning models like Naive Bayes KMeans SVM Apriori Linear  Logistic Regression Neural Random Forests Decision Trees KNN along with handson experience in at least  of them   Intermediate to expert level coding skills in Python R Ability to write functions clean and efficient data manipulation are mandatory for this role   Exposure to packages like NumPy SciPy Pandas Matplotlib etc in Python or GGPlot dplyr tidyR in R   Ability to communicate Model findings to both Technical and NonTechnical stake holders   Hands on experience in SQL Hive or similar programming language   Must show past work via GitHub Kaggle or any other published article   Masters degree in Statistics Mathematics Computer Science or any other quant specific field  Apply Now'

## 2) Use Spacy to tokenize the listings 

In [68]:
nlp = spacy.load("en_core_web_lg")

In [69]:
# load nlp spacy model and inst tokenizer 
# add common words to stop words
STOP_WORDS = nlp.Defaults.stop_words.union([
    ' ',
    '  ',
    '   ',
    '    ',
    '     ',
])

def tokenize(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc if (token.is_stop != True) and (token.is_punct != True)]

In [70]:
df['tokens'] = df['description'].apply(lambda x: tokenize(x))

In [71]:
df['tokens'].head()

0    [job, Requirements,  , conceptual, understandi...
1    [job, description,     , Data, Scientist,  , h...
2    [Data, scientist, work, consult, business, res...
3    [month, Contract, general, supervision, Profes...
4    [location, USA,  , multiple, location,    , ye...
Name: tokens, dtype: object

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [84]:
##### Your Code Here #####
vect = CountVectorizer(stop_words=STOP_WORDS)

#Learn our Vocab
vect.fit(df['description'])

# Get sparse dtm
dtm = vect.transform(df['description'])

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm.head()

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,aa,aap,aas,ab,abernathy,abilities,ability,able,abounds,abroad,...,zfs,zheng,zillow,zillows,zogsports,zones,zoom,zuckerberg,zurich,zurichs
0,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [89]:
# ten most common words
dtm.sum().sort_values(ascending=False)[:10]

data          4323
experience    1887
business      1203
work          1157
team           966
science        947
learning       917
analytics      731
machine        698
skills         696
dtype: int64

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [90]:
##### Your Code Here #####
# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english')

# Create a vocabulary and get word counts per document
# Similiar to fit_predict
dtm = tfidf.fit_transform(df['description'])

# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,aa,aap,aas,ab,abernathy,abilities,ability,able,abounds,abroad,...,zfs,zheng,zillow,zillows,zogsports,zones,zoom,zuckerberg,zurich,zurichs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.094714,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.021094,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.066684,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.108354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [91]:
# ten most common words
dtm.sum().sort_values(ascending=False)[:10]

data          57.097119
experience    27.080244
business      19.675219
work          17.459894
learning      16.980156
science       15.326844
team          15.285831
analytics     14.426667
machine       13.972056
models        12.977181
dtype: float64

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [147]:
##### Your Code Here #####
# Fit on TF-IDF Vectors
nn  = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

In [106]:
sample_desc = """
As part of the advanced analytics team, you will be working closely with clients, information architects, data engineers, project/program managers, and other teams to turn data into critical information and insights that can be used to make sound business decisions. This includes providing data that is congruent and reliable. You’ll need to be creative thinker and propose innovative ways to look at problems. You will work the mining of the data for insights, development and implementation of new and advanced forecasting models using advanced statistics and machine learning methods. You’ll validate findings using an experimental and iterative approach. In this role, you will need to be able to present findings to the business by testing hypotheses and presenting in a way that can be understood by business counterparts.

The Junior Data Scientist will be working on solving business problems by helping to develop models, discovering insights, and identifying opportunities with the use of statistical techniques, visualizations, and succinct narratives to describe findings. In addition to advanced analytical skills, the Junior Data Scientist should be integrating and preparing large, varied datasets, applying advanced analytical techniques, and communicating results. Additionally, you will be responsible for:
• Some predictive modeling, optimization, and simulation to generate business insights.
• Applying pattern recognition techniques to perform descriptive, predictive, and prescriptive insights.
"""

title = "Jr Data Scientist"

In [125]:
sample.head()

Unnamed: 0,desciption,title
0,"\nAs part of the advanced analytics team, you ...",Jr Data Scientist


In [131]:
sample.rename(columns={'desciption': 'description'}, inplace=True)

In [136]:
sample['description'] = sample['description'].apply(lambda x: clean_descriptions(x))

In [141]:
df = df.dropna()

In [143]:
df = df.append(sample, ignore_index=True)
df.tail()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,description,title,tokens
422,Internship At Uber we ignite opportunity by se...,2019 PhD Data Scientist Internship - Forecasti...,"[internship, Uber, ignite, opportunity, set, w..."
423,a year A million people a year die in car coll...,Data Scientist - Insurance,"[year, million, people, year, die, car, collis..."
424,SENIOR DATA SCIENTIST JOB DESCRIPTION ABOU...,Senior Data Scientist,"[senior, DATA, scientist, , job, description,..."
425,Cerner Intelligence is a new innovative organi...,Data Scientist,"[Cerner, Intelligence, new, innovative, organi..."
426,part of the advanced analytics team you will b...,Jr Data Scientist,


In [144]:
# Create a vocabulary and get word counts per document
# Similiar to fit_predict
dtm = tfidf.fit_transform(df['description'])

# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,aa,aap,aas,ab,abernathy,abilities,ability,able,abounds,abroad,...,zfs,zheng,zillow,zillows,zogsports,zones,zoom,zuckerberg,zurich,zurichs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.094807,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.021118,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.066761,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.108339,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [148]:
# Query Using kneighbors 
nn.kneighbors([dtm.iloc[-1]])

(array([[0.        , 1.24841473, 1.25702896, 1.25702896, 1.26249157]]),
 array([[426, 327, 339, 333, 328]]))

In [156]:
df['description'].iloc[327][:250]

'If youre ready to innovate and help lead the development for Hewlett Packard Enterprises HPE Analytics Platform come join us now You will be part of an organization that is revolutionizing reporting solutions and architecting a data and analytics lan'

In [158]:
df['description'].iloc[-1][:250]

'part of the advanced analytics team you will be working closely with clients information architects data engineers project program managers and other teams to turn data into critical information and insights that can be used to make sound business de'

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 