<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [4]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy


## 1) *Optional:* Scrape 100 Job Listings that contain the title "Data Scientist" from indeed.com

At a minimum your final dataframe of job listings should contain
- Job Title
- Job Description

If you choose to not to scrape the data, there is a CSV with outdated data in the directory. Remeber, if you scrape Indeed, you're helping yourself find a job. ;)

In [5]:
from bs4 import BeautifulSoup
import requests

df = pd.read_csv('data/job_listings.csv', index_col=0)
df.head()


Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [6]:
def clean_descriptions(desc):

    # transform unicode to ascii
    desc = (desc.
            replace('\\xe2\\x80\\x99', "'").
            replace('\\xc3\\xa9', 'e').
            replace('\\xc2\\xa8', '').
            replace('\\xe2\\x80\\x90', '-').
            replace('\\xe2\\x80\\x91', '-').
            replace('\\xe2\\x80\\x92', '-').
            replace('\\xe2\\x80\\x93', '-').
            replace('\\xe2\\x80\\x94', '-').
            replace('\\xe2\\x80\\x94', '-').
            replace('\\xe2\\x80\\x98', "'").
            replace('\\xe2\\x80\\x9b', "'").
            replace('\\xe2\\x80\\x9c', '"').
            replace('\\xe2\\x80\\x9c', '"').
            replace('\\xe2\\x80\\x9d', '"').
            replace('\\xe2\\x80\\x9e', '"').
            replace('\\xe2\\x80\\x9f', '"').
            replace('\\xe2\\x80\\xa6', '...').#
            replace('\\xe2\\x80\\xb2', "'").
            replace('\\xe2\\x80\\xb3', "'").
            replace('\\xe2\\x80\\xb4', "'").
            replace('\\xe2\\x80\\xb5', "'").
            replace('\\xe2\\x80\\xb6', "'").
            replace('\\xe2\\x80\\xb7', "'").
            replace('\\xe2\\x81\\xba', "+").
            replace('\\xe2\\x81\\xbb', "-").
            replace('\\xe2\\x81\\xbc', "=").
            replace('\\xe2\\x81\\xbd', "(").
            replace('\\xe2\\x81\\xbe', ")")
           )
    # use BeautifulSoup to strip html tags
    soup = BeautifulSoup(desc)
    for st in soup(['script', 'style']):
        s.decompose()
    desc = ' '.join(soup.stripped_strings)
    
    # remove '\n' from string
    desc = (desc.
            replace('\\n', ' ').
            replace('\n',  ' ').
            replace('/', ' ').
            replace('$', '').
            replace(',', '')
           )
    
    #remove b' at begining of each string
    desc = desc[2:]
    
    # remove non aplh characters
    desc = re.sub(r'[^a-zA-Z ]', '', desc)
    
    desc = str.strip(desc)
    
    return desc


In [9]:
df['description'][1]

'b\'<div>Job Description<br/>\\n<br/>\\n<p>As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journey. You will do so by empowering and improving the next generation of Accolade Applications and user experiences.</p><p><b>\\nA day in the life\\xe2\\x80\\xa6</b></p><ul><li>\\nWork with a small agile team to design and develop mobile applications in an iterative fashion.</li><li>\\nWork with a tight-knit group of development team members in Seattle.</li><li>\\nContribute to best practices and help guide the future of our applications.</li><li>\\nOperates effectively as a collaborative member of the development team.</li><li>\\nOperates effectively as an individual for quick turnaround of enhancements and fixes.</li><li>\\nResponsible for meeting expectations and deliverables on time with high quality.</li><li>\\nDrive and implement new features within our mobile applications.</li><li>\\nP

In [10]:
df['description'] = df['description'].apply(lambda x: clean_descriptions(x))

df.head()

Unnamed: 0,description,title
0,Job Requirements Conceptual understanding in ...,Data scientist
1,Job Description As a Data Scientist you w...,Data Scientist I
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level
3,a month Contract Under the general supervision...,Data Scientist
4,Location USA multiple locations years of A...,Data Scientist


In [11]:
df['description'][1]

'Job Description     As a Data Scientist  you will help us build machine learning models data pipelines and microservices to help our clients navigate their healthcare journey You will do so by empowering and improving the next generation of Accolade Applications and user experiences  A day in the life  Work with a small agile team to design and develop mobile applications in an iterative fashion  Work with a tightknit group of development team members in Seattle  Contribute to best practices and help guide the future of our applications  Operates effectively as a collaborative member of the development team  Operates effectively as an individual for quick turnaround of enhancements and fixes  Responsible for meeting expectations and deliverables on time with high quality  Drive and implement new features within our mobile applications  Perform thorough manual testing and writing test cases that cover all areas  Identify new development tools approaches that will increase code quality 

## 2) Use Spacy to tokenize / clean the listings 

In [12]:
##### Your Code Here #####
nlp = spacy.load("en_core_web_lg")

In [13]:
# load nlp spacy model and inst tokenizer 
# add common words to stop words
STOP_WORDS = nlp.Defaults.stop_words.union([
    ' ',
    '  ',
    '   ',
    '    ',
    '     ',
])

def tokenize(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc if (token.is_stop != True) and (token.is_punct != True)]

In [14]:
df['tokens'] = df['description'].apply(lambda x: tokenize(x))

In [15]:
df['tokens'].head()

0    [job, Requirements,  , conceptual, understandi...
1    [job, description,     , Data, Scientist,  , h...
2    [Data, scientist, work, consult, business, res...
3    [month, Contract, general, supervision, Profes...
4    [location, USA,  , multiple, location,    , ye...
Name: tokens, dtype: object

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [16]:
##### Your Code Here #####
vect = CountVectorizer(stop_words=STOP_WORDS)

#Learn our Vocab
vect.fit(df['description'])

# Get sparse dtm
dtm = vect.transform(df['description'])

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm.head()

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,aa,aap,aas,ab,abernathy,abilities,ability,able,abounds,abroad,...,zfs,zheng,zillow,zillows,zogsports,zones,zoom,zuckerberg,zurich,zurichs
0,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [17]:
##### Your Code Here #####

# 15 most common words
dtm.sum().sort_values(ascending=False)[:15]

data           4323
experience     1887
business       1203
work           1157
team            966
science         947
learning        917
analytics       731
machine         698
skills          696
analysis        685
models          621
product         577
statistical     576
solutions       531
dtype: int64

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [18]:
##### Your Code Here #####
# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english')

# Create a vocabulary and get word counts per document
# Similiar to fit_predict
dtm = tfidf.fit_transform(df['description'])

# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,aa,aap,aas,ab,abernathy,abilities,ability,able,abounds,abroad,...,zfs,zheng,zillow,zillows,zogsports,zones,zoom,zuckerberg,zurich,zurichs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.094714,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.021094,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.066684,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.108354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# 15 most common words
dtm.sum().sort_values(ascending=False)[:15]

data           57.097119
experience     27.080244
business       19.675219
work           17.459894
learning       16.980156
science        15.326844
team           15.285831
analytics      14.426667
machine        13.972056
models         12.977181
product        12.933913
analysis       12.348780
skills         12.084145
statistical    11.547950
solutions      11.058450
dtype: float64

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [33]:
##### Your Code Here #####
# Fit on TF-IDF Vectors
from sklearn.neighbors import NearestNeighbors
nn  = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

In [34]:
sample_desc = """
We are looking for a Data Scientist who will support our product, 
sales, leadership and marketing teams with insights gained from analyzing company data. 
The ideal candidate is adept at using large data sets to find opportunities for product 
and process optimization and using models to test the effectiveness of different courses 
of action. They must have strong experience using a variety of data mining/data analysis 
methods, using a variety of data tools, building and implementing models, using/creating 
algorithms and creating/running simulations. They must have a proven ability to drive business 
results with their data-based insights. They must be comfortable working with a wide range of 
stakeholders and functional teams. The right candidate will have a passion for discovering solutions 
hidden in large data sets and working with stakeholders to improve business outcomes.
"""

title = "Data Scientist"


In [35]:
sample.head()

AttributeError: 'str' object has no attribute 'head'

In [26]:
sample.rename(columns={'desciption': 'description'}, inplace=True)

NameError: name 'sample' is not defined

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 