<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string
import ftfy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from spacy.tokenizer import Tokenizer
from bs4 import BeautifulSoup
import requests

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read thru the documentation to accomplish this task. 

`Tip:` You will need to install the `bs4` library inside your conda environment. 

In [111]:
df = pd.read_csv('./data/job_listings.csv')

In [112]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [113]:
def make_soup(html):
    html = html.lstrip('b"\'').rstrip('"\'')
    
    soup = BeautifulSoup(html, 'html.parser')

    return soup.get_text()

In [114]:
df['description'] = df['description'].apply(make_soup)
df['description'] = df['description'].str.replace(r'\\(x|n)[a-z0-9]{0,2}', ' ')
df['description'] = df['description'].str.replace(r'\s\s+/g', ' ')
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,Job Requirements: Conceptual understanding in ...,Data scientist
1,1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I
2,2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level
3,3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist
4,4,Location: USA multiple locations + years o...,Data Scientist


In [115]:
# Just trying out ftfy
ftfy.fix_text(df['description'][0])

"Job Requirements: Conceptual understanding in Machine Learning models like Nai  ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role) Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R Ability to communicate Model findings to both Technical and Non-Technical stake holders Hands on experience in SQL/Hive or similar programming language Must show past work via GitHub, Kaggle or any other published article Master's degree in Statistics/Mathematics/Computer Science or any other quant specific field. Apply Now"

## 2) Use Spacy to tokenize the listings 

In [63]:
nlp = spacy.load('en_core_web_lg')
tokenizer = Tokenizer(nlp.vocab)

In [132]:
STOP_WORDS = nlp.Defaults.stop_words.union(['data','science','scientist', ' ', '  ', '   ', '    ', 's'])

In [133]:
# Tokenizer Pipe

tokens = []

for doc in tokenizer.pipe(df['description']):
    
    doc_tokens = []
    
    for token in doc: 
        if token.text.lower() not in STOP_WORDS:
            if ((token.is_stop == False) & (token.is_punct == False)) & (token.pos_!= 'PRON'):
                doc_tokens.append(token.lemma_.lower())
   
    tokens.append(doc_tokens)

In [134]:
df['tokens'] = tokens

print(df.shape)

df.head()

(426, 4)


Unnamed: 0.1,Unnamed: 0,description,title,tokens
0,0,Job Requirements: Conceptual understanding in ...,Data scientist,"[job, requirements:, conceptual, understand, m..."
1,1,"Job Description As a Data Scientist 1, you wi...",Data Scientist I,"[job, description, 1,, help, build, machine, l..."
2,2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,"[work, consult, business., responsible, analyz..."
3,3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist,"[$4,969, $6,756, monthcontractunder, general, ..."
4,4,Location: USA multiple locations + years o...,Data Scientist,"[location:, usa, multiple, location, +, year, ..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [135]:
data = df['description']

vect = CountVectorizer(stop_words='english')

dtm = vect.fit_transform(data)

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

dtm.head()

Unnamed: 0,00,000,02115,03,0356,04,05,062,06366,08,...,zero,zeus,zf,zheng,zillow,zogsports,zones,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [130]:
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [131]:
# Object from Base Python
from collections import Counter

# The object `Counter` takes an iterable, but you can instaniate an empty one and update it. 
word_counts = Counter()

# Update it based on a split of each of our documents
df['tokens'].apply(lambda x: word_counts.update(x))

# Print out the 10 most common words
word_counts.most_common(10)

[('experience', 1797),
 ('work', 1500),
 ('team', 1112),
 ('business', 1091),
 ('model', 855),
 ('learn', 719),
 ('machine', 677),
 ('product', 677),
 ('build', 662),
 ('s', 563)]

In [137]:
wc = count(df['tokens'])
wc.head(20)

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
9,experience,406,1797,1.0,0.0141,0.0141,0.953052
0,work,372,1500,2.0,0.01177,0.02587,0.873239
236,team,349,1112,3.0,0.008725,0.034596,0.819249
96,business,305,1091,4.0,0.008561,0.043156,0.715962
61,model,297,855,5.0,0.006709,0.049865,0.697183
288,learn,305,719,6.0,0.005642,0.055507,0.715962
56,machine,274,677,7.0,0.005312,0.060819,0.643192
289,product,240,677,8.0,0.005312,0.066131,0.56338
174,build,290,662,9.0,0.005194,0.071325,0.680751
507,analytics,212,559,10.0,0.004386,0.075712,0.497653


## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 