<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [3]:
from bs4 import BeautifulSoup
import requests

import pandas as pd
df = pd.read_csv('./data/job_listings.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


## 2) Use Spacy to tokenize the listings 

In [5]:
# funtion to clean .csv doc
def clean_description(desc):
    soup = BeautifulSoup(desc)
    return soup.get_text()
df['clean_desc'] = df['description'].apply(clean_description)

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,title,clean_desc
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,"b""Job Requirements:\nConceptual understanding ..."
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"b'Job Description\n\nAs a Data Scientist 1, yo..."
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,b'As a Data Scientist you will be working on c...
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen..."
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...


In [7]:
# Use Spacy to tokenize text
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_lg")

tokenizer = Tokenizer(nlp.vocab)

In [46]:
tokens = []

for doc in tokenizer.pipe(df["clean_desc"], batch_size = 500):
    
    doc_tokens = []
    
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False):
            res = str(token.text.lower().split())[1:-1]
            doc_tokens.append(res)
            
    tokens.append(doc_tokens)

df["tokens"] = tokens

In [47]:
df.tail()

Unnamed: 0.1,Unnamed: 0,description,title,clean_desc,tokens
421,421,"b""<b>About Us:</b><br/>\nWant to be part of a ...",Senior Data Science Engineer,"b""About Us:\nWant to be part of a fantastic an...","['b""about', 'us:\\nwant', 'fantastic', 'fun', ..."
422,422,"b'<div class=""jobsearch-JobMetadataHeader icl-...",2019 PhD Data Scientist Internship - Forecasti...,"b'InternshipAt Uber, we ignite opportunity by ...","[""b'internshipat"", 'uber,', 'ignite', 'opportu..."
423,423,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist - Insurance,"b'$200,000 - $350,000 a yearA million people a...","[""b'$200,000"", '$350,000', 'yeara', 'million',..."
424,424,"b""<p></p><div><p>SENIOR DATA SCIENTIST</p><p>\...",Senior Data Scientist,"b""SENIOR DATA SCIENTIST\nJOB DESCRIPTION\n\nAB...","['b""senior', 'data', 'scientist\\njob', 'descr..."
425,425,b'<div></div><div><div><div><div><p>Cerner Int...,Data Scientist,"b'Cerner Intelligence is a new, innovative org...","[""b'cerner"", 'intelligence', 'new,', 'innovati..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [50]:
data = df["clean_desc"]

In [51]:
# Count Vectorizer
# instantiate vectorizer object
vect = CountVectorizer(stop_words='english')

# build vocab
vect.fit(data)

# sparse dtm
dtm = vect.transform(data)

In [52]:
#vect.get_feature_names()

In [53]:
type(dtm.todense())

numpy.matrix

In [54]:
dtm.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 2, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0]], dtype=int64)

In [55]:
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm.head()

Unnamed: 0,00,000,02115,03,0356,04,062,06366,08,10,...,zenreach,zero,zeus,zf,zheng,zillow,zones,zoom,zuckerberg,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualizeranke most common word counts

In [56]:
sum(dtm["zillow"])

7

In [57]:
result = []
for col in dtm:
    x = sum(dtm[col])
    result.append(x)

data = zip(vect.get_feature_names(),result)
words = pd.DataFrame(data, columns=["words", "count"])

In [58]:
words["rank"] = words["count"].rank(method="first", ascending=False)

In [59]:
words.tail()

Unnamed: 0,words,count,rank
9811,zillow,7,2665.0
9812,zones,1,9815.0
9813,zoom,1,9816.0
9814,zuckerberg,2,6441.0
9815,zurich,2,6442.0


In [60]:
words[words["rank"] <= 10]

Unnamed: 0,words,count,rank
539,analytics,730,10.0
1116,business,1198,5.0
2018,data,4394,1.0
2885,experience,1238,4.0
4389,learning,912,9.0
8063,science,956,8.0
8877,team,972,7.0
9629,work,976,6.0
9663,x80,1404,3.0
9780,xe2,1417,2.0


## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [61]:
data = df["clean_desc"]

In [62]:
# Term Frequency - Inverse Document Frequency (Tf-Idf)
# instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english',
                        ngram_range=(1,2),
                        max_df=.97,
                        min_df=4)

# create vocab / get word count
dtm = tfidf.fit_transform(data)
dtm.todense()

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.10237496, 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.03965295, 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [63]:
# dataframe
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
dtm.tail()

Unnamed: 0,000,000 employees,04,10,10 time,10 years,100,100 000,100 companies,100 countries,...,years nrequirements,years professional,years related,years relevant,years work,years working,years xe2,yes,york,york city
421,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043372,0.0
422,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
423,0.102375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
424,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
425,0.039653,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [64]:
# K-NN
from sklearn.neighbors import NearestNeighbors

# fit on dtm
nn = NearestNeighbors(n_neighbors=5, algorithm="ball_tree")
nn.fit(dtm)

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [65]:
dtm.iloc[421]

000              0.000000
000 employees    0.000000
04               0.000000
10               0.000000
10 time          0.000000
                   ...   
years working    0.000000
years xe2        0.000000
yes              0.000000
york             0.043372
york city        0.000000
Name: 421, Length: 8357, dtype: float64

In [66]:
dtm.iloc[421].values

array([0.        , 0.        , 0.        , ..., 0.        , 0.04337169,
       0.        ])

In [70]:
nn.kneighbors([dtm.iloc[421].values])

(array([[0.        , 1.26436697, 1.26545639, 1.27829784, 1.27854697]]),
 array([[421, 369, 351, 145, 401]], dtype=int64))

In [71]:
data[421][:150]

'b"About Us:\\nWant to be part of a fantastic and fun startup that\\xe2\\x80\\x99s revolutionizing the online travel advertising space? Want to join a data'

In [72]:
data[369][:150]

"b'Job Description\\n\\nAs a Data Scientist at Square, you will lead projects that derive value from our unique, rich, and rapidly growing data. We partn"

In [73]:
data[145][:150]

'b"Fiat Chrysler Automobiles is looking to fill the full-time position of a Data Scientist. This position is responsible for delivering insights to the'

In [74]:
# New Description
new_description = [ """
You will support the Investment Management FinTech Strategy mission of exploring technologies that could dramatically improve investment performance or make capital markets work better, and to embed a culture of innovation in IMG. The IMFS Senior Data Scientist is primarily responsible for identifying potential experiments with financial technology related to alternative data sources, predictive analytics, and machine learning; will be accountable for executing experiments with increased complexity; and is responsible for growing the organization’s data science capabilities and tool sets through mentoring other crew.

Requirements

In This Role You Will:
• Execute the strategic direction of IMFS program to implement new data strategies.
• Acquire structured and unstructured data and prepares it for analysis. Investigates, extracts, cleans, transforms, and manages data using a variety of approaches.
• Build and maintain FinTech Strategy data architecture, software, and model libraries, and process routines.
• Conceptualize, code, and implement new analytic processes for IMFS using Big Data technologies.
• Use statistical and machine learning applications to unearth insights in the investment landscape.
• Develop and optimize process and data architectures, data preparation, and normalization processes, and feature engineering techniques.
• Explore developing financial technologies and data science techniques and machine learning applications.
• Continue to develop deep knowledge of financial markets and market structure changes in order to bring an informed perspective to current financial technology landscape.

Impact

Reimagine The Investment Experience

A World-class client experience is something that can only be defined by our clients. Owning the development and delivery of advanced, actionable analytic products and services you will work to uncover difficulties and challenges a client may face and how they may differ across product lines. Using advanced analytics approaches such as predictive and prescriptive modeling your goal will be to bring tangible and significant business impact on Vanguard’s businesses and clients to drive overall efficiency.

Qualifications

What it Takes:
• Undergraduate degree or equivalent combination of training and experience. Advanced degree in a quantitative discipline preferred. Minimum five years experience in a technical or related financial discipline or combination of education and certifications.
• Understanding of system design architecture, including data modeling techniques, OOP and design patterns.
• Experience working with large data sets, data preparation processes, normalization, standardization, preprocessing techniques, and common data science concepts.
• Understanding of supervised/unsupervised machine learning techniques and their business applications. Strong proficiency and technical skills with some of the following: MATLAB, SQL Server, Oracle, NoSQL, C++, R, Python (Numpy, SciKit, Pandas, ML libraries ), Spark (and MLlib), Scala, Big Data technologies and ecosystems (Impala, Hive, Presto, AWS EMR, Apache, etc)
• Superior knowledge of investments in areas such as equities, fixed income, and alternative investment strategies.
• Exceptional conceptual, analytical, and problem solving skills, with proven ability to build/execute.
    
"""]

In [75]:
new = tfidf.transform(new_description)
nn.kneighbors(new.todense())

(array([[1.25726268, 1.26905021, 1.27352511, 1.27585551, 1.28187813]]),
 array([[293, 215, 199, 201, 411]], dtype=int64))

In [76]:
# most relevant result
data[293][:500]

"b'General Information\\nRef #: 25256\\nEmployee Type: Full Time\\nLocation: New York\\nExperienced Required: Please See Below\\nEducation Required: Masters Degree\\nDate published: 15-Mar-2019\\nAbout Us:\\nWe are PIMCO, a leading global asset management firm. We manage investments and develop solutions across the full spectrum of asset classes, strategies and vehicles: fixed income, equities, commodities, asset allocation, ETFs, hedge funds and private equity. PIMCO is one of the largest investment man"

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 