<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [2]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup
nlp = spacy.load("en_core_web_lg")

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [58]:
from bs4 import BeautifulSoup
import requests

df = pd.read_csv("data/job_listings.csv")                
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [59]:
def filter_html(html):
    """Parse through given html to remove tags and return text data.
    
    Args:
        html (str): Html structured strings to be parsed.
    
    Returns:
        str: Text data.
    """
    soup = BeautifulSoup(html, "html.parser")
    
    return " ".join(soup.get_text().split("\\n")).strip('b')

In [60]:
# Write over the original description column with the filtered html text data.
df["description"] = df["description"].apply(filter_html)

In [61]:
df['description']

0      "Job Requirements: Conceptual understanding in...
1      'Job Description  As a Data Scientist 1, you w...
2      'As a Data Scientist you will be working on co...
3      '$4,969 - $6,756 a monthContractUnder the gene...
4      'Location: USA \xe2\x80\x93 multiple locations...
                             ...                        
421    "About Us: Want to be part of a fantastic and ...
422    'InternshipAt Uber, we ignite opportunity by s...
423    '$200,000 - $350,000 a yearA million people a ...
424    "SENIOR DATA SCIENTIST JOB DESCRIPTION  ABOUT ...
425    'Cerner Intelligence is a new, innovative orga...
Name: description, Length: 426, dtype: object

## 2) Use Spacy to tokenize the listings 

In [62]:
def tokenize(document):
    doc = nlp(document)
    
    tokens = [token.lemma_.lower().strip() for token in doc if (token.is_stop != True) and (token.is_punct != True) and (token.text not in ["$", " "])]
    
    return tokens

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [69]:
STOP_WORDS = nlp.Defaults.stop_words.union(["apply", "now", "job", "requirement", "description"])

In [73]:
vect = CountVectorizer(tokenizer=tokenize,
                       stop_words=STOP_WORDS)

vect.fit(df["description"])

dtm = vect.transform(df["description"])



## 4) Visualize the most common word counts

In [103]:
wc = pd.DataFrame(data=dtm.todense(), columns=vect.get_feature_names())

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [96]:
vect = TfidfVectorizer(tokenizer=tokenize,
                       stop_words=STOP_WORDS)

vect.fit(df["description"])

dtm = vect.transform(df["description"])



## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [98]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(algorithm="kd_tree",
                      n_neighbors=10)

nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

In [100]:
my_desc = "I want to work somewhere where I can use Python and other coding languages to be able to understand cool technologies. Data engineering is something I am very interested in as well. I like to communicate with a team and build useful insights for the company."

my_desc_vect = vect.transform([my_desc])

nn.kneighbors(my_desc_vect.todense())

(array([[1.25496227, 1.25496227, 1.30213162, 1.30734456, 1.3091513 ,
         1.32244375, 1.32244375, 1.32274969, 1.32653132, 1.32691683]]),
 array([[ 76, 172, 169, 403, 364, 395, 155, 358, 136, 111]], dtype=int64))

In [102]:
df.iloc[76].description

'\'Title: Data Scientist This role is designed for people who like to dig into data, figure out what matters, and communicate those insights to others. Fast. Background At numo, we create and incubate new "fintech" companies. numo seeks a business-minded, results-oriented data scientist who wants to see the insights they derive be put to work in early stage product concepts. The Venture team, which you will be a key member of, sits at the fuzzy front end of the idea pipeline, partnering with internal stakeholders, 3rd-parties and academics to define new product development opportunities in the fintech space. We\\\'re looking for a hands-on data wizard who wants to experiment with new data sources, analyze data, problem solve about how data can be used, develop predictive models, and deploy and maintain these models as they become part of our product portfolio. Our venture team includes product and business experts, and now we\\\'re seeking to build our data science competency in the sp

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 