<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup
nlp = spacy.load("en_core_web_lg")

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [2]:
from bs4 import BeautifulSoup
import requests

df = pd.read_csv("data/job_listings.csv")                
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [3]:
def filter_html(html):
    """Parse through given html to remove tags and return text data.
    
    Args:
        html (str): Html structured strings to be parsed.
    
    Returns:
        str: Text data.
    """
    soup = BeautifulSoup(html, "html.parser")
    
    return "".join([string for string in soup.stripped_strings])

In [4]:
# Write over the original description column with the filtered html text data.
df["description"] = df["description"].apply(filter_html)

## 2) Use Spacy to tokenize the listings 

In [17]:
def tokenize(document):
    doc = nlp(document)
    
    tokens = [token.lemma_.lower().strip() for token in doc if (token.is_stop != True) and (token.is_punct != True)]
    
    return tokens

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [18]:
vect = CountVectorizer(tokenizer=tokenize)

dtm = vect.fit_transform(df["description"])

In [24]:
wc_df = pd.DataFrame(data=dtm.todense(), columns=vect.get_feature_names())
wc_df.head()

Unnamed: 0,"""\ncommvault",$,)\nabout,)\npractical,+,+3.\nat,",\npython",-5,-\n\nopportunities,-\nann,...,|design,|develop,|manage,|stand,|take,|work,||,~$70,~1,~4
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [26]:
vect = TfidfVectorizer(tokenizer=tokenize)

dtm = vect.fit_transform(df["description"])

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [33]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(algorithm="kd_tree",
                      n_neighbors=10)

nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

In [34]:
reference_doc = 0
reference_vector = [dtm.iloc[reference_doc].values]

dist, ind = nn.kneighbors(reference_vector)
print(ind)

AttributeError: iloc not found

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 