<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Assignment 2*

# Document Representations: Bag-Of-Words

In [1]:
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# 1) (optional) Scrape 100 Job Listings that contain the title "Data Scientist" from indeed.com

At a minimum your final dataframe of job listings should contain
- Job Title
- Job Description

If you choose to not to scrape the data, there is a CSV with outdated data in the directory. Remeber, if you scrape Indeed, you're helping yourself find a job. ;)

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/master/module2-vector-representations/job_listings.csv').drop('Unnamed: 0', axis=1)
df.head()

Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


## 2) Use Spacy to tokenize / clean the listings 

In [3]:
from bs4 import BeautifulSoup
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

In [4]:
def clean(series):
    cleaned = []
    for row in series:
        soup = BeautifulSoup(row, 'html.parser') #use BS4 to strip away html quickly
        row = re.sub(r'[^a-zA-Z ]', '', soup.text.lower().replace('\\n', ' '))[1:] #remove any other non-alpha numerics, drop first char 'b'

        doc = nlp(row)
        tokens = [token.lemma_ for token in doc if (token.lemma_ != '-PRON-')] #tokenize as long as not pronoun
        
        cleaned.append(tokens)
    return cleaned

In [5]:
df['description_tokens'] = clean(df['description'])

In [6]:
df.head()

Unnamed: 0,description,title,description_tokens
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,"[job, requirement, conceptual, understanding, ..."
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"[job, description, , as, a, data, scientist, ..."
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,"[as, a, data, scientist, will, be, work, on, c..."
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"[ , a, monthcontractunder, the, general, sup..."
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,"[location, usa, xexx, multiple, location, , y..."


# 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [7]:
def c_vectorizer(text):
    vectorizer = CountVectorizer(stop_words='english')
    vectorizer.fit(text)
    vectorizer.vocabulary_
    dtm = vectorizer.transform(text)
    return pd.DataFrame(dtm.todense(), columns=vectorizer.get_feature_names())

# 4) Visualize the most common word counts

In [8]:
#It takes a long time to sum and sort the entire token set, so below is a subset
nested_list = df['description_tokens'][0:100].to_list()
tokens_list = [item for sublist in nested_list for item in sublist]

dtm_df = c_vectorizer(tokens_list)
dtm_df.sum().sort_values(ascending=False).head(10)

datum         658
experience    387
work          348
team          248
business      243
use           179
product       165
analysis      162
model         160
science       159
dtype: int64

 # 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [9]:
tfidf = TfidfVectorizer(stop_words='english', max_features = 5000, min_df=5)

nested_list = df['description_tokens'].to_list()
tokens_list = [item for sublist in nested_list for item in sublist]

dtm = tfidf.fit_transform(tokens_list)

tfidf_df = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
tfidf_df.head()

Unnamed: 0,ab,ability,able,abstract,academic,accelerate,accept,access,accessibility,accessible,...,yeti,york,young,youxexxll,youxexxre,youxexxve,yrs,zf,zillow,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


 # 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
job = 'Natural language processing and cool shit'
job_tokens = job.lower().split()

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.