<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

## 1) *Optional:* Scrape 100 Job Listings that contain the title "Data Scientist" from indeed.com

At a minimum your final dataframe of job listings should contain
- Job Title
- Job Description

If you choose to not to scrape the data, there is a CSV with outdated data in the directory. Remeber, if you scrape Indeed, you're helping yourself find a job. ;)

In [2]:
jobs = pd.read_csv('data//job_listings.csv')
jobs = jobs[['description', 'title']]
jobs.description[0]

'b"<div><div>Job Requirements:</div><ul><li><p>\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them</p>\\n</li><li><p>Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)</p>\\n</li><li><p>Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R</p>\\n</li><li><p>Ability to communicate Model findings to both Technical and Non-Technical stake holders</p>\\n</li><li><p>Hands on experience in SQL/Hive or similar programming language</p>\\n</li><li><p>Must show past work via GitHub, Kaggle or any other published article</p>\\n</li><li><p>Master\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.</p></li></ul><div><div><div><div><div><d

In [4]:
test = jobs.description[0]
html = re.compile(r'(<.*?>)')
test = html.sub('', test)
test = test.replace('\\', ' ')
punct = re.compile(r'[^\w\d\s]')
test = punct.sub(' ', test)
test

'b Job Requirements  nConceptual understanding in Machine Learning models like Nai xc2 xa8ve Bayes  K Means  SVM  Apriori  Linear  Logistic Regression  Neural  Random Forests  Decision Trees  K NN along with hands on experience in at least 2 of them nIntermediate to expert level coding skills in Python R   Ability to write functions  clean and efficient data manipulation are mandatory for this role  nExposure to packages like NumPy  SciPy  Pandas  Matplotlib etc in Python or GGPlot2  dplyr  tidyR in R nAbility to communicate Model findings to both Technical and Non Technical stake holders nHands on experience in SQL Hive or similar programming language nMust show past work via GitHub  Kaggle or any other published article nMaster s degree in Statistics Mathematics Computer Science or any other quant specific field  nApply Now '

In [5]:
b = re.findall('(?<=\s)n\w*.?', test)
for i in b:
    test = test.replace(i, ''.join(i[1:]))

test = ' '.join(test.split()[1:])

In [6]:
test

'Job Requirements Conceptual understanding in Machine Learning models like Nai xc2 xa8ve Bayes K Means SVM Apriori Linear Logistic Regression Neural Random Forests Decision Trees K NN along with hands on experience in at least 2 of them Intermediate to expert level coding skills in Python R Ability to write functions clean and efficient data manipulation are mandatory for this role Exposure to packages like NumPy SciPy Pandas Matplotlib etc in Python or GGPlot2 dplyr tidyR in R Ability to communicate Model findings to both Technical and Non Technical stake holders Hands on experience in SQL Hive or similar programming language Must show past work via GitHub Kaggle or any other published article Master s degree in Statistics Mathematics Computer Science or any other quant specific field Apply Now'

In [7]:
def wash(df):
    df['description'] = df['description'].apply(lambda x: punct.sub('', x))
    df['description'] = df['description'].apply(lambda x: html.sub('', x))
    
    df['description'] = df['description'].replace('\\', ' ').replace('bdiv', '')

wash(jobs)

In [8]:
jobs

Unnamed: 0,description,title
0,bdivdivJob RequirementsdivullipnConceptual und...,Data scientist
1,bdivJob DescriptionbrnbrnpAs a Data Scientist ...,Data Scientist I
2,bdivpAs a Data Scientist you will be working o...,Data Scientist - Entry Level
3,bdiv classjobsearchJobMetadataHeader icluxsmbm...,Data Scientist
4,bulliLocation USA xe2x80x93 multiple locations...,Data Scientist
5,bdivCreate various Business Intelligence Analy...,Data Scientist
6,bdivpAs Spotify Premium swells to over 96M sub...,Associate Data Scientist – Premium Analytics
7,bEverytown for Gun Safety the nations largest ...,Data Scientist
8,bulliMS in a quantitative discipline such as S...,Sr. Data Scientist
9,bdivpSlack is hiring experienced data scientis...,"Data Scientist, Lifecyle"


## 2) Use Spacy to tokenize / clean the listings 

In [9]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_lg")

# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

In [10]:
# Tokenizer Pipe
tokens = []

for doc in tokenizer.pipe(jobs['description'], batch_size=500):
    doc_tokens = []
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False):
            doc_tokens.append(token.text.lower())
    tokens.append(doc_tokens)

jobs['tokens'] = tokens


In [201]:
def get_lemmas(text):

    lemmas = []
    
    doc = nlp(text)
    
    # Something goes here :P
    for token in doc: # punctuation already removed
        if (token.is_stop==False) and (token.pos_!= 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

In [None]:
jobs['lemmas'] = jobs['description'].apply(get_lemmas)
jobs

In [None]:
# word frequency
freq = pd.Series(' '.join(jobs['description']).split()).value_counts()
freq

In [215]:
def pct_change(word_freq):
    """
    stops when the pct change between descending word frequency removal is less than 1%
    """
    change = 1.1
    total = sum(word_freq.values)
    step = 1
    while change > .0051:
        curr = word_freq[step]
        prior = total - sum(word_freq.values[:step])
        change = curr / prior
        
        print(word_freq[:step].index[-1], end="  ")
        print(f'{change:.5f}', step, end="  |||")
        if step % 4 == 0:
            print('\n')
        step +=1
    curr = word_freq[step]
    prior = total - sum(word_freq.values[:step])
    change = curr / prior
    print(word_freq[:step].index[-1], end="  ")
    print(f'{change:.5f}', end="  |||")

In [221]:
common = (freq[:25])

In [222]:
jobs['description'] = jobs['description'].apply(lambda x:
                                " ".join([x for x in x.split() if x not in common]))



In [None]:
jobs

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [13]:
##### Your Code Here #####
raise Exception("\n This task is not complete. \n Replace this line with your code for the task.")

Exception: 
 This task is not complete. 
 Replace this line with your code for the task.

## 4) Visualize the most common word counts

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 