<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [2]:
data = "data/job_listings.csv"
df = pd.read_csv(data)
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [3]:
# from google.colab import files
# uploaded = files.upload()

In [4]:
# import io
# df = pd.read_csv(io.BytesIO(uploaded['job_listings.csv']))

In [3]:
def lower_word(text):
    """
    Converts a string into lower case form
    Args:
        text(str): The string that the function will convert to lowercase
    Returns:
        list: lowercased versions
    """
    clean_text = re.sub('[^a-zA-Z 0-9]', "", text)
    lower = clean_text.lower()
    return lower

In [4]:
def beautify(text):
    """
    Extracts text from html string
    Args:
        text(str): The html string
    Returns:
        string: Text of the html string
    """
    text = BeautifulSoup(text).text
    return text

In [5]:
df['description'] = df['description'].apply(beautify).apply(lower_word)

In [6]:
df

Unnamed: 0.1,Unnamed: 0,description,title
0,0,bjob requirementsnconceptual understanding in ...,Data scientist
1,1,bjob descriptionnnas a data scientist 1 you wi...,Data Scientist I
2,2,bas a data scientist you will be working on co...,Data Scientist - Entry Level
3,3,b4969 6756 a monthcontractunder the general s...,Data Scientist
4,4,blocation usa xe2x80x93 multiple locationsn2 y...,Data Scientist
...,...,...,...
421,421,babout usnwant to be part of a fantastic and f...,Senior Data Science Engineer
422,422,binternshipat uber we ignite opportunity by se...,2019 PhD Data Scientist Internship - Forecasti...
423,423,b200000 350000 a yeara million people a year ...,Data Scientist - Insurance
424,424,bsenior data scientistnjob descriptionnnabout ...,Senior Data Scientist


In [7]:
df['title'].value_counts()

Data Scientist                                  150
Senior Data Scientist                            14
Junior Data Scientist                            10
Associate Data Scientist                          8
Data Scientist Intern                             7
                                               ... 
Data scientist                                    1
Intern, Data Engineer                             1
Data Scientist- Machine Translation               1
Data Scientist- Enterprise Product Analytics      1
Applied Data Scientist                            1
Name: title, Length: 177, dtype: int64

## 2) Use Spacy to tokenize the listings 

In [10]:
# !python -m spacy download en_core_web_lg

In [8]:
import en_core_web_lg
nlp = en_core_web_lg.load()
# nlp = spacy.load('en_core_web_lg')

ModuleNotFoundError: No module named 'en_core_web_lg'

In [None]:
# STOP_WORDS = nlp.Defaults.stop_words.union([])

In [None]:
##### Your Code Here #####

tokens = []

for doc in nlp.pipe(df['description']):
    doc_tokens = []
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False):
            doc_tokens.append(token.text.lower())
    tokens.append(doc_tokens)
df['tokens'] = tokens

In [None]:
df

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [None]:
data = df['tokens']
text = data[0]
len(text)

In [None]:
data[10]

In [None]:
##### Your Code Here #####
from sklearn.feature_extraction.text import CountVectorizer

# for doc in df['tokens']:
#   vect = CountVectorizer(stop_words='english')
#   dtm = vect.fit_transform(doc)
#   dtm_df = pd.DataFrame(data=dtm.toarray(), columns=vect.get_feature_names())

data = df['description']

vect = CountVectorizer(stop_words='english')

vect.fit(data)

dtm = vect.transform(data)

dtm_df = pd.DataFrame(data=dtm.toarray(), columns=vect.get_feature_names())
  

In [None]:
dtm_df

In [None]:
print(dtm)

## 4) Visualize the most common word counts

In [None]:
##### Your Code Here #####
import seaborn as sns
sns.countplot(df['tokens'][0])

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
##### Your Code Here #####
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')
dtm = tfidf_vect.fit_transform(data)
dtm_df = pd.DataFrame(data=dtm.toarray(), columns=tfidf_vect.get_feature_names())

In [None]:
dtm_df

In [None]:
# Find 5 most common job listings
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(dtm)
df_sim = pd.DataFrame(sim_matrix)

# which listings are most common to listing 0 (first listing)
listing_ind = 0

# we don't want any suggested similar listings to be the listing itself
sim_mask = df_sim[listing_ind] < 1

# show 5 most common listings to listing 0
most_sim_listings = df_sim[listing_ind][sim_mask].sort_values(ascending=False)[:5]
most_sim_listings

In [None]:
df.iloc[0]['description']

In [None]:
df.iloc[276]['description']

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
print(dtm)

In [None]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors

# Fit on DTM
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm_df)

# Sample the first doc from dtm to use as our query point
doc_index = 0
doc = [dtm_df.iloc[doc_index].values]

# Query using kneighbors
neigh_dist, neigh_index = nn.kneighbors(doc)

In [None]:
neigh_dist

In [None]:
neigh_index

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 