<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [127]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

## 1) *Optional:* Scrape 100 Job Listings that contain the title "Data Scientist" from indeed.com

At a minimum your final dataframe of job listings should contain
- Job Title
- Job Description

If you choose to not to scrape the data, there is a CSV with outdated data in the directory. Remeber, if you scrape Indeed, you're helping yourself find a job. ;)

In [2]:
import requests
import bs4
from bs4 import BeautifulSoup
import time

In [37]:
URL = "https://www.indeed.com/jobs?q=data+science&l=Hialeah%2C+FL&radius=10"
#conducting a request of the stated URL above:
page = requests.get(URL)
#specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, "html.parser")
#printing soup in a more structured tree format that makes for easier reading
#print(soup.prettify())

In [56]:
def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)
extract_job_title_from_result(soup)

['Data Scientist',
 'Research Data Engineer',
 'Sr. Data Scientist',
 'Big Data/ETL Automation Test Engineer',
 'Data Scientist',
 'Data Scientist, Data Analytics & AI',
 'Associate Data Scientist',
 'Senior Manager, Data Science & Analytics',
 'AI/Machine Learning/Coding Enthusiast',
 'Sales & Marketing Business Analyst',
 'Senior Data Engineer, Temporary Full Time',
 'Business Data Analyst',
 'Data Engineer',
 'Manager, Data Science & Engineering']

In [57]:
def extract_summary_from_result(soup): 
    summaries = []
    style="list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;"
    #spans = soup.findAll("div", attrs={"class": "summary"})
    for row in soup.find_all(name="div", attrs={"class":"row"}):
        for div in row.findAll("div", attrs={"class": "summary"}):
                for ul in div.findAll(["ul", "/ul"], attrs={"style": style}):
                    summaries.append(ul.text.strip())
                
    return summaries
extract_summary_from_result(soup)

["Solid knowledge of current technology data architecture, particularly data lakes and cloud environments.SUMMARY: The Data Scientist is part of BankUnited's…",
 'Ensures databases and data extracts reflect specifications; proactively seeks specification clarification as needed to ensure data quality.',
 'Source, manipulate, cleanse and synthesis data at scale from disparate structured and unstructured data sources.Mentor and coach other team members.',
 'Understanding of data analysis, data modeling, database design, data migration and business intelligence solutions.Experience in analyzing & validating data.',
 'The data scientist will also need to integrate data from disparate sources into an efficient and intuitive rational database structure.',
 'This Data Scientist will focus on architecting, deploying and evaluating intelligent solutions as part of a growing Data Science team within Royal Caribbean.',
 '2+ years of experience as a data scientist or highly technical data analyst.

In [119]:
max_results_per_city = 100
city_set = ["San+Francisco", "Washington+DC","Austin","Seattle","Baltimore","New+York",
            "Miami", "Denver", "Portland", "Chicago", "Atlanta", "Seattle", "Los+Angeles"]
columns = ["city","job_title","summary"]
sample_df = pd.DataFrame(columns = columns)

In [120]:
#scraping code:
for city in city_set:
    for start in range(0, max_results_per_city, 10):
        page = requests.get('http://www.indeed.com/jobs?q=data+scientist+junior&l=' + str(city) + '&start=' + str(start))
        time.sleep(1)  #ensuring at least 1 second between page grabs
        soup = BeautifulSoup(page.text, "html.parser", from_encoding="utf-8")
        for div in soup.find_all(name="div", attrs={"class":"row"}):
            #specifying row num for index of job posting in dataframe
            num = (len(sample_df) + 1)
            #creating an empty list to hold the data for each posting
            job_post = []
            #append city name
            job_post.append(city)
            #grabbing job title
            for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
                job_post.append(a["title"])
            #grabbing summary text
            d = div.findAll("span", attrs={"class":"summary"})
#             for span in d:
#                 job_post.append(span.text.strip())
            style="list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;"
            for b in div.findAll("div", attrs={"class": "summary"}):
                for ul in b.findAll(["ul", "/ul"], attrs={"style": style}):
                    job_post.append(ul.text.strip())
            #appending list of job post info to dataframe at index num
            sample_df.loc[num] = job_post

In [124]:
sample_df.head()

Unnamed: 0,city,job_title,summary
1,San+Francisco,Principal Data Scientist,Title: Principal Data Scientist - 61756.Work w...
2,San+Francisco,Scientist,Provide guidance and supervise junior research...
3,San+Francisco,Staff Deep Learning Scientist,"Convey ideas, guide execution and mentor junio..."
4,San+Francisco,Data Scientist (Jr. to Sr. Level),Responsibilities for the Data Scientist includ...
5,San+Francisco,Associate Data Scientist,The Jr. Data Scientist will dig into data to u...


In [126]:
sample_df.summary[1]

'Title: Principal Data Scientist - 61756.Work with large geospatial data, including high resolution satellite imagery and environmental data.'

In [121]:
##### Your Code Here #####
# raise Exception("\nThis task is not complete. \nReplace this line with your code for the task.")
                
# from bs4 import BeautifulSoup
# import requests

## 2) Use Spacy to tokenize / clean the listings 

In [128]:
##### Your Code Here #####
nlp = spacy.load("en_core_web_lg")

In [130]:
def get_lemmas(text):

    lemmas = []
    
    doc = nlp(text)
    
    # Something goes here :P
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)) and (token.pos_!= 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

In [135]:
df = sample_df.copy()
df.head()

Unnamed: 0,city,job_title,summary
1,San+Francisco,Principal Data Scientist,Title: Principal Data Scientist - 61756.Work w...
2,San+Francisco,Scientist,Provide guidance and supervise junior research...
3,San+Francisco,Staff Deep Learning Scientist,"Convey ideas, guide execution and mentor junio..."
4,San+Francisco,Data Scientist (Jr. to Sr. Level),Responsibilities for the Data Scientist includ...
5,San+Francisco,Associate Data Scientist,The Jr. Data Scientist will dig into data to u...


In [184]:
df['lemmas'] = df['summary'].apply(get_lemmas)
df_test = df['summary'].apply(get_lemmas)

In [185]:
df_test.head()

1    [title, Principal, Data, Scientist, 61756.work...
2    [provide, guidance, supervise, junior, researc...
3    [convey, idea, guide, execution, mentor, junio...
4    [responsibility, Data, Scientist, include, qua...
5    [Jr., Data, Scientist, dig, datum, uncover, in...
Name: summary, dtype: object

In [149]:
df.summary[2]

'Provide guidance and supervise junior research associates.All aspects of analytical laboratory operations, such as sample receipt, data generation,…'

In [150]:
df.lemmas[2]

['provide',
 'guidance',
 'supervise',
 'junior',
 'research',
 'associate',
 'aspect',
 'analytical',
 'laboratory',
 'operation',
 'sample',
 'receipt',
 'datum',
 'generation']

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [182]:
##### Your Code Here #####
vect = CountVectorizer(stop_words='english')
vect.fit(test_df)
dtm = vect.transform(test_df)
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm

AttributeError: 'list' object has no attribute 'lower'

## 4) Visualize the most common word counts

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [188]:
##### Your Code Here #####
def tokenize(document):
    doc = nlp(document)
    return [token.lemma_.strip() for token in doc if (token.is_stop != True) and (token.is_punct != True)]

In [194]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Instantiate vectorizer object
tfidf = TfidfVectorizer(tokenizer=tokenize, min_df=0.025, max_df=.98, ngram_range=(1,2))
# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(df.summary) # Similiar to fit_predict
# Print word counts
# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,+,+ year,ability,advanced,algorithm,analysis,analyst,analytic,analytical,analyze,...,technical support,technology,tool,training,training technical,use,visualization,work,work closely,year
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.469429,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [195]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors

# Fit on TF-IDF Vectors
nn  = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

In [206]:
nn.kneighbors([dtm.iloc[30]])

(array([[0.        , 0.        , 0.        , 0.82208963, 0.86034953]]),
 array([[ 79,  30, 141, 127, 272]], dtype=int64))

In [205]:
sample_df.summary[30]

'Lead and manage a small team (2-3 junior scientists).It is located in the QB3 incubator at UCSF, San Francisco, CA.This Job Is Ideal for Someone Who Is:'

In [208]:
sample_df.summary[272]

'Must demonstrate the ability to develop structured research including, but not limited to, obtaining, evaluating, organizing, and maintaining information within…'

In [209]:
ideal_job_description = ["junior small team good pay learning opportunity"]

In [210]:
new = tfidf.transform(ideal_job_description)

In [212]:
nn.kneighbors(new.todense())

(array([[1.02968071, 1.02968071, 1.02968071, 1.02968071, 1.02968071]]),
 array([[393, 440, 421, 381, 406]], dtype=int64))

In [216]:
sample_df.summary[406]

'NLign Analytics is a pioneer in the development of software tools that fundamentally change the way aircraft manufacturers and maintenance organizations use…'

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 