# Data Scientist Job Posts: Web Scraping

The objective of this project is twofold;
- 1) build out a tool that can scape several job sites to pull out key competencies + trends for data science roles
- 2) use this project as a way to build competency in web scraping and data visualization

Even though several people have taken on similar projects, there has not been a strict focus on; 1) remote jobs, 2) current jobs / a tool I can reuse. I've also not seen a lot of EDA + visualization that was super useful for new data scientists to zero in their training on the most marketable skills (at the time).

Either way, this will still be a super useful and fun project to build out skills in web scraping and visualization.

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
import seaborn as sns
import numpy as np

# Web Scraping

The key data we want to collect are:
- job title
- company name
- job summary - note: once we have this extracted, we'll work on EDA and data visulation in sections below

For this project, I'll also pull this information out for jobs with titles that contain:
- data scientist
- analytics
- machine learning

In [70]:
jobs = []
companies = []
summaries = []
types = ['data+scientist', 'analytics', 'machine+learning']

for i in range (0,2000,10):
    #url = "https://www.indeed.com/jobs?q=data+scientist&l=Remote&start=" + str(i)
    url = "https://www.indeed.com/jobs?" + "q=" + str(types) + "&l=Remote" + "&start=" +str(i)
    results = requests.get(url+str(types)+str(i))
    time.sleep(1)
    soup = BeautifulSoup(results.text, "html.parser")
    for div in soup.find_all(name="h2", attrs={"class":"title"}):
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    for div in soup.find_all(name="div", attrs={"class":"sjcl"}):
        company = div.find_all(name="a", attrs={"data-tn-element":"companyName"})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        else:
            sec_try = div.find_all(name="span", attrs={"class":"company"})
            for span in sec_try:
                companies.append(span.text.strip())
    sum = soup.findAll("div", attrs={"class": "summary"})
    for sum in sum:
        summaries.append(sum.text.strip())
    
indeed_combined = pd.DataFrame({'title':jobs,
                      'company':companies,
                      'summary':summaries})

In [71]:
print(indeed_combined["summary"])

0       Design and build scalable production-ready ana...
1       Do you have experience, knowledge or interest ...
2       Support and drive analytic efforts around mach...
3       Proficiency with statistical analysis tools e....
4       2+ years experience with machine learning and ...
                              ...                        
3006    Partner with a cross-functional team of data s...
3007    Proven track record of architecting, developin...
3008    Natural Language Processing (NLP) centric role...
3009    We offer a generous budget for personal develo...
3010    You’ll get to work with an experienced team of...
Name: summary, Length: 3011, dtype: object


In [2]:
indeed_combined.to_csv("/Volumes/GoogleDrive/My Drive/ml_projects/indeed_combined")

NameError: name 'indeed_combined' is not defined

# Natural Language Processing

Now that we have all the relevant data scaped, let's start digging into the summary section to pull out key competencies for data science roles. For this, I'll be using the NLTK throughout. 

### Prepping and Loading Data

In [2]:
import nltk
from string import punctuation
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

indeed_combined = pd.read_csv("/Volumes/GoogleDrive/My Drive/ml_projects/indeed_combined", dtype=str)


In [3]:
indeed_combined['summary'] = indeed_combined['summary'].dropna() 
indeed_combined['summary'] = indeed_combined.summary.astype('string')
summary = indeed_combined.summary.str.lower()

### Preprocessing

Removing numbers from summary

In [6]:
summary_cleaned = ''.join(c for c in summary if not c.isdigit())

Removing tags using regex

In [7]:
import re
summary_cleaned = re.sub('<[^<]+?','',summary_cleaned)

Tokenization and Word Frequency

In [8]:
nltk.download("stopwords")
stop_words = stopwords.words('english')

punctuation = punctuation + '\n'

tokens = nltk.word_tokenize(summary_cleaned)

word_frequencies = {}
for word in tokens:
    if word.lower() not in stop_words:
        if word.lower() not in punctuation:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/patrickbell/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Checking it all out

In [9]:
from collections import Counter
k = Counter(word_frequencies)
high = k.most_common(75)

for i in high:
    print(i[0], ":", i[1], "")

data : 3005 
learning : 2119 
experience : 1837 
machine : 1455 
years : 798 
analytics : 643 
techniques : 631 
statistical : 597 
three : 579 
science : 561 
team : 560 
scientists : 401 
modeling : 398 
least : 398 
five : 398 
industry : 398 
decision : 391 
tree : 391 
projects : 391 
engineers : 382 
solutions : 369 
get : 362 
role : 361 
product : 359 
work : 352 
experienced : 339 
using : 279 
building : 252 
r : 220 
testing : 218 
python : 218 
deep : 211 
knowledge : 210 
scientist : 202 
areas : 201 
build : 200 
interest : 200 
drive : 200 
efforts : 200 
analysis : 200 
tools : 200 
applying : 200 
hands-on : 200 
programming : 200 
developing : 200 
products : 200 
scalable : 199 
production-ready : 199 
wide : 199 
array : 199 
methodologies : 199 
field : 199 
machine…do : 199 
one : 199 
following : 199 
practical : 199 
a/b : 199 
websites : 199 
feature : 199 
engineering : 199 
predictive…support : 199 
analytic : 199 
around : 199 
innovation : 199 
coach : 199 

Calculating Sentence Scores using Word Frequencies

In [10]:
sentence_list = nltk.sent_tokenize(summary_cleaned)
sentence_scores = {}
for sent in sentence_list:
    for word in tokens:
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
            

Conlclussion / Summary
- it's not the mots clean summary (runons, etc.)
- however, for learning purposes, this was great and gives a decent overview or current hiring trends in data science
- when taking a more objective / quantitative view, there isn't a ton of nuance in what is being looked for in data science roles
-- 3-5 years experience
-- ability to do all aspects (data finding, processing, modeling, etc.)
-- machine learning using python and / or R

In [11]:
import heapq
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)


at least three to five years of industry experience in data analytics.cleanse and wrangle student interaction data from our learning platforms. experience on projects applying network science/graph analytics.lead data automation projects to aggregate disparate data sources. d required with concentration on statistics/ machine learning highly preferred.we offer a generous budget for personal development expenses like training courses, conferences, and books. experience with a variety of machine learning techniques (clustering, decision tree learning,…at least three to five years of industry experience in data analytics. enhancing data collection procedures to include information that is…machine learning for supervised learning. querying and analyzing data using a holistic perspective. experience on projects applying network science/graph analytics.at least three to five years of industry experience in data analytics. exposure to python or r libraries for machine learning.lead data autom