Job Board Scraping

Collect data on data science salary trends from a job listings aggregator for your analysis.

Select and parse data from at least ~1000 postings for jobs, potentially from multiple location searches.

Find out what factors most directly impact salaries (title, location, department, etc.). In this case, we do not want to predict mean salary as would be done in a regression. Your boss believes that salary is better represented in categories than continuously


Test, validate, and describe your models. What factors predict salary category? How do your models perform?

Prepare a presentation for your Principal detailing your analysis.


BONUS PROBLEMS: 1. Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your logistic regression models to ease her mind, and explain what it is doing and any tradeoffs. Plot the ROC curve. 2. Text variables and regularization:

Part 1: Job descriptions contain more potentially useful information you could leverage. Use the job summary to find words you think would be important and add them as predictors to a model.
Part 2: Gridsearch parameters for Ridge and Lasso for this model and report the best model.


Goal: Scrape & clean data, run logistic regression, derive insights, present findings.

Requirements

Scrape and prepare your data using BeautifulSoup.
A team Jupyter Notebook with your regression analysis for a peer audience of data scientists.
An individual blog post describing your findings, with two sections: the first for a non-technical audience, and the second for data scientist peers.

In [1]:
# import modules and packages

import pandas as pd
import numpy as np
import requests
from IPython.core.display import HTML
from selenium import webdriver
from bs4 import BeautifulSoup
import timeit
import urllib2

In [19]:
# send get request for target website and check response code

r = requests.get('http://www.careerbuilder.com/jobs-data-scientist?keywords=data+scientist&pay=20')
r

<Response [200]>

In [20]:
# see request headers and print clean

r.request.headers
for k, v in r.request.headers.items():
    print (k+':',v)

('Connection:', 'keep-alive')
('Accept-Encoding:', 'gzip, deflate')
('Accept:', '*/*')
('User-Agent:', 'python-requests/2.10.0')


In [21]:
# see response headers and print clean

r.headers
for k, v in r.headers.items():
    print (k + ':', v)

('Date:', 'Sun, 16 Oct 2016 23:45:34 GMT')
('Content-Type:', 'text/html;charset=UTF-8')
('Transfer-Encoding:', 'chunked')
('Connection:', 'keep-alive')
('Set-Cookie:', '__cfduid=d5cf1dcf5ca91d2f6a84ff17894d8f3b21476661533; expires=Mon, 16-Oct-17 23:45:33 GMT; path=/; domain=.careerbuilder.com; HttpOnly, BID=X1C69E5FD56C8B4769F419FD1DCD18CBC367127A6A7F1D10D6BAD6B0B04A46457FE6FC9584CB983A9DC380DF34AC09C2FC; domain=.careerbuilder.com; path=/, _session_id=6e08fbd4a1bd4125700139869ec9a21c; path=/; expires=Mon, 17 Oct 2016 11:45:34 -0000; HttpOnly')
('ApplicationName:', 'CbMobile')
('Cache-Control:', 'max-age=0, private, must-revalidate')
('cb-request-id:', '2f2f629987cf1876-EWR')
('ETag:', 'W/"eb7230c0562072ef87e753e886297c16"')
('Routing-Host:', 'ip-10-0-13-11')
('Status:', '200 OK')
('X-Content-Type-Options:', 'nosniff')
('X-NewRelic-App-Data:', 'PxQAWVZWDQsTUFhbBgIDUUYdFHANCBcQXw5UB0oXXl1ROkoEUBNQCktfWQUDGxofAEpRTgYfBlVVBQUGWlFSUwJWClQPDhgfAkkbAgMFVwBSBQJQXlZbCgkCVEBq')
('X-Powered-By:',

In [22]:
# review content wrapped in HTML

from selenium import webdriver
import selenium as sel
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# path at on gus
# driver = webdriver.PhantomJS(executable_path='~/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin')

# path at home
driver = webdriver.PhantomJS(executable_path='/Users/EKandTower/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs')

'http://www.careerbuilder.com/jobs-data-scientist?keywords=data+scientist&pay=20'


driver.set_window_size(1024, 768) 
driver.get('http://www.careerbuilder.com/jobs-data-scientist?keywords=data+scientist&pay=20')

### note: website had to be changed several times to specify. response stalled with wider terms.
# tried to define url so it would be easier to reference later. soup did not recognize.
timeit.timeit()

0.06231880187988281

In [23]:
# show website
HTML(driver.page_source)


In [35]:
# pass through Beautiful Soup

import urllib
r = urllib.urlopen('http://www.careerbuilder.com/jobs-data-scientist?keywords=data+scientist&pay=20').read()
soup = BeautifulSoup(r, 'lxml')
print type(soup)

timeit.timeit()

<class 'bs4.BeautifulSoup'>


0.07046794891357422

In [36]:
# see what's in there

print(soup.text)





window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"42096352b4","applicationID":"4991347","transactionName":"el1YEUVZXlRRSxgIXlttRQBWRFFQG1BZBlRB","queueTime":0,"applicationTime":823,"agent":""}
(window.NREUM||(NREUM={})).loader_config={xpid:"Uw4HVFNbGwcJXVBRAwY="};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{s.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-er

In [37]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html class="ie6" lang="en"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7" lang="en"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8" lang="en"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]> ><! <![endif]-->
<html lang="en">
 <!-- <![endif] -->
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"42096352b4","applicationID":"4991347","transactionName":"el1YEUVZXlRRSxgIXlttRQBWRFFQG1BZBlRB","queueTime":0,"applicationTime":823,"agent":""}
  </script>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).loader_config={xpid:"Uw4HVFNbGwcJXVBRAwY="};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o

In [None]:
# extracting city and state

In [58]:
#getting links from soup

import re

for link in soup.find_all('a', href=re.compile('/job/') ):
    print("http://www.careerbuilder.com" + (link.get('href')))


http://www.careerbuilder.com/job/J3H3S262FXSL1FTXYR9?ipath=JRG1&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.careerbuilder.com/job/JHN248739K97BBXQ9PJ?ipath=JRG2&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.careerbuilder.com/job/J8Q24L6YYR2M90TWDV3?ipath=JRG3&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.careerbuilder.com/job/J3H4KQ71LVCXFDRG4JQ?ipath=JRG4&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.careerbuilder.com/job/J3L15W6K4HP16D6QCKZ?ipath=JRG5&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.careerbuilder.com/job/J3J4QX66750J03KDDK4?ipath=JRG6&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.careerbuilder.com/job/J3H6LC76TYD64GH0Y6M?ipath=JRG7&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.careerbuilder.com/job/J3H6QZ6F6YRNX2GG2VR?ipath=JRG8&searchid=e35edb6e-bb5c-44cd-9dbc-242784ab4ed3&siteid=cbnsv
http://www.caree

0.06836891174316406

In [27]:
# send get request for target website and check response code

job = requests.get('http://www.careerbuilder.com/job/J3H3S262FXSL1FTXYR9?ipath=\
JRG1&searchid=41c7d746-a553-4c2f-bcbc-8d23ef079ae8&siteid=ns_us_g')
job

<Response [200]>

In [28]:
# see request headers and print clean

job.request.headers
for k, v in job.request.headers.items():
    print (k+':',v)

('Connection:', 'keep-alive')
('Accept-Encoding:', 'gzip, deflate')
('Accept:', '*/*')
('User-Agent:', 'python-requests/2.10.0')


In [29]:
driver.set_window_size(1024, 768) 
driver.get('http://www.careerbuilder.com/job/J3H3S262FXSL1FTXYR9?ipath=\
JRG1&searchid=41c7d746-a553-4c2f-bcbc-8d23ef079ae8&siteid=ns_us_g')

timeit.timeit()

0.06978487968444824

In [31]:
# pass through Beautiful Soup

import urllib
job = urllib.urlopen('http://www.careerbuilder.com/job/J3H3S262FXSL1FTXYR9?ipath=\
JRG1&searchid=41c7d746-a553-4c2f-bcbc-8d23ef079ae8&siteid=ns_us_g').read()
soup_job = BeautifulSoup(job, 'lxml')
print type(soup_job)

timeit.timeit()

<class 'bs4.BeautifulSoup'>


0.06377506256103516

In [32]:
print(soup_job.prettify())

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html class="ie6" lang="en"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7" lang="en"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8" lang="en"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]> ><! <![endif]-->
<html lang="en">
 <!-- <![endif] -->
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"42096352b4","applicationID":"4991347","transactionName":"el1YEUVZXlRRSxgIXltBGRZfWUU=","queueTime":0,"applicationTime":171,"agent":""}
  </script>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).loader_config={xpid:"Uw4HVFNbGwcJXVBRAwY="};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o

In [33]:
# retrieve all information with "tag" class

print(soup_job.findAll(class_='tag'))

[<div class="tag">\nFull-Time\n</div>, <div class="tag">\nExperience - At least 3 year(s)\n</div>, <div class="tag">\nDegree - None\n</div>, <div class="tag">\n$95,000.00 - $120,000.00 /Year\n</div>, <div class="tag" id="job-industry">\nComputer Software, Banking - Financial Services, Biotechnology\n</div>, <div class="tag" id="job-categories">\nInformation Technology, Engineering, Professional Services\n</div>, <div class="tag">\nRelocation - No\n</div>]


In [None]:
results = soup_job.findAll("td", {"valign" : "top"})

In [34]:
# retrieve all information with "description" class

print(soup_job.findAll(class_='description'))

[<div class="description">\nThis position is open as of 10/16/2016.<br/><br/>Data Scientist - Improving Healthcare through Big-Data Analytics<br/><br/><div>If you are a Data Scientist with 3+ years of professional experience, please read on!<br/>\n<br/>\nWith offices in Cambridge, MA we are a big data analytics software start-up that has taken a data driven approach to consumer travel. We have compiled billions of data sets around the world to give travelers actionable insights on where and when to travel for the best deals.</div><br/><br/><b>Top Reasons to Work with Us</b><br/><br/><div>1. This is a fun start-up company that is taking a completely new approach to the industry- and we have the VC backing to make it all happen!<br/>\n2. We are conveniently located in Cambridge, MA not far from the Harvard University Campus.<br/>\n3. We offer base compensation, bonus, stock, benefits, a business casual work environment and more.</div><br/><br/><b>What You Will Be Doing</b><br/><br/><div>