Job Board Scraping

Collect data on data science salary trends from a job listings aggregator for your analysis.

Select and parse data from at least ~1000 postings for jobs, potentially from multiple location searches.

Find out what factors most directly impact salaries (title, location, department, etc.). In this case, we do not want to predict mean salary as would be done in a regression. Your boss believes that salary is better represented in categories than continuously


Test, validate, and describe your models. What factors predict salary category? How do your models perform?

Prepare a presentation for your Principal detailing your analysis.


BONUS PROBLEMS: 1. Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your logistic regression models to ease her mind, and explain what it is doing and any tradeoffs. Plot the ROC curve. 2. Text variables and regularization:

Part 1: Job descriptions contain more potentially useful information you could leverage. Use the job summary to find words you think would be important and add them as predictors to a model.
Part 2: Gridsearch parameters for Ridge and Lasso for this model and report the best model.


Goal: Scrape & clean data, run logistic regression, derive insights, present findings.

Requirements

Scrape and prepare your data using BeautifulSoup.
A team Jupyter Notebook with your regression analysis for a peer audience of data scientists.
An individual blog post describing your findings, with two sections: the first for a non-technical audience, and the second for data scientist peers.

In [12]:
# import modules and packages

import pandas as pd
import numpy as np
import requests
from IPython.core.display import HTML
from selenium import webdriver
from bs4 import BeautifulSoup
import timeit
import urllib2

In [13]:
# send get request for target website and check response code

r = requests.get('http://www.careerbuilder.com/jobs-data-scientist?keywords=data+scientist&pay=20')
r

<Response [200]>

In [14]:
# pass through Beautiful Soup

import urllib
r = urllib.urlopen('http://www.careerbuilder.com/jobs-data-scientist?keywords=data+scientist&pay=20').read()
soup = BeautifulSoup(r, 'lxml')
print type(soup)

timeit.timeit()

<class 'bs4.BeautifulSoup'>


0.06422781944274902

In [15]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html class="ie6" lang="en"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7" lang="en"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8" lang="en"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]> ><! <![endif]-->
<html lang="en">
 <!-- <![endif] -->
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"42096352b4","applicationID":"4991347","transactionName":"el1YEUVZXlRRSxgIXlttRQBWRFFQG1BZBlRB","queueTime":0,"applicationTime":917,"agent":""}
  </script>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).loader_config={xpid:"Uw4HVFNbGwcJXVBRAwY="};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o

In [16]:
#getting links from soup

import re
job_pages=[]
for i in range(31):
    url = "http://www.careerbuilder.com/jobs-data-science?page_number=1&pay=20"
    index = 59
    char = i
    charplus = i + 1
    charplusstring = str(charplus)
    url2 = url[:index] + charplusstring + url[index + 1:]
    rm = requests.get(url2)
    for link in soup.find_all('a', href=re.compile('/job/') ):
        job_pages.append("http://www.careerbuilder.com" + (link.get('href')))
#job_pages

In [17]:
# send get request for target website and check response code

job = requests.get('http://www.careerbuilder.com/job/J3H3S262FXSL1FTXYR9?ipath=\
JRG1&searchid=41c7d746-a553-4c2f-bcbc-8d23ef079ae8&siteid=ns_us_g')
job

<Response [200]>

In [18]:
# pass through Beautiful Soup

import urllib
job = urllib.urlopen('http://www.careerbuilder.com/job/J3H3S262FXSL1FTXYR9?ipath=\
JRG1&searchid=41c7d746-a553-4c2f-bcbc-8d23ef079ae8&siteid=ns_us_g').read()
soup_job = BeautifulSoup(job, 'lxml')
print type(soup_job)

timeit.timeit()

<class 'bs4.BeautifulSoup'>


0.06327509880065918

In [19]:
print(soup_job.prettify())

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html class="ie6" lang="en"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7" lang="en"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8" lang="en"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]> ><! <![endif]-->
<html lang="en">
 <!-- <![endif] -->
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"42096352b4","applicationID":"4991347","transactionName":"el1YEUVZXlRRSxgIXltBGRZfWUU=","queueTime":0,"applicationTime":154,"agent":""}
  </script>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).loader_config={xpid:"Uw4HVFNbGwcJXVBRAwY="};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o

In [20]:
# retrieve all information with "tag" class

print(soup_job.findAll(class_='tag'))

[<div class="tag">\nFull-Time\n</div>, <div class="tag">\nExperience - At least 3 year(s)\n</div>, <div class="tag">\nDegree - None\n</div>, <div class="tag">\n$95,000.00 - $120,000.00 /Year\n</div>, <div class="tag" id="job-industry">\nComputer Software, Banking - Financial Services, Biotechnology\n</div>, <div class="tag" id="job-categories">\nInformation Technology, Engineering, Professional Services\n</div>, <div class="tag">\nRelocation - No\n</div>]


In [21]:
result = soup_job.findAll(class_='small-12 item')

tcl = unicode.join(u'\n',map(unicode,result))

tclstripped=tcl.strip('<div class="small-12 item">\n<h1>\n')

tclstripped

tclstr2 = tclstripped.split('\n', 12 )

title = []
company  = []
location = []

title.append(tclstr2[0])
company.append(tclstr2[5])
location.append(tclstr2[7])

print title
print company
print location

[u'Data Scientist']
[u'CyberCoders']
[u'Cambridge, MA']


In [16]:
# retrieve all information with "description" class

print(soup_job.findAll(class_='description'))

[<div class="description">\nThis position is open as of 10/18/2016.<br/><br/>Data Scientist - Improving Healthcare through Big-Data Analytics<br/><br/><div>If you are a Data Scientist with 3+ years of professional experience, please read on!<br/>\n<br/>\nWith offices in Cambridge, MA we are a big data analytics software start-up that has taken a data driven approach to consumer travel. We have compiled billions of data sets around the world to give travelers actionable insights on where and when to travel for the best deals.</div><br/><br/><b>Top Reasons to Work with Us</b><br/><br/><div>1. This is a fun start-up company that is taking a completely new approach to the industry- and we have the VC backing to make it all happen!<br/>\n2. We are conveniently located in Cambridge, MA not far from the Harvard University Campus.<br/>\n3. We offer base compensation, bonus, stock, benefits, a business casual work environment and more.</div><br/><br/><b>What You Will Be Doing</b><br/><br/><div>

In [139]:
hours = []
experience = []
degree = []
salary = []

soup_job.findAll(class_='job-facts item')

# i = 0
# for d in soup_job.findAll(class_='job-facts item'):
#     print(d)
#     print('\n')
#     i += 1

    
#     dt = unicode.join(u'\n',map(unicode,d))    
    
# tags = unicode.join(u'\n',map(unicode,result))

# tclstripped=tcl.strip('<div class="small-12 item">\n<h1>\n')

# tclstripped

## tclstr2 = tclstripped.split('\n', 12 )

[<div class="job-facts item">\n<div class="tag">\nFull-Time\n</div>\n<div class="tag">\nExperience - At least 3 year(s)\n</div>\n<div class="tag">\nDegree - None\n</div>\n<div class="tag">\n$95,000.00 - $120,000.00 /Year\n</div>\n<div class="tag" id="job-industry">\nComputer Software, Banking - Financial Services, Biotechnology\n</div>\n<div class="tag" id="job-categories">\nInformation Technology, Engineering, Professional Services\n</div>\n<div class="tag">\nRelocation - No\n</div>\n</div>]

In [20]:
salary = []
company = []
location = []
jobs = []
jobdescription = []
hours = []  # full time / part time
experience = []
degree = []
relocation = []
industry = []
category = []

import urllib2
from bs4 import BeautifulSoup

page_num = 1

# for link in soup.find_all('a', href=re.compile('/job/') ):
#         job_pages.append("http://www.careerbuilder.com" + (link.get('href')))

for j in job_pages:
    open_url = urllib2.urlopen(url).read()
    # job_page = BeautifulSoup(j)
#     for i in job_page:
#         print(soup_job.findAll(class_='tag'))

         
    ###Appends job titles
    for d in soup_job.findAll(class_='job-title'):
        jobs.append(d.text)
        
#     for i in market_page('div', {'class' : 'market_listing_row      market_recent_listing_row market_listing_searchresult'}):
#         item_name = i.find_all('span', {'class' : 'market_listing_item_name'})[0].get_text()
#         price = i.find_all('span')[1].get_text()
#         page_num += 1
#         print  item_name + ' costs ' + price

[<div class="tag">\nFull-Time\n</div>, <div class="tag">\nExperience - At least 3 year(s)\n</div>, <div class="tag">\nDegree - None\n</div>, <div class="tag">\n$95,000.00 - $120,000.00 /Year\n</div>, <div class="tag" id="job-industry">\nComputer Software, Banking - Financial Services, Biotechnology\n</div>, <div class="tag" id="job-categories">\nInformation Technology, Engineering, Professional Services\n</div>, <div class="tag">\nRelocation - No\n</div>]
[<div class="tag">\nFull-Time\n</div>, <div class="tag">\nExperience - At least 3 year(s)\n</div>, <div class="tag">\nDegree - None\n</div>, <div class="tag">\n$95,000.00 - $120,000.00 /Year\n</div>, <div class="tag" id="job-industry">\nComputer Software, Banking - Financial Services, Biotechnology\n</div>, <div class="tag" id="job-categories">\nInformation Technology, Engineering, Professional Services\n</div>, <div class="tag">\nRelocation - No\n</div>]
[<div class="tag">\nFull-Time\n</div>, <div class="tag">\nExperience - At least 

error: [Errno 10053] An established connection was aborted by the software in your host machine