# Web Scraping - Data Science Jobs in Chicago, IL
### Query: Data Scientist -- Chicago, IL
This notebook illustrates how to scrape some common job boards for **links** to job postings. The first web page is determined by a search from a web browser of the user's choice. For example, if I search Glassdoor's job postings for *'data scientist'* in *'Chicago, IL'*, the URL can be seen in `gd_p1`. This is a starting point for more complex analyses in which a user would actually open the links to job postings and scrape the content within. 

This exploratory work resulted in the functions at the end of this notebook which will be used in further analyses. If these functions ever break, this notebook will serve as a starting point for debugging.

**Next steps**:
* Defining functions to open and scrape data from the job postings
* Iteratating through URLs for each site and writing data to disk (sqlite or other)
* Performing analysis on resulting data

In [1]:
import requests
from bs4 import BeautifulSoup
import re
from IPython.core.display import HTML

###Glassdoor
Glassdoor also has a [free public API](http://www.glassdoor.com/developer/index.htm)


In [123]:
gd_uri = 'http://www.glassdoor.com'
gd_p1 = 'http://www.glassdoor.com/Job/chicago-data-scientist-jobs-SRCH_IL.0,7_IC1128808_KO8,22.htm'
gd_p2 = 'http://www.glassdoor.com/Job/chicago-data-scientist-jobs-SRCH_IL.0,7_IC1128808_KO8,22_IP2.htm'

In [124]:
regex = r'ListingId=(.*)'
gd_page = requests.get(gd_p1,headers={'User-Agent': 'Mozilla/5.0'})
gd_data = gd_page.text
gd_soup = BeautifulSoup(gd_data)
gd_jobs = []
gd_job_ids = []

for link in gd_soup.find_all('a'):                              # for each link in the page
    href = link.get('href')                                     # get the href
    if href:                                                    # if the href exists
        if ('partner/job' in href and                           # if it is a link for a job posting
            re.search(regex, href).group(1) not in gd_job_ids): # and we don't alread have that job
            gd_jobs.append(gd_uri + href)                       # add the job to the list of jobs
            gd_job_ids.append(re.search(regex, href).group(1))  # add the job id to the list of job ids
            

In [125]:
# the number of jobs on the page
len(gd_jobs)


29

### Indeed

In [6]:
in_p1 = 'http://www.indeed.com/jobs?q=Data+Scientist&l=Chicago%2C+IL'
in_p2 = 'http://www.indeed.com/jobs?q=Data+Scientist&l=Chicago%2C+IL&start=10'
in_p3 = 'http://www.indeed.com/jobs?q=Data+Scientist&l=Chicago%2C+IL&start=20'
in_uri = 'http://www.indeed.com'

In [7]:
in_page = requests.get(in_p1, headers={'User-Agent': 'Mozilla/5.0'})
in_data = in_page.text
in_soup = BeautifulSoup(in_data)

In [13]:
in_jobs = []
for link in in_soup.find_all('a'):
    if '/rc/' in link.get('href'): # there is an 'rc' in the links for the jobs
        in_jobs.append(in_uri + link.get('href')) # links to the actual jobs
        

In [95]:
len(in_jobs)

10

#### Two Sample Postings
The job postings for Indeed are external links, which means the actual job postings do not have a consistent format. This may still be useful if we just want raw text, but it will be skipped for now.

In [28]:
disp_soup = BeautifulSoup(requests.get(in_jobs[3], headers={'User-Agent': 'Mozilla/5.0'}).text)
HTML(str(disp_soup))

0,1
,
,
Back to Search Results,Back to Search Results
,
Data Scientist,Data Scientist
Company:,"VHA-UHC Alliance NewCo, Inc."
Location:,NewCo - Chicago
Requisition ID:,4167
# of Openings:,1
,

0,1
Previous Applicants:,Previous Applicants:
Email:,
Password:,
,
If you do not remember your password click here.,If you do not remember your password click here.


In [30]:
disp_soup = BeautifulSoup(requests.get(in_jobs[1], headers={'User-Agent': 'Mozilla/5.0'}).text)
HTML(str(disp_soup))

0
Careers > Job Search Results JOB SEARCH RESULTS
Careers > Job Search Results JOB SEARCH RESULTS

0
Careers > Job Search Results JOB SEARCH RESULTS

0,1
(Job Number: ) Description Cognizant is an Equal Opportunity Employer Minority/Female/Disability/Veteran. If you require accessibility assistance applying for open positions in the US please send an email with your request to CareersNorthAmerica@cognizant.com Qualifications Job Primary Location Other Locations Employee Status Schedule Job Type Job Level Shift Travel Job Posting Organization Refer a friend for this jobRefer a friendRefer a candidateSubmit a candidate's profile,
(Job Number: ) Description Cognizant is an Equal Opportunity Employer Minority/Female/Disability/Veteran. If you require accessibility assistance applying for open positions in the US please send an email with your request to CareersNorthAmerica@cognizant.com Qualifications Job Primary Location Other Locations Employee Status Schedule Job Type Job Level Shift Travel Job Posting Organization,Refer a friend for this jobRefer a friendRefer a candidateSubmit a candidate's profile
,

0,1
(Job Number: ) Description Cognizant is an Equal Opportunity Employer Minority/Female/Disability/Veteran. If you require accessibility assistance applying for open positions in the US please send an email with your request to CareersNorthAmerica@cognizant.com Qualifications Job Primary Location Other Locations Employee Status Schedule Job Type Job Level Shift Travel Job Posting Organization,Refer a friend for this jobRefer a friendRefer a candidateSubmit a candidate's profile


### Kaggle

In [148]:
kg_uri = 'https://www.kaggle.com'
kg_p1 = 'https://www.kaggle.com/forums/f/145/data-science-jobs'
kg_p2 = 'https://www.kaggle.com/forums/f/145/data-science-jobs?page=2'

In [151]:
kg_page = requests.get(kg_p1, headers={'User-Agent': 'Mozilla/5.0'})
kg_data = kg_page.text
kg_soup = BeautifulSoup(kg_data)

In [152]:
regex = r'/forums/f/145/data-science-jobs/t/(.*?)[/]'
kg_jobs = []
kg_job_ids = []
for link in kg_soup.find_all('a'):
    for link in kg_soup.find_all('a'):                              # for each link in the page
        href = link.get('href')                                     # get the href
        if href:                                                    # if the href exists
            if ('/forums/f/145/data-science-jobs/t/' in href and    # if it is a link for a job posting
                re.search(regex, href).group(1) not in kg_job_ids): # and we don't alread have that job
                kg_jobs.append(kg_uri + href)                       # add the job to the list of jobs
                kg_job_ids.append(re.search(regex, href).group(1))  # add the job id to the list of job ids


In [153]:
len(kg_jobs)

20

### LinkedIn

In [80]:
li_uri = 'https://www.linkedin.com'
li_p1 = 'https://www.linkedin.com/job/data-scientist-jobs-chicago-il/?sort=relevance&page_num=1&trk=jserp_pagination_1'
l1_p2 = 'https://www.linkedin.com/job/data-scientist-jobs-chicago-il/?sort=relevance&page_num=2&trk=jserp_pagination_2'

In [81]:
li_page = requests.get(li_p1, headers={'User-Agent': 'Mozilla/5.0'})
li_data = li_page.text
li_soup = BeautifulSoup(li_data)

In [92]:
regex = r'/jobs2/view/(.*?)[?]'
li_jobs = []
li_job_ids = []
for link in li_soup.find_all('a'):                              # for each link in the page
    href = link.get('href')                                     # get the href
    if href:                                                    # if the href exists
        if ('/jobs2/view/' in href and                          # if it is a link for a job posting
            re.search(regex, href).group(1) not in li_job_ids): # and we don't alread have that job
            li_jobs.append(href)                                # add the job to the list of jobs
            li_job_ids.append(re.search(regex, href).group(1))  # add the job id to the list of job ids

In [94]:
len(li_jobs)

25

###CareerBuilder

In [2]:
cb_uri = 'http://www.careerbuilder.com'
cb_p1 = 'http://www.careerbuilder.com/jobseeker/jobs/jobresults.aspx?IPath=QHKVGV&excrit=freeLoc%3dChicago%2c+IL%3bst%3dA%3buse%3dALL%3brawWords%3ddata+scientist%3bTID%3d0%3bCTY%3dChicago%3bSID%3dIL%3bCID%3dUS%3bLOCCID%3dUS%3bENR%3dNO%3bDTP%3dDRNS%3bYDI%3dYES%3bIND%3dALL%3bPDQ%3dAll%3bPDQ%3dAll%3bPAYL%3d0%3bPAYH%3dGT120%3bPOY%3dNO%3bETD%3dALL%3bRE%3dALL%3bMGT%3dDC%3bSUP%3dDC%3bFRE%3d30%3bCHL%3dAL%3bQS%3dSID_UNKNOWN%3bSS%3dNO%3bTITL%3d0%3bOB%3d-relv%3bJQT%3dRAD%3bJDV%3dFalse%3bSITEENT%3dUSJOB%3bMaxLowExp%3d-1%3bRecsPerPage%3d25'
cb_p2 = 'http://www.careerbuilder.com/jobseeker/jobs/jobresults.aspx?excrit=freeLoc%3dChicago%2c+IL%3bst%3dA%3buse%3dALL%3brawWords%3ddata+scientist%3bTID%3d0%3bCTY%3dChicago%3bSID%3dIL%3bCID%3dUS%3bLOCCID%3dUS%3bENR%3dNO%3bDTP%3dDRNS%3bYDI%3dYES%3bIND%3dALL%3bPDQ%3dAll%3bPDQ%3dAll%3bPAYL%3d0%3bPAYH%3dGT120%3bPOY%3dNO%3bETD%3dALL%3bRE%3dALL%3bMGT%3dDC%3bSUP%3dDC%3bFRE%3d30%3bCHL%3dAL%3bQS%3dSID_UNKNOWN%3bSS%3dNO%3bTITL%3d0%3bOB%3d-relv%3bJQT%3dRAD%3bJDV%3dFalse%3bSITEENT%3dUSJOB%3bMaxLowExp%3d-1%3bRecsPerPage%3d25&pg=2&IPath=QHKVGV'

In [3]:
cb_page = requests.get(cb_p1, headers={'User-Agent': 'Mozilla/5.0'})
cb_data = cb_page.text
cb_soup = BeautifulSoup(cb_data)

In [7]:
cb_jobs = []
for link in cb_soup.find_all('a'):                              # for each link in the page
    href = link.get('href')
    if href:
        if '/jobseeker/jobs/jobdetails' in href:
            cb_jobs.append(href)

In [8]:
len(cb_jobs)

25

###Some functions we will need later

In [24]:
def get_soup(url):
    page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    
    data = page.text
    
    return BeautifulSoup(data)



def get_gd_links(gd_soup):
    
    gd_uri = 'http://www.glassdoor.com'
    regex = r'ListingId=(.*)'

    gd_jobs = []
    gd_job_ids = []

    for link in gd_soup.find_all('a'):                              # for each link in the page
        href = link.get('href')                                     # get the href
        if href:                                                    # if the href exists
            if ('partner/job' in href and                           # if it is a link for a job posting
                re.search(regex, href).group(1) not in gd_job_ids): # and we don't alread have that job
                gd_jobs.append(gd_uri + href)                       # add the job to the list of jobs
                gd_job_ids.append(re.search(regex, href).group(1))  # add the job id to the list of job ids
            
    return gd_jobs, gd_job_ids



def get_kg_links(kg_soup):
    kg_uri = 'https://www.kaggle.com'
    regex = r'/forums/f/145/data-science-jobs/t/(.*?)[/]'
    kg_jobs = []
    kg_job_ids = []
    for link in kg_soup.find_all('a'):
        for link in kg_soup.find_all('a'):                              # for each link in the page
            href = link.get('href')                                     # get the href
            if href:                                                    # if the href exists
                if ('/forums/f/145/data-science-jobs/t/' in href and    # if it is a link for a job posting
                    re.search(regex, href).group(1) not in kg_job_ids): # and we don't alread have that job
                    kg_jobs.append(kg_uri + href)                       # add the job to the list of jobs
                    kg_job_ids.append(re.search(regex, href).group(1))  # add the job id to the list of job ids
                    
    return kg_jobs, kg_job_ids



def get_li_links(li_soup):
    regex = r'/jobs2/view/(.*?)[?]'
    li_jobs = []
    li_job_ids = []
    for link in li_soup.find_all('a'):                              # for each link in the page
        href = link.get('href')                                     # get the href
        if href:                                                    # if the href exists
            if ('/jobs2/view/' in href and                          # if it is a link for a job posting
                re.search(regex, href).group(1) not in li_job_ids): # and we don't alread have that job
                li_jobs.append(href)                                # add the job to the list of jobs
                li_job_ids.append(re.search(regex, href).group(1))  # add the job id to the list of job 
    return li_jobs, li_job_ids



def get_cb_links(cb_soup):
    cb_jobs = []
    for link in cb_soup.find_all('a'):                              # for each link in the page
        href = link.get('href')
        if href:
            if '/jobseeker/jobs/jobdetails' in href:
                cb_jobs.append(href)
    return cb_jobs

In [25]:
get_cb_links(get_soup(cb_p1))

['http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=J3H15K78R1V81R5QV46&IPath=QHKVGV0A',
 'http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=JHT78L5YDDBL4JM4ZCJ&IPath=QHKVGV0B',
 'http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=J3J3Z36RKF2T74X4XWL&IPath=QHKVGV0C',
 'http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=JHN47B6M98TKZPJTNJS&IPath=QHKVGV0D',
 'http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=JHL1R570BSDWTCZVBQ4&IPath=QHKVGV0E',
 'http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=JYR24P65SPJQBQ1Z72L&IPath=QHKVGV0F',
 'http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=J3F7416Z9P5T6LMPVYP&IPath=QHKVGV0G',
 'http://www.careerbuilder.com/jobseeker/jobs/jobdetails.aspx?APath=2.21.0.0.0&job_did=J3H6826PW9Q03MFX74T&IPath=QHKVGV0H',
 'http:/