# Introduction

The project is to build a scraper to extract useful data of the job posts found on www.indeed.com, based on customizable search keyword and location.

Libraries used: `requests` and `bs4`


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Tasks

- Scrape the first 100 available search results across multiple pages
- Generalize the code to allow searching for different locations/jobs
- Pick out information about the URL, job title and job location
- Save the results to a file

# Part 1: Inspect

- How do the URLs change when you navigate to the next results page?
- How do the URLs change when you use a different location and or job title search?
- Which HTML elements contain the link, title and location of each job?

# Part 2: Scrape and Parse one Page


## Scrape the Page

Let's start by scraping the first page of the results returned by entering keyword "python" and loction "new york" for search. 

In [2]:
# url of the result page
url = 'https://www.indeed.com/jobs?q=python&l=new+york'

# scrape and parse the page 
# to return a list of job results
def scrapeResultPage(url):
    # scrape the html
    try:
        res = requests.get(url)
        res.raise_for_status()
    except Exception as err:
        print(err)

    # Make a soup
    soup = BeautifulSoup(res.content)

    # Parse result column
    result = soup.find(id='resultsCol')
    jobs = result.find_all('a', class_='result')    
    return jobs

jobs = scrapeResultPage(url)
len(jobs) #number of jobs displayed on this page

15

In [3]:
# Examine one of the result
print(jobs[0].prettify())

<a class="tapItem fs-unmask result job_2b6bb659e8d137a5 resultWithShelf sponTapItem desktop" data-hide-spinner="true" data-hiring-event="false" data-jk="2b6bb659e8d137a5" data-mobtk="1fqu3c3ljo15k800" href="/rc/clk?jk=2b6bb659e8d137a5&amp;fccid=1684d3322a4331bc&amp;vjs=3" id="job_2b6bb659e8d137a5" rel="nofollow" target="_blank">
 <div class="slider_container">
  <div class="slider_list">
   <div class="slider_item">
    <div class="job_seen_beacon">
     <table cellpadding="0" cellspacing="0" class="jobCard_mainContent big6_visualChanges" role="presentation">
      <tbody>
       <tr>
        <td class="resultContent">
         <div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
          <h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
           <div class="new topLeft holisticNewBlue desktop">
            <span class="label">
             new
            </span>
           </div>
           <span title="Python Engineer">
            Python Engineer
  

## Parse the Data

### Job title

In [4]:
# Examine the jobTitle element inside job, which is a 'h2' tag
for job in jobs[:5]:
    print(job.find('h2').prettify())

<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
 <div class="new topLeft holisticNewBlue desktop">
  <span class="label">
   new
  </span>
 </div>
 <span title="Python Engineer">
  Python Engineer
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
 <div class="new topLeft holisticNewBlue desktop">
  <span class="label">
   new
  </span>
 </div>
 <span title="GUCCI Advanced Analytics Specialist - Supply Chain">
  GUCCI Advanced Analytics Specialist - Supply Chain
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple">
 <span title="Evolve Rotational Program - Software Engineer">
  Evolve Rotational Program - Software Engineer
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple">
 <span title="Python Developer">
  Python Developer
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple">
 <span title="Python Developer">
  Python Developer
 </span>
</h2>


Note that, from the above results, there could be "label" inside the title, which we don't need.

In [5]:
# We can list all children inside jobTitle, and 
# extract only the last child - the title
for job in jobs[:5]:
    print([child.text for child in job.find('h2').children][-1])

Python Engineer
GUCCI Advanced Analytics Specialist - Supply Chain
Evolve Rotational Program - Software Engineer
Python Developer
Python Developer


In [6]:
# Now, define a function to extract all job titles on the page
def extractTitles(jobs):
    titles = [job.find('h2', class_='jobTitle') for job in jobs]
    return [[child.text for child in title.children][-1] for title in titles]

extractTitles(jobs)

['Python Engineer',
 'GUCCI Advanced Analytics Specialist - Supply Chain',
 'Evolve Rotational Program - Software Engineer',
 'Python Developer',
 'Python Developer',
 'Junior Python Developer',
 'Employee Relations Analyst',
 'Data Scientist, TikTok Creation and Consumption',
 'Python Developer',
 'Python Developer',
 'Python Developer',
 'Mechatronics Engineer',
 'Software Engineer',
 'Python Developer',
 'Vice President of Data Analytics-IT535001']

### Job URLs

In [7]:
# Let's examine the `href` attribute in the job result
# which shall be the link to the details of the job
for job in jobs[:5]:
    print(job['href'])
    print('\n')

/rc/clk?jk=2b6bb659e8d137a5&fccid=1684d3322a4331bc&vjs=3


/pagead/clk?mo=r&ad=-6NYlbfkN0DZr6HLVGkfhjMBGCl7Y_G72AbDddeUEn-9Fjk_W4Gp8IMJNxrhtuY6JmUBEXNxvWXxLg_CaT28dxLk1n5YNbLFTTJWZ5m69LhjsFTo-ksV5j67ZLx4trIMpmgdb4K5DN2cysJMNNAylFle9yDPqIX76iEfwnb0egTzvqoDRuIGM0_H5eA_O9dWa3Z7df5ynTpk8U5yYtFZMyyhcMUDxWmkCFPZlH3EAhGP5JL6IN-rkoXcIWXZihvWxktX7YctpAtELG7euSwwwT--O9wgKbkg9hZnK52DQZV3fOABJV_JakUFrJSkA-owZ5ps00OPVsH6vlMLxTO67Wk-4Xjqt1cbIiYD1P00tjNGOUvl0iD2ziOjjDV0wdig5N7HamUAGvjR-wVX1GwhXL3LAnoO1wCGxR_mWVbbojYVjYiQtY9e2UsvycysEo8VPmrptZKA-neYSKR3aPWQux75RPL6p94NiI-QWgIoCwo7wY5eRCSNgYofR4lV8_fkCdMvN8QZMjIEnAGSOUWaq2YrpuUKa4os0Sdg7Trz2BPQiiuIZuBpB4DXYll3ybAe&p=1&fvj=0&vjs=3


/pagead/clk?mo=r&ad=-6NYlbfkN0BvKrLyj5gPmtZO9T8euul8TCxuuKNOtzRJOomxnwSEodTz2Bc-sPZlFpP0h5lDivomgISaMt6THYJ46OYNy2u_IFPcc0ISRjUkhI5ghGmh1FdwbM0iZBfWCroRXUdmrPHErAfZvKEK0fzcR-4eWdCNEGzTNLIBpAGLkQzpaJgYZPRSKyBgRuxTh58-VJSiig4YquK9YrsVcDEH5mnkx0OsttNnlqflfvE_6UkCOOsFQLuzgtFstY8nFot_lzmkOrI_SRrfeyCRxZ17NThIpcY1PwjVqBCfbAZ7dD7mPm

The extracted links above are in different patterns for different job results, and they don't work with clicking directly. Instead, after observing the urls of different job pages, it's found that all job detail pages can follow the same pattern as below:  

    "https://www.indeed.com/viewjob?jk="+ "job id" 

- the "job id" can be found in one of the classes of the job result, which looks like this: `job_941fca298fa33332`

In [8]:
def extractlink(job):
    import re
    r = re.compile('job_.*')
    job_id = (list(filter(r.match, job['class'])))[0]
    return 'https://www.indeed.com/viewjob?jk='+job_id.replace('job_','')

for job in jobs:
    print(extractlink(job))

https://www.indeed.com/viewjob?jk=2b6bb659e8d137a5
https://www.indeed.com/viewjob?jk=89bee222d7262b40
https://www.indeed.com/viewjob?jk=31c79b82e44fb0dc
https://www.indeed.com/viewjob?jk=d605b68ac7e6d40b
https://www.indeed.com/viewjob?jk=d1d9864a3915b653
https://www.indeed.com/viewjob?jk=71acc67a281f6038
https://www.indeed.com/viewjob?jk=242500ede18e7c9a
https://www.indeed.com/viewjob?jk=65b4fe45973ad5d6
https://www.indeed.com/viewjob?jk=36ee007222043432
https://www.indeed.com/viewjob?jk=c018bff370890e44
https://www.indeed.com/viewjob?jk=12cb14f1afecd0d1
https://www.indeed.com/viewjob?jk=964539f3b3e1247a
https://www.indeed.com/viewjob?jk=4c9d5b27c034ee44
https://www.indeed.com/viewjob?jk=b462e36768457437
https://www.indeed.com/viewjob?jk=bfbdae6ad4809897


### Company Name

In [9]:
jobs[0].find('span', class_='companyName').text

'TrailStone Group'

In [10]:
def extractCompanies(jobs):    
    return [job.find('span', class_='companyName').text for job in jobs]

extractCompanies(jobs)

['TrailStone Group',
 'Gucci America Inc',
 'Apple',
 'Advithri Technologies',
 'Capgemini',
 'GTS',
 'Google',
 'TikTok',
 'WorkCog',
 'iConnect',
 'CRISIL LIMITED',
 'The Rockefeller University',
 'ACLU',
 'Emonics LLC',
 'ICL']

### Location

In [11]:
location = jobs[0].find('div', class_='companyLocation')

# Examine location html
print(location.prettify())

<div class="companyLocation">
 New York, NY
</div>


In [12]:
# This div contains not only the text of the company location,
# but also a `span` of other content;
# We can list all contents of the div
location.contents

['New York, NY']

In [13]:
# and then extract the text we want
location.contents[0]

'New York, NY'

In [14]:
def extractLocations(jobs):
    locations = [job.find('div', class_='companyLocation') for job in jobs]
    return [location.contents[0] for location in locations]
            
extractLocations(jobs)

['New York, NY',
 'New York, NY',
 'New York, NY',
 'Temporarily Remote in New York, NY',
 'New York, NY',
 'New York, NY',
 'Hybrid remote in New York, NY',
 'New York, NY',
 'Remote in New York, NY',
 'New York, NY',
 'Temporarily Remote in New York, NY',
 'New York, NY',
 'Remote in New York, NY',
 'New York, NY',
 'New York, NY']

### Salary

In [15]:
# Let's see if any jobs provide salary information
for job in jobs[6:10]:
    print(job.select('div[class*="salary-"]'))

[<div class="metadata salary-snippet-container"><div class="attribute_snippet"><svg aria-hidden="true" aria-label="Salary" fill="none" role="presentation" viewbox="0 0 16 13" xmlns="http://www.w3.org/2000/svg"><defs></defs><path clip-rule="evenodd" d="M2.45168 6.10292c-.30177-.125-.62509-.18964-.95168-.1903V4.08678c.32693-.00053.6506-.06518.95267-.1903.30331-.12564.57891-.30979.81105-.54193.23215-.23215.4163-.50775.54194-.81106.12524-.30237.18989-.62638.19029-.95365H9.0902c0 .3283.06466.65339.1903.9567.12564.30331.30978.57891.54193.81106.23217.23215.50777.41629.81107.54193.3032.12558.6281.19024.9562.1903v1.83556c-.3242.00155-.6451.06616-.9448.19028-.3033.12563-.5789.30978-.81102.54193-.23215.23214-.4163.50774-.54193.81106-.12332.2977-.18789.61638-.19024.93849H3.99496c-.00071-.32645-.06535-.64961-.19029-.95124-.12564-.30332-.30979-.57891-.54193-.81106-.23215-.23215-.50775-.4163-.81106-.54193zM0 .589843C0 .313701.223858.0898438.5.0898438h12.0897c.2762 0 .5.2238572.5.5000002V9.40715c0 .27

In [16]:
jobs[6].select('div[class*="salary-"]')[0].text

'$122,000 a year'

In [17]:
def extractSalaries(jobs):
    selector = 'div[class*="salary-"]'
    return [job.select(selector)[0].text if len(job.select(selector))!=0 
            else 'n/a' for job in jobs]
     
extractSalaries(jobs)

['Estimated $124K – $157K a year',
 'Estimated $42.1K – $53.3K a year',
 'n/a',
 '$49 - $80 an hour',
 'Estimated $98.8K – $125K a year',
 'Estimated $97.8K – $124K a year',
 '$122,000 a year',
 'n/a',
 '$40 - $60 an hour',
 '$82,770 - $150,000 a year',
 '$60,000 - $110,000 a year',
 'Estimated $95.5K – $121K a year',
 '$115,638 a year',
 '$60,000 - $110,000 a year',
 'n/a']

### Extract Results on the Page

In [18]:
# extract all the above-mentioned data from the page
def extractData(jobs):
    return pd.DataFrame({'jobTitle': extractTitles(jobs),
                         'company': extractCompanies(jobs), 
                         'location': extractLocations(jobs),
                         'salary': extractSalaries(jobs),
                          'url': [extractlink(job) for job in jobs]})
extractData(jobs)

Unnamed: 0,jobTitle,company,location,salary,url
0,Python Engineer,TrailStone Group,"New York, NY",Estimated $124K – $157K a year,https://www.indeed.com/viewjob?jk=2b6bb659e8d1...
1,GUCCI Advanced Analytics Specialist - Supply C...,Gucci America Inc,"New York, NY",Estimated $42.1K – $53.3K a year,https://www.indeed.com/viewjob?jk=89bee222d726...
2,Evolve Rotational Program - Software Engineer,Apple,"New York, NY",,https://www.indeed.com/viewjob?jk=31c79b82e44f...
3,Python Developer,Advithri Technologies,"Temporarily Remote in New York, NY",$49 - $80 an hour,https://www.indeed.com/viewjob?jk=d605b68ac7e6...
4,Python Developer,Capgemini,"New York, NY",Estimated $98.8K – $125K a year,https://www.indeed.com/viewjob?jk=d1d9864a3915...
5,Junior Python Developer,GTS,"New York, NY",Estimated $97.8K – $124K a year,https://www.indeed.com/viewjob?jk=71acc67a281f...
6,Employee Relations Analyst,Google,"Hybrid remote in New York, NY","$122,000 a year",https://www.indeed.com/viewjob?jk=242500ede18e...
7,"Data Scientist, TikTok Creation and Consumption",TikTok,"New York, NY",,https://www.indeed.com/viewjob?jk=65b4fe45973a...
8,Python Developer,WorkCog,"Remote in New York, NY",$40 - $60 an hour,https://www.indeed.com/viewjob?jk=36ee00722204...
9,Python Developer,iConnect,"New York, NY","$82,770 - $150,000 a year",https://www.indeed.com/viewjob?jk=c018bff37089...


# Part 3. Building the Pipeline


### First Page

First, define a function to generate first page url based on the search keyword and location.

In [19]:
# Let's make sure the query string work in the url
# use "+" to replace all non-alphanumerical chars
import re
s = 'san francisco, CA  '
s = re.sub('[^0-9a-zA-Z]+', '+', s.strip())
s

'san+francisco+CA'

In [20]:
# Generate the first page url
def firstURL(base_url, location, keyword):
    # make sure keyword and location strings acceptable for query in the url
    import re
    location = re.sub('[^0-9a-zA-Z]+', '+', location.strip())
    keyword = re.sub('[^0-9a-zA-Z]+', '+', keyword.strip())
    
    return base_url + f'jobs?q={keyword}&l={location}'

base_url = 'https://www.indeed.com/'
firstURL(base_url, 'San Francisco bay area, CA', 'sql developer')

'https://www.indeed.com/jobs?q=sql+developer&l=San+Francisco+bay+area+CA'

### Other Pages
Second, generate the urls of other pages.

In [21]:
# Try scraping next page (the url observed in the website)
url = 'https://www.indeed.com/jobs?q=python&l=new+york&start=10'
jobs = scrapeResultPage(url)
len(jobs)

15

In [22]:
# Genrate the urls of other pages until we hit the limit of results
def otherURLs(url, result_num=15, resultlimit=100, increment=10):
    urls =[]
    start = 0
    result_total = result_num
    while True:
        if result_total >= resultlimit:
            break
            
        start += increment
        urls.append(url+f'&start={start}')
        result_total += result_num
        
    return urls

other_urls = otherURLs(url='https://www.indeed.com/jobs?q=python&l=new+york', resultlimit=50)
other_urls

['https://www.indeed.com/jobs?q=python&l=new+york&start=10',
 'https://www.indeed.com/jobs?q=python&l=new+york&start=20',
 'https://www.indeed.com/jobs?q=python&l=new+york&start=30']

### Generalize the Scraper

Now, let's combine the above to build the scraper function that extracts results on multiple pages and accomodates general search.

In [23]:
# Now, let's combine results of multiple pages
def job_scraper(base_url, location, keyword, resultlimit, increment):
    
    allResults = []
    
    # Extract results from first page
    first_url = firstURL(base_url, location, keyword)
    jobs = scrapeResultPage(first_url)
    allResults.append(extractData(jobs))
    print(first_url)
    
    
    # Extract results from other pages
    other_urls = otherURLs(first_url, result_num = len(jobs), 
                           resultlimit=resultlimit, increment=increment)
    
    for url in other_urls:
        jobs = scrapeResultPage(url)
        allResults.append(extractData(jobs))
        
    return pd.concat(allResults, ignore_index=True)

In [24]:
data = job_scraper(base_url, 'New York', 'Python', resultlimit=100, increment=10)
data

https://www.indeed.com/jobs?q=Python&l=New+York


Unnamed: 0,jobTitle,company,location,salary,url
0,Python Engineer,TrailStone Group,"New York, NY",,https://www.indeed.com/viewjob?jk=2b6bb659e8d1...
1,Python Developer,Advithri Technologies,"Temporarily Remote in New York, NY",$49 - $80 an hour,https://www.indeed.com/viewjob?jk=d605b68ac7e6...
2,"Data Scientist, Spotify Advertising",Spotify,"Remote in New York, NY",,https://www.indeed.com/viewjob?jk=cf1dfad66649...
3,Python Developer,Capgemini,"New York, NY",,https://www.indeed.com/viewjob?jk=d1d9864a3915...
4,Junior Python Developer,GTS,"New York, NY",,https://www.indeed.com/viewjob?jk=71acc67a281f...
...,...,...,...,...,...
100,Junior Software Engineer,SpinCar,"New York, NY",Estimated $46.9K – $59.3K a year,https://www.indeed.com/viewjob?jk=c4dd1929f340...
101,"Adjunct Associate Faculty, Python for Data Ana...",Columbia University,"New York, NY",Estimated $45.5K – $57.5K a year,https://www.indeed.com/viewjob?jk=d9fd6ff8ba04...
102,Marketing Analytics Manager,the NBA,"New York, NY",Estimated $119K – $150K a year,https://www.indeed.com/viewjob?jk=0dd27b8c84dd...
103,Associate Data Consultant,YipitData,Remote in New York State,Estimated $51.3K – $64.9K a year,https://www.indeed.com/viewjob?jk=c5da02b15631...


### Output

We can now save the data to a csv file. 

In [27]:
today = str(pd.to_datetime("today"))
data.to_csv(f'./scrape_results/jobs_{today}.csv')