# Introduction - Build a Pipeline

The project is to build a scraper to extract useful data of the job posts found on www.indeed.com, based on customizable search keyword and location.

Software used: `requests` and `bs4`


# Tasks

- Scrape the first 100 available search results across multiple pages
- Generalize the code to allow searching for different locations/jobs
- Pick out information about the URL, job title and job location
- Save the results to a file

# Part 1: Inspect

- How do the URLs change when you navigate to the next results page?
- How do the URLs change when you use a different location and or job title search?
- Which HTML elements contain the link, title and location of each job?

# Part 2: Scrape and Parse one Page

In [427]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [545]:
url = 'https://www.indeed.com/jobs?q=python&l=new+york'

# scrape and parse the page 
# to return a list of job results' HTML
def scrapeResultPage(url):
    # scrape the html
    try:
        res = requests.get(url)
        res.raise_for_status()
    except Exception as err:
        print(err)

    # Make a soup
    soup = BeautifulSoup(res.content)

    # Parse result column
    result = soup.find(id='resultsCol')
    jobs = result.find_all('a', class_='result')    
    return jobs

jobs = scrapeResultPage(url)
len(jobs)

15

In [546]:
# Examine one of the result
print(jobs[0].prettify())

<a class="tapItem fs-unmask result job_941fca298fa33332 sponsoredJob resultWithShelf sponTapItem desktop" data-ci="383080872" data-empn="9767451333976163" data-hide-spinner="true" data-hiring-event="false" data-jk="941fca298fa33332" data-mobtk="1fqseeb4uo2b6800" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0CiRNM7CVr8YueLFKlzwbFWI0o7IjV438l4sVrvKZ0flpURU_mqoI8E-VxPfg2eTCEAoksXSZRepKnmYgoejgpnsuX8Fi3wzHIIMzd5QVHw4THV_jIwcDmoSBPKm5n1dKEO3E4B--owBXD4w7Ce7eASZa9wksZPPfhC1-ED7kHHKlkVXiFtdBe4R6VgoRMjuJClPEZ-UJfkxBpi294A2ZcTAIgd0l73_DWXlt9k5bT3-N5TzKeVVy5uluSCu7Eh0aGZViqZa1EAnQnqv1EbXUdPHjNl8biuC96wxVJLOeJfyeZwCuyb4x6zaJEKcZu-wjayKssVV8tgsnohEKBF7CXXePjP1iJQVALY5W88Rtp_5wkJW8pAD4RUv_P3LRdzvAk5UmfTIpQVnqGSQbDyA3Nq0Yh4XuQttv23D-yD7AWEPTi3uFs7wBIrtZRRci6TZ0RjjsocGQ4psjJHabGHV_7TSSl0qZmIsnilsJNdWw9l9j1cmAVuSvORuzKOdxrQi0LuSfX-7sm9a0ItMV1ifC7L&amp;p=0&amp;fvj=0&amp;vjs=3" id="sj_941fca298fa33332" rel="nofollow" target="_blank">
 <div class="slider_container">
  <div class="slider_list">
   <div class="s

### Job title

In [559]:
# Examine jobTitle element inside job, which is a 'h2' tag
# there could be "label" inside the title, which we don't need
for job in jobs[:5]:
    print(job.find('h2').prettify())

<h2 class="jobTitle jobTitle-color-purple">
 <span title="Software Engineer II">
  Software Engineer II
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple">
 <span title="Python Developer">
  Python Developer
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple">
 <span title="Junior Python Developer">
  Junior Python Developer
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple">
 <span title="Trainee, Deloitte Cyber Career Accelerator Program">
  Trainee, Deloitte Cyber Career Accelerator Program
 </span>
</h2>
<h2 class="jobTitle jobTitle-color-purple">
 <span title="Associate, YouTube">
  Associate, YouTube
 </span>
</h2>


In [558]:
# We can list all children inside jobTitle, and extract only the last child - the title
for job in jobs[:5]:
    print([child.text for child in job.find('h2').children][-1])

Software Engineer II
Python Developer
Junior Python Developer
Trainee, Deloitte Cyber Career Accelerator Program
Associate, YouTube


In [560]:
# Now, define a function to extract all job titles on the page
def extractTitles(jobs):
    titles = [job.find('h2', class_='jobTitle') for job in jobs]
    return [[child.text for child in title.children][-1] for title in titles]

extractTitles(jobs)

['Software Engineer II',
 'Python Developer',
 'Junior Python Developer',
 'Trainee, Deloitte Cyber Career Accelerator Program',
 'Associate, YouTube',
 'Employee Relations Analyst',
 'Python Full Stack Developer',
 'SR. PYTHON DEVELOPER',
 'Junior Software Engineer',
 'Opportunities Analyst, YouTube',
 'Python Developer',
 'Junior Data Scientist (NYC or Remote)',
 'Automation Tester',
 'Software Developer - C++ and/or Python',
 'Data Scientist, TikTok Creation and Consumption']

### Job URLs

In [519]:
# Let's examine the `href` attribute in the job result
# which shall be the link to the details of the job
for job in jobs[:5]:
    print(job['href'])
    print('\n')

/pagead/clk?mo=r&ad=-6NYlbfkN0CiRNM7CVr8YueLFKlzwbFWI0o7IjV438l4sVrvKZ0flpURU_mqoI8E-VxPfg2eTCEAoksXSZRepKnmYgoejgpnsuX8Fi3wzHIIMzd5QVHw4THV_jIwcDmoSBPKm5n1dKEO3E4B--owBXD4w7Ce7eASZa9wksZPPfhC1-ED7kHHKkDhLuHcw8P35lytUzNrT3LjRZyoS7sxy8Rrjp6rQUBo9-UgAqf0Ns-AD2Ig1B5xi2RmSWbPl-EefHZRQ5OGxwKviE1z4e2NZF4ZnEcJ7raeCzV7Y28CDqsigcOkFHZt5YyTpaT2ZNCLDa2CQp5mzBdRF8Czlfr6Ko0PSooXJRyfeWQIzKggX_Kl8qcgPCqdNz9F9EKF-lmQspNbkAU8IAT0VuGPNCGs0x3DG9sZ17NSyamfHFzphwnQQhNC_6EaJjj3IVjZ71o5-_2_ww-TUNZIrMqSaVFUssuD6YxEyn5UAz--dwp0PpZZuYl3-APHDEV1De-ETmJHCFltFW0x4uEt5Fob80CdM8xL5GqSkHjW&p=0&fvj=0&vjs=3


/rc/clk?jk=d401e602fc961fa1&fccid=2df6a1e69a70a1e7&vjs=3


/company/Global-Trading-Systems/jobs/Junior-Python-Developer-71acc67a281f6038?fccid=225e56cbd1119db8&vjs=3


/rc/clk?jk=a19d5d6e71bffeee&fccid=a9021c35fcef6968&vjs=3


/rc/clk?jk=242500ede18e7c9a&fccid=a5b4499d9e91a5c6&vjs=3




The extracted links above are in different patterns for different job results, and they don't work directly. Instead, after observing the url of job detail webpage, it's found that the url pattern of job detail page can be as follows:  

    "https://www.indeed.com/viewjob?jk="+ "job id" 

- "job id" can be found in one of the classes of the job result, which looks like this: `job_941fca298fa33332`

In [561]:
def extractlink(job):
    import re
    r = re.compile('job_.*')
    job_id = (list(filter(r.match, job['class'])))[0]
    return 'https://www.indeed.com/viewjob?jk='+job_id.replace('job_','')

for job in jobs:
    print(extractlink(job))

https://www.indeed.com/viewjob?jk=941fca298fa33332
https://www.indeed.com/viewjob?jk=261a6748688d29d6
https://www.indeed.com/viewjob?jk=71acc67a281f6038
https://www.indeed.com/viewjob?jk=11144e0a61fbae87
https://www.indeed.com/viewjob?jk=a19d5d6e71bffeee
https://www.indeed.com/viewjob?jk=242500ede18e7c9a
https://www.indeed.com/viewjob?jk=2d98a45f7e98a6c9
https://www.indeed.com/viewjob?jk=d748d23c8181f77a
https://www.indeed.com/viewjob?jk=c4dd1929f3409f7f
https://www.indeed.com/viewjob?jk=5c009893824fdca2
https://www.indeed.com/viewjob?jk=b462e36768457437
https://www.indeed.com/viewjob?jk=c71b72157f64f825
https://www.indeed.com/viewjob?jk=d401e602fc961fa1
https://www.indeed.com/viewjob?jk=b7f4dd18740d62cc
https://www.indeed.com/viewjob?jk=65b4fe45973ad5d6


### Company Name

In [203]:
jobs[0].find('span', class_='companyName').text

'TrailStone Group'

In [575]:
def extractCompanies(jobs):    
    return [job.find('span', class_='companyName').text for job in jobs]

extractCompanies(jobs)

['Indeed',
 'Systemonex INC',
 'GTS',
 'Deloitte',
 'YouTube',
 'Google',
 'First Soft Solutions [ Direct]',
 'DATEC',
 'SpinCar',
 'YouTube',
 'Emonics LLC',
 'ViacomCBS',
 'Cognizant Technology Solutions',
 'Radkl',
 'TikTok']

### Location

In [562]:
location = jobs[0].find('div', class_='companyLocation')

# Examine location html
print(location.prettify())

<div class="companyLocation">
 Remote in New York, NY
 <span class="more_loc_container">
  <a aria-label="Same Software Engineer II job in 4 other locations" class="more_loc" href="/addlLoc/redirect?tk=1fqseeb4uo2b6800&amp;jk=941fca298fa33332&amp;dest=%2Fjobs%3Fq%3Dpython%26l%3Dnew%2Byork%26grpKey%3D8gcGdG5mdGNsuA%252F2iAOqECUKCW5vcm10aXRsZRoYc2VuaW9yIHNvZnR3YXJlIGVuZ2luZWVy" rel="nofollow">
   +4 locations
  </a>
 </span>
</div>


In [563]:
# Examine location content
location.contents

['Remote in New York, NY',
 <span class="more_loc_container"><a aria-label="Same Software Engineer II job in 4 other locations" class="more_loc" href="/addlLoc/redirect?tk=1fqseeb4uo2b6800&amp;jk=941fca298fa33332&amp;dest=%2Fjobs%3Fq%3Dpython%26l%3Dnew%2Byork%26grpKey%3D8gcGdG5mdGNsuA%252F2iAOqECUKCW5vcm10aXRsZRoYc2VuaW9yIHNvZnR3YXJlIGVuZ2luZWVy" rel="nofollow">+4 locations</a></span>]

In [564]:
location.contents[0]

'Remote in New York, NY'

In [574]:
def extractLocations(jobs):
    locations = [job.find('div', class_='companyLocation') for job in jobs]
    return [location.contents[0] for location in locations]
            
extractLocations(jobs)

['Remote in New York, NY',
 'New York, NY',
 'New York, NY',
 'New York, NY',
 'New York, NY',
 'Hybrid remote in New York, NY',
 'New York, NY',
 'New York, NY',
 'New York, NY',
 'New York, NY',
 'New York, NY',
 'Remote in New York, NY',
 'New York, NY',
 'New York, NY',
 'New York, NY']

### Salary

In [570]:
for job in jobs[:3]:
    print(job.select('div[class*="salary-"]'))

[<div class="metadata salary-snippet-container"><div class="attribute_snippet"><svg aria-hidden="true" aria-label="Salary" fill="none" role="presentation" viewbox="0 0 16 13" xmlns="http://www.w3.org/2000/svg"><defs></defs><path clip-rule="evenodd" d="M2.45168 6.10292c-.30177-.125-.62509-.18964-.95168-.1903V4.08678c.32693-.00053.6506-.06518.95267-.1903.30331-.12564.57891-.30979.81105-.54193.23215-.23215.4163-.50775.54194-.81106.12524-.30237.18989-.62638.19029-.95365H9.0902c0 .3283.06466.65339.1903.9567.12564.30331.30978.57891.54193.81106.23217.23215.50777.41629.81107.54193.3032.12558.6281.19024.9562.1903v1.83556c-.3242.00155-.6451.06616-.9448.19028-.3033.12563-.5789.30978-.81102.54193-.23215.23214-.4163.50774-.54193.81106-.12332.2977-.18789.61638-.19024.93849H3.99496c-.00071-.32645-.06535-.64961-.19029-.95124-.12564-.30332-.30979-.57891-.54193-.81106-.23215-.23215-.50775-.4163-.81106-.54193zM0 .589843C0 .313701.223858.0898438.5.0898438h12.0897c.2762 0 .5.2238572.5.5000002V9.40715c0 .27

In [571]:
jobs[1].select('div[class*="salary-"]')[0].text

'$35 - $45 an hour'

In [573]:
def extractSalaries(jobs):
    selector = 'div[class*="salary-"]'
    return [job.select(selector)[0].text if len(job.select(selector))!=0 
            else 'n/a' for job in jobs]
     
extractSalaries(jobs)

['$109,000 - $159,000 a year',
 '$35 - $45 an hour',
 'Estimated $97.8K – $124K a year',
 '$28 an hour',
 'Estimated $43K – $54.4K a year',
 '$122,000 a year',
 '$31 - $68 an hour',
 'Estimated $110K – $139K a year',
 'Estimated $46.9K – $59.3K a year',
 'Estimated $63.5K – $80.5K a year',
 '$60,000 - $110,000 a year',
 'Estimated $104K – $132K a year',
 'Estimated $74K – $93.6K a year',
 'Estimated $113K – $144K a year',
 'Estimated $105K – $133K a year']

### Extract Results on the Page

In [576]:
# extract all data above from the page
def extractResults(jobs):
    return pd.DataFrame({'jobTitle': extractTitles(jobs),
                         'company': extractCompanies(jobs), 
                         'location': extractLocations(jobs),
                         'salary': extractSalaries(jobs),
                          'url': [extractlink(job) for job in jobs]})
extractResults(jobs)

Unnamed: 0,jobTitle,company,location,salary,url
0,Software Engineer II,Indeed,"Remote in New York, NY","$109,000 - $159,000 a year",https://www.indeed.com/viewjob?jk=941fca298fa3...
1,Python Developer,Systemonex INC,"New York, NY",$35 - $45 an hour,https://www.indeed.com/viewjob?jk=261a6748688d...
2,Junior Python Developer,GTS,"New York, NY",Estimated $97.8K – $124K a year,https://www.indeed.com/viewjob?jk=71acc67a281f...
3,"Trainee, Deloitte Cyber Career Accelerator Pro...",Deloitte,"New York, NY",$28 an hour,https://www.indeed.com/viewjob?jk=11144e0a61fb...
4,"Associate, YouTube",YouTube,"New York, NY",Estimated $43K – $54.4K a year,https://www.indeed.com/viewjob?jk=a19d5d6e71bf...
5,Employee Relations Analyst,Google,"Hybrid remote in New York, NY","$122,000 a year",https://www.indeed.com/viewjob?jk=242500ede18e...
6,Python Full Stack Developer,First Soft Solutions [ Direct],"New York, NY",$31 - $68 an hour,https://www.indeed.com/viewjob?jk=2d98a45f7e98...
7,SR. PYTHON DEVELOPER,DATEC,"New York, NY",Estimated $110K – $139K a year,https://www.indeed.com/viewjob?jk=d748d23c8181...
8,Junior Software Engineer,SpinCar,"New York, NY",Estimated $46.9K – $59.3K a year,https://www.indeed.com/viewjob?jk=c4dd1929f340...
9,"Opportunities Analyst, YouTube",YouTube,"New York, NY",Estimated $63.5K – $80.5K a year,https://www.indeed.com/viewjob?jk=5c009893824f...


# Part 3. Building the Pipeline


### First Page

First, define a function to generate first page url based on the search keyword and location.

In [584]:
# Let's make sure the query string work in the url
# use "+" to replace all non-alphanumerical chars
import re
s = 'san francisco, CA  '
s = re.sub('[^0-9a-zA-Z]+', '+', s.strip())
s

'san+francisco+CA'

In [585]:
# Generate the first page url
def firstURL(base_url, location, keyword):
    # make sure keyword and location strings acceptable for query in the url
    import re
    location = re.sub('[^0-9a-zA-Z]+', '+', location.strip())
    keyword = re.sub('[^0-9a-zA-Z]+', '+', keyword.strip())
    
    return base_url + f'jobs?q={keyword}&l={location}'

base_url = 'https://www.indeed.com/'
firstURL(base_url, 'San Francisco bay area, CA', 'sql developer')

'https://www.indeed.com/jobs?q=sql+developer&l=San+Francisco+bay+area+CA'

### Other Pages
Second, generate the urls of other pages.

In [586]:
# Try scraping next page (the url observed in the website)
url = 'https://www.indeed.com/jobs?q=python&l=new+york&start=10'
jobs = scrapeResultPage(url)
len(jobs)

15

In [588]:
# Genrate the urls of other pages until we hit the limit of results
def otherURLs(url, result_num=15, resultlimit=100, increment=10):
    urls =[]
    start = 0
    result_total = result_num
    while True:
        if result_total >= resultlimit:
            break
            
        start += increment
        urls.append(url+f'&start={start}')
        result_total += result_num
        
    return urls

other_urls = otherURLs(url='https://www.indeed.com/jobs?q=python&l=new+york', resultlimit=100)
other_urls

['https://www.indeed.com/jobs?q=python&l=new+york&start=10',
 'https://www.indeed.com/jobs?q=python&l=new+york&start=20',
 'https://www.indeed.com/jobs?q=python&l=new+york&start=30',
 'https://www.indeed.com/jobs?q=python&l=new+york&start=40',
 'https://www.indeed.com/jobs?q=python&l=new+york&start=50',
 'https://www.indeed.com/jobs?q=python&l=new+york&start=60']

### Generalize the Scraper

Now, let's combine the above to extract results on multiple pages for general search.

In [589]:
# Now, let's combine results of multiple pages
def job_scraper(base_url, location, keyword, resultlimit, increment):
    
    allResults = []
    
    # Extract results from first page
    first_url = firstURL(base_url, location, keyword)
    jobs = scrapeResultPage(first_url)
    allResults.append(extractResults(jobs))
    print(first_url)
    
    
    # Extract results from other pages
    other_urls = otherURLs(first_url, result_num = len(results), 
                           resultlimit=resultlimit, increment=increment)
    
    for url in other_urls:
        jobs = scrapeResultPage(url)
        allResults.append(extractResults(jobs))
        
    return pd.concat(allResults, ignore_index=True)

In [590]:
data = job_scraper(base_url, 'New York', 'Python', resultlimit=100, increment=10)
data

https://www.indeed.com/jobs?q=Python&l=New+York


Unnamed: 0,jobTitle,company,location,salary,url
0,Software Engineer II,Indeed,"Remote in New York, NY","$109,000 - $159,000 a year",https://www.indeed.com/viewjob?jk=941fca298fa3...
1,Junior Python Developer,GTS,"New York, NY",Estimated $97.8K – $124K a year,https://www.indeed.com/viewjob?jk=71acc67a281f...
2,"Trainee, Deloitte Cyber Career Accelerator Pro...",Deloitte,"New York, NY",$28 an hour,https://www.indeed.com/viewjob?jk=11144e0a61fb...
3,"Associate, YouTube",YouTube,"New York, NY",,https://www.indeed.com/viewjob?jk=a19d5d6e71bf...
4,Employee Relations Analyst,Google,"Hybrid remote in New York, NY","$122,000 a year",https://www.indeed.com/viewjob?jk=242500ede18e...
...,...,...,...,...,...
100,Associate,Morgan Stanley,"New York, NY",Estimated $85.7K – $109K a year,https://www.indeed.com/viewjob?jk=76cc643cf9c6...
101,Backend Engineer,Alma,Remote in New York State,Estimated $101K – $128K a year,https://www.indeed.com/viewjob?jk=4912f9f4277d...
102,Data Science Rotation Program Associate,New York Life Insurance Co,"New York, NY",Estimated $107K – $136K a year,https://www.indeed.com/viewjob?jk=61c19d3402d0...
103,Healthcare Data Engineer (Remote),pulseData,"Remote in New York, NY",Estimated $126K – $159K a year,https://www.indeed.com/viewjob?jk=84bb503268ed...


### Output

We can now save the data to a csv file. 

In [591]:
today = str(pd.to_datetime("today"))
data.to_csv(f'{today}.csv')