# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [4]:
# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search(keywords):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
    

    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])
    response = requests.get(scrape_url, headers=headers)
    if response.status_code != 200:
        print('Failed to retrieve the webpage. Status code:', response.status_code)
        return print(response.status_code) 

    # Create a request to get the data from the server 
    page = requests.get(scrape_url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    job_name_element = soup.find('span', class_='sr-only')
    job_name_list = []
    for job in job_name_element:
        job_name = job.get_text(strip=True)
        job_name_list.append(job_name)
    job_name_list = job_name_list[2:-1]

    location_element = soup.find_all('span', class_='job-result-card__location')
    location_list = []
    for location in location_element:
        location.append(location.get_text(strip=True))

    company_element = soup.find_all('a', class_='hidden-nested-link')

    company_names = [i.get_text(strip=True) for i in company_element]

    df = pd.DataFrame({
        'Job Name': job_name_list,
        'Company Name': company_names,
        'Location': location_list        
    })     
    # Return dataframe
    return df

In [11]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

# Set a User-Agent header so LinkedIn doesn't block me
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

def scrape_linkedin_job_search(keywords):
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])
    page = requests.get(scrape_url, headers=HEADERS)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    titles = []
    companies = []
    locations = []
    for card in soup.select("div.base-search-card__info"):
        title = card.select_one("h3.base-search-card__title")
        company = card.select_one("h4.base-search-card__subtitle")
        location = card.select_one("span.job-search-card__location")
        titles.append(title.get_text(strip=True) if title else None)
        companies.append(company.get_text(strip=True) if company else None)
        locations.append(location.get_text(strip=True) if location else None)

    data = pd.DataFrame({'Title': titles, 'Company': companies, 'Location': locations})
    return data

In [12]:
# Example to call the function

results = scrape_linkedin_job_search('data%20analysis')
results

Unnamed: 0,Title,Company,Location
0,"Associate, Strategic Finance - LinkedIn Market...",LinkedIn,"New York, NY"
1,"Senior Associate, Insights Analytics Engineer",LinkedIn,"New York, NY"
2,"Senior Associate, Insights Analytics Engineer",LinkedIn,"Chicago, IL"
3,Junior Quantitative Analyst - Remote,Novartis Norge,"Santa Ana, CA"
4,Junior Quantitative Analyst - Remote,Novartis Norge,"Riverside, CA"
5,Junior Quantitative Analyst - Remote,Novartis Norge,"Stockton, CA"
6,Data Scientist,Facebook,United States
7,Business Analyst - Operations Analytics,Rover.com,"Seattle, WA"
8,Data Analysis Intern (Yearlong),AUMOVIO,"Auburn Hills, MI"
9,Data Analytics,Microsoft,"Redmond, WA"


## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}

def scrape_linkedin_job_search(keywords, num_pages=1):
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    all_titles = []
    all_companies = []
    all_locations = []

    for page in range(num_pages):
        start = page * 25
        scrape_url = f"{BASE_URL}keywords={keywords}&start={start}"
        
        response = requests.get(scrape_url, headers=HEADERS)
        if response.status_code != 200:
            break  # Stop if the request fails
        
        soup = BeautifulSoup(response.text, 'html.parser')
        cards = soup.select("div.base-search-card__info")
        if not cards:
            break  # No more job listings, stop pagination
        
        for card in cards:
            title = card.select_one("h3.base-search-card__title")
            company = card.select_one("h4.base-search-card__subtitle")
            location = card.select_one("span.job-search-card__location")
            all_titles.append(title.get_text(strip=True) if title else None)
            all_companies.append(company.get_text(strip=True) if company else None)
            all_locations.append(location.get_text(strip=True) if location else None)
        
        time.sleep(1)  # Polite delay between requests

    data = pd.DataFrame({
        'Title': all_titles,
        'Company': all_companies,
        'Location': all_locations
    })
    return data



In [None]:
# Test with 5 pages
df = scrape_linkedin_job_search("data analysis", num_pages=5)
df.head()

Unnamed: 0,Title,Company,Location
0,"Associate, Strategic Finance - LinkedIn Market...",LinkedIn,"New York, NY"
1,"Senior Associate, Insights Analytics Engineer",LinkedIn,"New York, NY"
2,"Senior Associate, Insights Analytics Engineer",LinkedIn,"Chicago, IL"
3,Business Analyst - Operations Analytics,Rover.com,"Seattle, WA"
4,Junior Quantitative Analyst - Remote,Novartis Norge,"Santa Ana, CA"


## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [18]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
from urllib.parse import quote

HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}

def scrape_linkedin_job_search(keywords, num_pages=1, country=None):
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    all_titles = []
    all_companies = []
    all_locations = []

    for page in range(num_pages):
        start = page * 25
        # Base URL with keywords and start parameter
        url = f"{BASE_URL}keywords={quote(keywords)}&start={start}"
        # Append country filter if provided
        if country:
            url += f"&location={quote(country)}"
        
        response = requests.get(url, headers=HEADERS)
        if response.status_code != 200:
            break  # Stop if the request fails
        
        soup = BeautifulSoup(response.text, 'html.parser')
        cards = soup.select("div.base-search-card__info")
        if not cards:
            break  # No more job listings, stop pagination
        
        for card in cards:
            title = card.select_one("h3.base-search-card__title")
            company = card.select_one("h4.base-search-card__subtitle")
            location = card.select_one("span.job-search-card__location")
            all_titles.append(title.get_text(strip=True) if title else None)
            all_companies.append(company.get_text(strip=True) if company else None)
            all_locations.append(location.get_text(strip=True) if location else None)
        
        time.sleep(1)  # Polite delay between requests

    data = pd.DataFrame({
        'Title': all_titles,
        'Company': all_companies,
        'Location': all_locations
    })
    return data

In [None]:
# Test with 5 pages in a specific country
df = scrape_linkedin_job_search("data analysis", num_pages=5, country="Germany")
df.head()

Scraped 279 jobs


Unnamed: 0,Title,Company,Location
0,Junior Credit Decisions Analyst (m/f/d),YouLend,"Berlin, Berlin, Germany"
1,Data Analysis Internship,Lucentra Group,"Berlin, Berlin, Germany"
2,Data Analyst,Cognizant Netcentric,"Frankfurt am Main, Hesse, Germany"
3,Data Science Intern (f/m/x),Enpal,"Berlin, Berlin, Germany"
4,Junior Data Analyst (m/f/d),Westwing,"Munich, Bavaria, Germany"


## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [19]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
from urllib.parse import quote

HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}

def scrape_linkedin_job_search(keywords, num_pages=1, country=None, num_days=None):
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    all_titles = []
    all_companies = []
    all_locations = []

    for page in range(num_pages):
        start = page * 25
        # Build base URL with keywords and start
        url = f"{BASE_URL}keywords={quote(keywords)}&start={start}"
        
        # Add country filter if provided
        if country:
            url += f"&location={quote(country)}"
        
        # Add time filter if num_days is provided
        if num_days is not None:
            seconds = num_days * 24 * 60 * 60
            url += f"&f_TPR=r{seconds}"
        
        response = requests.get(url, headers=HEADERS)
        if response.status_code != 200:
            break  # Stop if the request fails
        
        soup = BeautifulSoup(response.text, 'html.parser')
        cards = soup.select("div.base-search-card__info")
        if not cards:
            break  # No more job listings, stop pagination
        
        for card in cards:
            title = card.select_one("h3.base-search-card__title")
            company = card.select_one("h4.base-search-card__subtitle")
            location = card.select_one("span.job-search-card__location")
            all_titles.append(title.get_text(strip=True) if title else None)
            all_companies.append(company.get_text(strip=True) if company else None)
            all_locations.append(location.get_text(strip=True) if location else None)
        
        time.sleep(1)  # Polite delay between requests

    data = pd.DataFrame({
        'Title': all_titles,
        'Company': all_companies,
        'Location': all_locations
    })
    return data

In [None]:
# Test with 5 pages, in Germany, posted in last 7 days
df = scrape_linkedin_job_search("data analysis", num_pages=5, country="Germany", num_days=7)
df.head()

Scraped 285 jobs


Unnamed: 0,Title,Company,Location
0,Business-Analyst,Instaffo,"Frechen, North Rhine-Westphalia, Germany"
1,Junior Credit Decisions Analyst (m/f/d),YouLend,"Berlin, Berlin, Germany"
2,Data Analysis Internship,Lucentra Group,"Berlin, Berlin, Germany"
3,Internship - Data & Analytics (m/f/x),FINN,"Munich, Bavaria, Germany"
4,Data Analyst,Cognizant Netcentric,"Frankfurt am Main, Hesse, Germany"


## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [None]:
# your code here