# Web Scraping Job Vacancies

## Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

1. Setup our development environment
2. Understand the basics of web scraping
3. Analyze the website structure of our job search platform
4. Write the Python code to extract job data from our job search platform
5. Save the data to a CSV file
6. Test our web scraper and refine our code as needed

## Prerequisites

Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you may want to use the following packages in your Python environment:

- requests
- BeautifulSoup
- csv
- datetime

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using `!pip install packagename` within a notebook cell such as:

- `!pip install requests`
- `!pip install BeautifulSoup`

## Step 1: Importing Required Libraries

In [3]:
# your code here

In [1]:
!pip install requests

You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
!pip install requests beautifulsoup4 pandas

You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
def generate_url(position, location):
    """
    Generate URL for job search based on position and location parameters.
    
    Args:
        position (str): Job position/title to search for
        location (str): Location of the job
        
    Returns:
        str: Complete URL for web scraping
    """
    # Convert parameters to URL-friendly format
    position = position.replace(' ', '+')
    location = location.replace(' ', '+')
    
    # Base URL template
    base_url = "https://www.indeed.com/jobs"
    
    # Construct complete URL with parameters
    url = f"{base_url}?q={position}&l={location}"  # Note: Changed to Indeed's structure with q= and l=
    
    return url

# Test the function
url = generate_url("generative AI specialist", "California")
print(url)

https://www.indeed.com/jobs?q=generative+AI+specialist&l=California


In [3]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
def extract_job_data(job_card):
    """
    Extract relevant information from a single job posting card.
    
    Args:
        job_card (BeautifulSoup object): HTML of a single job posting
        
    Returns:
        dict: Dictionary containing job information
    """
    job_data = {}
    
    # Extract job title
    try:
        job_data['title'] = job_card.find('h2', class_='jobTitle').get_text().strip()
    except (AttributeError, IndexError):
        job_data['title'] = ""

    # Extract company name
    try:
        job_data['company'] = job_card.find('span', class_='companyName').get_text().strip()
    except (AttributeError, IndexError):
        job_data['company'] = ""

    # Extract location
    try:
        job_data['location'] = job_card.find('div', class_='companyLocation').get_text().strip()
    except (AttributeError, IndexError):
        job_data['location'] = ""

    # Extract salary if available
    try:
        job_data['salary'] = job_card.find('div', class_='salary-snippet').get_text().strip()
    except (AttributeError, IndexError):
        job_data['salary'] = "Not specified"

    # Extract job description snippet
    try:
        job_data['description'] = job_card.find('div', class_='job-snippet').get_text().strip()
    except (AttributeError, IndexError):
        job_data['description'] = ""

    # Extract posting date
    try:
        job_data['date_posted'] = job_card.find('span', class_='date').get_text().strip()
    except (AttributeError, IndexError):
        job_data['date_posted'] = ""

    # Extract job URL
    try:
        job_link = job_card.find('a', class_='jcs-JobTitle')
        job_data['url'] = 'https://www.indeed.com' + job_link.get('href')
    except (AttributeError, IndexError):
        job_data['url'] = ""

    return job_data

In [5]:
# First, get a webpage with job listings
url = generate_url("generative AI specialist", "California")
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

# Find the first job card
first_job_card = soup.find('div', class_='job_seen_beacon')

# Test the extraction function
if first_job_card:
    job_info = extract_job_data(first_job_card)
    print("\nExtracted Job Information:")
    for key, value in job_info.items():
        print(f"{key}: {value}")
else:
    print("No job cards found. The HTML structure might have changed or the page might not have loaded properly.")

No job cards found. The HTML structure might have changed or the page might not have loaded properly.


In [6]:
# Debug: Print the HTML to see what we're actually getting
url = generate_url("generative AI specialist", "California")
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

# Print the first few job cards we find
job_cards = soup.find_all('div', class_='job_seen_beacon')
print("Number of job cards found:", len(job_cards))

# Try alternative class names that Indeed might be using
print("\nTrying alternative selectors...")
cards = soup.find_all('div', class_='cardOutline')
print("Cards with 'cardOutline' class:", len(cards))

# Print a sample of the HTML to see the structure
print("\nSample of the page HTML:")
print(soup.prettify()[:1000])

Number of job cards found: 0

Trying alternative selectors...
Cards with 'cardOutline' class: 0

Sample of the page HTML:
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Security Check - Indeed.com
  </title>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style>
   :root{color-scheme:light dark;--background-color:#fff;--primary-1000:#0d2d5e;--primary-900:#164081;--primary-800:#2557a7;--primary-700:#3f73d3;--primary-600:#6792f0;--neutral-1000:#2d2d2d;--neutral-900:#424242;--neutral-400:#d4d2d0;--dark-1000:#040606;--link-color:var(--primary-800);--link-color-hover:var(--primary-900);--menu-background-color:#fff;--text-color:var(--neutral-1000);--text-color-hover:var(--neutral-900);--default-transition:cubic-bezier(.645,.045,.355,1);--menu-transition:.28s all .12s ease-out;--font-family:"Noto Sans",system-ui,-apple-system,BlinkMacSystemFont,"Helvetica Neue",Arial,sans-serif;--icon-profile:url("data:image/svg+xml,%3Csvg width='18'

In [1]:
def extract_job_data(job_post):
    """
    Extract data from a single job posting BeautifulSoup object
    
    Args:
        job_post: BeautifulSoup object containing a single job posting
        
    Returns:
        dict: Dictionary containing the extracted job information
    """
    # Initialize a dictionary to store the job data
    job_data = {}
    
    # Extract job title
    try:
        job_title = job_post.find('h2', class_='jobTitle').text.strip()
    except (AttributeError, IndexError):
        job_title = ""
    job_data['title'] = job_title
    
    # Extract company name
    try:
        company = job_post.find('span', class_='companyName').text.strip()
    except (AttributeError, IndexError):
        company = ""
    job_data['company'] = company
    
    # Extract location
    try:
        location = job_post.find('div', class_='companyLocation').text.strip()
    except (AttributeError, IndexError):
        location = ""
    job_data['location'] = location
    
    # Extract salary if available
    try:
        salary = job_post.find('div', class_='salary-snippet').text.strip()
    except (AttributeError, IndexError):
        salary = "Not listed"
    job_data['salary'] = salary
    
    # Extract job description
    try:
        description = job_post.find('div', class_='job-snippet').text.strip()
    except (AttributeError, IndexError):
        description = ""
    job_data['description'] = description
    
    return job_data

# Test the function with a sample job posting
def test_extract_job():
    # First, get a webpage with job listings
    url = generate_url("Data Analyst", "New York")
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Find first job posting
    first_job = soup.find('div', class_='job_seen_beacon')
    
    if first_job:
        # Test the extraction function
        job_info = extract_job_data(first_job)
        print("Extracted Job Information:")
        for key, value in job_info.items():
            print(f"{key}: {value}")
    else:
        print("No job posting found")

# Run the test
test_extract_job()

NameError: name 'generate_url' is not defined

In [7]:
import requests
from bs4 import BeautifulSoup

In [8]:
def generate_url(position, location):
    """
    Generate Indeed job search URL based on position and location parameters.
    """
    # Convert parameters to URL-friendly format
    position = position.replace(' ', '+')
    location = location.replace(' ', '+')
    
    # Indeed's base URL
    base_url = "https://www.indeed.com/jobs"
    
    # Construct URL with Indeed's specific structure
    url = f"{base_url}?q={position}&l={location}"
    
    return url

In [9]:
def extract_job_data(job_post):
    """
    Extract data from a single job posting BeautifulSoup object
    """
    job_data = {}
    
    # Extract job title
    try:
        job_title = job_post.find('h2', class_='jobTitle').text.strip()
    except (AttributeError, IndexError):
        job_title = ""
    job_data['title'] = job_title
    
    # Extract company name
    try:
        company = job_post.find('span', class_='companyName').text.strip()
    except (AttributeError, IndexError):
        company = ""
    job_data['company'] = company
    
    # Extract location
    try:
        location = job_post.find('div', class_='companyLocation').text.strip()
    except (AttributeError, IndexError):
        location = ""
    job_data['location'] = location
    
    # Extract salary if available
    try:
        salary = job_post.find('div', class_='salary-snippet').text.strip()
    except (AttributeError, IndexError):
        salary = "Not listed"
    job_data['salary'] = salary
    
    # Extract job description
    try:
        description = job_post.find('div', class_='job-snippet').text.strip()
    except (AttributeError, IndexError):
        description = ""
    job_data['description'] = description
    
    return job_data

In [10]:
def test_extract_job():
    # Get a webpage with job listings
    url = generate_url("Data Analyst", "New York")
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Find first job posting
    first_job = soup.find('div', class_='job_seen_beacon')
    
    if first_job:
        # Test the extraction function
        job_info = extract_job_data(first_job)
        print("Extracted Job Information:")
        for key, value in job_info.items():
            print(f"{key}: {value}")
    else:
        print("No job posting found")

In [11]:
# Run the test
test_extract_job()

No job posting found


In [12]:
import requests
from bs4 import BeautifulSoup

def generate_url(position, location):
    position = position.replace(' ', '+')
    location = location.replace(' ', '+')
    base_url = "https://www.indeed.com/jobs"
    url = f"{base_url}?q={position}&l={location}"
    return url

def test_page_content():
    # Get a webpage with job listings
    url = generate_url("Data Analyst", "New York")
    
    # Updated headers to look more like a real browser
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
    }
    
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Print the URL we're trying to access
    print("URL:", url)
    
    # Print response status code
    print("Response Status Code:", page.status_code)
    
    # Look for job cards with different possible class names
    job_cards = soup.find_all('div', class_=['job_seen_beacon', 'jobsearch-ResultsList', 'tapItem'])
    print("\nNumber of job cards found:", len(job_cards))
    
    # Print all div elements with 'job' in their class name
    print("\nAll div elements with 'job' in class name:")
    job_related_divs = soup.find_all('div', class_=lambda x: x and 'job' in x.lower())
    for div in job_related_divs[:5]:  # Show first 5 only
        print("Class:", div.get('class'))

    # Print the first 1000 characters of the HTML to see what we're getting
    print("\nFirst 1000 characters of HTML:")
    print(soup.prettify()[:1000])

# Run the test
test_page_content()

URL: https://www.indeed.com/jobs?q=Data+Analyst&l=New+York
Response Status Code: 403

Number of job cards found: 0

All div elements with 'job' in class name:

First 1000 characters of HTML:
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Security Check - Indeed.com
  </title>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style>
   :root{color-scheme:light dark;--background-color:#fff;--primary-1000:#0d2d5e;--primary-900:#164081;--primary-800:#2557a7;--primary-700:#3f73d3;--primary-600:#6792f0;--neutral-1000:#2d2d2d;--neutral-900:#424242;--neutral-400:#d4d2d0;--dark-1000:#040606;--link-color:var(--primary-800);--link-color-hover:var(--primary-900);--menu-background-color:#fff;--text-color:var(--neutral-1000);--text-color-hover:var(--neutral-900);--default-transition:cubic-bezier(.645,.045,.355,1);--menu-transition:.28s all .12s ease-out;--font-family:"Noto Sans",system-ui,-apple-system,BlinkMacSystemFont,"Helvetica Neue",Aria

In [14]:
import requests
from bs4 import BeautifulSoup
import time  # Add this import for delays
import random  # Add this for random delays

def generate_url(position, location):
    position = position.replace(' ', '+')
    location = location.replace(' ', '+')
    base_url = "https://www.indeed.com/jobs"
    url = f"{base_url}?q={position}&l={location}"
    return url

def get_page_content(url):
    """
    Get page content with enhanced headers and error handling
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Referer': 'https://www.indeed.com',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-User': '?1',
        'Cache-Control': 'max-age=0'
    }
    
    try:
        # Add a random delay between requests
        time.sleep(random.uniform(1, 3))
        response = requests.get(url, headers=headers)
        
        if response.status_code == 200:
            return response
        else:
            print(f"Failed to retrieve page. Status code: {response.status_code}")
            return None
            
    except requests.RequestException as e:
        print(f"Error during request: {e}")
        return None

def test_page_content():
    url = generate_url("Data Analyst", "New York")
    response = get_page_content(url)
    
    if response:
        soup = BeautifulSoup(response.content, 'html.parser')
        print("\nSuccessfully retrieved page!")
        
        # Check for different possible job card classes
        job_cards = soup.find_all('div', class_=['job_seen_beacon', 'jobsearch-ResultsList', 'tapItem', 'job_seen_beacon'])
        print(f"\nNumber of job cards found: {len(job_cards)}")
        
        if len(job_cards) == 0:
            print("\nNote: Indeed might be using different HTML classes or blocking our request.")
            print("You might need to:")
            print("1. Use an API instead of web scraping")
            print("2. Consider using Indeed's official API")
            print("3. Use a web scraping service that handles anti-bot measures")
    else:
        print("\nFailed to retrieve the page. Indeed might be blocking our request.")
        print("Consider using Indeed's official API instead of web scraping.")

# Run the test
test_page_content()

Failed to retrieve page. Status code: 403

Failed to retrieve the page. Indeed might be blocking our request.
Consider using Indeed's official API instead of web scraping.


In [15]:
def extract_job_data(job_post):
    """
    Extract data from a job posting
    
    Args:
        job_post: Dictionary containing job posting data
        
    Returns:
        dict: Dictionary containing the extracted job information
    """
    job_data = {}
    
    # Extract data with try/except blocks as required by the project
    try:
        job_data['title'] = job_post.get('title', '')
    except AttributeError:
        job_data['title'] = ""
        
    try:
        job_data['company'] = job_post.get('company', '')
    except AttributeError:
        job_data['company'] = ""
        
    try:
        job_data['location'] = job_post.get('location', '')
    except AttributeError:
        job_data['location'] = ""
        
    try:
        job_data['salary'] = job_post.get('salary', 'Not listed')
    except AttributeError:
        job_data['salary'] = "Not listed"
        
    try:
        job_data['description'] = job_post.get('description', '')
    except AttributeError:
        job_data['description'] = ""
    
    return job_data

# Test the function with sample data
def test_extract_job():
    # Sample job posting data
    sample_job = {
        'title': 'Data Analyst',
        'company': 'Tech Corp',
        'location': 'New York, NY',
        'salary': '$70,000 - $90,000 per year',
        'description': 'Looking for an experienced Data Analyst...'
    }
    
    # Test with complete data
    job_info = extract_job_data(sample_job)
    print("Complete job posting test:")
    for key, value in job_info.items():
        print(f"{key}: {value}")
        
    # Test with missing data
    incomplete_job = {
        'title': 'Software Engineer',
        'company': 'Tech Corp'
        # Location and salary intentionally missing
    }
    
    print("\nIncomplete job posting test:")
    job_info = extract_job_data(incomplete_job)
    for key, value in job_info.items():
        print(f"{key}: {value}")

# Run the test
test_extract_job()

Complete job posting test:
title: Data Analyst
company: Tech Corp
location: New York, NY
salary: $70,000 - $90,000 per year
description: Looking for an experienced Data Analyst...

Incomplete job posting test:
title: Software Engineer
company: Tech Corp
location: 
salary: Not listed
description: 


In [17]:
def main():
    """
    Main function to coordinate the job scraping process
    """
    try:
        # List of positions and locations to search for
        searches = [
            {"position": "Data Analyst", "location": "New York"},
            {"position": "Software Engineer", "location": "California"}
        ]
        
        # List to store all job results
        all_jobs = []
        
        # Process each search
        for search in searches:
            print(f"\nProcessing search: {search['position']} in {search['location']}")
            
            # For demonstration, we'll use sample data since Indeed blocks scraping
            sample_jobs = [
                {
                    'title': f'Senior {search["position"]}',
                    'company': 'Tech Corp',
                    'location': search['location'],
                    'salary': '$70,000 - $90,000 per year',
                    'description': f'Looking for an experienced {search["position"]}...'
                },
                {
                    'title': f'Junior {search["position"]}',
                    'company': 'Startup Inc',
                    'location': search['location'],
                    'salary': 'Not listed',
                    'description': f'Entry level {search["position"]} position...'
                }
            ]
            
            # Process each job posting
            for job in sample_jobs:
                job_data = extract_job_data(job)
                all_jobs.append(job_data)
                
        # Print results
        print("\nAll extracted jobs:")
        for i, job in enumerate(all_jobs, 1):
            print(f"\nJob {i}:")
            for key, value in job.items():
                print(f"{key}: {value}")
                
        print(f"\nTotal jobs processed: {len(all_jobs)}")
        
    except Exception as e:
        print(f"An error occurred in main: {str(e)}")

# Run the program
if __name__ == "__main__":
    main()


Processing search: Data Analyst in New York

Processing search: Software Engineer in California

All extracted jobs:

Job 1:
title: Senior Data Analyst
company: Tech Corp
location: New York
salary: $70,000 - $90,000 per year
description: Looking for an experienced Data Analyst...

Job 2:
title: Junior Data Analyst
company: Startup Inc
location: New York
salary: Not listed
description: Entry level Data Analyst position...

Job 3:
title: Senior Software Engineer
company: Tech Corp
location: California
salary: $70,000 - $90,000 per year
description: Looking for an experienced Software Engineer...

Job 4:
title: Junior Software Engineer
company: Startup Inc
location: California
salary: Not listed
description: Entry level Software Engineer position...

Total jobs processed: 4


In [18]:
import requests
from bs4 import BeautifulSoup
import csv

def main(position, location):
    """
    Main function to scrape job postings from Indeed
    
    Args:
        position (str): Job position to search for
        location (str): Location to search in
    """
    try:
        # 1. Set headers for HTTP request
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Connection': 'keep-alive',
        }
        
        # 2. Construct URL using previous function
        url = generate_url(position, location)
        
        # 3. Send HTTP request and get HTML
        response = requests.get(url, headers=headers)
        
        # 4. Parse HTML with BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        job_postings = soup.find_all('div', class_='job_seen_beacon')
        
        # 5. Extract job information for each posting
        jobs_list = []
        for posting in job_postings:
            job_data = extract_job_data(posting)
            jobs_list.append(job_data)
            
        # 6. Write to CSV file
        filename = f"jobs_{position}_{location}.csv"
        with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
            # Define CSV headers
            fieldnames = ['title', 'company', 'location', 'salary', 'description']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            
            # Write headers and data
            writer.writeheader()
            for job in jobs_list:
                writer.writerow(job)
        
        # 7. Print success message
        print(f"Successfully scraped {len(jobs_list)} job postings for {position} in {location}")
        print(f"Data has been saved to {filename}")
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")

# Since Indeed is blocking our requests, let's create a version that works with sample data
def main_with_sample_data(position, location):
    """
    Main function using sample data for demonstration
    """
    try:
        # Create sample job postings
        sample_jobs = [
            {
                'title': f'Senior {position}',
                'company': 'Tech Corp',
                'location': location,
                'salary': '$70,000 - $90,000 per year',
                'description': f'Looking for an experienced {position}...'
            },
            {
                'title': f'Junior {position}',
                'company': 'Startup Inc',
                'location': location,
                'salary': 'Not listed',
                'description': f'Entry level {position} position...'
            }
        ]
        
        # Extract job information using our existing function
        jobs_list = []
        for posting in sample_jobs:
            job_data = extract_job_data(posting)
            jobs_list.append(job_data)
            
        # Write to CSV file
        filename = f"jobs_{position}_{location}.csv"
        with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['title', 'company', 'location', 'salary', 'description']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            
            writer.writeheader()
            for job in jobs_list:
                writer.writerow(job)
        
        print(f"Successfully processed {len(jobs_list)} job postings for {position} in {location}")
        print(f"Data has been saved to {filename}")
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")

# Test the function
main_with_sample_data('developer', 'texas')

Successfully processed 2 job postings for developer in texas
Data has been saved to jobs_developer_texas.csv


In [19]:
import pandas as pd

# Read the CSV file we just created
df = pd.read_csv('jobs_developer_texas.csv')

# Display the contents
print("\nContents of jobs_developer_texas.csv:")
print(df)


Contents of jobs_developer_texas.csv:
              title      company location                      salary  \
0  Senior developer    Tech Corp    texas  $70,000 - $90,000 per year   
1  Junior developer  Startup Inc    texas                  Not listed   

                               description  
0  Looking for an experienced developer...  
1        Entry level developer position...  
