## Web Scraping Indeed Jobs

### Summary

Here we scrape the HTML data from the <a href="https://www.indeed.com/">Indeed</a> jobs web page.

It is used the Python 'requests' library to read the HTML content of the page, Beautifulsoup module to convert the 
content to a readable format and Pandas for data manipulation.

The job information is extracted using different search criteria and the data is saved to a csv file.

In [1]:
# Import libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

Scraping the Indeed webpage involved mainly the following steps:
- access the HTML content from the webpage
- search and extract the job details
- save the results to a csv file
- repeat the steps above using different search criteria
- merge all data into a final csv file

### 1. Define functions for web scraping

The first function **extract()** accesses the HTML content from the webpage using requests and Beautiful Soup libraries. It uses the page 
URL that includes the search criteria under 'q' (query) and 'l' (location), e.g. 'data scientist', 'Washington State'.

The second function **transform()** is used for searching and extracting all the details for a job posting, saves them to a dictionary, then 
appends all dictionaries to a list.

In [2]:
def extract(page):
    """Function to extract HTML content into Python script"""
    
    # URL of the Indeed page
    # (e.g. search criteria 'data scientist', 'Washington State')
    url = f"https://www.indeed.com/jobs?q=data%20scientist&l=Washington%20State&start={page}"                  
    
    # Read the content of the HTML page
    r = requests.get(url)   
    
    # Convert the HTML content to a readable 
    # format using BeautifulSoup
    soup = BeautifulSoup(r.content, "html.parser")
    
    return soup

In [3]:
def transform(soup):
    """Function to extract all job details for a posting in a
       dictionary, then append all dictionaries to a list."""
    
    # use find_all() function to view the info inside "div" container
    divs = soup.find_all("div", class_="job_seen_beacon")
    
    # Iterate through each div container and get job details
    for item in divs:
        # title
        job_title = item.h2.find_all("span")
        if len(job_title) == 2:
            title = item.h2.find_all("span")[1].text.strip()
        else:
            title = item.h2.find_all("span")[0].text.strip()
        # company
        company = item.find("span", class_="companyName").text.strip()
        # location
        location = item.find("div", class_="companyLocation").text.strip()
        # salary
        try:
            salary = item.find("div", {"class": "metadata salary-snippet-container"}).text.strip()
        except:
            salary = ''
        # summary
        summary = item.find("div", {"class": "job-snippet"}).text.strip().replace('\n', '')

        # Create a dictionary with all job details
        job = {
            'Title': title,
            'Company': company,
            'Location': location,
            'Salary': salary,
            'Summary': summary           
        }
        
        # Append all dictionaries to a list
        joblist.append(job)
    
    return

### 2. Scrape multiple webpages using specific search criteria

#### 2.1 Extract job details from multiple pages

In [4]:
# Search criteria: 'data scientist', 'Washington State'

# Create a list of dictionaries
# Each dictionary contains details about one job
joblist = []

# Extract jobs details for 4 web pages
# e.g. pages 40 to 70
for i in range(40, 80, 10):           
    print(f'Getting page, {i}')
    c = extract(0)
    transform(c)

# print the number of jobs found in these pages
print(len(joblist))

Getting page, 40
Getting page, 50
Getting page, 60
Getting page, 70
60


#### 2.2 Save the scraped data to a csv file

In [None]:
# Create a dataframe from the 'joblist' with 60 jobs found
# The list of jobs correspond to search criteria 'data scientist', 'Washington State'
df = pd.DataFrame(joblist)

# Print shape and first rows
print(df.shape)
df.head()

In [7]:
# Save the dataframe to a csv file
df.to_csv('jobsDS.csv', sep =",", index=False)       

#### 2.3 Repeat steps for scraping data using different search criteria. Save data into a final dataframe.

Different search criteria that were used include 'ML', 'azure', 'Tableau', 'entry level' etc.. The data was saved to
csv files, named according to the criteria used.

At the end, we merge all data into a final csv file.

In [8]:
# Load data from all job files into dataframes
# Merge the dataframes into a final one with all data

jobs1 = pd.read_csv("jobsAI1.csv")
jobs2 = pd.read_csv("jobsAI2.csv")
jobs3 = pd.read_csv("jobsAWS1.csv")
jobs4 = pd.read_csv("jobsAWS2.csv")
jobs5 = pd.read_csv("jobsAzure1.csv")
jobs6 = pd.read_csv("jobsAzure2.csv")
jobs7 = pd.read_csv("jobsDataEng1.csv")
jobs8 = pd.read_csv("jobsDataEng2.csv")
jobs9 = pd.read_csv("jobsDeepLearn1.csv")
jobs10 = pd.read_csv("jobsDeepLearn2.csv")
jobs11 = pd.read_csv("jobsDS1.csv")
jobs12 = pd.read_csv("jobsDS2.csv")
jobs13 = pd.read_csv("jobsDSEntryLevel.csv")
jobs14 = pd.read_csv("jobsDSIntern.csv")
jobs15 = pd.read_csv("jobsDSSenior1.csv")
jobs16 = pd.read_csv("jobsDSSenior2.csv")
jobs17 = pd.read_csv("jobsML1.csv")
jobs18 = pd.read_csv("jobsML2.csv")
jobs19 = pd.read_csv("jobsPowerBI1.csv")
jobs20 = pd.read_csv("jobsPowerBI2.csv")
jobs21 = pd.read_csv("jobsTableau1.csv")
jobs22 = pd.read_csv("jobsTableau2.csv")

all_jobs = pd.concat([jobs1, jobs2, jobs3, jobs4, jobs5, jobs6, jobs7, jobs8, jobs9, jobs10,
               jobs11, jobs12, jobs12a, jobs13, jobs14, jobs15, jobs16, jobs17, jobs18, jobs19, jobs20,
               jobs21, jobs22], axis=0)
print(all_jobs.shape)

(1296, 5)


In [11]:
# Save the dataframe to a csv file 'indeed_jobs.csv'
all_jobs.to_csv("indeed_jobs.csv", sep=",", index=False)

### 3. Check the final dataframe with all scraped data

In [12]:
# Read the csv file with all scraped data
jobs = pd.read_csv("indeed_jobs.csv")

# Print the shape and the first rows
print("Shape of dataframe with all jobs: ", jobs.shape)
print()
jobs.head()

Shape of dataframe with all jobs:  (1296, 5)



Unnamed: 0,Title,Company,Location,Salary,Summary
0,"AI/ML - Chief of Staff, Machine Intelligence",Apple,"Seattle, WA",,The Machine Intelligence team is accelerating ...
1,"Senior Data Scientist, Rich Media Experiences",Zillow,Washington State•Remote,"$127,100 - $203,000 a year",Using Computer Vision techniques and AI-powere...
2,Machine Learning Engineer,Zillow,Washington State•Remote,"$132,400 - $211,600 a year",These algorithms and platforms ingest large vo...
3,Senior Data Scientist,Zillow,Washington State•Remote,"$127,100 - $203,000 a year",This data can be mined for intelligence on com...
4,"Research Program Manager, Artificial Intelligence",Facebook,"New York, NY",,"We drive efficiency, cultivate relationships, ..."


**The final dataframe has 1296 rows and 5 columns**. Each row contains **details** about one job posting: **Title, Company, 
Location, Salary and Summary.**