<h1 style='text-align: center;'> Web scraping using Selenium </h1>

___

This Python project is a web scraping script that extracts job offers from the Indeed website. It uses the **Selenium** and BeautifulSoup libraries to automate the process of entering search queries and extracting data from the website. The script is designed to search for job offers related to a topic and a location. 
<br><br>
**Introduction of ***Selenium*** library** <br> The selenium library is a Python package that provides a simple way to automate web browsers. It is used to interact with web pages and perform tasks such as filling out forms, clicking buttons, and navigating between pages. The library is built on top of the WebDriver API, which is a cross-platform API for controlling web browsers. The selenium library supports several popular web browsers, including Firefox, Chrome, and Internet Explorer.

# Import libraries

In [73]:
# Selenium and web scraping libraries
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

# Usual libraries
import time
import pandas as pd
import re
import ipywidgets as widgets 
import numpy as np
import warnings
import glob 

#If the chrome didn't open, please see the below explanation
#https://stackoverflow.com/questions/29858752/error-message-chromedriver-executable-needs-to-be-available-in-the-path
#!pip install webdriver-manager

In [74]:
# disable warning 
warnings.filterwarnings('ignore')

# Functions

In [75]:
def job_topic(job_topic_str):
    """
    This function sets the search topic for job offers.

    Args:
        job_topic_str: A string representing the search topic.

    Returns:
        None.
    """

    # Select the input box "What"
    what=driver.find_element(By.XPATH,"//input[@id='text-input-what']")

    # Select all the text in the input box
    what.send_keys(Keys.CONTROL, 'a')

    # Delete the previous writing if exists
    what.send_keys(Keys.DELETE)

    # Enter the search topic
    what.send_keys(job_topic_str)

In [76]:
def job_localisation(localisation):
    """
    This function enters the location of the search into the input box "Where" on a webpage.

    Args:
        localisation: A string representing the location of the search.

    Returns:
        None.
    """
    # Select the input box "Where"
    where=driver.find_element(By.XPATH,"//input[@id='text-input-where']")

    # Select all the text in the input box
    where.send_keys(Keys.CONTROL, 'a')

    # Delete the previous writing if exists
    where.send_keys(Keys.DELETE)

    # Enter the location of the search
    where.send_keys("France")

In [77]:
# Get all the job divs, in average each page has 15 job offres 
def find_all_job_divs():
    # Git the page source and convet it to BeautifulSoup
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
    # Find the job divs recognised with the class name = "job_seen_beacon"
    divs_mosaic=soup.find_all('div',class_="job_seen_beacon")
    #return the list of the job divs and the soup object
    return  divs_mosaic, soup

In [78]:
def job_title(div):
    """
    This function reads the job title from a job div.

    Args:
        div: A BeautifulSoup object representing a job div.

    Returns:
        The job title as a string.
    """
    # Compile a regular expression pattern to match the 'jobTitle' class
    regex = re.compile('jobTitle *')
    
    # Find all 'h2' tags with the 'jobTitle' class and extract the text of the first tag
    jobTitle = div.find_all('h2', class_=regex)[0].text
    
    # Return the job title as a string
    return jobTitle


In [79]:
def job_attributes(div):
    '''
    This function creates a dictionary of job attributes from a job div.
    
    Args:
        div: A BeautifulSoup object representing a job div.

    Returns:
        A dictionary containing the job attributes. If an attribute is absent, 
        it will not be added to the dictionary.
    '''
    # Create a dictionary of the job attributes
    dic={}
    # Remark
    # In this function, we will use Try/Except to avoiding ending the loop because a error due 
    # to an absent attribute
    
    # Read the job title
    try:
        regex = re.compile('jobTitle *')
        jobTitle = div.find_all('h2', class_=regex)[0].text
        dic['jobTitle']=jobTitle
    except:pass
    # Read the job company name
    try:
        companyName=div.find_all('span', class_="companyName")[0].text
        dic['companyName']=companyName
    except:pass
    # Read the job location
    try:
        location=div.find_all('div', class_="jobsearch-PreciseLocation-container")[0].text
        dic['location']=location
    except:
        try: 
            location=div.find_all('div', class_="companyLocation")[0].text
            dic['location']=location
        except:pass
    # Read the job salary
    try:
        regex = re.compile('metadata salary*')
        salary=div.find_all('div', class_=regex)[0].text
        dic['salary']=salary
    except:pass
    # Read other metadata
    try:  
        other_metadata=div.find_all('div', class_='metadata').text
        dic['other_metadata']=other_metadata
    except:pass
    # Read the job date
    try:
        date=div.find_all('span',class_="date")[0].text
        dic['date']=date
    except:pass
    # Return the dictionary of job attributes
    return dic

In [80]:
# Save the data as csv file
def save_data_as_csv(data,page,run_id):
    # Print the saving task
    print('save CSV')
    # Convert the list of dictionaries to Pandas DataFrame
    df=pd.DataFrame(data)
    # Save the dataFrame as csv file
    file_name='data\\raw\\data'+str(page)+'_run_id_'+run_id+'.csv'
    df.to_csv(file_name)

In [81]:
def check_if_the_final_page(soup):
    return len(soup.find_all('a',{'aria-label': ["Next Page"]})) ==0

In [82]:
# Go to the next page
def next_page(page):
    # Make the next page url 
    next_url='https://fr.indeed.com/jobs?q='+job_topic_str.replace(' ','%20')+'&l='+\
            where_to_find_job.replace(' ','%20')+'&start='+str(page+1)+'0'
    # Go to the next page
    driver.get(next_url)

# Initialization

In [83]:
# Open the chrome browser 
driver = webdriver.Chrome(ChromeDriverManager().install())
# Enter Indeed website
driver.get("https://fr.indeed.com/")

<img src='images\\img1.PNG'  width='400'>

In [84]:
# Set the search topic to "data scientist"
job_topic_str="data scientist"

# Call the job_topic function to enter the search topic into the input box "What"
job_topic(job_topic_str)

In [85]:
# Set the location of the search to "France"
where_to_find_job="France"

# Call the job_localisation function to enter the location of the search into the input box "Where"
job_localisation(where_to_find_job)

In [86]:
# Select and click to the search button
driver.find_element(By.XPATH,"//button[text()='Rechercher']").click()

<img src='images\\img2.PNG'  width='400'>

In [87]:
# Glbal variables initialisation 
data=[]
# The maximum number of job to search
Nb_max_jobs=2000
# The maximum jobs/rows by csv file
Nb_job_by_csv=100
# The page number
page=0
# The job number 
jobs=0
# This number is like a run ID
# it is a random integer to avoir overwrinting the CSV file if we run the PYTHON notbock two time 
# We hope that the random number is not the same for both runs 
# Alternative solution: adding the date _ time to the csv file. 
run_id=str(np.random.randint(0,10000))

# The main loop

This is a Python loop that reads job divs from a webpage and extracts the job attributes. The loop continues until the last page is read or the maximum number of tasks is reached. <br>The loop first calls the `find_all_job_divs` function to get all the job divs and the soup object. It then reads each job div using a `for` loop and extracts the job attributes using the `job_attributes` function. The function adds the page number and job number to the dictionary and appends the job dictionary to the list “data”. The loop then checks if the number of collected jobs exceeds the specified maximum number of jobs by CSV file. If it does, the data is saved as a CSV file and the list of jobs is initialized. The loop then checks if the final page is read or the maximum number of tasks is reached. If it is, the main loop ends with printing END. If not, the loop goes to the next page and increments the page counter.


In [88]:
# The main loop: continue until the last page is read or the maximum number of tasks is reached
while(1):
    # Get all the job divs and the soup object, in average each page has 10 job offres 
    divs_mosaic, soup=find_all_job_divs()
    # Read each div
    for i,div in enumerate(divs_mosaic):
        # get the job attributes
        dic=job_attributes(div)
        # add the page number to the dictionnary 
        dic['page']=page
        # add the job number to the dictionnary 
        dic['job']=jobs
        # Append the job dictionary to the list "data"
        data.append(dic)
        # Increment the jobs counter
        jobs+=1
    # Check if the number of the collected jobs depasse the specify maximum number of jobs by CSV file
    if len(data) >= Nb_job_by_csv:
        # Save the data as csv file
        save_data_as_csv(data,page,run_id)
        # Intialise the list of jobs
        data=[]
    # Check if the final page is readed or or the maximum number of tasks is reached
    end_pages=check_if_the_final_page(soup)
    if jobs >= Nb_max_jobs or end_pages:
        # If yes end the main loop with printing END
        print('END of the main loop')
        break
    # go to the next page
    next_page(page)
    # Increments the page counter
    page+=1

save CSV
save CSV
save CSV
save CSV
save CSV
END of the main loop


# Group the csv files in one file

In [89]:
# Find all CSV files in the "data/raw" directory
files = glob.glob("data\\raw\\*.csv")

# Print the first three files
display(files[:3])

# Read all CSV files into a list of DataFrames
dfs = [pd.read_csv(f) for f in files]

# Concatenate all DataFrames into a single DataFrame
df = pd.concat(dfs, axis=0)

['data\\raw\\data6_run_id_4725.csv',
 'data\\raw\\data13_run_id_4725.csv',
 'data\\raw\\data20_run_id_4725.csv']

In [94]:
# display the first 2 rows of the grouped dataFrame
df.head(2)

Unnamed: 0,jobTitle,companyName,location,date,page,job,salary
0,Data Scientist H/F,Aon Corporation,Paris (75)+1 location,PostedOffre publiée il y a plus de 30 jours,0,0,
1,Alternance - Data Scientist F/H,BPCE SA,Paris (75),PostedOffre publiée il y a plus de 30 jours,0,1,


In [None]:
# Remove the "Unnamed: 0" column from the DataFrame
df = df.drop('Unnamed: 0', axis=1)

# Write the DataFrame to a CSV file without the index column
df.to_csv('data\\processed\\all_scraped_files.csv', index=False)