# Scraping apec webpage (www.apec.fr) with SELENIUM

### Web crawling Vs. Web scraping
- __Web crawling__, also known as Indexing is used to index the information on the page using bots also known as crawlers. Crawling is essentially what search engines do. It’s all about viewing a page as a whole and indexing it. When a bot crawls a website, it goes through every page and every link, until the last line of the website, looking for ANY information. Web Crawlers are basically used by major search engines like Google, Bing, Yahoo, statistical agencies, and large online aggregators.

- __Web scraping__, also known as web data extraction, is similar to web crawling in that it identifies and locates the target data from web pages. __The key difference, is that with web scraping, we know the exact data set identifier__ e.g. an HTML element structure for web pages that are being fixed, from which data needs to be extracted. Web scraping is an automated way of extracting specific data sets using bots which are also known as ‘scrapers’. Once the desired information is collected it can be used for comparison, verification, and analysis based on a given business’s needs and goals.

In [1]:
# Importing required modules:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import time, sleep
import pandas as pd

In [2]:
# Creating the the executable link path
link_path = Service('/Users/elhadji/Desktop/Python_Labs/chromedriver')
# Creating the “driver”
driver =  webdriver.Chrome(service=link_path)

## Create a function to generate a list urls of webpages

In [3]:
# define a function to build the url list of wepages
def build_ful_apecurl(keyword_list, pages_numb):
    # Add %20 join strings in keywordlist
    str_keyword = '%20'.join(keyword_list)
    # set apec web page as string variable
    root_url = "https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles="
    # url page
    url_page = '&page='
    # empty list store full url or each wbepage
    apec_url_list = []
    # full url by contenating in the following list
    for n in pages_numb:
        apec_url_list.append(root_url+str_keyword+url_page+str(n))
    return apec_url_list

In [4]:
# Keyword to seach a keyword : "data scientist" as list two items
keyword_list = ['data', 'scientist']
# Page numbers list 
pages_numb = [n for n in range(5)]

In [5]:
# Build the list of five pages
my_apec_links = build_ful_apecurl(keyword_list, pages_numb)
# Print to see 
my_apec_links

['https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20scientist&page=0',
 'https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20scientist&page=1',
 'https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20scientist&page=2',
 'https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20scientist&page=3',
 'https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20scientist&page=4']

## Create function to extract data by tagname, class_name, css_selector, xpath

In [6]:
# Extratct job titles with driver.find_elements By.TAG_NAME()
def job_title_extractor_tagname(url_list, tag):
    """
    Take an url list and a tag and return all jobs the given webpage
    """
    all_jobs = []
    # loop thorought urls and extract job title by h2 tag
    for url in url_list:
        driver.get(url)
        all_h2 = driver.find_elements(By.TAG_NAME,tag)
        all_jobs += [element.text for element in all_h2]
    # return all_jobs
    return all_jobs
# Probem : It scraps any 'h2' element in the welpage including job titles and others

In [7]:
# Job title tag : 'h2'
tag = "h2"
# get all job titles
job_titles = job_title_extractor_tagname(my_apec_links, tag)
# check the lenght of the list
len(job_titles) 
# Include all h2 elements (job title + other section titles) 

115

In [9]:
# Extratct element with driver.find_elements by.CSS_SELECTOR() >  class_name
def job_title_extractor_class(url_list, class_name):
    """
    Take an url list and a class name to return all jobs from all webpages
    """
    all_jobs = []
    for url in url_list:
        driver.get(url)
        all_class = driver.find_elements(By.CSS_SELECTOR,class_name)
        all_jobs += [element.text for element in all_class]
    return all_jobs

In [10]:
# Class name value and extract element by.CSS_SELECTOR()
class_name = ".card-title.fs-16"
job_titles = job_title_extractor_class(my_apec_links, class_name) # better than TAG_NAME(), more specific !!!
# Check the lenght of the list
len(job_titles)

100

In [11]:
# print the fourth element
print(job_titles[3])

DATA SCIENTIST F/H


In [12]:
# This function finds elements by XPATH
def job_company_extractor(url_list, class_xpath):
    """
    Take an url and an Xpath to return all job company in a webpage
    """
    all_company = []
    for url in url_list:
        driver.get(url)
        all_comp = driver.find_elements(By.XPATH, class_xpath)
        all_company += [element.text for element in all_comp]
    #driver.quit()
    return all_company

In [13]:
# define company name's xpath
class_xpath = "//*[@class='card-offer__company mb-10']" # company name
# get all the companies' name
job_company = job_company_extractor(my_apec_links, class_xpath)
# check the lenght of their list
len(job_company)

100

In [14]:
# print the the fourth element
print(job_company[3])

STUDIEL PARTICIPATIONS


In [15]:
# define 'salary' xpath 
xpath = '//ul[@class="details-offer"]' # salary per year
# get all the job details
job_salary = job_company_extractor(my_apec_links,xpath)
# check the lenght of their list
len(job_salary)

100

In [16]:
# print the fourth element
print(job_salary[3])

35 - 43 k€ brut annuel


In [20]:
# job description by class name
class_name = ".card-offer__description.mb-15" 
# get the job descriptions
job_descrition = job_title_extractor_class(my_apec_links, class_name) # using CSS_SELECTOR
# check the lenght of their list
len(job_descrition)

80

In [21]:
# printh the fourth element
print(job_descrition[3])

Dans le cadre d'une embauche, vous interviendrez en tant que data scientist. Vous analyserez de gros volumes de données de type série temporelle, développerez et déploierez des modèles prédictifs, et assurerez la communication avec le client. Vous contribuer également aux...


In [22]:
# get job type, job location and published date 
class_name = ".details-offer.important-list"
# get this derails 
type_loc_date = job_title_extractor_class(my_apec_links, class_name) # using CSS_SELECTOR
# check the lenght of their list
len(type_loc_date)

100

In [23]:
# print the fourth element
print(type_loc_date[3])

CDI
Blagnac - 31
16/11/2021


In [24]:
# print the fourth element
type_loc_date[3]

'CDI\nBlagnac - 31\n16/11/2021'

### Create an unique function that can extract elements from mutiple classes by xpath

In [87]:
# This function finds elements specially by XPATH
def jobs_infos_extractor(url_list, xpath_class_list):
    """
    Take an url and an Xpath to return all job infos (title, company, salary, description, ....) from each a webpage
    """
    # Initiate jobs characteristics list
    all_jobs = [] # job titles
    all_company = [] # job companies
    all_salary = [] # job salaries
    all_description = [] # job descriptions
    all_details = [] # job details (type, location, date)
    
    # loop throught the urls' list
    for url in url_list:
        driver.get(url)
        # get titles
        all_class = driver.find_elements(By.XPATH, xpath_class_list[0])
        all_jobs += [element.text for element in all_class]
        # get companies
        all_comp = driver.find_elements(By.XPATH, xpath_class_list[1])
        all_company += [element.text for element in all_comp]
        # get salaries
        all_sal = driver.find_elements(By.XPATH, xpath_class_list[2])
        all_salary  += [element.text for element in all_sal]
        # get descriptions
        all_descrip = driver.find_elements(By.XPATH, xpath_class_list[3])
        all_description += [element.text for element in all_descrip]
        # get job details
        all_det = driver.find_elements(By.XPATH, xpath_class_list[4])
        all_details += [element.text for element in all_det]      
    # return scrapped data
    return (all_jobs, all_company, all_salary, all_description, all_details)

In [88]:
# Define a function that splits string into list to deal with job details
def string_to_list(text):
    return text.split("\n", 3)

In [89]:
# Testing string to list function
txt = 'CDI\nBlagnac - 31\n16/11/2021'
string_to_list(txt) # It's working well

['CDI', 'Blagnac - 31', '16/11/2021']

In [90]:
# Convert type_loc_date into a list of three elements
job_det = [string_to_list(my_elem) for my_elem in type_loc_date]

In [92]:
# Extract each element and put it in list : type "0", location "1" or date "2"
type_ = [job_type[0] for job_type in job_det] # contract type
loc = [job_type[1] for job_type in job_det] # location
dat = [job_type[2] for job_type in job_det] # date

## Build a function that creates a csv file from data after scrapping the wedsite www.apec.fr

In [100]:
def build_apec_csv(keyword_list,xpath_class_list,pages_numb):
    """
    Function that creates a dictionary with all the data scrapped from the apec website for data scientist postion
    and stores it in a csv file.
    """
    ###################################################################
    # Build url bebpages' list
    my_apec_links = build_ful_apecurl(keyword_list, pages_numb)
    # Five job items to extract here
    job_titles, job_company,job_salary,job_descrition,job_details = jobs_infos_extractor(my_apec_links, xpath_class_list)
    # Transform job_details to extract type, loc and date
    type_loc_date = [string_to_list(my_elem) for my_elem in job_details] # call string_to_list function
    # Put type in a list
    contract = [job_type[0] for job_type in type_loc_date]
    # Put location in a list
    location = [job_type[1] for job_type in type_loc_date]
    # Put date in a list
    date = [job_type[2] for job_type in type_loc_date]
    
    # create a dictionary to stucture the scrapped data
    my_dict = {"Job title":job_titles, "Company":job_company, "Salary":job_salary, "Description":job_descrition,
               "Contract":contract, "Location":location, "date":date}
    
    # create the dataframe with dict
    data = pd.DataFrame(my_dict)
    
    # store scrapped data in csv file
    data.to_csv('/Users/elhadji/Desktop/Python_Labs/aspec_ds_jobs_offers.csv', sep=";", encoding='utf-8', index=False)
    return None

In [94]:
# Iniate the keyword list
keyword_list = ['data', 'scientist']
# Page numbers list 
pages_numb = [n for n in range(21)]
# Xpath classes list
xpath_class_list = ['//h2[@class="card-title fs-16"]', "//*[@class='card-offer__company mb-10']",
              '//ul[@class="details-offer"]', '//p[@class="card-offer__description mb-15"]', '//ul[@class="details-offer important-list"]']

In [95]:
# Let's put apec offers data in csv
build_apec_csv(keyword_list,xpath_class_list,pages_numb)

## Importing scrapped data now

In [96]:
# Open the saved data with pandas
df = pd.read_csv('/Users/elhadji/Desktop/Python_Labs/aspec_ds_jobs_offers.csv', sep=";")

In [99]:
# prin the 20 first rows
df.head(20) # 15 firts offers

Unnamed: 0,Job title,Company,Salary,Description,Contract,Location,date
0,Data Scientist F/H,FULL DATA MANAGEMENT,35 - 45 k€ brut annuel,"Dans le cadre de notre développement, nous rec...",CDI,Lille - 59,16/11/2021
1,Data Scientist F/H,ROBERT WALTERS FRANCE,45 - 50 k€ brut annuel,"Notre client, groupe spécialisé dans l'assuran...",CDI,Toulon - 83,10/11/2021
2,Data Scientist F/H,TALENTS RH,45 - 50 k€ brut annuel,"TALENTS RH, société de recrutement spécialisée...",CDI,Lille - 59,18/11/2021
3,DATA SCIENTIST F/H,STUDIEL PARTICIPATIONS,35 - 43 k€ brut annuel,"Dans le cadre d'une embauche, vous interviendr...",CDI,Blagnac - 31,16/11/2021
4,Data Scientist F/H,REGIONSJOB,A négocier,"""Recrutez au delà des compétences."" PERSUADERS...",CDI,Lyon 01 - 69,26/11/2021
5,DATA SCIENTIST F/H,PREM CANAGARADJA,A négocier,"D’une manière générale, vous serez en charge d...",CDI,Toulouse - 31,19/11/2021
6,Data Scientist F/H,YSANCE,A négocier,"En tant que Data Scientist, vous contribuerez ...",CDI,Levallois-Perret - 92,21/11/2021
7,Data Scientist F/H,METEOJOB,A négocier,A la recherche de nouvelles affinités professi...,CDI,Écully - 69,10/11/2021
8,Data Scientist F/H,METEOJOB,A négocier,Rattaché(e) à la Direction Générale Exécutive ...,CDI,Clichy - 92,23/11/2021
9,Data Scientist F/H,SAS PROXIEL,A partir de 40 k€ brut annuel,Nous recherchons un Data Scientist (F/H) pour ...,CDI,Nice - 06,09/11/2021
