# 1. Automatically scrape job postings of a competitor

We will now see scraping in action. Imagine you are working in HR for a major retailer. Your boss asks you to monitor the strategic hiring decisions of your close competitors. Naturally, you cannot go and call them up - but you could take a look at their job postings to see ($i$) how much they are hiring, ($ii$) what types of positions they are hiring for.

Now, you could log onto their website every day, see what job postings there are, compare that with the job postings from before, and save the relevant data. But why go through so much effort if we can just automate the task?

## 1.1 A simple case - using BeautifulSoup

The first example relies purely on what we have learned about BeautifulSoup and Requests (and a bit of Pandas!).

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import re


We are searching for positions in the head office of Aldi. On the website, we see that there are different types of head office positions, each with their own website. Let's get the links to those sub-sites.

In [3]:
url = "https://www.aldirecruitment.co.uk/head-office"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1,IE=8" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport"/>
<meta content="yes" name="mobile-web-app-capable"/>
<style>html{visibility: hidden;opacity:0;}</style>
<link as="style" href="/dist/css/screen.css?v=5" rel="preload"/>
<link as="script" href="/dist/js/commons.js?v=5" rel="preload"/>
<link as="script" href="/dist/js/main.js?v=5" rel="preload"/>
<link href="https://cc.cdn.civiccomputing.com" rel="preconnect">
<link href="https://apikeys.civiccomputing.com" rel="preconnect">
<link href="https://www.google-analytics.com" rel="preconnect">
<link href="https://www.googletagmanager.com" rel="preconnect"/>
<link href="/dist/css/screen.css?v=5" rel="stylesheet" type="text/css"/>
<title>Head Office | Aldi Recruitment UK</title>
<meta content="en_GB" property="og:locale">
<meta content="website" property="og:typ

In [4]:
links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
print(links)

[None, '/', None, '/', '/about-us/', '/diversity-inclusion/', '/stores/', '/stores/', '/stores/store-manager/', '/stores/assistant-store-manager/', '/stores/deputy-store-manager/', '/stores/store-assistant/', '/stores/stock-assistant/', '/stores/caretaker-cleaner/', '/warehouse-transport/', '/head-office/', '/area-manager-programme/', '/area-manager-programme/', '/area-manager-programme/graduate-area-manager-programme/', '/area-manager-programme/career-changers/', '/placement-programmes/', '/apprenticeships/', '/contact-us/#contact-us', '/about-us/#equality-and-diversity', '/about-us/#right-to-work', 'https://www.aldi.co.uk/', '/privacy-notice-cookies/', 'https://www.facebook.com/AldiCareersUK', 'https://www.youtube.com/user/AldiUK', 'https://twitter.com/AldiCareersUK', 'https://www.linkedin.com/company/aldi-uk', 'https://www.instagram.com/aldiCareersuk', '/job-search/', '/', '/job-search/', '/job-search/', '/job-search/', '/head-office/national-administration/', '/head-office/regional

We want to get only the links to actual job postings, so we have to clean the results somewhat:

In [2]:
url = "https://www.aldirecruitment.co.uk/head-office"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
links = []
for link in soup.findAll('a'):
    new_link = link.get('href')
    if new_link != None and new_link.startswith('/head-office/'):
        links.append(new_link)
print(links)

['/head-office/', '/head-office/national-administration/', '/head-office/regional-administration/', '/head-office/national-buying/', '/head-office/corporate-responsibility/', '/head-office/customer-service/', '/head-office/ecommerce/', '/head-office/global-business-coordination/', '/head-office/hr/', '/head-office/it/', '/head-office/marketing-communications/', '/head-office/merchandising-category/', '/head-office/national-property/', '/head-office/supply-chain-management/', '/head-office/quality-assurance/', '/head-office/quality-control/']


As we use "../head-office" as our base url, we do not need to repeat it. Accordingly, we make a few final adjustments to the link collection.

In [3]:
url = "https://www.aldirecruitment.co.uk/head-office"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
links = []
for link in soup.findAll('a'):
    new_link = link.get('href')
    if new_link != None and new_link.startswith('/head-office/'):
        new_link = new_link.replace('/head-office','')
        if new_link != '/':
            links.append(new_link)
print(links)

['/national-administration/', '/regional-administration/', '/national-buying/', '/corporate-responsibility/', '/customer-service/', '/ecommerce/', '/global-business-coordination/', '/hr/', '/it/', '/marketing-communications/', '/merchandising-category/', '/national-property/', '/supply-chain-management/', '/quality-assurance/', '/quality-control/']


Let's see how many postings there are on one of the sub-sites. For this, we have to find the right tags, using their class argument. Again, inspecting the site is very important!

In [4]:
category_url = url + links[2]
page = requests.get(category_url)
soup = BeautifulSoup(page.content, "html.parser")
postings = soup.findAll("div", class_="c-career--dropdown")
len(postings)


11

We now extract some information from the actual position: the job title.

In [5]:
title = postings[9].find("div", class_="c-career--dropdown__content").find('h2')
print(title.text)

National Project Manager


Aside from the title and the text description (which we will ignore in this example, but which can hold extremely useful information), there are some key details about the job, such as the work time and the salary.

In [6]:
details = postings[9].findAll("div", class_="c-job-details__content")
print(details)

[<div class="c-job-details__content">
<img data-replace="false" src="/media/xjpjb1m0/market.svg" width="56">
<div>
<span class="c-job-details__title">Salary</span>
<div class="c-job-details__text"><p>£54,285 rising in increments to £79,040 in year 4.</p></div>
</div>
</img></div>, <div class="c-job-details__content">
<img data-replace="false" src="/media/ygxa0wif/island.svg" width="56"/>
<div>
<span class="c-job-details__title">Holiday allownance</span>
<div class="c-job-details__text"><p>5 weeks’ annual leave plus Bank Holidays.</p></div>
</div>
</div>, <div class="c-job-details__content">
<img data-replace="false" src="/media/e4pnj4sn/health.svg" width="56"/>
<div>
<span class="c-job-details__title">Hours and benefits</span>
<div class="c-job-details__text"><p>You’ll work a 50-hour week, working from Monday – Friday. Plus, you’ll receive a company pension when you start, private employee medical insurance after 6 months, company maternity and adoption leave as well as company paterni

We definitely want to get the salary information. Sometime, the text gives multiple values, so let's make sure to save the lowest and the highest value (of course, multiple values may be due to changes over time or for different starting requirements - we can adapt our scraper to capture arbitrary complexity later on).

In [7]:
detail = details[2]
detail_text = detail.find('div', class_="c-job-details__text").text
print(detail_text)

You’ll work a 50-hour week, working from Monday – Friday. Plus, you’ll receive a company pension when you start, private employee medical insurance after 6 months, company maternity and adoption leave as well as company paternity pay after 2 years.


In [9]:
detail = details[1]
detail_text = detail.find('div', class_="c-job-details__text").text
allowance = detail_text.split()
if 'annual' and 'leave' in allowance:
    holiday = allowance[0]

In [10]:
detail = details[2]
detail_text = detail.find('div', class_="c-job-details__text").text
if 'when you start' in detail_text:
    Pension_from_start = 'Yes'
else:
    Pension_from_start = 'No'
   
Pension_from_start

'Yes'

In [42]:
temp = detail_text.replace(',','')
temp = temp.replace('-','')
temp = temp.split()
if temp[1].isdigit():
        temp[0] = temp[0] + temp[1]
        temp.remove(temp[1])
salary_numbers = [float(s[1:]) for s in temp if s.startswith('£')]

for salary in salary_numbers:
        if salary < 6:
                salary_numbers.remove(salary)

ub = max(salary_numbers)
lb = min(salary_numbers)
print(ub)
print(lb)
salary_numbers

13.78
11.77


[11.77, 12.34, 13.09, 13.78]

Let's also try to capture the weekly working hours:

In [109]:
detail = details[2]
detail_text = detail.find('div', class_="c-job-details__text").text

for s in detail_text.split():
    if '-hour' in s:
        work_time = s
        work_time = int(work_time.replace('-hour',''))  
for i in range(len(detail_text.split())):
    if detail_text.split()[i] == "hours":
        work_time = int(detail_text.split()[i-1])
        
               
print(work_time)

30


The following code combines our extraction of job details:

In [11]:
details = postings[10].findAll("div", class_="c-job-details__content")
x = 0
for detail in details:
    detail_title = detail.find('span', class_="c-job-details__title").text
    detail_text = detail.find('div', class_="c-job-details__text").text
    if detail_title == 'Salary':
        temp = detail_text.replace(',','')
        temp = temp.replace('-','')
        temp = temp.split()
        if temp[1].isdigit():
            temp[0] = temp[0] + temp[1]
            temp.remove(temp[1])
        salary_numbers = [float(s[1:]) for s in temp if s.startswith('£')]
        for salary in salary_numbers:
            if salary < 6:
                salary_numbers.remove(salary)
        ub = max(salary_numbers)
        lb = min(salary_numbers)
    elif detail_title == 'Hours and benefits':
        
        for s in detail_text.split():
            if '-hour' in s:
                work_time = s
                work_time = int(work_time.replace('-hour',''))
            #if 'hours' in s and 'per' in s and 'week' in s:
        for i in range(len(detail_text.split())):
            if detail_text.split()[i] == "hours":
                work_time = int(detail_text.split()[i-1])  
        if 'when you start' in detail_text:
            Pension_from_start = 'Yes'
        else:
            Pension_from_start = 'No'   

    elif detail_title == 'Holiday allowance':
        allowance = detail_text.split()
        if 'weeks’' and 'annual' and 'leave' in allowance:
            holiday = int(allowance[0])
                    
        elif 'days’' and 'annual' and 'leave' in allowance:
            holiday = int(allowance[0])/7              

print(ub)
print(lb)
print(work_time)
print(holiday)
print(Pension_from_start)

46590.0
32200.0
40
5
No


In [14]:

def scrape_aldi_jobs(starting_page = 'head-office'):
    url = "https://www.aldirecruitment.co.uk/" + starting_page
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    links = []
    for link in soup.findAll('a'):
        new_link = link.get('href')
        if new_link != None and new_link.startswith('/head-office/'):
            new_link = new_link.replace('/head-office','')
            if new_link != '/':
                links.append(new_link)
    Pension = []
    hols = []
    department = []
    titles = []
    ubs = []
    lbs = []
    hours = []
    for link in links:
        category_url = url + link
        page = requests.get(category_url)
        soup = BeautifulSoup(page.content, "html.parser")
        postings = soup.findAll("div", class_="c-career--dropdown")
        for posting in postings:
            ## Also grabbing the department information
            dep_name = link.replace('-',' ').replace('/','')
            department.append(dep_name)
            titles.append(posting.find("div", class_="c-career--dropdown__content").find('h2').text)
            details = posting.findAll("div", class_="c-job-details__content")

            for detail in details:
                detail_title = detail.find('span', class_="c-job-details__title").text
                detail_text = detail.find('div', class_="c-job-details__text").text

                if detail_title.lower() == 'holiday allowance' or detail_title.lower() == 'holiday' or detail_title.lower() == 'holiday allownance':
                    for i in range(len(detail_text.split())):
                        if 'weeks’' or 'days’' in detail_text.split():
                            holiday = int(detail_text.split()[0])
                            if holiday>7:
                                holiday = holiday/7
                    hols.append(holiday)
                    
    
                        
                elif detail_title.lower() == 'hours and benefits' or detail_title.lower() == 'benefits':
                    ## Some postings do not specify a number of hours per week
                    work_time = None
                    for s in detail_text.split():
                        if '-hour' in s:
                            work_time = s
                            ## Some postings write, e.g., 40-hour per week, some 40-hours per week
                            if '-hours' in s:
                                work_time = int(work_time.replace('-hours',''))
                            else:
                                work_time = int(work_time.replace('-hour',''))
                    for i in range(len(detail_text.split())):
                        if detail_text.split()[i] == "hours":
                            work_time = int(detail_text.split()[i-1])   

                    hours.append(work_time)

                    if 'when you start' in detail_text or 'When you start' in detail_text or 'When you join' in detail_text:
                        Pension_from_start = 'Yes'
                    else:
                        Pension_from_start = 'No'  
                    Pension.append(Pension_from_start)

                elif detail_title == 'Salary':
                    temp = detail_text.replace(',','')
                    temp = temp.replace('-','')
                    temp = temp.split()
                    if temp[1].isdigit():
                        temp[0] = temp[0] + temp[1]
                        temp.remove(temp[1])
                    salary_numbers = [float(s[1:]) for s in temp if s.startswith('£')]
                    for salary in salary_numbers:
                        if salary < 6:
                            salary_numbers.remove(salary)
                else:
                    hols.append(None)
                    ## Salary may not be specified
            if len(salary_numbers) > 0:
                        ## Salaries are sometimes specified as per week instead of per year
                if 'per' in temp and 'week' in temp:
                    salary_numbers = [salary*52 for salary in salary_numbers]
                elif 'per' in temp and 'hour' in temp:
                    salary_numbers = [salary*work_time*52 for salary in salary_numbers]
                ubs.append(max(salary_numbers))
                lbs.append(min(salary_numbers))
            else:
                ubs.append(None)
                lbs.append(None)
                
                        
    job_data = pd.DataFrame(
        {'Department': department,
         'Job title': titles,
         'Salary lower': lbs,
         'Salary upper': ubs,
         'Weekly hours': hours,
         'Holiday weeks': hols,
         'Pension from start': Pension
        })
    return job_data


Finally, we are putting it all together into a simple-to-call function that returns a data frame of job postings. We have to make a few adjustments to avoid errors. These are marked with comments.

Let's try it out:

In [15]:
aldi_job_data = scrape_aldi_jobs()
aldi_job_data.head(32)

Unnamed: 0,Department,Job title,Salary lower,Salary upper,Weekly hours,Holiday weeks,Pension from start
0,national administration,National Finance & Administration Manager,54285.0,79040.0,50.0,5.0,Yes
1,national administration,National Tax Manager,44660.0,93655.0,50.0,5.0,Yes
2,national administration,Data Scientist,57130.0,74430.0,40.0,5.0,Yes
3,national administration,National Finance & Administration Team Leader,47100.0,54255.0,40.0,5.0,Yes
4,national administration,Technical Analyst,47100.0,54255.0,40.0,5.0,Yes
5,national administration,Tax Specialist,41950.0,50475.0,40.0,5.0,No
6,national administration,National Administration Analyst,41200.0,45225.0,,5.0,No
7,national administration,Personal Assistant - National Administration,32075.0,37455.0,40.0,5.0,No
8,national administration,National Administration Assistant,32075.0,37455.0,40.0,5.0,No
9,national administration,Administration Support Assistant,22310.0,27450.0,40.0,5.0,No


We can now explore the data frame, improve our code if we find issues, and then analyze it. For example, let's have a look at a simple histogram of postings per department.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
fig.set_size_inches(30, 10)
sns.histplot(data=aldi_job_data, x="Department",ax=ax)
plt.show()

Finally, save the job postings we found as a CSV:

In [49]:

%cd '/Users/agamchug/Desktop/' 
aldi_job_data.to_csv('Aldi_postings_2021-11-03.csv', index=False)

/Users/agamchug/Desktop


/Users/agamchug
