# Web Scraping Featured Jobs on jobberman.com using Python


![banner-image](https://i.imgur.com/s6UkuCC.png)

[Jobberman](https://www.jobberman.com) is a popular online jobs platform in Nigeria that connects qualified professionals to their dream jobs and employers to the best talent. Over 2 million people each year use jobberman to find jobs. The website homepage has a clean layout where job seekers can search for a job based on different categories. For each category, various jobs are featured.For example the [Sales](https://www.jobberman.com/jobs/sales) category, has different roles like `Sales Manager`, `Investment Advisor` and `Sales Representative`. 


In this project, we will retrieve information from this job category [Accounting, Auditing & Finance](https://www.jobberman.com/jobs/accounting-auditing-finance) using a technique called _web scraping_.


**What is Web Scraping?**

Let’s assume you want to collect a line of information from a website, what do you do? The first line of action would be to copy and paste this information, but what if you want to collect a large amount of information running into hundreds of pages as quickly as possible, do you think you would be able to achieve this by manually copying and pasting? This is where web scraping comes in! By using simple libraries, the process of web scraping allows you to extract data from a website in an automated fashion using code. Most of this data is unstructured and in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

So, when a web scraper needs to scrape a site, first the URL is provided, then it loads all the HTML code for the site. The scraper then obtains the required data from this HTML code and outputs this data in the format specified by the user. Mostly, this is in the form of an Excel spreadsheet or a CSV file, but the data can also be saved in other formats, such as a JSON file.


Here's an outline of the steps we'll follow.
1. Download the webpage using Requests
2. Parse the HTML source code using Beautiful Soup
3. Extract featured jobs from each job category
4. Compile extracted information into Python dictionaries
5. Extract and combine data from multiple pages
6. Save the extracted information to a CSV file

By the end of the project, we will create a CSV in the following format:

```
Job_Url,job_Title,Company
https://www.jobberman.com/listings/accountant-rvwnq9,Accountant,Jobberman (Third Party Recruitment)
https://www.jobberman.com/listings/investment-principal-5xn457,Investment Principal,Jobberman (Third Party Recruitment)
...
```

###  How to run the code

You can execute the code by selecting the "Run" button at the top of this page. You can make changes and save your own version to [Jovian](https://jovian.ai/) by executing the following cells.

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="scraping-job-catgeories-on-jobberman-using-python")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python[0m


'https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python'

## Download the webpage using Requests

We can use the `requests` library to download the web page.

In [4]:
!pip install requests --upgrade --quiet

In [5]:
import requests

The library is now installed and imported

To download a page we can use the `get` function from requests, which returns a response object.

In [6]:
topic_url = 'https://www.jobberman.com/jobs/accounting-auditing-finance'
response = requests.get(topic_url)

The `.status_code` property can be used to check if the response was successful. If the request was successful, `response.status_code` is set to a value between 200 and 299.

In [7]:
response.status_code

200

The response was successful. The contents of the web page can then be accessed using the `.text` property of the response.

In [8]:
page_contents = response.text

Let's check the number of characters on the page.

In [9]:
len(page_contents)

172806

The page contains over 150,000 characters! Let's view the first 500 characters of the web page.

In [10]:
page_contents[:500]

'<!DOCTYPE html>\n<html class="no-js scroll-smooth" lang="en-ng">\n<head>\n<meta charset="utf-8" />\n<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n<meta name="google-site-verification" content="4hClu3W3PtjepabInbxYOahjMMOlzUNqWBY9OKZJtdA" />\n<meta name="csrf-token" content="1xU1AR4ZpMrKHgfAhvnjvT5ojFRs2tte6Gc5GzYO">\n<link rel="prerender" href="https://www.jobberman.com/listings/financial-accountant-6kjm6p">\n<link rel="prerender" href="https://www.jobberman.com'

What you see above is the source code of the web page. It's written in a language called [HTML](https://www.w3schools.com/html/). It defines the content and structure of the web page.

Let's save the contents to a file with the `.html` extension. This page can also be viewed locally within Jupyter using 'File > Open.

In [11]:
with open('job-in-nigeria.html', 'w') as f:
    f.write(page_contents)

In this section, we successfully downloaded a web page using `requests`.

In [12]:
jovian.commit(project="scraping-job-catgeories-on-jobberman-using-python")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python[0m


'https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python'

## Parse the HTML source code using Beautiful Soup

To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. 

Let's install the library and import the BeautifulSoup class from the bs4 module.

In [13]:
!pip install beautifulsoup4 --upgrade --quiet

In [14]:
from bs4 import BeautifulSoup

We have successfully installed and imported beautiful soup, let's now call it on the contents of the page.

In [15]:
doc = BeautifulSoup(response.text, 'html.parser')

A beautiful soup object `doc` is then created. The doc object contains several properties and methods for extracting information from the HTML document. To view the source code of any webpage on your browser, right-click anywhere on a page and select the "Inspect" option. It opens the "Developer Tools" pane, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.Let's look at a few examples below.

Let's check the type of doc using the `type` function to confirm that it is a beautiful soup object.

In [16]:
type(doc)

bs4.BeautifulSoup

Let's illustrate how beautiful soup works by retrieving the title of the page which is contained within the `h1` tag.

![](https://i.imgur.com/MLVJbb8.png)

In [17]:
doc.find('h1')

<h1 class="flex-grow-0 flex-shrink-0 mb-7 text-2xl font-medium text-gray-700 capitalize basis-full">Accounting, auditing &amp; finance jobs in Nigeria</h1>

Let's find the first embedded image on the web page, using the `img` tag.

In [18]:
doc.find('img')

<img alt="Jobberman" class="mr-10 lazyload" src="https://www.jobberman.com/static-assets/img/ng/landscape.svg" width=" 180 "/>

To find all the occurrences of `a` tag (which represents a link), use the find_all method. To find the first occurrence use the find method.

In [19]:
#Uncomment this and run the cell to view the output

#doc.find_all('a')

The attributes of a tag can be accessed using the indexing notation, e.g., doc.a['href']

In [20]:
doc.a['href']

'https://www.jobberman.com'

Combining the first two sections, we can now define a helper function to download the web page and return a beautifulsoup doc.

In [21]:
def job_page(url):
    response = requests.get(url)
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page {}'.format(url))
    page_contents = response.text    
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc
    

We can now use the function `job_page()` to download any web page and parse it using beautiful soup. 

Now, getting web pages of different job catgeories is now as simple as invoking the function with a different argument. Let's show the usage of this function.

In [22]:
doc = job_page(topic_url)

Let's check the title of the page to ensure we have the right page. We will use the `text` method to access the text within the tag

In [23]:
doc.find('h1').text

'Accounting, auditing & finance jobs in Nigeria'

We've been able to install and import the beautiful soup library which we used to parse our web page, and we have created a reusable function which captures the process of downloading any web page and extracting information from it using beautiful soup. The function captures both the functionalities of the first and second section.

In [24]:
jovian.commit(project="scraping-job-catgeories-on-jobberman-using-python")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python[0m


'https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python'

## Extract featured jobs from a job category

We are now able to parse through web pages so let's go ahead and extract the various job listings under each job category. We will be collecting information about the job title, job url and company.

Now, upon inspecting the box containing the information for a job listing, you will find a `div` tag for each listing, with `class` attribute set to 'mx-5 md:mx-0 flex flex-wrap col-span-1 mb-5 bg-white rounded-lg border border-gray-300 hover:border-gray-400 focus-within:ring-2 focus-within:ring-offset-2 focus-within:ring-gray-500'.


![](https://i.imgur.com/FZ561or.png)


Let's find all the `div` tags matching this `class`.

In [25]:
div_tags = doc.find_all('div', class_= 'mx-5 md:mx-0 flex flex-wrap col-span-1 mb-5 bg-white rounded-lg border border-gray-300 hover:border-gray-400 focus-within:ring-2 focus-within:ring-offset-2 focus-within:ring-gray-500')

Let's find the number of div_tags on this page.

In [26]:
len(div_tags)

19

There are 19 job listings on the page, and our query resulted in 19 `div` tags. It looks like we've found the enclosing tag for each job listing.

We need to extract the following information from each tag:

1. Job Url
2. Job Title
3. Company

### Job Url 

Let's retrieve the first `div` tag.

In [27]:
div_tag = div_tags[0]

Look at the source of any of the `div` tags. You will notice that the job link is a part of an `a` tag with `class` attribute 'relative mb-3 text-lg font-medium break-words focus:outline-none metrics-apply-now text-brand-linked'.

![](https://i.imgur.com/WN7FSQH.png)

Let's find the first instance of an `a` tag with it's corresponding `class`.

In [28]:
a_tag = div_tag.find('a')['href']

In [29]:
a_tag

'https://www.jobberman.com/listings/accountant-48gdmr'

As expected, the first instance of `a` tag has the link "https://www.jobberman.com/listings/financial-analyst-0w68md"

Now let's find all instances of the `a` tags with corresponding `class`.

We can create a function to capture all job urls on the page.

In [30]:
def get_job_url(doc):
    selector = 'relative mb-3 text-lg font-medium break-words focus:outline-none metrics-apply-now text-brand-linked'
    a_tags = doc.find_all('a', {'class': selector})
    return [tag['href'] for tag in a_tags]
    

Let's create an object of the function `get_job_url`.

In [31]:
urls = get_job_url(doc)

Checking the number of urls contained on the page, we have:

In [32]:
len(urls)

19

As expected, we have 19 job urls for the 19 jobs posted on the page.

Let's print the first five job urld.

In [33]:
urls[:5]

['https://www.jobberman.com/listings/accountant-48gdmr',
 'https://www.jobberman.com/listings/finance-lead-rvx6k2',
 'https://www.jobberman.com/listings/accountant-vejz8v',
 'https://www.jobberman.com/listings/financial-accountant-6kjm6p',
 'https://www.jobberman.com/listings/head-of-finance-administration-99vdmw']

### Job Title

The title of each job is contained within a `p` tag with `class` attribute 'text-lg font-medium break-words text-brand-linked'.

![](https://i.imgur.com/SOQn4yQ.png)


To extract this information for all the jobs on the page lets create a helper function.


In [34]:
def get_job_title(doc):
    selector = 'text-lg font-medium break-words text-brand-linked'
    p_tags = doc.find_all('p', {'class': selector})
    return [tag.text.strip() for tag in p_tags]
    

Let's create an object of the function `get_job_title`.

In [35]:
titles = get_job_title(doc)

Checking the number of titles contained on the page, we have:

In [36]:
len(titles)

19

As expected, we have 19 job titles for the 19 jobs posted on the page.

Let's print the first five job titles.

In [37]:
titles[:5]

['Accountant',
 'Finance Lead',
 'Accountant',
 'Financial Accountant',
 'Head of Finance & Administration']

### Company

Let's retrieve the name of the hiring company which is enclosed within a `p` tag with `class` 'text-sm text-brand-linked'.

![](https://i.imgur.com/qDK5aSo.png)




Let's create a helper function to retrieve this information.

In [38]:
def get_company(doc):
    selector = 'text-sm text-brand-linked'
    p_tags = doc.find_all('p', {'class': selector})
    return [tag.text.strip() for tag in p_tags]

Let's create an object of the function `get_company`.

In [39]:
companies = get_company(doc)

In [40]:
len(companies)

19

We have 19 hiring companies for the 19 jobs posted on the page.

Let's print the first five companies.

In [41]:
companies[:5]

['Stylish Hair and Beauty Studio Ltd',
 'Jobberman (Third Party Recruitment)',
 'Ayomo bakery',
 'Xcene Research',
 'Afos Foundation']

In this section, we identified all the tags containing the information we need to successfully scrape our website and we wrote helper functions to help retrieve these information.

In [42]:
jovian.commit(project="scraping-job-catgeories-on-jobberman-using-python")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python[0m


'https://jovian.ai/olivelikes2chat/scraping-job-catgeories-on-jobberman-using-python'

## Compile extracted information into Python dictionaries

Now, we have been able to extract all the information we need, let's put them into a dictionary. Where each key is a column and the data for the column is a list.

In [43]:
jobs_info = {'Job_Url': urls, 
            'Job_Title': titles, 
            'Company': companies
            }

We will now install and import the [pandas](https://pandas.pydata.org/docs/) library to convert the compiled dictionary into a dataframe. To do this, we wll use the `pd.DataFrame` function of pandas.

In [44]:
!pip install pandas --upgrade --quiet

In [45]:
import pandas as pd

We have successfully installed and imported the `pandas` library. Now, let's view the information as a dataframe.

In [46]:
df = pd.DataFrame(jobs_info)
df

Unnamed: 0,Job_Url,Job_Title,Company
0,https://www.jobberman.com/listings/accountant-...,Accountant,Stylish Hair and Beauty Studio Ltd
1,https://www.jobberman.com/listings/finance-lea...,Finance Lead,Jobberman (Third Party Recruitment)
2,https://www.jobberman.com/listings/accountant-...,Accountant,Ayomo bakery
3,https://www.jobberman.com/listings/financial-a...,Financial Accountant,Xcene Research
4,https://www.jobberman.com/listings/head-of-fin...,Head of Finance & Administration,Afos Foundation
5,https://www.jobberman.com/listings/accounts-of...,Accounts Officer,Alarm Center Limited
6,https://www.jobberman.com/listings/accountant-...,Accountant,Anonymous Employer
7,https://www.jobberman.com/listings/accountant-...,Accountant,Tender Years Preparatory School
8,https://www.jobberman.com/listings/finance-off...,Finance Officer,SocketWorks Nigeria Limited
9,https://www.jobberman.com/listings/account-man...,Account Manager,Teal Harmony


We can see from the dataframe we have that there are 19 jobs on the page. 

## Extract and combine data from multiple pages

Using a dictionary to convert to a dataframe is great, however, it can only be used for one page. 

Putting it all together, let's create a function that will help us extract the same information(job url, job title, company) from all the pages. The function will work by using beautiful soup to parse through all the pages on the web page to retrieve individual information about the url, title and company of all the jobs listed.

In [47]:
def get_all_pages(page_number):
    url = 'https://www.jobberman.com/jobs/accounting-auditing-finance?page=' + str(page_number)
    doc = job_page(url)
    urls = get_job_url(doc)
    titles = get_job_title(doc)
    companies = get_company(doc)
    return urls, titles, companies


Let's now create empty lists of all the titles,urls and companies on all the pages and then run a for loop to go through every page and collect information to fill up our lists.

In [49]:
all_urls, all_titles, all_companies = [], [], []

for page_number in range(2, 8):
    urls, titles, companies = get_all_pages(page_number)
    all_urls += urls
    all_titles += titles
    all_companies += companies
    
    

Let's create a dictionary using the lists we have created.

In [50]:
jobs_all_pages = {'Job_Url': all_urls, 
                'Job_Title': all_titles, 
                'Company': all_companies, 
        }

Let's then create a dataframe using `pd.DataFrame`.

In [51]:
dataframe = pd.DataFrame(jobs_all_pages)
dataframe

Unnamed: 0,Job_Url,Job_Title,Company
0,https://www.jobberman.com/listings/accountant-...,Accountant,Stylish Hair and Beauty Studio Ltd
1,https://www.jobberman.com/listings/finance-lea...,Finance Lead,Jobberman (Third Party Recruitment)
2,https://www.jobberman.com/listings/accountant-...,Accountant,Ayomo bakery
3,https://www.jobberman.com/listings/accountant-...,Accountant,JTech Global Resources Ltd
4,https://www.jobberman.com/listings/accountant-...,Accountant/ Cost Accountant,NCIC Oil Service
...,...,...,...
106,https://www.jobberman.com/listings/team-lead-f...,"Team Lead, Fraud Desk",Teamapt
107,https://www.jobberman.com/listings/field-credi...,Field Credit Risk Officer,Teamapt
108,https://www.jobberman.com/listings/bank-sales-...,Bank Sales Manager (Cross River),Teamapt
109,https://www.jobberman.com/listings/moniepoint-...,MONIEPOINT CUSTOMER SUCCESS (KWARA STATE),Teamapt


We have successfully scraped 8 pages from the `
Accounting, Auditing & Finance` web page and converted into a dataframe which contains 3 columns and 111 rows of data.

## Save the extracted information to a csv file

To save the dataframe to a `csv` file, we simply call the `dataframe.to_csv` function.

In [52]:
dataframe.to_csv('accounting.csv', index=None)

We can now read the file and inspect its contents. The contents of the file can also be inspected using the "File > Open" menu option within Jupyter.

# Summary

In this project, we have been able to extract information about featured jobs by using python libraries and creating helper functions. We also used these functions to collect information from multiple pages of the job website and compiled it into a python dictionary. The dictionary was then converted into a pandas data frame and saved as a CSV file that can be used in various applications.

The CSV file we created had this format:

```
Job_Url,job_Title,Company
https://www.jobberman.com/listings/accountant-rvwnq9,Accountant,Jobberman (Third Party Recruitment)
https://www.jobberman.com/listings/investment-principal-5xn457,Investment Principal,Jobberman (Third Party Recruitment)
...
```
Here's the full working code for this project:
```
all_urls, all_titles, all_companies = [], [], []

for page_number in range(2, 9):
    urls, titles, companies = get_all_pages(page_number)
    all_urls += urls
    all_titles += titles
    all_companies += companies
    
def get_all_pages(page_number):
    url = 'https://www.jobberman.com/jobs/accounting-auditing-finance?page=' + str(page_number)
    doc = job_page(url)
    urls = get_job_url(doc)
    titles = get_job_title(doc)
    companies = get_company(doc)
    return urls, titles, companies

def get_company(doc):
    selector = 'text-sm text-brand-linked'
    p_tags = doc.find_all('p', {'class': selector})
    return [tag.text.strip() for tag in p_tags]

def get_job_title(doc):
    selector = 'text-lg font-medium break-words text-brand-linked'
    p_tags = doc.find_all('p', {'class': selector})
    return [tag.text.strip() for tag in p_tags]
    
def get_job_url(doc):
    selector = 'relative mb-3 text-lg font-medium break-words focus:outline-none metrics-apply-now text-brand-linked'
    a_tags = doc.find_all('a', {'class': selector})
    return [tag['href'] for tag in a_tags]
    
def job_page(url):
    response = requests.get(url)
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page {}'.format(url))
    page_contents = response.text    
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc
```    
    

# Future Work

1. Exploring all jobs category and scraping all jobs at once to make a list of all the jobs available on the Jobberman website
2. Fetching job data from other job listing companies in Nigeria, such as Philips Consulting, and creating a job aggregator website where job seekers can easily find a comprehensive list of jobs.
3. Data fetched can be used for analyzing job trends and the labor market

# References

Here are some links that helped in the completion of this project:

1. https://www.jobberman.com/
2. https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
3. https://realpython.com/python-requests/
4. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
5. https://www.w3schools.com/html/
7. https://pandas.pydata.org/docs/
8. https://www.jobberman.com/jobs/accounting-auditing-finance)


In [None]:
jovian.commit(files=['accounting.csv'])

<IPython.core.display.Javascript object>