# Web Scraping of Data Analyst Jobs Analysis 

This project is a showcase for Scrape and Analyze data analyst job requirements with Python Coursera Project which is part of the Google Data Analytics Professional Certificate course I completed.
- https://www.coursera.org/professional-certificates/google-data-analytics
- https://www.coursera.org/learn/scrape-job-postings-data-analyst/supplement/uKwPr/the-project-scenario

## Overview

I am acting as a data analyst at a medium-sized recruitment agency to help improve its sourcing of job vacancies

## 1. Business Problem

The agency relies on multiple job posting sites to identify potential job openings for its clients. It searches through each site manually which is time-consuming and often leads to missed opportunities.  

They want me to  analyse **Data Analyst** role advert data using web scraping tools that can ** automatically** extract job posting data from a job posting site.  The team will use my analysis to provide a more efficient way to provide job vacancies to better serve its clients. This feature will help the recruitment agency by getting relevant openings to their clients more quickly, giving their clients a competitive advantage over other applicants.

## Project Objectives

- To  create a web scraping tool that can automatically extract data of  Data Analyst jobs from a job posting site
- To increase the efficiency &  quality of job vacancy sourcing 
- To gain a competitive advantage
- To give suggestions on my findings 
  

## Selecting a job posting site

* Most job websites I tried have an anti-scraping filter which can affect my results.

* I have chosen **www.reed.co.uk** for this project because it is easier to scrape and specialised in advertising local jobs in the UK 


# THE SCRAPPING TOOL (Application)

**A one-line code function to extract the required information from all pages of the website, download and save the data as a CSV file format**

In [2]:
#importing libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup


def divs_all_pages(div):
        
    title = div.h2.text

    date  = div.find('div', class_="job-card_jobResultHeading__postedBy__sK_25").text.partition('by')[0].strip()

    employer = div.find('a', {'data-element':"recruiter"}).text

    location = div.find('li',{'data-qa' :'job-card-location'}).text

    salary = div.find_all('li', class_="job-card_jobMetadata__item___QNud list-group-item")[0].text

    working_hrs = div.find_all('li', class_="job-card_jobMetadata__item___QNud list-group-item")[2].text
    
    weblink  = "https://www.reed.co.uk" + div.h2.a.get('href')

    description = div.find('p', {'data-qa':"jobDescriptionDetails"}).text
    

    # Append the above a tuple (files)

    files = (title, date, employer, location,salary, working_hrs,weblink, description)
    
    # Create a tuple list
    return files 
        


def main():
    
    # Initiate an empty list and base url

    data_all =[]
    url = "https://www.reed.co.uk/jobs/data-analyst-jobs-in-england"
    

    while True:
    
        headers = {
            "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

        response = requests.get(url, headers=headers)
        data = response.text
        soup = BeautifulSoup(data, 'html')
        
        divs = soup.find_all('div', class_= 'col-sm-12 col-md-7 col-lg-8 col-xl-9')
        
        # Run all divs in divs_all_pages(div) function created above 
        for div in divs:
            files = divs_all_pages(div)
            data_all.append(files)
        
        # the function to try extracting url if reached the final 'Next' page to break out from the while loop
        try:
            
            url = "https://www.reed.co.uk" + soup.find('a', {'aria-label':"Next page"}).get('href')

        except AttributeError as AttError:

            break
            
        # create a Dataframe      
        df_first = pd.DataFrame(data_all, columns = ['title', 'date', 'employer', 'location', 'salary','working_hours',
                                                     'job_weblink', 'job_description'])

        # save the data as csv file format 
        df_first.to_csv('data_analyst_28_04_24.csv', index=None)


# Run the function 
main()



## Load the scrapped data

In [2]:
#importing libraries
import pandas as pd
pd.options.display.max_rows = None

df_original  = pd.read_csv('data_analyst_28_04_24.csv')
df_original.head(2)


Unnamed: 0,title,date,employer,location,salary,working_hours,job_weblink,job_description
0,Data Analyst,28 March,SES Water,Redhill,"£28,000 - £32,000 per annum","Permanent, full-time",https://www.reed.co.uk/jobs/data-analyst/52392...,If you are a Data Analyst with a commitment to...
1,Business Data Analyst/Data Analyst,3 days ago,Deutsche Bank,Birmingham,Competitive salary,"Contract, full-time",https://www.reed.co.uk/jobs/business-data-anal...,AMS is the world's leading provider of Talent ...
