# Scrape and Analyze Data Analyst Job Requirements with Python

---

### Overview

- In this project, you will step into the shoes of an entry-level data analyst at medium-sized recruitment agency, helping to improve its sourcing of job vacancies.

---

### Project Scenario

The team at the recruitment agency is trying to improve its sourcing of job vacancies. To do this the agency relies on multiple job posting sites to identify potential job openings for its clients. However, manually searching through each site is time-consuming and often leads to missed opportunities.  

They want you to  analyze the data using web scraping tools that can automatically extract job posting data from multiple job posting sites.  The team will use your analysis to provide a more efficient way to provide job vacancies to better serve its clients. This feature will help the recruitment agency by getting relevant openings to their clients more quickly, giving their clients a competitive advantage over other applicants.

---

### Learning Objectives

- Increase the efficiency of job vacancy sourcing
- Improve the quality of job vacancy sourcing
- Gain a competitive advantage

---

## Step 1: Importing Required Libraries

In [82]:
import time
import httpx
import pandas as pd
from typing import Optional
from selectolax.parser import HTMLParser
from dataclasses import dataclass, asdict

## Step 2: Create Schema of Output Data

In [64]:
@dataclass
class JobPost:
    job_title: Optional[str] = None
    location: Optional[str] = None
    job_description: Optional[str] = None
    platform: Optional[str] = None
    source_link: Optional[str] = None

## Step 3: Fetch or Define Job Posting Sources

In [65]:
url = "https://www.flexjobs.com/search?"
# job_titles = ["data science", "data analyst", "data engineer"]

## Step 4: Define Functions for the Scraping Process

In [68]:
def get_html(url, querystring={"":""}, **kwargs):
    """This function returns the HTML content of a given website url and query parameters, using httpx."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
    }
    if kwargs.get("page"):
        response = httpx.get(url + "/" + str(kwargs.get("page")), headers=headers, params=querystring, follow_redirects=True)
    else:
        response = httpx.get(url, headers=headers, params=querystring, follow_redirects=True)
    
    try:
        response.raise_for_status()
    except httpx.HTTPStatusError as exc:
        print(f"Error response {exc.response.status_code} while requesting {exc.request.url!r}. \nPage Limit Exceeded...")
        return False

    html = HTMLParser(response.text)
    return html


def extract_nodes(website_html, selector):
    """This function return node objects given the html and the necessary css selector."""
    try:
        return website_html.css(selector)
    except AttributeError:
        return None


def get_entry(job_title, location, job_description, platform, source_link):
    """This function creates an entry for the output data with the recommended schema using information from the website."""
    new_entry = {
        "job_title": job_title,
        "location": location,
        "job_description": job_description,
        "platform": platform,
        "source_link": source_link
    }
    return new_entry

## Step 5: Main Function

In [83]:
platform = "flexjobs"
job_postings = []

# Loop through all job title queries
for q in job_titles:

    # Get raw html
    html = get_html(url, querystring={"search": q})
    
    if not html:
        continue
        
    nodes = extract_nodes(html, "li.job")
    
    for node in nodes:
        
        try:
            job_title = node.attrs["data-title"]
        except:
            job_title = "null"
            
        try:
            location = node.css_first("div.job-locations").text().strip()
        except:
            location = "null"
            
        try:
            job_description = node.css_first("div.job-description").text().strip()
        except:
            job_description = "null"
            
        try:
            source_link = "https://www.flexjobs.com" + node.css_first("a.job-title.job-link").attrs["href"]
        except:
            source_link = "null"
        
        entry = get_entry(job_title, location, job_description, platform, source_link)
        job_postings.append(entry)
    
    time.sleep(5)

In [84]:
df = pd.DataFrame(job_postings)

In [85]:
df

Unnamed: 0,job_title,location,job_description,platform,source_link
0,Data Science Manager,US National,Manage a team of remote sensing scientists and...,flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=192...
1,Data Science Practice Director,"Washington, DC",Lead and manage the Data Science Practice area...,flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=191...
2,Data Science Leader,United Kingdom,Lead and execute a comprehensive data science ...,flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=191...
3,"Senior Director, Data Science",US National,Lead management science and statistical modeli...,flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=190...
4,"Manager, Data Science",US National,Lead and manage a team of data scientists. Dev...,flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=190...
...,...,...,...,...,...
145,Data Engineering,"Chicago, IL","Design, develop, and deliver data products to ...",flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=192...
146,Data Engineer,"Lisbon, Portugal",Develop and maintain scalable ETL processes. I...,flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=192...
147,Data Engineer,US National,"Develop, test, and deploy data pipelines, main...",flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=192...
148,Data Engineer,"Søborg, Denmark",Collaborate with cross-functional teams to des...,flexjobs,https://www.flexjobs.com/HostedJob.aspx?id=193...
