# Web Scraping Job Vacancies

## Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information, if available.

## Step 1: Importing Required Libraries

In [33]:
import requests
from bs4 import BeautifulSoup
import csv

## Step 2: Generating a URL with a Function

In [None]:
requests.get("https://absolventa.de")

In [35]:
def generate_url(position = "", location = ""):
    position = position.lower().replace(" ", "+")
    location = location.lower()
    search_url = f"https://www.absolventa.de/jobs?text={position}&location={location}"
    search_url = requests.get(search_url)
    
    soup = BeautifulSoup(search_url.text)

    return soup

## Step 3: Extract the Job Data from a Single Job Posting Card

In [36]:
results = []

def job_posting(soup):
    postings = soup.find_all("a", class_="flex flex-col gap-sm bg-white rounded-xl p-sm border border-1 border-outline hover:border-primary aria-current-location:border-primary text-secondary hover:text-primary active:shadow-input aria-current-location:shadow-input block aria-current-location:cursor-default h-full")

    if postings:
        for posting in postings:
            try:
                title = posting.find("h2", class_="text-secondary @lg:hidden break-words hyphens-auto leading-[160%] tracking-tight text-[1rem] font-bold")
                company = posting.find("span", class_="text-secondary break-words hyphens-auto leading-[160%] tracking-tight text-[0.875rem]")
                location = posting.find("ul", class_="flex flex-wrap flex-col items-start text-xs md:flex-row @md:flex-row gap-xs text-tertiary-500 fill-tertiary-500 md:gap-sm @md:gap-sm")

                results.append({
                    "Job position": title.text.strip() if title else "Not available", 
                    "Company name": company.text.strip() if company else "Not available", 
                    "Job location": location.text.strip() if location else "Not available"})

                print("Title:", title.text.strip() if title else "Not available")
                print("Company:", company.text.strip() if company else "Not available")
                print("Location:", location.text.strip() if location else "Not available")

                print("+" * 40)
            
            except AttributeError as e:
                print(f"Error parsing a job posting: {e}")
                results.append({
                    "Job position": "Error",
                    "Company name": "Error",
                    "Job location": "Error"
                })
    else:
        print(f"There are currently no job openings.")

## Step 4: Define the Main Function

In [37]:
def main(position, location):
    soup = generate_url(position, location)
    job_posting(soup)

    with open("job_postings_result.csv", "w", encoding="utf-8", newline="") as file:
        fieldnames = ["Job position", "Company name", "Job location"]
        writer = csv.DictWriter(file, fieldnames=fieldnames)

        writer.writeheader()
        writer.writerows(results)

In [None]:
main(position="data analyst", location="berlin")

## Step 5: Conclusions

- It was challenging to find a job posting website that wasn't anti-scraping. When I attempted to scrape sites like Indeed, Monster, Glassdoor etc., I consistently encountered a 403 error code. After researching these errors, I came across several suggestions, such as using headers, proxies, and other techniques. However, none of these solutions worked for me. Eventually, after some additional research I decided to move on with Germany based job posting site [absolventa.de](https://absolventa.de), which proved to be very useful and has straightforward URL parameters.
- Then I defined three functions: `generate_url`, `job_posting` and `main`. The first two functions are called within the `main` function, and I divided each processes for better flexibility. The first function, `generate_url`, creates the URL using parameters provided by the user and returns a `soup` object. The second function, `job_posting`, takes this `soup` as input, parses HTML, and retrieves all job postings, if available. Finally, the `main` function orchestrates the process by calling the other two functions, making it simpler and more user-friendly. 