# Automation of Job Search in Guatemala

### Developed by

**Cristofer Darwin Berganza**

Email 1: cdberganza@gmail.com

Email 2: cdberganza@proton.me

GitHub: [GitHub portfolio](https://github.com/cdberganza/portfolio)

LinkedIn: [LinkedIn profile](https://www.linkedin.com/in/darwin-berganza/)

If you have any questions or comments about this project, feel free to reach out to me.


### Table of Contents
1. [Introduction](#introduction)
2. [Project Description](#Project-Description)
3. [Code](#Code)
4. [Capture of the generated file](#Capture-of-the-generated-file)
5. [Conclusions](#Conclusions)

## Introduction

This project aims to automate the search for job offers on job websites in Guatemala. The developed solution allows the company's collaborators to quickly obtain CSV files with information about available job vacancies, saving time and effort in manual searching.

## Project Description

The main objective of this project is to facilitate the search for job vacancies by automating the process of gathering information from job websites. Through this algorithm, users will be able to input the job title or category they wish to search for, and the system will automatically generate a CSV file containing all available offers.

In the development of the code, an effort has been made to use function and variable names in English, following best programming practices. However, since the algorithm will be used by Spanish speakers in Guatemala, communication with the user through messages in `input()` and `print()`, as well as the headers in the CSV file, has been done in Spanish. This ensures that the tool is accessible and user-friendly for the target audience.

The process is divided into several key stages:
1. **URL Generation**: A dynamic URL is generated based on the job title or category entered by the user.
2. **Data Extraction**: Using `BeautifulSoup`, relevant details from each job listing are extracted, such as job title, company, location, salary, and the link to the offer.
3. **Saving to CSV**: The extracted data is saved in a CSV file, providing a clear and structured format that users can use to review the offers or share them.

The code is structured into several functions that handle different aspects of the process, from URL generation to CSV file writing, making it easier to maintain and understand.

## Code

This Jupyter Notebook contains a set of functions that work together to extract job offers from a website and save them into a CSV file. As you progress through the code, you will see how each function has a specific purpose, and how they integrate to achieve the overall goal of the project.

We invite you to read the code, experiment with it, and run it in your local environment. Feel free to make modifications to adapt it to your needs or to add new functionalities. Enjoy the learning process!


### Installing Libraries

This cell includes the commands needed to install the required libraries for the project. Make sure to run this cell before executing the rest of the code if the libraries are not already installed in your working environment. These libraries are essential for the program's functionality.

In [None]:
# !pip install requests beautifulsoup4 numpy

### Importing Libraries

In this cell, the necessary libraries for the project are imported:

- `requests`: to make HTTP requests and access web pages.
- `BeautifulSoup`: to parse the HTML content of web pages.
- `numpy`: to handle numerical operations.
- `csv`: to work with CSV files and save the extracted data.

In [None]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import csv

### URL Generation

In this cell, the function `gen_url()` is defined, allowing the user to enter the job title or category they wish to search for. The function performs the following tasks:

1. Prompts the user to enter the job title or category via input.
2. Replaces spaces in the user's input with hyphens to form part of the URL.
3. Generates a specific URL to search for job offers on the Computrabajo website in Guatemala.
4. Prints the generated URL for the user to see.
5. Returns the generated URL.

This function is essential for accessing job offers in an automated manner.

In [None]:
def gen_url():
    
    job = input('Introduce el cargo o categoria: ')
    job_mod = job.replace(' ', '-')
    url = f'https://gt.computrabajo.com/trabajo-de-{job_mod}?p='
    print(f'URL generada: {url}')
    return url

### Extraction of Job Offers

In this cell, the function `extract_offers()` is defined, responsible for extracting job offers from the pages generated by the `gen_url()` function. The function performs the following tasks:

1. Sets a `User-Agent` header to simulate a request from a web browser and avoid being blocked by the server.
2. Initializes an empty list `offers` to store the found job offers.
3. Uses a `for` loop to iterate over the first eight result pages (from 1 to 8).
   - For each page, it makes an HTTP request using the generated URL and the established header.
   - Prints the status code of the response for each page.
   - Parses the content of the response using `BeautifulSoup`.
   - Searches for all `article` elements with the class `box_offer`, which contain job offers.
   - Adds the found offers to the `offers` list.
4. Prints the total number of offers found.
5. Returns the list of extracted offers.

This function is crucial for gathering data on available job offers across several pages.

In [None]:
def extract_offers():

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
    offers = []
    for i in range(1, 9):
        response = requests.get(url+str(i), headers=headers)
        print(f"Status code {str(i)}: {response.status_code}")
        
        soup = BeautifulSoup(response.content, 'html.parser')
        offers_soup = soup.find_all('article', class_='box_offer')
        
        offers = offers + offers_soup
        
    print(f'Ofertas encontradas: {len(offers)}')
    
    return offers

### Data Extraction

In this cell, the function `extract_data(soup)` is defined, responsible for extracting specific information about each job offer from a BeautifulSoup `soup` object. The function performs the following tasks:

1. Attempts to extract the job title:
   - Searches for the link corresponding to the job title using the class `js-o-link fc_base`.
   - If the title is found, it is stripped of whitespace. If not, `NaN` from NumPy is assigned.

2. Attempts to extract the company name:
   - Searches for the link corresponding to the company name using the class `fc_base t_ellipsis`.
   - If the name is found, it is stripped of whitespace. If not, `NaN` is assigned.

3. Attempts to extract the location:
   - Searches for the paragraph corresponding to the location using the class `fs16 fc_base mt5`.
   - If the location is found, it is stripped of whitespace. If not, `NaN` is assigned.

4. Attempts to extract the salary:
   - Searches for the div corresponding to the salary using the class `fs13 mt15`.
   - If the salary is found, it is stripped of whitespace. If not, `NaN` is assigned.

5. Generates the job offer URL by concatenating the link from the title with the base URL of Computrabajo.

6. Returns a list containing the job title, company, location, salary, and offer URL.

This function is essential for structuring and storing the relevant information of each found job offer while managing potential errors in data extraction.

In [None]:
def extract_data(soup):
    
    try:
        job_title_soup = soup.find('a', class_='js-o-link fc_base')
        job_title = job_title_soup.text.strip()
    except:
        job_title = np.nan
    
    try:
        company_soup = soup.find('a', class_='fc_base t_ellipsis')
        company = company_soup.text.strip()
    except:
        company = np.nan
    
    try:
        location_soup = soup.find('p', class_='fs16 fc_base mt5')
        location = location_soup.text.strip()
    except:
        location = np.nan

    try:
        salary_soup = soup.find('div', class_='fs13 mt15')
        salary = salary_soup.text.strip()
    except:
        salary = np.nan
    
    job_url = 'https://gt.computrabajo.com/'+job_title_soup['href']
    
    return [job_title, company, location, salary, job_url]    

### Data Generation

In this cell, the function `gen_data(offers)` is defined, responsible for generating a list of structured data from the extracted job offers. The function performs the following tasks:

1. Initializes an empty list called `data`, which will store the information extracted from each job offer.

2. Uses a `for` loop to iterate over the offers received as an argument:
   - For each offer, it calls the `extract_data()` function to extract the relevant information.
   - Appends the result to the `data` list.

3. Returns the `data` list, which contains the structured information of all processed offers.

This function is essential for compiling the extracted data in an organized manner, facilitating its later use in the CSV file.

In [None]:
def gen_data(offers):
    
    data = []
    
    for i, offer in enumerate(offers):
        
        data.append(extract_data(offers[i]))

    return data

### File Name Generation

In this cell, the function `gen_file_name()` is defined, allowing the user to specify the name of the CSV file that will be generated. The function performs the following tasks:

1. Prompts the user to enter the file name through input.
2. Appends the `.csv` extension to the name provided by the user.
3. Returns the full name of the CSV file.

This function is useful for customizing the name of the output file, making it easier to identify and manage.

In [None]:
def gen_file_name():
    file_name = input('Ingrese el nombre del archivo CSV: ')
    file_name = file_name+'.csv'
    return file_name  

### Writing CSV File

In this cell, the function `write_csv(file_name, data)` is defined, responsible for writing the extracted data to a CSV file. The function performs the following tasks:

1. Defines a list `columns` containing the names of the columns in the CSV file: `Puesto`, `Empresa`, `Ubicación`, `Salario`, and `Enlace`.

2. Opens a CSV file in write mode (`'w'`) with UTF-8 encoding:
   - If the file already exists, it will be overwritten.

3. Creates a `writer_csv` object using the `csv` module, allowing writing to the CSV file.

4. Writes the header row to the CSV file using `writer_csv.writerow(columns)`.

5. Iterates over the received data and writes each offer to the CSV file using `writer_csv.writerow(offer)`.

This function is fundamental for saving the extracted information in a structured manner in a CSV file located in the current directory.

In [None]:
def write_csv(file_name, data):

    columns = ['Puesto', 'Empresa', 'Ubicación', 'Salario', 'Enlace']
    
    with open(file_name, mode='w', newline='', encoding='utf-8') as csv_file:
        
        writer_csv = csv.writer(csv_file)
        
        writer_csv.writerow(columns)
        
        for offer in data:
            writer_csv.writerow(offer)

### Executing Functions to Obtain Data

In this cell, the following operations are performed:

1. The function `gen_url()` is called to generate the search URL for job offers based on the category or job title entered by the user. The generated URL is stored in the variable `url`.

2. The function `extract_offers()` is called to extract job offers from the pages of the generated URL. The extracted offers are stored in the variable `offers`.

3. The function `gen_data(offers)` is called to process the extracted offers and generate a list of structured data. This list is stored in the variable `data`.

4. A message is printed indicating that the list of data has been generated and that the user can proceed to create the CSV file.

This cell is key to executing the main flow of the program, integrating all the previously defined functions to obtain and organize the necessary data.

In [None]:
url = gen_url()
offers = extract_offers()
data = gen_data(offers)
print('Lista de datos generada, ya puedes generar el archivo CSV')

### Generation and Writing of the CSV File

In this cell, the following operations are performed to generate and save the CSV file:

1. The function `gen_file_name()` is called to prompt the user for the name of the CSV file. The generated name is stored in the variable `file_name`.

2. The function `write_csv(file_name, data)` is called to write the list of structured data to the CSV file with the name specified by the user.

3. A confirmation message is printed indicating that the CSV file has been successfully saved.

This cell is essential to finalize the program process, allowing the user to obtain a CSV file with the collected job offers.

In [None]:
file_name = gen_file_name()
write_csv(file_name, data)
print('Archivo CSV guardado con éxito')

## Capture of the generated file

Attached below is a screenshot of the file generated with the search for job offers for "asesor de ventas" (sales advisor)

![Captura del archivo generado](https://raw.githubusercontent.com/cdberganza/job_scraping_gt/refs/heads/main/img/capture_3_asesor_de_ventas.png)

## Conclusions

This project has proven to be an effective solution for automating the search for job offers on job websites in Guatemala. Through the implementation of various functions, we successfully extracted and organized relevant information efficiently.

**Key Points:**

- **Automation:** Automating the search for job offers saves time and effort compared to manual searching.
- **Data Structuring:** Collecting data in a CSV file facilitates its analysis and further use.
- **Accessibility:** Communication in Spanish and user-centered design make the tool accessible to the target audience.

**Future Improvements:**

- **Graphical Interface:** Consider developing a graphical interface to further facilitate the use of the program.
- **Error Handling:** Implement more robust error handling to address potential changes in the source websites.
- **Expansion of Sources:** Explore the possibility of including more data sources to offer a wider range of job offers.

This project has not only been a great opportunity to apply programming and web scraping skills but can also serve as a foundation for future developments in the job search domain.
