# Web Scraping Selenium-Pandas Pipeline: Extraction, Transformation, and Analysis of Voting Data for the  Supreme Court Election in Guatemala 2024

## Introduction

The election of Supreme Court magistrates in Guatemala is one of the most significant and complex events within the country’s political and judicial system. This project aims to analyze this process through advanced techniques for data extraction, cleaning, and transformation, culminating in a network analysis to understand voting patterns and the connections between congressmen and candidates.

#### Main Objective  
The project seeks to extract data from the official website of the Congress of the Republic of Guatemala, where the voting results for each magistrate candidate are published. Using this data, an **adjacency matrix** will be generated to represent the relationships between congressmen (who voted for whom) and candidates (who received support). This matrix will be used for network analysis with the Gephi tool, providing a clear visualization of the political and social dynamics in the process.

#### Technical Challenges  
The project faced several challenges, including:  
1. **Dynamic Website Content**:  
   - The use of JavaScript to load voting data made direct extraction with traditional tools like Requests or BeautifulSoup infeasible.  
   - The solution was to implement Selenium to interact with the dynamic content and capture the necessary data.  

2. **Fragmented Data Structure**:  
   - Voting results are divided by candidate across individual pages, requiring consolidation of information from multiple sources into a cohesive dataset.  

3. **Data Transformation**:  
   - Votes ("A FAVOR," "CONTRA," null) needed to be standardized into a binary format (`1` for "A FAVOR" and `0` for "CONTRA"/null) to facilitate analysis.  
   - Creating the adjacency matrix required structuring the data in a format suitable for effective network analysis.  

#### Methodology  
The project followed a systematic approach that included:  
1. **Data Extraction**: Using Selenium and BeautifulSoup to collect voting information directly from the Congress website.  
2. **Cleaning and Transformation**: Utilizing pandas to organize, clean, and standardize the data, converting it into a structured format.  
3. **Network Analysis**: Generating an adjacency matrix and performing visual analysis in Gephi to identify patterns of support and political connections.  
4. **Result Exportation**: Producing CSV files containing intermediate and final data, ensuring reusability and transparency.

#### Relevance  
Analyzing the voting process is valuable not only from a technical perspective but also from a social and political one. Understanding how congressmen vote in such significant elections reveals power dynamics, support patterns, and potential external influences on the judicial system. By presenting the results in a visual and structured manner, this project contributes to transparency and fosters informed debate about judicial independence in Guatemala.

This introduction provides a broad context, explaining the technical and social importance of the project, as well as the challenges and solutions implemented to achieve its objectives.

## Social and Political Context of the Supreme Court Magistrates Election in Guatemala

In Guatemala, the election of Supreme Court of Justice (CSJ) magistrates is a critical event that directly impacts the country's judicial system and public trust in its institutions. This process, conducted by the Congress of the Republic, holds significant social and political importance.

#### Social Importance  
The Supreme Court of Justice is the highest judicial authority in Guatemala, responsible for ensuring the application of laws and protecting the fundamental rights of citizens. The magistrates who serve on this court have the power to influence crucial decisions regarding corruption cases, human rights, and the overall administration of justice. As a result, their election garners significant public interest and demands for transparency and fairness.

#### Political Relevance  
The election process in Congress is designed to select magistrates from a list (nómina) prepared by the Nomination Commission, a body composed of representatives from various sectors of society. However, in recent years, this process has come under scrutiny due to allegations of politicization and potential external influences on the decisions of Congress members. This has raised concerns about judicial independence and the system's ability to ensure impartial justice.

#### Implications  
The election of magistrates not only shapes the structure of the judicial system for the coming years but also reflects the dynamics of power and political alliances within Congress. As such, monitoring and analyzing this process is important not only from a legal perspective but also from a civic and academic standpoint, as it deeply influences the development of the rule of law in Guatemala.

This social and political context highlights why a detailed analysis of the voting process, like the one conducted in this project, is essential for evaluating transparency and identifying support patterns in this decisive process.

## Dynamics of the Magistrates Election Process and Its Reflection on the Website Structure

The election of Supreme Court magistrates in Guatemala involves a complex and structured process that is reflected in the design and organization of the official website of the Congress of the Republic. This site serves as the primary source for extracting data about voting patterns, providing key insights into how the process is conducted. Below is an explanation of the election dynamics and how they are represented on the website.

#### Dynamics of the Election Process  

1. **Candidate Selection**:  
   The process begins with the Nomination Commission compiling a list of candidates (usually 26) who meet the qualifications to serve as magistrates. This list is then submitted to the Congress for voting.  

2. **Congressional Voting**:  
   During a plenary session, deputies vote for each candidate individually. The process is designed to ensure that every candidate is evaluated separately, with votes cast as either "A FAVOR" (in favor) or "CONTRA" (against). Deputies may also abstain or be absent, leading to null votes.  

3. **Majority Requirement**:  
   To be elected, a candidate must secure an absolute majority of votes from the total number of deputies in Congress. This often results in prolonged voting sessions, as the required majority can be difficult to achieve, particularly in a politically polarized environment.

4. **Public Transparency**:  
   The results of each voting session are published on the Congress website, providing details such as the voting outcome for each candidate and the individual votes of deputies. This level of transparency is intended to build public trust but also opens the process to scrutiny and analysis.

#### Reflection of the Process on the Website  

1. **Page Structure**:  
   Each voting session for a candidate is documented on a dedicated webpage. The structure of these pages includes:  
   - **Candidate Information**: Details such as the name of the candidate and the position they are being considered for.  
   - **Voting Results**: Tables showing the votes cast by deputies, categorized as "A FAVOR," "CONTRA," or "AUSENCIA" (absent/null).  
   - **Interactivity**: Drop-down menus and pagination are used to navigate and display the data, which can include hundreds of rows depending on the number of deputies.

2. **Dynamic Content**:  
   The voting data on the website is rendered dynamically using JavaScript. This means the tables and other key elements are not present in the raw HTML source, requiring tools like Selenium to interact with the website and load the necessary content.

3. **Challenges for Data Extraction**:  
   - **Pagination**: Voting data is often split across multiple pages, requiring automated navigation to collect all relevant information.  
   - **Menus and Filters**: Drop-down menus allow users to filter votes by type (e.g., "A FAVOR," "CONTRA"), which must be interacted with programmatically to access the full dataset.  
   - **Table Identification**: The tables are not labeled consistently, making it necessary to rely on positional indexing or specific identifiers to extract the correct data.

4. **Data Presentation**:  
   The website provides detailed information for each candidate but does not consolidate the results across all candidates. This fragmentation mirrors the voting process but requires additional processing to create a comprehensive dataset for analysis.

#### Key Takeaways  

The structure of the website mirrors the segmented and candidate-focused nature of the voting process in Congress. Each session is treated as an independent event, with individual results displayed transparently. However, the fragmented presentation and use of dynamic content introduce technical challenges for data extraction. These challenges are addressed in this project through automation and careful processing, allowing the raw data to be transformed into a unified format suitable for network analysis.

#### This is how the section of votes in favor looks like on the Web site of the Congress of the Republic of Guatemala.

![Captura sitio web a favor](https://raw.githubusercontent.com/cdberganza/Scraping_supreme_court_election_gt_2024/refs/heads/main/images/web_vote_in_favour_screenshot.png)

#### This is how the section of votes against looks like on the Web site of the Congress of the Republic of Guatemala.

![Captura sitio en contra](https://raw.githubusercontent.com/cdberganza/Scraping_supreme_court_election_gt_2024/refs/heads/main/images/web_vote_against_screenshot.png)

#### This is how the section of null votes looks like on the Web site of the Congress of the Republic of Guatemala.

![Captura sitio web ausente](https://raw.githubusercontent.com/cdberganza/Scraping_supreme_court_election_gt_2024/refs/heads/main/images/web_null_vote_screenshot.png)

## Technical Perspective of the Project

This project focuses on the extraction, transformation, and analysis of voting data from the election process for Supreme Court magistrates in Guatemala. From a technical perspective, it combines web scraping, data processing, and network analysis to generate insights into voting patterns. Below is an overview of the technical methodology and tools used:

#### 1. **Data Extraction**  
   - **Challenge**: The data is dynamically rendered on the official website of the Congress of Guatemala using JavaScript, making traditional scraping tools like Requests and BeautifulSoup insufficient.  
   - **Solution**: Selenium, an automation tool for web browsers, was used to interact with dynamic content and extract the HTML structure. The workflow included:  
     - Navigating to specific URLs for each voting session.  
     - Interacting with dropdown menus and dynamic elements to reveal all relevant data.  
     - Capturing page source and parsing it using BeautifulSoup for structured data extraction.  

#### 2. **Data Transformation**  
   - **Preparation**: Extracted tables were cleaned and standardized to ensure consistency across candidates. Key steps included:  
     - Sorting rows alphabetically by congressmen’s names.  
     - Converting voting data into a binary format (`1` for "A FAVOR" and `0` for "CONTRA" or absence).  
     - Renaming columns and resetting indices for uniform structure.  
   - **Consolidation**: The data from individual voting sessions was transformed into a unified adjacency matrix representing congressmen as rows and candidates as columns.  

#### 3. **Data Export**  
   - Intermediate results, including cleaned data for each candidate and the final adjacency matrix, were exported as CSV files. This ensured compatibility with external tools for further analysis.  

#### 4. **Network Analysis**  
   - **Tool**: The adjacency matrix was imported into Gephi, a network analysis and visualization software.  
   - **Visualization**: Gephi was used to create a graph representation where:  
     - Nodes represented congressmen and candidates.  
     - Edges indicated "A FAVOR" votes, revealing patterns of support.  
   - **Outcome**: The network graph provided a clear visual representation of voting dynamics and connections.  

#### 5. **Technological Stack**  
   - **Python Libraries**: Selenium, BeautifulSoup, pandas, and NumPy were key components of the technical implementation.  
   - **Data Formats**: CSV files were used for portability and compatibility with analytical tools.  
   - **Visualization**: Gephi served as the primary tool for network analysis and visualization.  

This technical framework highlights the project's ability to automate complex data extraction tasks, handle dynamic web content, and transform unstructured data into actionable insights for network analysis. It demonstrates a scalable and efficient approach to processing election data in a transparent and reproducible manner.

## Algorithm, code and results

**Installing Required Libraries**

If this is your first time running this notebook or you don’t have the necessary libraries installed on your workstation, a code cell is included to install the required dependencies.

This cell uses the `pip` command to install the following libraries:

1. **`selenium`**: Enables automated browser interactions to extract dynamic data from the website.
2. **`pandas`**: Used for cleaning, transforming, and analyzing data.
3. **`numpy`**: Facilitates numerical operations, such as creating the adjacency matrix.
4. **`beautifulsoup4`**: A tool for parsing and extracting structured data from HTML content.

Once the dependencies are installed, you can proceed with running the rest of the notebook seamlessly. If you already have these libraries installed, you can skip this cell without affecting the execution of the project.

In [None]:
!pip install selenium pandas numpy beautifulsoup4

### importing libraries

In this initial section of the project, we import the essential libraries and modules needed for data extraction, cleaning, and transformation. Below is an overview of each library's purpose:

- **pandas**: A core library for data analysis and manipulation. It will be used to work with structured data in tables (DataFrames) and perform operations such as data cleaning and transformation.
- **numpy**: A complement to pandas for advanced numerical operations, such as handling matrices and performing mathematical calculations.
- **BeautifulSoup**: A tool for extracting information from HTML and XML files. While Selenium is required for handling JavaScript-based content on the website, BeautifulSoup can assist in processing static HTML content.
- **io.StringIO**: A module that helps handle data as if it were files, useful for quick testing and in-memory data transformations.
- **sys**: Provides access to system-level variables and functions. It might be used for managing exceptions or redirecting output.
- **selenium**: The main library for browser automation, critical for this project as the website relies on JavaScript to load the voting tables. The imported modules from Selenium include:
  - **webdriver**: Controls the browser for data extraction.
  - **common.keys**: Simulates keyboard interactions, such as pressing specific keys.
  - **common.by**: Enables locating elements on the page using methods like ID, name, or class.
  - **support.select**: Simplifies interaction with dropdown menus.
  - **support.wait** and **expected_conditions**: Implement explicit waits to ensure dynamically loaded JavaScript elements are available before interacting with them.

These tools are fundamental to overcoming the project's main challenges: handling dynamic content from the website and transforming the extracted tables into an adjacency matrix for effective network analysis.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from io import StringIO
import sys
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

### Function: Extracting Page Soups from URLs

This function, `extract_soups(urls)`, automates the process of extracting HTML content (or "soups") from a list of URLs using Selenium. Here's a detailed breakdown of its functionality:

1. **Purpose**: 
   The function navigates through each URL provided in the `urls` list, interacts with web elements dynamically loaded by JavaScript, and collects the page source as BeautifulSoup objects for further data processing.

2. **Step-by-Step Execution**:
   - **Initialization**:
     - A Chrome WebDriver instance is created to automate browser operations.
     - An empty list, `soups`, is initialized to store the extracted HTML content.
   - **URL Iteration**:
     - The function iterates through each URL, loading it in the browser.
     - A `WebDriverWait` object is used to ensure the page elements are fully loaded before interaction.
   - **Error Handling**:
     - If a page fails to load within the timeout, a message is displayed, the browser is closed, and the program exits gracefully using `sys.exit(1)`.
   - **Interacting with Dropdown Menus**:
     - The function interacts with dropdown menus (e.g., "A FAVOR", "CONTRA", and "AUSENCIA") to ensure that all relevant rows are displayed.
     - It uses Selenium's `Select` module to select the option that displays all rows (value `-1`).
   - **Capturing Page Content**:
     - After all interactions, the function captures the fully loaded page source and parses it with BeautifulSoup, appending the result to the `soups` list.
   - **Completion**:
     - Once all URLs are processed, the browser is closed, and a success message is printed.

3. **Key Points**:
   - **Dynamic Content**: The function ensures the data dynamically loaded by JavaScript (e.g., dropdown menus) is fully visible before parsing the page source.
   - **User Feedback**: Print statements inform the user about the progress and possible errors, improving the script's usability.
   - **Error Handling**: Graceful handling of timeouts avoids unexpected crashes if a page fails to load.

4. **Output**:
   - The function returns a list of BeautifulSoup objects, each representing the parsed HTML content of a URL.

This function is critical to overcome the first challenge of the project: extracting data from JavaScript-rendered pages. It prepares the raw HTML data for further processing and analysis.

In [2]:
def extract_soups(urls):

    print("Please wait while the web browser loads and the data extraction automation starts.")
    print("This may take several minutes...\n")
    driver = webdriver.Chrome()
    soups = []
    for url in urls:
        
        driver.get(url)
        wait = WebDriverWait(driver, 10)
        try:
            wait_element_vote_in_favor_length = wait.until(
                EC.presence_of_element_located((By.NAME, "congreso_a_favor_length"))
            )
        except:
            print("The timeout for loading the website has been exceeded. Please close the automated browser window if necessary and try again.\n")
            driver.close()
            sys.exit(1)
        
        element_select_vote_in_favour_length = Select(wait_element_vote_in_favor_length)
        element_select_vote_in_favour_length.select_by_value("-1")
                
        element_select_vote_against = wait.until(
            EC.presence_of_element_located((By.LINK_TEXT, "CONTRA"))
        )
        element_select_vote_against.click()
        
        wait_element_vote_against_length = wait.until(
            EC.presence_of_element_located((By.NAME, "congreso_contra_length"))
        )
        element_select_vote_against_length = Select(wait_element_vote_against_length)
        element_select_vote_against_length.select_by_value("-1")
        
        element_select_null_vote = wait.until(
            EC.presence_of_element_located((By.LINK_TEXT, "AUSENCIA"))
        )
        element_select_null_vote.click()
        
        wait_select_null_vote_length = wait.until(
            EC.presence_of_element_located((By.NAME, "congreso_votos_nulos_length"))
        )
        elemnt_select_null_vote_length = Select(wait_select_null_vote_length)
        elemnt_select_null_vote_length.select_by_value("-1")
                
        soup = BeautifulSoup(driver.page_source, "html.parser")
        soups.append(soup)

    driver.close()
    print("Soups extracted successfully!\n")

    return soups

### Function: Extracting Metadata from a BeautifulSoup Object

The `extract_metadata(soup)` function is designed to extract key metadata from a BeautifulSoup object representing the HTML content of a page. Specifically, it retrieves the candidate's name and number from the voting data. Here’s a detailed explanation of its functionality:

1. **Purpose**:  
   Extract the candidate’s name and number from the HTML structure, where this information is located within a header element with the ID `"title_list"`. If the data is unavailable or the structure varies, the function handles these cases by returning `NaN` values.

2. **Step-by-Step Execution**:  
   - **Accessing the Header**:  
     - The function attempts to locate an `<h5>` element with the ID `"title_list"`.  
     - If found, it retrieves all paragraphs (`<p>`) within the header. If the operation fails, an empty BeautifulSoup object is used as a fallback.  
   - **Extracting the Candidate's Name**:  
     - The second paragraph (`title_paragraphs[1]`) is expected to contain the candidate's name, prefixed by the phrase `"Pregunta: ELECCIÓN DE"`.  
     - If the text follows this pattern, unnecessary text is removed, leaving only the name.  
     - If the paragraph does not conform to this structure or is absent, `NaN` is assigned.  
   - **Extracting the Candidate's Number**:  
     - The third paragraph (`title_paragraphs[2]`) is expected to contain the candidate's number, prefixed by the word `"Número:"`.  
     - If the text matches this format, the prefix is removed to extract only the number.  
     - If the paragraph does not follow this format or is missing, `NaN` is assigned.  
   - **Error Handling**:  
     - To ensure the function does not stop due to unexpected errors (e.g., missing HTML elements), exceptions (`try-except`) are used for each operation.

3. **Output**:  
   - The function returns a tuple `(candidate_name, candidate_number)` containing the candidate's name and number. If either value is missing, it returns `NaN`.

4. **Key Points**:  
   - **Robustness**: Exception handling ensures the function can process pages with incomplete or unexpected structures without critical errors.  
   - **Flexibility**: The function includes checks to adapt to potential variations in the textual data within the paragraphs.  
   - **Relevance to the Project**: This function extracts essential metadata that identifies each candidate in the voting data, a crucial step in consolidating the information into the adjacency matrix.

In [3]:
def extract_metadata(soup):

    try:
        title = soup.find("h5", id="title_list")
        title_paragraphs = title.find_all("p")
    except:
        title_paragraphs = BeautifulSoup("", "html.parser")
    try:
        if title_paragraphs[1].text != "Pregunta: ":
            candidate_name = title_paragraphs[1].text.replace("Pregunta: ELECCIÓN DE ", "").replace(" COMO MAGISTRADO DE LA CORTE SUPREMA DE JUSTICIA", "")
        else:
            candidate_name = np.nan
    except:
        candidate_name = np.nan
    try:
        if title_paragraphs[2].text != "Número: ":
            candidate_number = title_paragraphs[2].text.replace("Número: ", "")
        else:
            candidate_number = np.nan
    except:
        candidate_number = np.nan

    return candidate_name, candidate_number


### Function: Extracting and Consolidating Tables from BeautifulSoup Objects

The `extract_tables(soups)` function extracts voting tables from a list of BeautifulSoup objects and transforms them into pandas DataFrame objects. Additionally, it consolidates the data for each candidate. Here's a detailed explanation of its functionality:

1. **Purpose**:  
   Extract the voting tables for each candidate, consolidating the data into a single DataFrame per candidate. This DataFrame includes votes "A FAVOR" (in favor), "CONTRA" (against), and "AUSENCIA" (absent/null).

2. **Step-by-Step Execution**:  
   - **Initialization**:  
     - An empty list, `dfs`, is created to store the resulting DataFrames.  
   - **Iteration over BeautifulSoup Objects**:  
     - For each `soup` object, the function uses `extract_metadata` to retrieve the candidate's name and number.  
     - If the candidate number is unavailable (`np.nan`), the current iteration is skipped.  
   - **Table Extraction**:  
     - The function attempts to locate all tables in the HTML using `soup.find_all("table")`. If no tables are found, the iteration is skipped.  
   - **Creating DataFrames by Vote Type**:  
     - The function attempts to load tables corresponding to votes "A FAVOR," "CONTRA," and "AUSENCIA" using `pd.read_html`.  
     - Each table is identified by its position in the list of extracted tables.  
     - If a table is missing, an empty DataFrame with the required columns (`"NOMBRE DIPUTADO"`, `"ESTADO"`, and `"VOTO"`) is created.  
   - **Data Consolidation**:  
     - The DataFrames for the three vote types are concatenated vertically (`axis=0`) into a single DataFrame representing all votes for a candidate.  
     - Attributes (`attrs`) are assigned to the DataFrame to store the candidate's name and number extracted earlier.  
   - **Storage**:  
     - The consolidated DataFrame is added to the `dfs` list.  
   - **Completion**:  
     - Once all `soup` objects are processed, a success message is printed.

3. **Output**:  
   - The function returns a list of pandas DataFrames, where each DataFrame contains voting data for a candidate, including votes "A FAVOR," "CONTRA," and "AUSENCIA."

4. **Key Points**:  
   - **Error Handling**: Each step includes exception handling to ensure that errors in one iteration do not affect the overall process.  
   - **Uniform Structure**: The resulting DataFrames have a consistent structure, facilitating subsequent consolidation and analysis.  
   - **Custom Attributes**: Using attributes in the DataFrames preserves metadata (candidate name and number) associated with the extracted tables.  
   - **Relevance to the Project**: This function is critical for processing raw data and preparing it for the final step of consolidating it into the adjacency matrix.  

In [21]:
def extract_tables(soups):

    dfs = []
    for soup in soups:
        
        df_name, df_number = extract_metadata(soup)
        
        if df_number == np.nan:
            continue
        
        try:
            tables = soup.find_all("table")
        except:
            continue
        
        try:
            df_vote_in_favor = pd.read_html(StringIO(str(tables)))[4]
        except:
            df_vote_in_favor = pd.DataFrame(columns=["NOMBRE DIPUTADO", "ESTADO", "VOTO"])
        try:
            df_vote_against = pd.read_html(StringIO(str(tables)))[5]
        except:
            df_vote_against = pd.DataFrame(columns=["NOMBRE DIPUTADO", "ESTADO", "VOTO"])
        try:
            df_null_vote = pd.read_html(StringIO(str(tables)))[6]
        except:
            df_null_vote = pd.DataFrame(columns=["NOMBRE DIPUTADO", "ESTADO", "VOTO"])
            
        df = pd.concat([df_vote_in_favor, df_vote_against, df_null_vote], axis=0, ignore_index=True)
        df.attrs["name"] = df_name
        df.attrs["number"] = df_number
        
        dfs.append(df)

    print("Tables have been successfully extracted from the soups and transformed into pandas dataframe objects\n")
    return dfs

### Function: Data Wrangling and Transformation

The `data_wrangling(df)` function cleans and transforms a DataFrame to structure it properly and prepare it for consolidation into the adjacency matrix. Here’s a detailed explanation of its functionality:

1. **Purpose**:  
   Standardize, clean, and transform the voting data for each candidate to follow a consistent format. This includes sorting, removing unnecessary columns, handling missing values, and renaming columns.

2. **Step-by-Step Execution**:  
   - **Sorting by Deputy Name**:  
     - The DataFrame is sorted alphabetically by the `"NOMBRE DIPUTADO"` column to ensure consistent order for future comparisons and consolidations.  
   - **Removing the `"ESTADO"` Column**:  
     - The `"ESTADO"` column is not relevant for the adjacency matrix and is therefore dropped.  
   - **Transforming Values in the `"VOTO"` Column**:  
     - Values `"CONTRA"` and `"A FAVOR"` are replaced with `"0"` and `"1"`, respectively.  
     - Missing values (`np.nan`) are replaced with `"0"`, assuming that the absence of a vote indicates no support.  
   - **Handling Empty or Invalid Data**:  
     - Rows containing invalid data, such as `"Ningún dato disponible en esta tabla =("`, are removed.  
   - **Renaming Columns**:  
     - The `"VOTO"` column is renamed with the candidate's name (stored in the DataFrame attributes, `df.attrs["name"]`).  
     - The `"NOMBRE DIPUTADO"` column is renamed to `"DIPUTADO"` for a clearer and more uniform format.  
   - **Resetting the Index**:  
     - After cleaning and dropping rows, the index is reset to maintain an orderly DataFrame.  

3. **Output**:  
   - The function returns a cleaned and structured DataFrame with two main columns: `"DIPUTADO"` (the deputy's name) and the candidate's name (with corresponding votes as values).

4. **Key Points**:  
   - **Data Consistency**: Cleaning ensures a uniform format, which is crucial for consolidating the DataFrames into the adjacency matrix.  
   - **Standardization of Values**: Votes are converted into binary values (`1` for "A FAVOR" and `0` for "CONTRA" or absent), simplifying their interpretation in network analysis.  
   - **Preparation for Consolidation**: This step is essential to ensure each candidate's data is ready to integrate with others in a common structure.  

In [5]:
def data_wrangling(df):
    
    df = df.sort_values(by="NOMBRE DIPUTADO")
    df = df.drop(columns=["ESTADO"])
    df['VOTO'] = df['VOTO'].replace({'CONTRA': '0', 'A FAVOR': '1', np.nan:"0"})
    df = df.replace("Ningún dato disponible en esta tabla =(", np.nan)
    df = df.dropna()
    df = df.rename(columns={'VOTO': df.attrs["name"]})
    df = df.rename(columns={'NOMBRE DIPUTADO': 'DIPUTADO'})
    df = df.reset_index(drop=True)
    
    return df

### Function: Data Wrangling for All DataFrames

The `data_wrangling_all_dfs(dfs)` function applies the data cleaning and transformation process defined in the `data_wrangling(df)` function to a list of DataFrames. This ensures that each DataFrame is standardized and prepared for final consolidation. Here's a detailed explanation of its functionality:

1. **Purpose**:  
   Automate the cleaning and transformation of multiple DataFrames, representing voting data for all candidates, to ensure consistency and a uniform structure across all DataFrames.

2. **Step-by-Step Execution**:  
   - **Initialization**:  
     - An empty list, `dfs_after_wrangling`, is created to store the processed DataFrames.  
   - **Iterating Through the List of DataFrames**:  
     - For each DataFrame in the `dfs` list, the `data_wrangling(df)` function is applied.  
     - The resulting DataFrame is added to the `dfs_after_wrangling` list.  
   - **Returning the Result**:  
     - Once all DataFrames are processed, the `dfs_after_wrangling` list is returned as the output.

3. **Output**:  
   - The function returns a list of cleaned and transformed DataFrames, ready for consolidation into an adjacency matrix.

4. **Key Points**:  
   - **Reuse of Functionality**: Leverages the logic previously defined in `data_wrangling` to keep the code modular and efficient.  
   - **Consistency Across DataFrames**: Ensures that all DataFrames have a uniform format, which is essential for proper integration in subsequent steps.  
   - **Preparation for Consolidation**: By processing all DataFrames uniformly, it facilitates their combination into a single structure that represents the voting data for all candidates.  

In [6]:
def data_wrangling_all_dfs(dfs):
    
    dfs_after_wrangling = []
    for df in dfs:
        df_after_wrangling = data_wrangling(df)
        dfs_after_wrangling.append(df_after_wrangling)
        
    return dfs_after_wrangling

### Function: Generating the Adjacency Matrix

The `adjacency_matrix(dfs)` function builds an adjacency matrix from the DataFrames containing the voting data of congressmen for each candidate. This matrix is essential for representing the relationships between congressmen and candidates in a format suitable for network analysis. Here's a detailed explanation of its functionality:

1. **Purpose**:  
   Create an adjacency matrix where rows represent congressmen and columns represent candidates. The values in the matrix indicate whether a congressman voted "A FAVOR" (in favor) of a candidate (`1`) or not (`0`).

2. **Step-by-Step Execution**:  
   - **Initialization of Lists and Sets**:  
     - `candidates`: A list to store the names of the candidates (columns of the matrix).  
     - `congressmen`: A set to gather unique names of congressmen (rows of the matrix).  
   - **Collecting Names**:  
     - The function iterates through the DataFrames to extract the names of candidates (second column of each DataFrame) and congressmen (`DIPUTADO`).  
     - Both candidate and congressman names are sorted alphabetically to ensure consistency in the matrix.  
   - **Initializing the Adjacency Matrix**:  
     - A DataFrame is created with indices corresponding to congressmen's names and columns to candidates' names. All initial values are set to `0`.  
   - **Populating the Matrix**:  
     - For each DataFrame, the rows are iterated over.  
     - If a congressman's vote for a candidate is `'1'` (voted "A FAVOR"), the corresponding value in the adjacency matrix is updated to `1`.  
   - **Returning the Matrix**:  
     - The fully populated adjacency matrix is returned as a DataFrame.

3. **Output**:  
   - A DataFrame representing the adjacency matrix, where rows are congressmen, columns are candidates, and values are binary (`1` for "A FAVOR" and `0` for otherwise or absence of a vote).

4. **Key Points**:  
   - **Consistent Structure**: Sorting congressmen and candidates alphabetically ensures the matrix has a uniform structure, regardless of the order in the original data.  
   - **Binary Relationship**: The adjacency matrix captures only affirmative connections ("A FAVOR"), ignoring other types of votes, making it ideal for a network analysis focused on positive relationships.  
   - **Relevance to the Project**: This matrix is the final product of the previous transformations and serves as the main input for network analysis with Gephi.  

In [7]:
def adjacency_matrix(dfs):
    
    candidates = []
    congressmen = set()
    
    for df in dfs:
        candidates.append(df.columns.values[1])
        congressmen.update(df['DIPUTADO'].values)
        
    candidates = sorted(candidates)
    congressmen = sorted(list(congressmen))
    
    adjacency_matrix = pd.DataFrame(0, index=congressmen, columns=candidates)
    
    for df in dfs:
        for i, row in df.iterrows():
            congressman = row['DIPUTADO']
            candidate = df.columns[1]
            value = row[candidate]
            if value == '1':
                adjacency_matrix.loc[congressman, candidate] = 1

    return adjacency_matrix

### Function: Export DataFrames to CSV Files

The `dfs_to_csvs(dfs)` function exports a list of DataFrames to CSV files, ensuring that each candidate's information is saved in an accessible format for future use. Below is a detailed explanation of its functionality:

1. **Purpose**:  
   Save each DataFrame as an individual CSV file, naming the files consistently using the DataFrame's attributes (`number` and `name`).

2. **Step-by-Step Execution**:  
   - **Iterating Through the DataFrames**:  
     - The function iterates over each DataFrame in the `dfs` list.  
   - **Exporting to CSV**:  
     - The `to_csv` method from pandas is used to save each DataFrame as a CSV file in the `files` folder.  
     - The file name includes the candidate's number (`df.attrs["number"]`) and name (`df.attrs["name"]`) for easy identification.  
   - **Error Handling**:  
     - If the export fails for any reason (e.g., permission issues or missing directories), a message is printed indicating the corresponding file could not be exported.  
   - **Completion**:  
     - After processing all DataFrames, a message is printed indicating the export process has finished.

3. **Output**:  
   - CSV files are saved in the `files` folder, with descriptive names indicating the candidate's number and name.  
   - Messages are printed to the console to inform the user of the success or failure of each file's export.

4. **Key Points**:  
   - **Clear Organization**: CSV files are named intuitively, making them easy to locate and use later.  
   - **Error Management**: The function includes logic to handle export errors, ensuring that a single failure does not halt the entire process.  
   - **Relevance to the Project**: This function ensures that intermediate results are saved in a widely-used format (CSV), providing a reliable backup for future reference or further analysis.  

In [8]:
def dfs_to_csvs(dfs):

    for df in dfs:
        try:
            df.to_csv(f"files/{df.attrs["number"]} {df.attrs["name"]}.csv", index=False)
            print(f"{df.attrs["number"]} {df.attrs["name"]}.csv file has been exported")
        except:
            print(f"{df.attrs["number"]} {df.attrs["name"]}.csv file could not be exported")
            
    print("The process of exporting CSV files has been completed")

### Generating the List of URLs

This cell generates a list of URLs corresponding to the voting pages for candidates on the official website of the Congress of the Republic of Guatemala. Below is the purpose and execution of the code:

1. **Purpose**:  
   Create a list of consecutive URLs to extract voting data for candidates within a specific range. These URLs will later be used to automate the data extraction process.

2. **Step-by-Step Execution**:  
   - **Defining the Range**:  
     - The variables `a` and `b` define the range of page identifiers. In this case, the range is from `7900` to `7926` (excluding `b`).  
   - **Generating the URL List**:  
     - A list comprehension is used to generate the URLs.  
     - Each URL is formatted using an `f-string`, which incorporates the values from the range as part of the unique identifier for each page.  
     - The range (`range(a, b)`) ensures that all numbers from `7900` to `7925` are iterated over, producing a total of 26 URLs.

3. **Output**:  
   - The variable `urls` contains a list of 26 URLs, each pointing to a specific voting detail page on the Congress website.

4. **Key Points**:  
   - **Automation**: Automatically generating URLs eliminates the need to manually list them, reducing errors and saving time.  
   - **Preparation for Data Extraction**: These URLs will serve as inputs for subsequent functions that automate navigation and data extraction from the respective pages.  
   - **Flexibility**: Changing the values of `a` and `b` allows easy adjustment of the range of pages to process, making it adaptable for future extensions of the project.  

In [9]:
a = 7900
b = 7926
urls = [f"https://www.congreso.gob.gt/detalle_de_votacion/4{i}/41228" for i in range(a, b)]

### HTML Content ("Soups") Extraction and Execution Considerations

In this cell, the `extract_soups(urls)` function is used to automate the extraction of HTML content from the voting pages specified in the previously generated URL list. Below is a detailed explanation of its purpose, expected behavior, and important considerations for the user.

1. **Purpose**:  
   Retrieve the complete HTML content of each voting page using Selenium to interact with dynamically loaded JavaScript content. These "soups" will later be processed to extract and transform the required data.

2. **Execution**:  
   - The `extract_soups(urls)` function takes the URL list as input.  
   - The function uses an automated browser (Selenium WebDriver) to:  
     - Launch the browser in a controlled manner.  
     - Load each URL in the list.  
     - Interact with page elements, such as dropdown menus, to ensure all necessary content is visible.  
     - Capture the dynamically rendered HTML content.

3. **Considerations for the User**:  
   - **Process Duration**:  
     - **Patience is key**: This process can take time since the automated browser must launch, navigate each page, and perform interactions like selecting menus and waiting for content to fully load.  
     - It is normal for the browser to take a few seconds to open and remain active while processing the URLs.  
   - **Console Messages**:  
     - **Process Start**: You will see the message:  
       `"Please wait while the web browser loads and the data extraction automation starts. This may take several minutes..."`  
       This indicates that the browser is starting and the process has begun successfully.  
     - **Load Timeout**: If a page fails to load within the expected time, you will see:  
       `"The timeout for loading the website has been exceeded. Please close the automated browser window if necessary and try again."`  
       In this case, you should manually close the browser window and restart the process.  
     - **Successful Completion**: Once the process finishes, you will see:  
       `"Soups extracted successfully!"`  
       This confirms that all HTML content has been extracted and is ready for further processing.  
   - **Browser Interaction**:  
     - During execution, you will observe the browser opening pages, selecting menus, and loading information automatically. This is normal and reflects the script's internal operation.  
     - It is important not to manually interfere with the browser while it performs automated actions, as this could disrupt the process.  

4. **Output**:  
   - The `soups` variable will contain a list of BeautifulSoup objects, each representing the HTML content of a voting page.

5. **Key Points**:  
   - **Automation of the Extraction Process**: Selenium enables interaction with dynamic page elements, overcoming the limitations of simpler tools like Requests.  
   - **Preparation for Cleaning and Analysis**: The extracted HTML content is the primary input for subsequent processing stages, where it will be transformed into structured data.  
   - **Guidance for New Users**: This level of automation may seem unusual for users with no prior experience, but it is a crucial step to handle websites that rely on JavaScript to load critical content.  

In [12]:
soups = extract_soups(urls)

Please wait while the web browser loads and the data extraction automation starts.
This may take several minutes...

Soups extracted successfully!



### Extraction of Voting Tables

In this cell, the `extract_tables(soups)` function is used to extract the voting tables from the list of BeautifulSoup objects obtained earlier. These tables represent the raw data of votes for each candidate. Below is a detailed explanation of its purpose and behavior:

1. **Purpose**:  
   Extract the voting data from each HTML content ("soup") to create pandas DataFrames. These DataFrames will organize the voting information in a structured format, separating votes into categories such as "A FAVOR" (in favor), "CONTRA" (against), and "AUSENCIA" (absent/null).

2. **Execution**:  
   - The function `extract_tables(soups)` processes each BeautifulSoup object in the `soups` list.  
   - For each candidate:
     - Metadata (such as the candidate's name and number) is extracted using the `extract_metadata` function.  
     - Voting tables are identified and converted into pandas DataFrames.  
     - DataFrames for "A FAVOR," "CONTRA," and "AUSENCIA" are consolidated into a single DataFrame for each candidate.  
   - Each consolidated DataFrame is stored with metadata attributes (`name` and `number`) for easy identification.

3. **Output**:  
   - The `dfs` variable contains a list of pandas DataFrames.  
   - Each DataFrame corresponds to a candidate and includes detailed voting data, ready for further cleaning and transformation.

4. **Considerations for the User**:  
   - **Automated Processing**: The function processes all HTML content automatically, extracting and structuring the tables without user intervention.  
   - **Potential Gaps in Data**: If a table for a specific voting category is missing (e.g., "AUSENCIA"), the function creates an empty DataFrame for that category to maintain consistency.  
   - **Console Feedback**: Messages printed by the `extract_tables` function inform the user about the status of the extraction process, making it easier to identify potential issues.

5. **Relevance to the Project**:  
   - These DataFrames form the foundation for subsequent steps, such as cleaning, transformation, and creating the adjacency matrix. They ensure that all voting data is structured uniformly for analysis.

In [22]:
dfs = extract_tables(soups)

Tables have been successfully extracted from the soups and transformed into pandas dataframe objects



### Visualizing Extracted Data

This cell allows the user to explore the extracted data and see an example of how the generated DataFrames are structured. It provides basic information about a specific candidate and a partial view of their voting data. Below is an explanation of the purpose and functionality of this cell:

1. **Purpose**:  
   Provide a simple way for the user to inspect the extracted data for a specific candidate, checking both the metadata and the content of the DataFrame.

2. **Execution**:  
   - **Customizable Index**:  
     - The variable `i` is used to select the index of the DataFrame in the `dfs` list. The user can modify this index to navigate between the different generated DataFrames.  
   - **Displaying Metadata**:  
     - The election number (`dfs[i].attrs["number"]`) and the candidate's name (`dfs[i].attrs["name"]`) are printed, making it easier to identify the selected DataFrame.  
   - **Viewing the DataFrame**:  
     - A combined view of the first 20 rows (`head(20)`) and the last 20 rows (`tail(20)`) of the selected DataFrame is displayed. This provides a representative sample of the data without showing the entire DataFrame.

3. **Output**:  
   - The console prints:
     - The election number corresponding to the candidate.  
     - The candidate's name.  
   - In the Jupyter Notebook, a styled view of the DataFrame is displayed, showing the first and last 20 rows.

4. **Considerations for the User**:  
   - **Interactive Exploration**: The user can modify the value of `i` to inspect the DataFrames of different candidates. This is useful for verifying the quality and consistency of the extracted data.  
   - **Data Context**: Displaying the associated metadata (election number and candidate's name) helps to understand which information is being viewed.  
   - **Row Limitation**: Showing only the first and last rows is a strategy to avoid overloading the visualization when DataFrames contain a large number of rows.  

5. **Relevance to the Project**:  
   - This cell helps validate the intermediate results of the data extraction and transformation process, ensuring that the DataFrames are correctly structured before proceeding to the next steps.  

In [14]:
i = 2
print(f"Supreme Court election number: {dfs[i].attrs["number"]}")
print(f"Name of the candidate in election: {dfs[i].attrs["name"]}") 
pd.concat([dfs[i].head(20), dfs[i].tail(20)]).style

Supreme Court election number: 6
Name of the candidate in election: WENDY ANGELICA RAMÍREZ LÓPEZ


Unnamed: 0,NOMBRE DIPUTADO,ESTADO,VOTO
0,Ajcip Canel Hellen Magaly Alexandrá,PRESENTE,A FAVOR
1,Arana Roca Sergio David,PRESENTE,A FAVOR
2,Barragán Morales Gerson Geovanny,PRESENTE,A FAVOR
3,Blanco Lapola Orlando Joaquín,PRESENTE,A FAVOR
4,Chic Cardona José Alberto,PRESENTE,A FAVOR
5,Cifuentes Ovalle Pablo Leonel,PRESENTE,A FAVOR
6,Coc Figueroa Randy Araely,PRESENTE,A FAVOR
7,De León Benítez Alberto Eduardo,PRESENTE,A FAVOR
8,De León Torres Lourdes Teresita,PRESENTE,A FAVOR
9,Flores Divas Jairo Joaquín,PRESENTE,A FAVOR


### Transformation of All DataFrames

In this cell, the `data_wrangling_all_dfs(dfs)` function is used to clean and transform all the DataFrames previously generated from the extracted data. This step ensures that the voting data is standardized and ready for final consolidation into the adjacency matrix. Below is a detailed explanation of the purpose and functionality of this cell:

1. **Purpose**:  
   Apply the data cleaning and transformation process defined in the `data_wrangling(df)` function to all DataFrames in the `dfs` list. This unifies the data format and ensures consistency.

2. **Execution**:  
   - The `data_wrangling_all_dfs(dfs)` function takes the `dfs` list, which contains the original DataFrames, as input.  
   - For each DataFrame in the list:  
     - The `data_wrangling(df)` function is applied to:  
       - Sort rows.  
       - Remove unnecessary columns.  
       - Standardize vote values to a binary format (`1` for "A FAVOR" and `0` for "CONTRA" or null votes).  
       - Rename columns to facilitate consolidation.  
   - The transformed DataFrames are stored in a new list called `transformed_dfs`.

3. **Output**:  
   - The `transformed_dfs` variable contains a list of transformed DataFrames.  
   - Each DataFrame follows a uniform format, with columns "DIPUTADO" (Deputy) and the candidate's name, and is ready for consolidation into the adjacency matrix.

4. **Considerations for the User**:  
   - **Automated Process**: This cell automates the cleaning and standardization of data, eliminating the need to manually perform these steps for each DataFrame.  
   - **Data Quality and Consistency**: The function ensures that the data is free from inconsistencies and formatted appropriately for the next stages of the project.  
   - **Project Impact**: This step is crucial to guarantee that the data can be consolidated and analyzed correctly in the adjacency matrix.

5. **Relevance to the Project**:  
   - Transforming the DataFrames prepares the data for consolidation, enabling the creation of a uniform and reliable adjacency matrix, which is essential for network analysis.  

In [15]:
transformed_dfs = data_wrangling_all_dfs(dfs)

### Visualizing Transformed Data

This cell provides the user with a way to inspect how the voting data has been transformed after applying the cleaning and standardization processes. It helps validate that the data is properly prepared for the next steps. Below is a detailed explanation of the purpose and execution of this cell:

1. **Purpose**:  
   Allow the user to verify the results of the data transformation for a specific candidate, observing the changes made to the format and structure of the DataFrames.

2. **Execution**:  
   - **Index Selection**:  
     - The variable `i` defines the index of the DataFrame in the `transformed_dfs` list that the user wishes to inspect.  
     - Changing the value of `i` allows the user to navigate between different processed candidates.  
   - **Displaying Metadata**:  
     - The election number (`transformed_dfs[i].attrs["number"]`) and the candidate’s name (`transformed_dfs[i].attrs["name"]`) are printed to clearly identify the selected DataFrame.  
   - **Viewing the DataFrame**:  
     - A combination of the first 20 rows (`head(20)`) and the last 20 rows (`tail(20)`) of the transformed DataFrame is displayed. This provides a representative sample of the data without overwhelming the display.

3. **Output**:  
   - In the console:
     - The election number and the candidate’s name corresponding to the selected DataFrame are printed.  
   - In the Jupyter Notebook:
     - A styled view of the DataFrame is displayed, showing a sample of the transformed rows.

4. **Considerations for the User**:  
   - **Interactive Exploration**: The user can modify the value of `i` to examine how the data for other candidates has been transformed.  
   - **Change Confirmation**: This cell is useful for verifying that columns have been renamed, vote values are in binary format, and the structure is uniform.  
   - **Row Limitation**: Displaying only the first and last rows avoids overwhelming the screen, especially if the DataFrames contain many rows.

5. **Relevance to the Project**:  
   - Validating the transformation results ensures that the data is properly structured before proceeding to the final consolidation into the adjacency matrix.  

In [16]:
i = 2
print(f"Supreme Court election number: {transformed_dfs[i].attrs["number"]}")
print(f"Name of the candidate in election: {transformed_dfs[i].attrs["name"]}") 
pd.concat([transformed_dfs[i].head(20), transformed_dfs[i].tail(20)]).style

Supreme Court election number: 6
Name of the candidate in election: WENDY ANGELICA RAMÍREZ LÓPEZ


Unnamed: 0,DIPUTADO,WENDY ANGELICA RAMÍREZ LÓPEZ
0,Aguirre Estrada Luis Fernando,0
1,Ajcip Canel Hellen Magaly Alexandrá,1
2,Aldana Reyes Héctor Adolfo,0
3,Alejos Lorenzana Felipe,0
4,Alvarado Vásquez Manuel Geovany,0
5,Alvarez y Alvarez Cristian Rodolfo,0
6,Amézquita Del Valle César Augusto,0
7,Arana Roca Sergio David,1
8,Archila Cordón Manuel de Jesús,0
9,Arzú Escobar Alvaro Enrique,0


### Generating the Adjacency Matrix

In this cell, the `adjacency_matrix(transformed_dfs)` function is used to consolidate the voting data into an adjacency matrix. This matrix is a key component of the project, as it represents the relationships between congressmen and candidates in a binary format suitable for network analysis. Below is a detailed explanation of its purpose and functionality:

1. **Purpose**:  
   Create an adjacency matrix where rows represent congressmen, columns represent candidates, and the values indicate whether a congressman voted "A FAVOR" (in favor) for a candidate (`1`) or not (`0`).

2. **Execution**:  
   - The `adjacency_matrix(transformed_dfs)` function takes the list `transformed_dfs`, which contains the transformed DataFrames, as input.  
   - From these DataFrames:  
     - The names of congressmen (rows) and candidates (columns) are collected.  
     - An initially zero-filled matrix is constructed.  
     - Each DataFrame is iterated through to update the matrix values. When a congressman voted "A FAVOR" (`1`) for a candidate, the corresponding value in the matrix is updated.  

3. **Output**:  
   - The `adjacency_matrix` variable contains a DataFrame representing the adjacency matrix:  
     - **Rows**: Names of congressmen.  
     - **Columns**: Names of candidates.  
     - **Values**: Binary (`1` for "A FAVOR" and `0` for "CONTRA" or absence of a vote).

4. **Considerations for the User**:  
   - **Clear Structure**: The adjacency matrix provides a compact and easily interpretable representation of the relationships between congressmen and candidates.  
   - **Prior Validation**: It is essential that all DataFrames in `transformed_dfs` have been correctly transformed to ensure the matrix’s accuracy.  
   - **Utility for Analysis**: This DataFrame is the final product needed for conducting network analysis using tools such as Gephi.

5. **Relevance to the Project**:  
   - The adjacency matrix is the culmination of the data extraction, cleaning, and transformation processes. It is a crucial input for analyzing voting patterns and relationships between congressmen and candidates in a network context.  

In [17]:
adjacency_matrix = adjacency_matrix(transformed_dfs)

### Visualization of the Adjacency Matrix

This cell provides a visual representation of the adjacency matrix, allowing the user to inspect the relationships between congressmen and candidates. This matrix summarizes the voting data in a binary format and serves as the final structured output of the project. Below is an explanation of its purpose and execution:

1. **Purpose**:  
   Display the adjacency matrix in a clear, styled format so that the user can visually verify the relationships captured in the data.

2. **Execution**:  
   - The `adjacency_matrix.style` method is used to create a styled view of the adjacency matrix.  
   - The resulting display shows:  
     - **Rows**: Representing the congressmen.  
     - **Columns**: Representing the candidates.  
     - **Values**: Binary (`1` for "A FAVOR" and `0` for "CONTRA" or absence of a vote).

3. **Output**:  
   - In the Jupyter Notebook, the adjacency matrix is displayed as a styled DataFrame. The matrix format provides an easily interpretable summary of voting patterns.

4. **Considerations for the User**:  
   - **Matrix Dimensions**: Depending on the number of congressmen and candidates, the matrix can be large. Scrolling may be required to view all rows and columns.  
   - **Binary Data Representation**: Each cell contains either a `1` (indicating a vote "A FAVOR") or a `0` (indicating "CONTRA" or no vote). This simple representation is designed for network analysis.  
   - **Final Output**: This matrix represents the culmination of all data extraction, cleaning, and transformation steps, providing the input for tools like Gephi.

5. **Relevance to the Project**:  
   - This visualization allows the user to confirm the accuracy of the adjacency matrix before exporting it or proceeding with network analysis. It is the final verification step to ensure the data is correctly structured and ready for use.  

In [18]:
adjacency_matrix.style

Unnamed: 0,ASTRID SIOMARA MORALES VIRULA,CARLOS RAMIRO CONTRERAS VALENZUELA,CARLOS RODIMIRO LUCERO PAZ,CLAUDIA LUCRECIA PAREDES CASTAÑEDA,CLEMEN VANESSA JUÁREZ MIDENCE,CÉSAR AUGUSTO ÁVILA APARICIO,DIMAS JIMÉNEZ Y JIMÉNEZ,EDGAR ORLANDO RUANJO GODOY,ERWIN IVÁN ROMERO MORALES,ESTUARDO ADOLFO CÁRDENAS,FLOR DE MARÍA GARCÍA VILLATORO,FLOR DE MARÍA GÁLVEZ BARRIOS,GUSTAVO ADOLFO MORALES DUARTE,IGMAÍN GALICIA PIMENTEL,JENNY NOEMY ALVARADO TENI,JORGE ALBERTO GONZALEZ BARRIOS,JORGE EDUARDO TUCUX COYOY,LIDIA JUDITH URIZAR CASTELLANOS,LUIS MAURICIO CORADO CAMPOS,MANUEL DE JESÚS MEJICANOS JIMÉNEZ,MARIO RENÉ MANCILLA BARILLAS,MARTA SUSANA VIDES LAVARREDA,RENÉ GUILLERMO GIRÓN PALACIOS,TEODULO ILDEFONSO CIFUENTES MALDONADO,VILMA ROSSANA REYES GONZÁLEZ,WENDY ANGELICA RAMÍREZ LÓPEZ
Aguirre Estrada Luis Fernando,0,1,1,1,0,0,0,0,0,1,1,1,0,1,0,0,0,0,1,0,0,0,1,1,0,0
Ajcip Canel Hellen Magaly Alexandrá,0,1,1,1,1,0,1,0,0,1,1,1,1,1,1,1,0,0,1,0,0,0,1,1,0,1
Aldana Reyes Héctor Adolfo,0,1,1,1,1,0,1,0,0,1,1,1,1,1,1,1,0,0,1,0,0,0,1,1,0,0
Alejos Lorenzana Felipe,0,1,1,1,1,0,1,0,0,1,1,1,1,1,1,1,0,0,1,0,0,0,1,1,0,0
Alvarado Vásquez Manuel Geovany,0,1,1,1,1,0,1,0,0,1,1,1,1,1,1,1,0,0,1,0,0,0,1,1,0,0
Alvarez y Alvarez Cristian Rodolfo,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,1,0
Amézquita Del Valle César Augusto,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Arana Roca Sergio David,0,1,1,1,1,0,1,0,0,1,1,1,1,1,1,0,0,0,1,0,0,0,1,1,0,1
Archila Cordón Manuel de Jesús,0,1,1,1,0,0,0,0,0,1,1,1,0,1,1,0,0,0,1,0,0,0,1,1,0,0
Arzú Escobar Alvaro Enrique,0,1,1,1,1,0,1,0,0,1,1,1,1,1,1,0,1,0,1,0,0,1,1,1,0,0


### Exporting DataFrames to CSV Files

In this cell, the `dfs_to_csvs(transformed_dfs)` function is used to export the transformed DataFrames to individual CSV files. This action allows saving the intermediate results of the project in a widely used format, making them accessible for further use. Below is a detailed explanation of the purpose and execution of this cell:

1. **Purpose**:  
   Save each transformed DataFrame as an individual CSV file, with a descriptive name that clearly identifies the corresponding candidate.

2. **Execution**:  
   - The `dfs_to_csvs(transformed_dfs)` function takes the `transformed_dfs` list, which contains the transformed DataFrames, as input.  
   - For each DataFrame:  
     - It is saved as a CSV file in the `files` folder, using the candidate's number and name (stored in the DataFrame attributes) to create the file name.  
     - If the export is successful, a message is printed to the console indicating the name of the generated file.  
     - In case of an error (e.g., if the `files` folder does not exist), a message is printed to inform that the file could not be exported.

3. **Output**:  
   - Individual CSV files are created for each DataFrame in the `files` folder.  
   - Messages in the console inform the user about the success or failure of each file export.

4. **Considerations for the User**:  
   - **File Organization**: The CSV files are named using the candidate's number and name, making them easy to locate and identify.  
   - **Potential Errors**: If files are not exported correctly, check that the `files` folder exists and that the system has the necessary permissions to write to it.  
   - **Compatibility**: CSV files can be opened with any tool that supports this format, making them accessible even outside the project environment.

5. **Relevance to the Project**:  
   - Exporting the DataFrames allows preserving the intermediate results, providing a backup of the processed data and preparing them for further analysis or integration with other tools.  

In [19]:
dfs_to_csvs(transformed_dfs)

4 CÉSAR AUGUSTO ÁVILA APARICIO.csv file has been exported
5 CARLOS RODIMIRO LUCERO PAZ.csv file has been exported
6 WENDY ANGELICA RAMÍREZ LÓPEZ.csv file has been exported
7 CLAUDIA LUCRECIA PAREDES CASTAÑEDA.csv file has been exported
8 GUSTAVO ADOLFO MORALES DUARTE.csv file has been exported
9 JORGE EDUARDO TUCUX COYOY.csv file has been exported
10 JENNY NOEMY ALVARADO TENI.csv file has been exported
11 IGMAÍN GALICIA PIMENTEL.csv file has been exported
12 FLOR DE MARÍA GÁLVEZ BARRIOS.csv file has been exported
13 CARLOS RAMIRO CONTRERAS VALENZUELA.csv file has been exported
14 MARTA SUSANA VIDES LAVARREDA.csv file has been exported
15 LIDIA JUDITH URIZAR CASTELLANOS.csv file has been exported
16 FLOR DE MARÍA GARCÍA VILLATORO.csv file has been exported
17 VILMA ROSSANA REYES GONZÁLEZ.csv file has been exported
18 DIMAS JIMÉNEZ Y JIMÉNEZ.csv file has been exported
19 CLEMEN VANESSA JUÁREZ MIDENCE.csv file has been exported
20 TEODULO ILDEFONSO CIFUENTES MALDONADO.csv file has been ex

### Exporting the Adjacency Matrix to a CSV File

In this cell, the generated adjacency matrix is exported to a CSV file using the `to_csv` method. This step finalizes the project processing by saving the matrix in an accessible and widely used format for further analysis or use in external tools, such as Gephi. Below is an explanation of the purpose and execution of this cell:

1. **Purpose**:  
   Save the adjacency matrix to a CSV file for easy storage, sharing, and analysis with tools compatible with this format.

2. **Execution**:  
   - The `to_csv` method is applied to the adjacency matrix (`adjacency_matrix`).  
   - The file is saved in the `files` folder with the name `adjacency_matrix.csv`.  
   - This file contains:  
     - **Rows**: Names of the congressmen.  
     - **Columns**: Names of the candidates.  
     - **Values**: Binary (`1` for "A FAVOR" and `0` for "CONTRA" or absence of a vote).  

3. **Output**:  
   - A CSV file named `adjacency_matrix.csv` in the `files` folder containing the complete adjacency matrix.

4. **Considerations for the User**:  
   - **File Path**: Ensure the `files` folder exists and that the system has permissions to save the file in it.  
   - **Format Compatibility**: The CSV file can be opened and manipulated with various tools, such as spreadsheet software (Excel), text editors, or specialized software like Gephi.  
   - **Project Completion**: This file represents the final product of the processing and can be directly used for network analysis.

5. **Relevance to the Project**:  
   - Exporting the adjacency matrix ensures that the analysis results are backed up and available in a versatile format. It is the final step before using the data for visualizations or additional analyses in external tools.  

In [20]:
adjacency_matrix.to_csv('files/adjacency_matrix.csv')

### Screenshots: Project Results

Below are screenshots illustrating the key outputs generated during the project. These images provide a visual representation of the processed data and the analyses performed:

#### 1. Example of a CSV File from an Election DataFrame

This screenshot displays the contents of one of the CSV files generated for a specific election. This file contains the structured voting data for a candidate, including information about congressmen and their respective votes:

- **Columns**:
  - `DIPUTADO`: Name of the congressman who participated in the vote.
  - The candidate’s name: Indicates whether the congressman voted "A FAVOR" (`1`) or "CONTRA"/null (`0`) for the candidate.
- **Format**: This file is automatically generated during the export process and is ready for further analysis or integration.

![Captura del archivo csv generado](https://raw.githubusercontent.com/cdberganza/Scraping_supreme_court_election_gt_2024/refs/heads/main/images/csv_screenshot.png)

#### 2. CSV File of the Adjacency Matrix

This screenshot shows the adjacency matrix exported as a CSV file. This file represents the relationship between congressmen (rows) and candidates (columns) in a binary format:

- **Values**:
  - `1`: Indicates that the congressman voted "A FAVOR" for the corresponding candidate.
  - `0`: Indicates that the congressman voted "CONTRA" or did not cast a vote.
- **Usage**: This file is the final product of the data processing and is designed for network analysis in tools like Gephi.

#### 3. Network Visualization Generated by Gephi

The final screenshot presents the network visualization created in Gephi, based on the adjacency matrix:

- **Nodes**: Represent congressmen and candidates.  
- **Edges**: Indicate the connections between congressmen and candidates, defined by "A FAVOR" votes.  
- **Relevance**: This graph is the result of the network analysis, providing visual insights into voting patterns and the relationships between participants in the election.

These screenshots are essential for understanding how the processed data is transformed into actionable insights for analysis and interpretation in a social and political network context.

### Conclusions

The analysis of the election process for Supreme Court magistrates in Guatemala demonstrates how advanced technological tools can provide a clearer and more structured understanding of a politically and socially significant event. Below are the main conclusions:

#### 1. **Extraction and Standardization of Complex Data**  
   - **Overcoming Dynamic Content**: The implementation of Selenium successfully handled the dynamic content of the Congress website, overcoming the limitations of traditional tools like Requests.  
   - **Standardized Data**: Transforming votes into a binary format (`1` for "A FAVOR" and `0` for "CONTRA"/null) simplified the analysis and enabled the creation of a clear and functional adjacency matrix.

#### 2. **Creation of a Useful and Accessible Adjacency Matrix**  
   - The generated adjacency matrix accurately represents the relationships between congressmen and candidates. This structure not only facilitates network analysis but also serves as a reusable input for future studies.

#### 3. **Network Analysis to Understand Voting Dynamics**  
   - Visualization in Gephi enabled the identification of support patterns, political alliances, and potential divisions within Congress during the election process. This provides deeper insights into power dynamics and the factors influencing key decisions.

#### 4. **Transparency and Reusability of Results**  
   - The generated CSV files allow the processed data to be stored and shared transparently, facilitating its use by other researchers, journalists, or citizens interested in analyzing Guatemala’s judicial and political systems.  
   - This approach emphasizes the importance of maintaining open and accessible data for public scrutiny.

#### 5. **Social and Political Impact of the Project**  
   - By providing a visual and structured representation of such a critical election process, the project promotes transparency and informed debate about judicial independence in Guatemala.  
   - It also highlights the importance of monitoring and analyzing political processes to ensure they are fair and represent the country’s best interests.

#### Lessons Learned  
   - **Automation as an Essential Tool**: Automation through Selenium and Python programming proved to be effective tools for managing complex data extraction and transformation processes.  
   - **Importance of Validation**: Each step, from extraction to final analysis, required rigorous validation to ensure the accuracy and reliability of the results.  
   - **Versatility of Network Analysis**: Network analysis not only represented the connections between congressmen and candidates but also revealed underlying patterns of interaction and support.

#### Future Perspectives  
This project lays the groundwork for future research, which could:  
   - Expand the analysis to other voting processes in Congress.  
   - Incorporate additional statistical analyses to explore correlations between votes and political affiliations.  
   - Develop interactive tools to make the results more accessible to the public.

In conclusion, this project not only addresses a technical challenge by managing dynamic and fragmented data but also contributes significantly to understanding the political and social dynamics behind a fundamental process for Guatemala’s justice system.