# Assignment 5: Web scrapping elections from the JNE page
The script should not give an error. Any mistake will be consider as 0.


Group 3

-Fátima Trujillo Quiñe
-Reynaldo Padilla Milla
-Claudia Córdova Yamauchi
-Mauricio Flores Jiménez

# 1. Calling the libraries

In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import time
import pandas as pd
import os
from io import StringIO

These libraries are essential for automating web scraping and handling data:

selenium: Automates browser interactions, allowing you to navigate web pages, click elements, and extract data. Key components include webdriver for controlling the browser, By for locating elements, WebDriverWait and expected_conditions for managing wait times, and Keys for keyboard actions.

webdriver_manager: Simplifies the setup of ChromeDriver, ensuring the correct version is installed and used with Selenium, avoiding compatibility issues.

pandas: Provides powerful data manipulation tools, enabling the storage, processing, and exporting of scraped data into formats like Excel.

os: Handles file system operations, such as retrieving the current working directory and constructing file paths.

StringIO: Allows handling HTML content as in-memory files, which is useful for converting scraped data into Pandas DataFrames without creating temporary files.

re: Supports regular expressions, useful for pattern matching and text processing within scraped data.

time: Manages timing functions, such as pauses between actions, to ensure elements load fully before interaction.

# 2. Setting the driver and the page

First, we initialize a Chrome browser instance using Selenium (driver). Inmediately after, we'll maximize the window and set the zoom level to 100%. Maximizing the window and setting the zoom to the regular focus ensures that all elements on the webpage are fully visible and interactable, which is crucial for reliable web scraping. The driver controls the browser, allowing automated interaction with the page, such as clicking elements or extracting data.

In [13]:
driver = webdriver.Chrome(service=webdriver.chrome.service.Service(ChromeDriverManager().install()))
driver

# We set the window size and zoom.
driver.maximize_window()
driver.execute_script("document.body.style.zoom='100%'")

# 3. Indicating the web page and actions

We'll automate the process of navigating the JNE web page. We are relying on Selenium to open the webpage, select the type of election ("ELECCIONES PRESIDENCIALES"), and then interact with a dropdown menu that displays all the available election results (it goes back to 1931). The options from this menu are collected into a list, and a dictionary is created that maps the text of each option to its corresponding element for future interaction.

In [15]:
url = 'https://infogob.jne.gob.pe/Eleccion'
driver.get(url)

# We'll click on the first [SELECCIONE], the one for the elections type
elections = driver.find_element(By.XPATH, '/html/body/div[1]/section/div[2]/div[2]/div[2]/div[1]/div').click()
time.sleep(5)

# Next, we choose (click on) ELECCIONES PRESIDENCIALES
elections = driver.find_element(By.XPATH, '/html/body/div[1]/section/div[2]/div[2]/div[2]/div[1]/div/div[2]/div[2]').click()
time.sleep(5)

# Then, click on the [SELECCIONE] on the right (under Elección)
elections = driver.find_element(By.XPATH, '/html/body/div[1]/section/div[2]/div[2]/div[2]/div[2]/div').click()
time.sleep(5)

# Now, we show the elements of the second menu
elections_list = driver.find_element(By.XPATH, '/html/body/div[1]/section/div[2]/div[2]/div[2]/div[2]/div/div[2]')
time.sleep(5)

# Move on to create a list of elements representing election options
options_elections_list = list(elections_list.find_elements(By.CLASS_NAME, 'item'))

# Create a dictionary mapping the text of each option to the element itself
dict_options = {option.text: option for option in options_elections_list}

dict_options

{'[SELECCIONE]': <selenium.webdriver.remote.webelement.WebElement (session="6f9c56a563ca6079c6ef42136d94ffc5", element="f.479E4B9A1FADEF7C2E91688BC2818635.d.D20A11A23EF92F36FD9594C5CDAD4B0B.e.54")>,
 'PRESIDENCIAL 2021 - 2DA VUELTA': <selenium.webdriver.remote.webelement.WebElement (session="6f9c56a563ca6079c6ef42136d94ffc5", element="f.479E4B9A1FADEF7C2E91688BC2818635.d.D20A11A23EF92F36FD9594C5CDAD4B0B.e.55")>,
 'PRESIDENCIAL 2021': <selenium.webdriver.remote.webelement.WebElement (session="6f9c56a563ca6079c6ef42136d94ffc5", element="f.479E4B9A1FADEF7C2E91688BC2818635.d.D20A11A23EF92F36FD9594C5CDAD4B0B.e.56")>,
 'PRESIDENCIAL 2016 - 2DA VUELTA': <selenium.webdriver.remote.webelement.WebElement (session="6f9c56a563ca6079c6ef42136d94ffc5", element="f.479E4B9A1FADEF7C2E91688BC2818635.d.D20A11A23EF92F36FD9594C5CDAD4B0B.e.57")>,
 'PRESIDENCIAL 2016': <selenium.webdriver.remote.webelement.WebElement (session="6f9c56a563ca6079c6ef42136d94ffc5", element="f.479E4B9A1FADEF7C2E91688BC2818635.d.D

# 4. Establishing the actions to get the data from each electoral process

This code initializes an empty DataFrame (df_results) to store the web scraping results. It iterates over the election options in a dictionary (dict_options), clicking each election option and using WebDriverWait to ensure elements are clickable before interacting with them. The code waits for and clicks the "VER DATOS DE LA ELECCIÓN" button and the "CANDIDATOS Y RESULTADOS" tab to load the election data. It then extracts the election results table, processes the relevant columns (ORGANIZACIÓN POLÍTICA, TOTAL VOTOS), adds a new column for the election name, and appends the data to df_results. We set longer wait times because shorter durations caused errors, aiming to guarantee that the page is fully loaded before each interaction. The code also includes time.sleep and navigation commands to handle page loads between iterations.

In [17]:
# We create an empty dataframe to storage the webscrapping results
df_results = pd.DataFrame()

# Iterating over the keys of the dictionary, representing all the election options
# For this part, we had to change the waiting time because shorter times have stopped repeatedly the code from running
for election_txt in list(dict_options.keys())[1:]:  # Excluding "SELECCIONAR" obtains all the dictionary keys 
    election = dict_options[election_txt]
    election.click() #Clicks on every option of "election"

    # With WebDriverWait, wait 25 seconds until the "VER DATOS DE LA ELECCIÓN" red button is clickable, then click it to load the selected election data
    WebDriverWait(driver, 25).until(
        EC.element_to_be_clickable((By.XPATH, '/html/body/div[1]/section/div[2]/div[2]/div[3]/div/button'))
    ).click()

    # Next, we'll wait until the "CANDIDATOS Y RESULTADOS" tab is clickable and click it to see the results
    WebDriverWait(driver, 25).until(
        EC.element_to_be_clickable((By.XPATH, '/html/body/div[1]/section/div[2]/div[3]/div[1]/ul/li[2]/a'))
    ).click()

    # Finally, wait until the table is present and extract the table HTML
    table = WebDriverWait(driver, 25).until(
        EC.presence_of_element_located((By.XPATH, '/html/body/div[1]/section/div[2]/div[3]/div[3]/div/div/div/div[1]/div[2]/div[2]'))
    )

    # Read the HTML of the table and turn it into a Pandas dataframe
    # We also use StringIO to avoid the annoying warning about using the soon-to-be-deprecated pd.read_html function
    table_html = StringIO(table.get_attribute('innerHTML'))
    table = pd.read_html(table_html)

    # Processing and storage data in the df_results
    # Before everything, assign the first dataframe of table to df
    df = table[0]

    # Then select the columns 'ORGANIZACIÓN POLÍTICA' and 'TOTAL VOTOS' and append a new 'ELECCIONES' column
    df = df[['ORGANIZACIÓN POLÍTICA', 'TOTAL VOTOS']].copy()
    df.insert(0, 'ELECCIONES', election_txt)

    # Concatenating df to df_results
    df_results = pd.concat([df_results, df], ignore_index = True)

    # Finally, going back and returning for the next iteration, and wait 25 seconds after each navigation to be sure the page is fully load
    # We do it twice because one gets us to the "NORMATIVA" label and the other one to the initial page
    driver.back()
    time.sleep(25)

    driver.back()
    time.sleep(25)
    
    # Click on the next list item to choose the election
    WebDriverWait(driver, 25).until(
        EC.element_to_be_clickable((By.XPATH, '/html/body/div[1]/section/div[2]/div[2]/div[2]/div[2]/div'))
    ).click()
    time.sleep(25)


# 5. Verifying the outcome

Here, we check the df to confirm that everything went as expected.

In [19]:
df_results

Unnamed: 0,ELECCIONES,ORGANIZACIÓN POLÍTICA,TOTAL VOTOS
0,PRESIDENCIAL 2021 - 2DA VUELTA,PARTIDO POLITICO NACIONAL PERU LIBRE,8836380
1,PRESIDENCIAL 2021 - 2DA VUELTA,FUERZA POPULAR,8792117
2,PRESIDENCIAL 2021 - 2DA VUELTA,VOTOS EN BLANCO,121489
3,PRESIDENCIAL 2021 - 2DA VUELTA,VOTOS NULOS,1106816
4,PRESIDENCIAL 2021,PARTIDO POLITICO NACIONAL PERU LIBRE,2724752
...,...,...,...
152,PRESIDENCIAL 1936,PARTIDO REPUBLICANO,30803
153,PRESIDENCIAL 1931,UNION REVOLUCIONARIA,152149
154,PRESIDENCIAL 1931,PARTIDO APRISTA PERUANO,106088
155,PRESIDENCIAL 1931,PARTIDO DESCENTRALISTA,21950


# 6. Exporting the results as an Excel file

This final part of the code creates and saves an Excel file with the scraped election data. First, the name of the Excel file is defined as "PE_Vote_1931-2021.xlsx". The code then retrieves the current working directory using os.getcwd() and combines it with the file name to create the full file path. Finally, the df_results DataFrame is saved to this file path using to_excel. The parameter index = False ensures that the DataFrame's index is not saved as a separate column in the Excel file, keeping the output clean and properly formatted.

In [23]:
# Now, let's create the name of the xlsx file
file_name = "PE_Vote_1931-2021.xlsx"

# We'll place it in our courrent working route
current_dir = os.getcwd()

# Afterwards, we link the route and the file name to create the file path 
file_path = os.path.join(current_dir, file_name)

# Last step: Save the dataframe "df_results"
# Note that index = False prevents the dataframe's index get saved as a new column in our Excel file
df_results.to_excel(file_path, index = False)