#### Name: Andrew Shapiro
#### Date: 11/24/2024
#### Exercise: Project #2, Part 2A - Web Scraping the Data
#### Description: This notebook will web scrap a combined total of 100 web pages, and create 2 .csv files that we will use to analyze the 2020 and 2024 POTUS Elections.

All of the web pages that I need to web scrap dynamically load the web page, so I had to use BeautifulSoup in order to web scrap the webpage once a certain element is loaded. For the 2024 elections, I had to web-scrap the page twice because there were 2 different attributes that I needed to web scrap. 

Lastly, since I made this notebook separate from the data analysis notebook, I had to import the 'csv' module so I could export my data list as a csv file that I could import in other notebooks.

In [2]:
from bs4 import BeautifulSoup as bs
import pandas as pd
import nest_asyncio
import asyncio
from playwright.async_api import async_playwright
import csv

## BeautifulSoup Functions
For this assignment, as stated earlier, I had to use BeautifulSoup, below are 2 functions that I copied from your notebook file you provided on slack and made some minor modifications to work for my website.

The function below is used by the 2020 and 2024 POTUS elections: it checks for a 'div' the ID 'president' (which does not load on initial page load).

In [5]:
# Allow nested async loops
nest_asyncio.apply()

# Define the asynchronous scraping function
async def async_scrape_wunderground(url):
    async with async_playwright() as p:
        # Launch the browser in headless mode
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        # Navigate to the URL
        await page.goto(url)

        # Wait for a specific element that indicates content has loaded, this was a div with a class='observation-table'
        # this is the part that you would customize for the specific page on your dynamic website
        await page.wait_for_selector("div#president")

        # Get the page content after JavaScript has executed
        html_content = await page.content()

        # Close the browser
        await browser.close()

        # Parse the HTML content with BeautifulSoup
        soup = bs(html_content, "html.parser")
        return soup

The second function below is used by the 2024 POTUS election only: after the function above returns the web page content when the president div loads, we also need to web scrap a different iframe source, which contains a table we can web scrap our data from. This functions checks for a 'div' with the class 'w-full' (which does not load on initial page load).

In [8]:
# Define the asynchronous scraping function
async def async_scrape_wunderground_iframe(url):
    async with async_playwright() as p:
        # Launch the browser in headless mode
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        # Navigate to the URL
        await page.goto(url)

        # Wait for a specific element that indicates content has loaded, this was a div with a class='observation-table'
        # this is the part that you would customize for the specific page on your dynamic website
        await page.wait_for_selector("div.w-full")

        # Get the page content after JavaScript has executed
        html_content = await page.content()

        # Close the browser
        await browser.close()

        # Parse the HTML content with BeautifulSoup
        soup = bs(html_content, "html.parser")
        return soup

### 2024 POTUS Data

I did not want to split this code block into separate functions as this loops through multiple pages for the same data and puts them all into a list which will be our data. So, I will be explaining to you how this code works so you have a general understanding of it. You can also see some of my thought process outlined below by comments.

This code basically loops through all 50 states + DC with a while loop. It gets the first state and resets the nest_asyncio before it crafts a URL by the state. After it has a URL, it will fetch the page content from the state using the first function (the one checking for a div with the ID 'president') and then tries to find another div with the id 'PRE-' which is where we will find an iframe element that has a link to another web page. The link is stored in the 'src' attribute so after we find that link, we use the second function to web scrap the iframe website and then find a table inside of it. If we find a table, we will go ahead and loop through each row and gather all of the attributes for each row (candidate). After we find the information in each row, if it's all defined, we will push it to the master list.

The code above will loop through each state + DC. After it is done, as a final check, I went in and used some code from ChatGPT (cited in comments below) to go ahead and get rid of all of the duplicate rows, if any. After this is done, we should have a list with our data.

This entire process takes around 5-10 minutes to web scrap. I find that at CCM, the internet is a bit slow so it errors a lot. At home, it's a bit faster. For this reason, I have made it easier on myself (and you) by converting this list into a .csv file that we could import rather than have to re-collect all of the data from the web scrap. You are welcome to try to run the code below as well if you would like to! I am always able to record a screen recording at home with it working if you need more information.

Additionally, it will only REMOVE states that have been successfully processed, so if a state errors, the loop will attempt to redo the web scraping of that state. I also print information that helps me see whats happening in the code (if a state is added, if it errors, etc.)

In [11]:
# This will hold the data.
master = []

# An array we loop through containing all 50 states + DC.
states = [
    'alabama', 'alaska', 'arizona', 'arkansas', 'california', 'colorado', 'connecticut', 'delaware', 'district-of-columbia',
    'florida', 'georgia', 'hawaii', 'idaho', 'illinois', 'indiana', 'iowa', 'kansas', 'kentucky', 
    'louisiana', 'maine', 'maryland', 'massachusetts', 'michigan', 'minnesota', 'mississippi', 'missouri', 
    'montana', 'nebraska', 'nevada', 'new-hampshire', 'new-jersey', 'new-mexico', 'new-york', 'north-carolina', 
    'north-dakota', 'ohio', 'oklahoma', 'oregon', 'pennsylvania', 'rhode-island', 'south-carolina', 'south-dakota', 
    'tennessee', 'texas', 'utah', 'vermont', 'virginia', 'washington', 'west-virginia', 'wisconsin', 'wyoming'
]

while states:  # Continue looping until 'states' is empty
    state = states[0]  # We reference the first state in the list.
    nest_asyncio.apply()

    # I craft a URL using string formatting with the state.
    url = f'https://www.270towin.com/2024-election-results-live/state/{state}'
    
    try:
        # Fetch page content for each state
        soup = asyncio.run(async_scrape_wunderground(url))

        # Find the div with id 'PRE-' (where the election data is stored)
        div = soup.find('div', id='PRE-')

        # Find the iframe inside that div
        iframe = div.find('iframe')['src']
        
        if iframe:
            # Scrape iframe content
            frame = asyncio.run(async_scrape_wunderground_iframe(iframe))
            table = frame.find('table', class_='w-full flex-grow')
            
            if table:
                for row in table.find_all('tr'):
                    # Extract candidate data from the specific columns
                    name = row.find('span', class_='ml-3 mr-1 text-base')
                    party = row.find('td', class_='px-1 text-center')
                    votes = row.find('td', class_='pr-1 text-end text-base sm:pr-2')
                    pct = row.find('td', class_='w-[90px] text-center text-lg font-bold')

                    # If all pieces of information are valid, we add the data to our list.
                    if name and party and votes and pct:
                        master.append({
                            'state': state,
                            'name': name.text.strip(),
                            'party': party.text.strip(),
                            'votes': votes.text.strip(),
                            'percentage': pct.text.strip()
                        })
        else:
            print(f"Broken link for {state}: {url}")

        print(f'Added: {state}')

        # Remove the state if it's successfully processed.
        states.pop(0)

    except Exception as e:
        print(f"Error with {state}, retrying...")

# ----

# The code below removes duplicate rows of data in the list.

m = []
seen = set()
        
for row in master:
    identifier = (row['state'], row['name'], row['party'], row['votes'], row['percentage'])
    if identifier not in seen:
        seen.add(identifier)
        m.append(row)

# OpenAI. ChatGPT. 2024. OpenAI, https://www.openai.com/chatgpt.
# ---

master = m

master

Added: alabama
Added: alaska
Added: arizona
Added: arkansas
Error with california, retrying...
Error with california, retrying...
Error with california, retrying...
Error with california, retrying...
Error with california, retrying...
Added: california
Added: colorado
Error with connecticut, retrying...
Added: connecticut
Added: delaware
Added: district-of-columbia
Added: florida
Error with georgia, retrying...
Added: georgia
Error with hawaii, retrying...
Error with hawaii, retrying...
Error with hawaii, retrying...
Error with hawaii, retrying...
Error with hawaii, retrying...
Added: hawaii
Error with idaho, retrying...
Error with idaho, retrying...
Error with idaho, retrying...
Added: idaho
Added: illinois
Added: indiana
Added: iowa
Added: kansas
Error with kentucky, retrying...
Added: kentucky
Added: louisiana
Added: maine
Error with maryland, retrying...
Added: maryland
Added: massachusetts
Added: michigan
Added: minnesota
Added: mississippi
Error with missouri, retrying...
Added: 

[{'state': 'alabama',
  'name': 'Donald J. Trump',
  'party': 'GOP',
  'votes': '1,457,913',
  'percentage': '64.8%'},
 {'state': 'alabama',
  'name': 'Kamala Harris',
  'party': 'DEM',
  'votes': '769,482',
  'percentage': '34.2%'},
 {'state': 'alabama',
  'name': 'Robert F. Kennedy Jr.',
  'party': 'IND',
  'votes': '12,028',
  'percentage': '0.5%'},
 {'state': 'alabama',
  'name': 'Chase Oliver',
  'party': 'LIB',
  'votes': '4,915',
  'percentage': '0.2%'},
 {'state': 'alabama',
  'name': 'Jill Stein',
  'party': 'GRE',
  'votes': '4,298',
  'percentage': '0.2%'},
 {'state': 'alaska',
  'name': 'Donald J. Trump',
  'party': 'GOP',
  'votes': '184,204',
  'percentage': '54.5%'},
 {'state': 'alaska',
  'name': 'Kamala Harris',
  'party': 'DEM',
  'votes': '139,812',
  'percentage': '41.4%'},
 {'state': 'alaska',
  'name': 'Robert F. Kennedy Jr.',
  'party': 'IND',
  'votes': '5,663',
  'percentage': '1.7%'},
 {'state': 'alaska',
  'name': 'Chase Oliver',
  'party': 'LIB',
  'votes': 

We will now export the list above to a .csv file. All of the code below is from ChatGPT (cited at the bottom). This list will be exported to 
'Election2024.csv'.

In [13]:
output = 'Election2024.csv'

# Open the file in write mode and create a CSV writer object
with open(output, mode='w', newline='', encoding='utf-8') as file:
    # Create a writer object
    writer = csv.DictWriter(file, fieldnames=['state', 'name', 'party', 'votes', 'percentage'])
    
    # Write the header row
    writer.writeheader()
    
    # Write the data rows
    writer.writerows(master)

# OpenAI. ChatGPT. 2024. OpenAI, https://www.openai.com/chatgpt.

### 2020 POTUS Data

I did not want to split this code block into separate functions as this loops through multiple pages for the same data and puts them all into a list which will be our data. So, I will be explaining to you how this code works so you have a general understanding of it. You can also see some of my thought process outlined below by comments.

This code basically loops through all 50 states + DC with a while loop. It gets the first state and resets the nest_asyncio before it crafts a URL by the state. After it has a URL, it will fetch the page content from the state using the first function (the one checking for a div with the ID 'president') and then tries to find a table with 'race_table' in its ID. ChatGPT has helped me figure out how to check if it is included in the ID (cited in comments below). From there, I loop through each element and then get the name, state, party, vote percentage, and vote count and push it into the master array again.

The code above will loop through each state + DC. After it is done, as a final check, I went in and used some code from ChatGPT (cited in comments below) to go ahead and get rid of all of the duplicate rows, if any. After this is done, we should have a list with our data.

This entire process takes around 5-10 minutes to web scrap. I find that at CCM, the internet is a bit slow so it errors a lot. At home, it's a bit faster. For this reason, I have made it easier on myself (and you) by converting this list into a .csv file that we could import rather than have to re-collect all of the data from the web scrap. You are welcome to try to run the code below as well if you would like to! I am always able to record a screen recording at home with it working if you need more information.

Additionally, it will only REMOVE states that have been successfully processed, so if a state errors, the loop will attempt to redo the web scraping of that state. I also print information that helps me see whats happening in the code (if a state is added, if it errors, etc.)

In [None]:
# This is where we store our data.
master = []

# This array holds the list of states + DC that we will loop through.
states = [
    'alabama', 'alaska', 'arizona', 'arkansas', 'california', 'colorado', 'connecticut', 'delaware', 'district-of-columbia',
    'florida', 'georgia', 'hawaii', 'idaho', 'illinois', 'indiana', 'iowa', 'kansas', 'kentucky', 
    'louisiana', 'maine', 'maryland', 'massachusetts', 'michigan', 'minnesota', 'mississippi', 'missouri', 
    'montana', 'nebraska', 'nevada', 'new-hampshire', 'new-jersey', 'new-mexico', 'new-york', 'north-carolina', 
    'north-dakota', 'ohio', 'oklahoma', 'oregon', 'pennsylvania', 'rhode-island', 'south-carolina', 'south-dakota', 
    'tennessee', 'texas', 'utah', 'vermont', 'virginia', 'washington', 'west-virginia', 'wisconsin', 'wyoming'
]

while states:  # Continue looping until 'states' is empty
    state = states[0]  # Always process the first state in the list
    nest_asyncio.apply()

    # We craft our URL using string formatting and each state.
    url = f'https://www.270towin.com/2020-election-results-live/state/{state}'
    
    try:
        # Fetch page content for each state
        soup = asyncio.run(async_scrape_wunderground(url))
        
        if soup:
            # We try to find a table containing 'race_table' in the id.
            table = soup.find('table', id=lambda x: x and 'race_table' in x)
            # OpenAI. ChatGPT. 2024. OpenAI, https://www.openai.com/chatgpt.
            
            if table:
                for row in table.find_all('tr', class_='candidate-row'):
                    votes_cell = row.find('td', class_=lambda x: x and 'votes' in x) 
                    # OpenAI. ChatGPT. 2024. OpenAI, https://www.openai.com/chatgpt.
                    pct_cell = row.find('td', class_=lambda x: x and 'vote_percent' in x) 
                    # OpenAI. ChatGPT. 2024. OpenAI, https://www.openai.com/chatgpt.

                    
                    # Find candidate name and party (which are both inside of the candidate cell)
                    name_cell = row.find('td', class_='candidate')

                    if name_cell and votes_cell and pct_cell:
                        # We split the name and party values from the candidate cell.
                        name = name_cell.text.strip().split('(')[0].strip().replace('*', '')
                        party = name_cell.text.strip().split('(')[-1].replace(')', '').strip()

                        # We find the values in the votes and percentage cells.
                        votes = votes_cell.text.strip()
                        percentage = pct_cell.text.strip()

                        # We all add of the information to our master list.
                        master.append({
                            'state': state,
                            'name': name,
                            'party': party,
                            'votes': votes,
                            'percentage': percentage
                        })
        else:
            print(f"Broken link for {state}: {url}")

        print(f'Added: {state}')

        # It will only remove a state once it processes it successfully and adds the data to the list.
        states.pop(0)

    except Exception as e:
        print(f"Error with {state}, retrying...")

# ----

# The code below removes duplicate rows of data in the list.

m = []
seen = set()
        
for row in master:
    identifier = (row['state'], row['name'], row['party'], row['votes'], row['percentage'])
    if identifier not in seen:
        seen.add(identifier)
        m.append(row)

# OpenAI. ChatGPT. 2024. OpenAI, https://www.openai.com/chatgpt.
# ---

master

Added: alabama
Added: alaska
Added: arizona
Added: arkansas
Added: california
Added: colorado
Added: connecticut
Added: delaware
Added: district-of-columbia
Added: florida
Added: georgia
Added: hawaii
Added: idaho
Added: illinois
Added: indiana
Added: iowa
Added: kansas
Added: kentucky
Added: louisiana
Added: maine
Added: maryland
Added: massachusetts
Added: michigan
Added: minnesota
Added: mississippi
Added: missouri
Added: montana
Added: nebraska
Added: nevada
Added: new-hampshire
Added: new-jersey
Added: new-mexico
Added: new-york
Added: north-carolina
Added: north-dakota
Added: ohio
Added: oklahoma
Added: oregon
Added: pennsylvania
Added: rhode-island
Added: south-carolina
Added: south-dakota
Added: tennessee
Added: texas
Added: utah
Added: vermont
Added: virginia
Added: washington
Added: west-virginia
Added: wisconsin
Added: wyoming


[{'state': 'alabama',
  'name': 'Donald J. Trump',
  'party': 'Republican',
  'votes': '1,441,170',
  'percentage': '62.2%'},
 {'state': 'alabama',
  'name': 'Joe Biden',
  'party': 'Democratic',
  'votes': '849,378',
  'percentage': '36.7%'},
 {'state': 'alabama',
  'name': 'Jo Jorgensen',
  'party': 'Libertarian',
  'votes': '25,176',
  'percentage': '1.1%'},
 {'state': 'alaska',
  'name': 'Donald J. Trump',
  'party': 'Republican',
  'votes': '189,543',
  'percentage': '53.1%'},
 {'state': 'alaska',
  'name': 'Joe Biden',
  'party': 'Democratic',
  'votes': '153,502',
  'percentage': '43.0%'},
 {'state': 'alaska',
  'name': 'Jo Jorgensen',
  'party': 'Libertarian',
  'votes': '8,877',
  'percentage': '2.5%'},
 {'state': 'alaska',
  'name': 'Jesse  Ventura',
  'party': 'Green',
  'votes': '2,664',
  'percentage': '0.7%'},
 {'state': 'alaska',
  'name': 'Don Blankenship',
  'party': 'Constitution',
  'votes': '1,120',
  'percentage': '0.3%'},
 {'state': 'alaska',
  'name': 'Brock Pier

We will now export the list above to a .csv file. All of the code below is from ChatGPT (cited at the bottom). This list will be exported to 
'Election2020.csv'.

In [5]:
output = 'Election2020.csv'

# Open the file in write mode and create a CSV writer object
with open(output, mode='w', newline='', encoding='utf-8') as file:
    # Create a writer object
    writer = csv.DictWriter(file, fieldnames=['state', 'name', 'party', 'votes', 'percentage'])
    
    # Write the header row
    writer.writeheader()
    
    # Write the data rows
    writer.writerows(master)
    
# OpenAI. ChatGPT. 2024. OpenAI, https://www.openai.com/chatgpt.

## End
At the end of this notebook, we have successfully web scraped our web pages and stored the data into 2 .csv files that we will import into our other notebook for data analysis. These notebooks are under the names `Election2024.csv` and `Election2020.csv`.