<a href="https://colab.research.google.com/github/bryaanabraham/Election-Data-Analysis/blob/main/Election_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Dependencies

In [79]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

###Data Extraction:
- The following snippet is a function to extract data from the provided url.
- The data is exctracted and saved to a file named as per your choice

In [80]:
def extract_data(url, name):
  response = requests.get(url)
  html_content = response.content

  # Parse HTML content
  soup = BeautifulSoup(html_content, 'html.parser')

  # Locate the table
  table = soup.find('table', class_='table table-striped table-bordered')

  if table is None:
      print(f"Table not found for URL: {url}")
      return None

  # Extract headers
  headers = [header.text.strip() for header in table.find_all('th')]
  headers = headers[:7]
  headers.append('Source')  # Add an extra header for the 'name' column
  print(f"Headers for {name}: {headers}")

  # Extract rows
  rows = []
  for row in table.find('tbody').find_all('tr'):
      cells = row.find_all('td')
      row_data = [cell.text.strip() for cell in cells[:7]]
      row_data.append(name)  # Add the name to the row
      rows.append(row_data)

  df = pd.DataFrame(rows, columns=headers)
  print('---------------------------------------------------------------------------------------------------------------------------------------------------')
  return df


# Web Scraping

### 1. WebScrape funtion
- the web scrape funtion is designed to perfrom heirachical scraping to extract data from links within links as needed
- it returns a set of urs

### 2. Initial Fetch
- data is extracted from the given url using Beautiful Soup
- Unique links are stored in a set named urls

### 3. Next Page Navigations
- data of foloowing pages needs to be extracted (Heirachical)
- the list of urls from the initial fetch are fed into the WebScrape function for the same
- result is store in a nav_urls set
 - We can derive the relevant urls and exclude promotions and advertisements by analysing the set

In [81]:
# WebScrape Function [1]
def webScrape(links):
    unique_urls = set()
    for link in links:
        try:
            response = requests.get(link)
            html_content = response.content
            soup = BeautifulSoup(html_content, 'html.parser')

            # Extract all links from buttons
            buttons = soup.find_all('a')
            urls = {button['href'] for button in buttons if 'href' in button.attrs}
            unique_urls.update(urls)
            for url in urls:
                if 'https://results.eci.gov.in' in url:
                    if url not in unique_urls:
                        unique_urls.add(url)
                        # Recursively scrape the new URL
                        unique_urls.update(webScrape([url]))

        except Exception as e:
            print(f"Unable to fetch {link} : {e}")

    return unique_urls

In [82]:
# Initial Fetch [2]
home_url = ['https://results.eci.gov.in']
nav_from_urls = webScrape(home_url)
for link in nav_from_urls:
    print(link)

https://results.eci.gov.in/AcResultGenJune2024/index.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
https://results.eci.gov.in/AcResultGen2ndJune2024/index.htm
https://results.eci.gov.in/PcResultGenJune2024/index.htm
https://apps.apple.com/in/app/voter-helpline/id1456535004
index.htm
https://results.eci.gov.in/AcResultByeJune2024/


In [83]:
# excluding false positives (removing unneeded links)
nav_from_urls = [item for item in nav_from_urls if 'results.eci.gov' in item]
for link in nav_from_urls:
    print(link)
print(f"Total Links: {len(nav_from_urls)}")

https://results.eci.gov.in/AcResultGenJune2024/index.htm
https://results.eci.gov.in/AcResultGen2ndJune2024/index.htm
https://results.eci.gov.in/PcResultGenJune2024/index.htm
https://results.eci.gov.in/AcResultByeJune2024/
Total Links: 4


In [84]:
next_nav_urls = webScrape(nav_from_urls)

In [85]:
for link in next_nav_urls:
    print(link)
print(len(next_nav_urls))


candidateswise-S0626.htm
partywisewinresultState-911.htm
partywisewinresultState-1458.htm
partywisewinresultState-1420.htm
candidateswise-S1036.htm
candidateswise-S2971.htm
partywisewinresultState-545.htm
partywisewinresultState-743.htm
partywisewinresultState-2070.htm
./hi/index.htm
candidateswise-S22233.htm
partywiseresult-S01.htm
partywisewinresultState-1680.htm
candidateswise-S237.htm
partywisewinresultState-582.htm
partywisewinresultState-1888.htm
candidateswise-S06108.htm
candidateswise-S25113.htm
candidateswise-S04195.htm
partywisewinresultState-2757.htm
partywisewinresultState-834.htm
partywiseresult-S02.htm
hi/index.htm
partywisewinresultState-1.htm
partywiseresult-S21.htm
candidateswise-S0845.htm
candidateswise-S20165.htm
candidateswise-S06136.htm
candidateswise-S24292.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
partywisewinresultState-118.htm
partywisewinresultState-369.htm
candidateswise-S2562.htm
partywisewinresultState-2989.htm
candidateswise-S24173

    Relevant Links:
    - AcResultGenJune
    - AcResultByeJune
    - Party Wise result
    - Candidate wise result
  
Links need to be further analysed to see if they hold more data in the next pages whereas domain names can be be extracted <br>(This is subjective to Websites, this was discovered upon exploration)

# Heirarchical Extraction

- this is the process of choosing to scrape for more links from a website or extracting data or simply ignoring
- on WebScrape the set includes links and domain names
- the domain names consist tabular data (subjective to website)
- the links navigate to other websites which contain data or more links(only some of which need to be analysed)
- Item is chosen for subscraping or extraction based on the list of relevant links (substrings)

In [86]:
# Directory to store Files
import os
os.makedirs('csv_files', exist_ok=True)
location = 'csv_files'

In [87]:
relevant_address = 'https://results.eci.gov.in'
relevant_domain_names = ['partywisewinresultState','candidateswise', ]
#filtered_links = [item for item in absolute_links if 'candidateswise'in item]

In [88]:
def save2csv(df, name, location):
  csv_file_path = f"{location}/{name}.csv"
  df.to_csv(csv_file_path, index=False)

In [89]:
# Hierarchical Extraction
for url in list(next_nav_urls):

  for relevant in relevant_domain_names:
    if relevant in url:

      if relevant == 'partywisewinresultState':
        df = pd.DataFrame()
        base_url = "https://results.eci.gov.in/PcResultGenJune2024/"
        url = base_url + '/' + url
        name = url.split('-')[-1].split('.')[0]
        df_new = extract_data(url, name)
        if df_new is not None:
            df_gen_party = pd.concat([df, df_new], ignore_index=True)

      if relevant == 'candidateswise':
        df = pd.DataFrame()
        base_url = "https://results.eci.gov.in/AcResultByeJune2024"
        url = base_url + '/' + url
        url = url.replace('candidateswise-', 'Constituencywise')
        name = url.split('-')[-1].split('.')[0]
        df_bye_new = extract_data(url, name)
        if df_bye_new is not None:
            df_bye_candidate = pd.concat([df, df_bye_new], ignore_index=True)


Headers for https://results: ['S.N.', 'Candidate', 'Party', 'EVM Votes', 'Postal Votes', 'Total Votes', '% of Votes', 'Source']
---------------------------------------------------------------------------------------------------------------------------------------------------
Headers for 911: ['S.No', 'Parliament Constituency', 'Winning Candidate', 'Total Votes', 'Margin', 'Source']
---------------------------------------------------------------------------------------------------------------------------------------------------
Headers for 1458: ['S.No', 'Parliament Constituency', 'Winning Candidate', 'Total Votes', 'Margin', 'Source']
---------------------------------------------------------------------------------------------------------------------------------------------------
Headers for 1420: ['S.No', 'Parliament Constituency', 'Winning Candidate', 'Total Votes', 'Margin', 'Source']
---------------------------------------------------------------------------------------------------

In [90]:
df_bye_candidate

Unnamed: 0,S.N.,Candidate,Party,EVM Votes,Postal Votes,Total Votes,% of Votes,Source
0,1,ANURADHA RANA,Indian National Congress,8877,537,9414,47.09,https://results
1,2,RAVI THAKUR,Bharatiya Janata Party,2934,115,3049,15.25,https://results
2,3,DR. RAM LAL MARKANDA,Independent,7091,363,7454,37.28,https://results
3,4,NOTA,None of the Above,75,1,76,0.38,https://results


In [91]:
df_gen_party

Unnamed: 0,S.No,Parliament Constituency,Winning Candidate,Total Votes,Margin,Source
0,1,Srikakulam(2),KINJARAPU RAMMOHAN NAIDU,754328,327901,1745
1,2,Vizianagaram(3),APPALANAIDU KALISETTI,743113,249351,1745
2,3,Visakhapatnam(4),SRIBHARAT MATHUKUMILI,907467,504247,1745
3,4,Amalapuram (SC)(7),G M HARISH (BALAYOGI),796981,342196,1745
4,5,Eluru(10),PUTTA MAHESH KUMAR,746351,181857,1745
5,6,Vijayawada(12),KESINENI SIVANATH (CHINNI),794154,282085,1745
6,7,Guntur(13),DR CHANDRA SEKHAR PEMMASANI,864948,344695,1745
7,8,Narsaraopet(14),LAVU SRIKRISHNA DEVARAYALU,807996,159729,1745
8,9,Bapatla (SC)(15),KRISHNA PRASAD TENNETI,717493,208031,1745
9,10,Ongole(16),MAGUNTA SREENIVASULU REDDY,701894,50199,1745


In [92]:
save2csv(df_bye_candidate,'candidate_wise',location)

In [93]:
save2csv(df_gen_party,'party_wise',location)

In [94]:
# Download data to a zip file for backup
import shutil

directory_to_compress = '/content/csv_files'
output_zip_file = '/content/csv_files.zip'
shutil.make_archive(output_zip_file.replace('.zip', ''), 'zip', directory_to_compress)

'/content/csv_files.zip'

# Manual Extraction