<a href="https://colab.research.google.com/github/bryaanabraham/Election-Data-Analysis/blob/main/Election_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Dependencies

In [20]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

###Data Extraction:
- The following snippet is a function to extract data from the provided url.
- The data is exctracted and saved to a file named as per your choice

In [46]:
def extract_data(url, name):
    response = requests.get(url)
    html_content = response.content

    # Parse HTML content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Locate the table
    table = soup.find('table', class_='table table-striped table-bordered')

    if table is None:
        print(f"Table not found for URL: {url}")
        return

    # Extract headers
    headers = [header.text.strip() for header in table.find_all('th')]

    # Keep only the first 7 headers that match the data rows
    headers = headers[:7]
    print(f"Headers for {name}: {headers}")

    # Extract rows
    rows = []
    for row in table.find('tbody').find_all('tr'):
        cells = row.find_all('td')
        rows.append([cell.text.strip() for cell in cells])
    print(f"Rows for {name}: {rows[:3]}")

    # Check the length of headers and rows to ensure they match
    for row in rows:
        if len(row) != len(headers):
            print(f"Row length {len(row)} does not match header length {len(headers)}")

    df = pd.DataFrame(rows, columns=headers)
    csv_file_path = f"csv_files/{name}.csv"
    df.to_csv(csv_file_path, index=False)

    print(f"Data has been saved to {csv_file_path}")
    print('---------------------------------------------------------------------------------------------------------------------------------------------------')

# Web Scraping

### 1. WebScrape funtion
- the web scrape funtion is designed to perfrom heirachical scraping to extract data from links within links as needed
- it returns a set of urs

### 2. Initial Fetch
- data is extracted from the given url using Beautiful Soup
- Unique links are stored in a set named urls

### 3. Next Page Navigations
- data of foloowing pages needs to be extracted (Heirachical)
- the list of urls from the initial fetch are fed into the WebScrape function for the same
- result is store in a nav_urls set
 - We can derive the relevant urls and exclude promotions and advertisements by analysing the set

In [59]:
# WebScrape Function [1]
def webScrape(links):
  unique_urls = set()
  for link in links:
    response = requests.get(link)
    html_content = response.content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract all links from buttons
    buttons = soup.find_all('a')
    url = {button['href'] for button in buttons if 'href' in button.attrs}
    unique_urls.update(url)

    for link in url:
        print(link)
    print(f"Total Links: {len(unique_urls)}")
  return unique_urls

In [60]:
# Initial Fetch [2]
url = ["https://results.eci.gov.in/"]
urls = webScrape(url)

https://apps.apple.com/in/app/voter-helpline/id1456535004
https://results.eci.gov.in/AcResultGenJune2024/index.htm
https://results.eci.gov.in/PcResultGenJune2024/index.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
index.html
https://results.eci.gov.in/AcResultGen2ndJune2024/index.htm
https://results.eci.gov.in/AcResultByeJune2024/
Total Links: 7


In [61]:
# excluding false positives
urls = [item for item in urls if 'https'in item]
for link in urls:
    print(link)
print(f"Total Links: {len(urls)}")

https://results.eci.gov.in/PcResultGenJune2024/index.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
https://apps.apple.com/in/app/voter-helpline/id1456535004
https://results.eci.gov.in/AcResultGen2ndJune2024/index.htm
https://results.eci.gov.in/AcResultByeJune2024/
https://results.eci.gov.in/AcResultGenJune2024/index.htm
Total Links: 6


In [62]:
# Extracting next page navigations from urls [3]
nav_urls = webScrape(urls)


partywisewinresultState-2484.htm
https://results.eci.gov.in/AcResultByeJune2024/index.htm
partywisewinresultState-805.htm
partywisewinresultState-1658.htm
#
partywisewinresultState-834.htm
partywisewinresultState-860.htm
partywisewinresultState-3165.htm
partywisewinresultState-743.htm
partywisewinresultState-547.htm
partywisewinresultState-804.htm
partywisewinresultState-3388.htm
partywisewinresultState-83.htm
partywisewinresultState-1142.htm
partywisewinresultState-2757.htm
partywisewinresultState-852.htm
partywisewinresultState-1888.htm
partywisewinresultState-1847.htm
index.htm
partywisewinresultState-742.htm
partywisewinresultState-1534.htm
partywisewinresultState-3529.htm
partywisewinresultState-664.htm
partywisewinresultState-1.htm
partywisewinresultState-1680.htm
partywisewinresultState-911.htm
partywisewinresultState-1584.htm
partywisewinresultState-1745.htm
partywisewinresultState-3482.htm
partywisewinresultState-1046.htm
https://apps.apple.com/in/app/voter-helpline/id1456535

KeyboardInterrupt: 

    Relevant Links:
    - AcResultGenJune
    - AcResultByeJune
    - Party Wise result
    - Candidate wise result
  
Links need to be further analysed to see if they hold more data in the next pages whereas domain names can be be extracted <br>(This is subjective to Websites, this was discovered upon exploration)

# Heirarchical Extraction

- this is the process of choosing to scrape for more links from a website or extracting data or simply ignoring
- on WebScrape the set includes links and domain names
- the domain names consist tabular data (subjective to website)
- the links navigate to other websites which contain data or more links(only some of which need to be analysed)
- Item is chosen for subscraping or extraction based on the list of relevant links (substrings)

In [50]:
# Directory to store Files
import os
os.makedirs('csv_files', exist_ok=True)

In [45]:
relevant_links = ['AcResultGenJune','AcResultByeJune']
relevant_domain_names = ['partywisewinresult','candidateswise', ]
#filtered_links = [item for item in absolute_links if 'candidateswise'in item]

In [58]:
# Heirarchical Extraction
urls = set()
for item in nav_urls:
  for link in relevant_links:
    if link in item:
      urls.update(webScrape(item))
      break

MissingSchema: Invalid URL 'h': No scheme supplied. Perhaps you meant https://h?

In [None]:
for item in urls:
  print(item)

In [None]:
for item in nav_urls:
  for domain in relevant_domain_names:
    if domain in item:
      # the website contains tabulated data under 'Constituencywise' whereas 'candidatewise' has it in the form of ccards
      # item = item.replace('candidateswise-', 'Constituencywise')
      # Convert relative URLs to absolute URLs
      base_url = "https://results.eci.gov.in/AcResultByeJune2024/"
      url = base_url + item
      name = url.split('-')[-1].split('.')[0]
      extract_data(url,name)
  print('-------------------------------------exctract_data--------------------------------------------------')

In [None]:
import shutil

directory_to_compress = '/content/csv_files'
output_zip_file = '/content/csv_files.zip'
shutil.make_archive(output_zip_file.replace('.zip', ''), 'zip', directory_to_compress)

'/content/csv_files.zip'

Data downloaded to a zip file for backup

# Merging Relevant Data