<a href="https://colab.research.google.com/github/bryaanabraham/Election-Data-Analysis/blob/main/Election_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Dependencies

In [27]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

###Data Extraction:
- The following snippet is a function to extract data from the provided url.
- The data is exctracted and saved to a file named as per your choice

In [28]:
def extract_data(url, name):
  response = requests.get(url)
  html_content = response.content

  # Parse HTML content
  soup = BeautifulSoup(html_content, 'html.parser')

  # Locate the table
  table = soup.find('table', class_='table table-striped table-bordered')

  if table is None:
      print(f"Table not found for URL: {url}")
      return

  # Extract headers
  headers = [header.text.strip() for header in table.find_all('th')]

  headers = headers[:7]
  print(f"Headers for {name}: {headers}")

  # Extract rows
  rows = []
  for row in table.find('tbody').find_all('tr'):
      cells = row.find_all('td')
      row_data = [cell.text.strip() for cell in cells[:7]]
      row_data.append(name)  # Add the name to the row
      rows.append(row_data)

  df = pd.DataFrame(rows, columns=headers)
  print('---------------------------------------------------------------------------------------------------------------------------------------------------')
  return df

# Web Scraping

### 1. WebScrape funtion
- the web scrape funtion is designed to perfrom heirachical scraping to extract data from links within links as needed
- it returns a set of urs

### 2. Initial Fetch
- data is extracted from the given url using Beautiful Soup
- Unique links are stored in a set named urls

### 3. Next Page Navigations
- data of foloowing pages needs to be extracted (Heirachical)
- the list of urls from the initial fetch are fed into the WebScrape function for the same
- result is store in a nav_urls set
 - We can derive the relevant urls and exclude promotions and advertisements by analysing the set

In [29]:
# WebScrape Function [1]
def webScrape(links):
  unique_urls = set()
  for link in links:
    response = requests.get(link)
    html_content = response.content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract all links from buttons
    buttons = soup.find_all('a')
    url = {button['href'] for button in buttons if 'href' in button.attrs}
    unique_urls.update(url)

    for link in url:
        print(link)
    print(f"Total Links: {len(unique_urls)}")
  return unique_urls

In [30]:
# Initial Fetch [2]
url = ["https://results.eci.gov.in/"]
urls = webScrape(url)

https://results.eci.gov.in/AcResultGen2ndJune2024/index.htm
https://results.eci.gov.in/AcResultGenJune2024/index.htm
https://results.eci.gov.in/PcResultGenJune2024/index.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
index.htm
https://results.eci.gov.in/AcResultByeJune2024/
https://apps.apple.com/in/app/voter-helpline/id1456535004
Total Links: 7


In [31]:
# excluding false positives
urls = [item for item in urls if 'https'in item]
for link in urls:
    print(link)
print(f"Total Links: {len(urls)}")

https://results.eci.gov.in/AcResultGen2ndJune2024/index.htm
https://results.eci.gov.in/AcResultGenJune2024/index.htm
https://results.eci.gov.in/AcResultByeJune2024/
https://results.eci.gov.in/PcResultGenJune2024/index.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
https://apps.apple.com/in/app/voter-helpline/id1456535004
Total Links: 6


In [32]:
# Extracting next page navigations from urls [3]
nav_urls = webScrape(urls)

./hi/index.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
partywiseresult-S02.htm
index.htm
partywiseresult-S21.htm
#
https://apps.apple.com/in/app/voter-helpline/id1456535004
Total Links: 7
./hi/index.htm
partywiseresult-S01.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
index.htm
#
partywiseresult-S18.htm
https://apps.apple.com/in/app/voter-helpline/id1456535004
Total Links: 9
./hi/index.htm
candidateswise-S04195.htm
candidateswise-S2731.htm
candidateswise-S0842.htm
candidateswise-S0839.htm
candidateswise-S24403.htm
candidateswise-S0821.htm
#
candidateswise-S0837.htm
candidateswise-S0626.htm
candidateswise-S22233.htm
candidateswise-S0845.htm
candidateswise-S24173.htm
candidateswise-S24136.htm
candidateswise-S2971.htm
candidateswise-S20165.htm
candidateswise-S0683.htm
candidateswise-S0818.htm
https://play.google.com/store/apps/details?id=com.eci.citizen
candidateswise-S237.htm
candidateswise-S2562.htm
index.htm
candidateswise-S24292.htm
candidates

    Relevant Links:
    - AcResultGenJune
    - AcResultByeJune
    - Party Wise result
    - Candidate wise result
  
Links need to be further analysed to see if they hold more data in the next pages whereas domain names can be be extracted <br>(This is subjective to Websites, this was discovered upon exploration)

# Heirarchical Extraction

- this is the process of choosing to scrape for more links from a website or extracting data or simply ignoring
- on WebScrape the set includes links and domain names
- the domain names consist tabular data (subjective to website)
- the links navigate to other websites which contain data or more links(only some of which need to be analysed)
- Item is chosen for subscraping or extraction based on the list of relevant links (substrings)

In [33]:
# Directory to store Files
import os
os.makedirs('csv_files', exist_ok=True)
location = 'csv_files'

In [34]:
relevant_links = ['AcResultGenJune','AcResultByeJune']
relevant_domain_names = ['partywisewinresult','candidateswise', ]
#filtered_links = [item for item in absolute_links if 'candidateswise'in item]

In [35]:
def save2csv(df, name, location):
  csv_file_path = f"{location}/{name}.csv"
  df.to_csv(csv_file_path, index=False)

In [36]:
# Heirarchical Extraction
def check_urls(urls, relevant_links, relevant_domain_names):
  for url in urls:
    match_found = False

    for domain in relevant_domain_names:
      if domain in url:
        match_found = True

        if domain == 'AcResultGenJune':
          base_url = "https://results.eci.gov.in/AcResultByeJune"
          url = base_url + url
          name = url.split('-')[-1].split('.')[0]
          df = extract_data(url,name)
          save2csv(df,domain,location)
          break

        if domain == 'AcResultByeJune':
          # the website contains tabulated data under 'Constituencywise' whereas 'candidatewise' has it in the form of ccards
          url = url.replace('candidateswise-', 'Constituencywise')
          # Convert relative URLs to absolute URLs
          base_url = "https://results.eci.gov.in/AcResultByeJune2024/"
          url = base_url + url
          name = url.split('-')[-1].split('.')[0]
          df = extract_data(url,name)
          save2csv(df,domain,location)
          break

    if not match_found:
      for link in relevant_links:
        if link in url:
          urls.update(webScrape(link))
          match_found = True
          break

In [37]:
check_urls(nav_urls,relevant_links,relevant_domain_names)

MissingSchema: Invalid URL 'A': No scheme supplied. Perhaps you meant https://A?

In [None]:
import shutil

directory_to_compress = '/content/csv_files'
output_zip_file = '/content/csv_files.zip'
shutil.make_archive(output_zip_file.replace('.zip', ''), 'zip', directory_to_compress)

'/content/csv_files.zip'

Data downloaded to a zip file for backup

# Merging Relevant Data