# Stage 1: Downloading Election Data from the Web

## Learning Goals
In this notebook, you'll learn how Python can **automate repetitive tasks** — specifically, downloading hundreds of files from a government website that would take hours to do by hand.

## What We're Doing
We need precinct-level election results from Iowa's Secretary of State website for 2016, 2018, and 2020. Each county has its own Excel file, and there are 99 counties × 3 years = **297 files** to download!

## Python Concepts You'll See
- **Importing libraries**: Python's power comes from libraries (similar to Stata's packages)
- **Variables**: Storing values like file paths and URLs
- **Loops**: Repeating actions (like downloading) multiple times
- **Functions**: Reusable blocks of code
- **Conditional statements**: Making decisions (if/else)

## A Note for Stata Users
| Stata | Python |
|-------|--------|
| `local myvar = "text"` | `myvar = "text"` |
| `foreach x in list { }` | `for x in list:` |
| `if condition { }` | `if condition:` |
| Comments: `* comment` or `// comment` | Comments: `# comment` |

---

## Step 1: Install Required Packages

Before we can use specialized tools, we need to install them. This is like `ssc install` in Stata.

- **requests**: Lets Python download web pages
- **beautifulsoup4**: Helps parse/read HTML (the code behind web pages)
- **openpyxl**: Reads modern Excel files (.xlsx)
- **xlrd**: Reads older Excel files (.xls)

**Run this cell once** — you don't need to re-run it every time you open the notebook.

In [1]:
pip install requests beautifulsoup4 openpyxl xlrd

Collecting xlrd
  Downloading xlrd-2.0.2-py2.py3-none-any.whl.metadata (3.5 kB)
Downloading xlrd-2.0.2-py2.py3-none-any.whl (96 kB)
Installing collected packages: xlrd
Successfully installed xlrd-2.0.2
Note: you may need to restart the kernel to use updated packages.


## Step 2: First Attempt — Simple Web Scraping

The code below shows our **first attempt** at downloading the files.

### How the Web Scraping Logic Works

1. **Build the URL** for each year's results page (e.g., `https://sos.iowa.gov/precinct-results-county-2018-general`)

2. **Download the HTML** of that page using `requests.get(url)`

3. **Parse the HTML** with BeautifulSoup to find all links (`<a href="...">` tags)

4. **Filter to find only the Excel files we want:**
   ```python
   is_excel = href.lower().endswith('.xls') or href.lower().endswith('.xlsx')
   is_correct_year = f"{year}general" in href.lower()
   
   if is_excel and is_correct_year:
       excel_links.append(href)
   ```
   This keeps ONLY links that:
   - End with `.xls` or `.xlsx` (it's an Excel file)
   - AND contain the year + "general" in the URL (it's for this election)
   
   So if the page has other Excel files (like templates or old data), they get filtered out.

5. **Download each matching file** and save it with a renamed filename

### Understanding the Error Output

When you run this, you'll see:
```
ERROR: Could not load page (status 403)
```

**What does 403 mean?** HTTP status codes tell you what happened:
- `200` = Success! 
- `403` = "Forbidden" — the server refused your request
- `404` = "Not Found" — page doesn't exist

A `403` error usually means the website detected that you're a script (not a human in a browser) and blocked you. This is called **bot protection**.

**How do we know to try "pretending to be a browser"?** When you get a 403 from a site that clearly exists and works in your browser, the most common cause is missing browser headers. The website checks "who's asking?" and if it doesn't look like a real browser, it says no.

In [2]:
#!/usr/bin/env python3
"""
Download Iowa Secretary of State precinct election results.

This script demonstrates how programming can automate repetitive tasks
like downloading hundreds of files from a website.
"""

import requests
from bs4 import BeautifulSoup
from pathlib import Path
import time

# =============================================================================
# CONFIGURATION - Edit these values as needed
# =============================================================================

years = ["2016", "2018", "2020"]
output_folder = Path("./iowa_election_results")

# =============================================================================
# SETUP - Create the output folder if it doesn't exist
# =============================================================================

output_folder.mkdir(exist_ok=True)
print(f"Files will be saved to: {output_folder.absolute()}\n")

# =============================================================================
# MAIN LOOP - Go through each year and download all Excel files
# =============================================================================

for year in years:
    
    # Build the URL for this year's page
    url = f"https://sos.iowa.gov/precinct-results-county-{year}-general"
    print(f"Processing {year}...")
    print(f"  URL: {url}")
    
    # Download the webpage
    response = requests.get(url)
    
    # Check if the page loaded successfully
    if response.status_code != 200:
        print(f"  ERROR: Could not load page (status {response.status_code})")
        continue  # skip to next year
    
    # Parse the HTML to find all links
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all links that end with .xls or .xlsx AND are for this year
    excel_links = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        is_excel = href.lower().endswith('.xls') or href.lower().endswith('.xlsx')
        is_correct_year = f"{year}general" in href.lower()
        
        if is_excel and is_correct_year:
            excel_links.append(href)
    
    print(f"  Found {len(excel_links)} Excel files")
    
    # Download each Excel file
    for i, href in enumerate(excel_links, start=1):
        
        # The href is already a full URL
        file_url = href
        
        # Extract the original filename from the URL
        original_filename = Path(href).name  # e.g., "Adair.xls"
        
        # Create new filename with year: "Adair.xls" -> "Adair_2018.xls"
        name_part = Path(original_filename).stem      # "Adair"
        extension = Path(original_filename).suffix    # ".xls"
        new_filename = f"{name_part}_{year}{extension}"  # "Adair_2018.xls"
        
        # Full path where we'll save the file
        save_path = output_folder / new_filename
        
        # Download the file
        print(f"  [{i}/{len(excel_links)}] Downloading {original_filename} -> {new_filename}")
        
        file_response = requests.get(file_url)
        
        if file_response.status_code == 200:
            # Save the file
            with open(save_path, 'wb') as f:
                f.write(file_response.content)
        else:
            print(f"    ERROR: Could not download (status {file_response.status_code})")
        
        # Small pause to be polite to the server
        time.sleep(0.25)
    
    print(f"  Done with {year}\n")

# =============================================================================
# SUMMARY
# =============================================================================

# Count how many files we downloaded
downloaded_files = list(output_folder.glob("*.xls*"))
print("=" * 50)
print(f"COMPLETE! Downloaded {len(downloaded_files)} files.")
print(f"Location: {output_folder.absolute()}")


Files will be saved to: C:\Users\las02013\OneDrive - University of Connecticut\Data science major\Capstone\S26\dash_test\iowa_election_results

Processing 2016...
  URL: https://sos.iowa.gov/precinct-results-county-2016-general
  ERROR: Could not load page (status 403)
Processing 2018...
  URL: https://sos.iowa.gov/precinct-results-county-2018-general
  ERROR: Could not load page (status 403)
Processing 2020...
  URL: https://sos.iowa.gov/precinct-results-county-2020-general
  ERROR: Could not load page (status 403)
COMPLETE! Downloaded 304 files.
Location: C:\Users\las02013\OneDrive - University of Connecticut\Data science major\Capstone\S26\dash_test\iowa_election_results


## Step 3: Second Attempt — Pretending to Be a Browser

Since we got a 403 "Forbidden" error, let's try making our script look more like a real browser.

### What Are HTTP Headers?

When your browser visits a website, it sends extra information called **headers**. These tell the server:
- What browser you're using (Chrome, Firefox, Safari)
- What operating system you're on (Windows, Mac)
- What languages you prefer
- What types of files you can accept

The **User-Agent** header is the most important one — it's like your browser's ID card.

### Where Do These Fake Headers Come From?

**Option 1: Copy from your own browser**
1. Open Chrome → Press F12 (Developer Tools)
2. Go to the "Network" tab
3. Visit any website
4. Click on a request → look at "Request Headers"
5. Copy the User-Agent string

**Option 2: Search online**
- Google "chrome user agent string" to find current ones
- Websites like [useragentstring.com](https://useragentstring.com) list common ones

**Option 3: Use a library**
- The `fake-useragent` Python library generates realistic headers

### What the Code Does Differently

```python
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',  # Pretend to be Chrome
    'Accept': 'text/html,application/xhtml+xml...',  # What file types we accept
    'Accept-Language': 'en-US,en;q=0.5',  # Language preference
}
session.headers.update(HEADERS)
```

The code also uses a **session** (`requests.Session()`) which maintains cookies across requests — just like a real browser remembers your login.

### Result

**Still fails!** This website has particularly strong bot protection that checks more than just headers. We need a different approach — controlling a real browser.

In [3]:
#!/usr/bin/env python3
"""
Download Iowa Secretary of State precinct election results (Excel files)
for multiple election years, renaming files with the year suffix.
"""

import os
import re
import requests
from pathlib import Path
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, unquote
import time

# For .xlsx files
try:
    import openpyxl
    HAS_OPENPYXL = True
except ImportError:
    HAS_OPENPYXL = False

# For older .xls files
try:
    import xlrd
    HAS_XLRD = True
except ImportError:
    HAS_XLRD = False


# Headers to mimic a real browser (avoids 403 errors)
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}

# Create a session to maintain cookies across requests
session = requests.Session()
session.headers.update(HEADERS)


def get_excel_links(page_url: str) -> list[tuple[str, str]]:
    """
    Scrape a page and find all links to Excel files.
    
    Args:
        page_url: URL of the page to scrape
    
    Returns:
        List of tuples: (filename, full_url)
    """
    print(f"Scanning: {page_url}")
    
    try:
        # First visit the main site to get cookies
        session.get("https://sos.iowa.gov/", timeout=30)
        time.sleep(1)
        
        # Now visit the actual page
        response = session.get(page_url, timeout=30)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"  Error fetching page: {e}")
        return []
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    excel_links = []
    excel_extensions = ('.xls', '.xlsx', '.xlsm')
    
    for link in soup.find_all('a', href=True):
        href = link['href']
        
        # Check if it's an Excel file
        if any(href.lower().endswith(ext) for ext in excel_extensions):
            full_url = urljoin(page_url, href)
            
            # Extract filename from URL
            parsed = urlparse(href)
            filename = unquote(Path(parsed.path).name)
            
            excel_links.append((filename, full_url))
    
    print(f"  Found {len(excel_links)} Excel file(s)")
    return excel_links


def download_file(url: str, local_path: Path, referer: str = None) -> bool:
    """
    Download a file from URL to local path.
    
    Args:
        url: URL to download
        local_path: Local path to save file
        referer: Referer URL to include in headers
    
    Returns:
        True if successful, False otherwise
    """
    try:
        headers = {}
        if referer:
            headers['Referer'] = referer
        
        response = session.get(url, headers=headers, timeout=60)
        response.raise_for_status()
        
        with open(local_path, 'wb') as f:
            f.write(response.content)
        
        return True
        
    except requests.RequestException as e:
        print(f"    Error: {e}")
        return False


def download_year_files(base_url: str, year: str, output_dir: Path) -> list[Path]:
    """
    Download all Excel files for a given election year.
    
    Args:
        base_url: Base URL pattern (with {year} placeholder or just base)
        year: Election year (e.g., "2018")
        output_dir: Directory to save files
    
    Returns:
        List of downloaded file paths
    """
    # Construct the URL for this year
    page_url = f"{base_url}-{year}-general"
    
    # Get all Excel links from the page
    excel_links = get_excel_links(page_url)
    
    if not excel_links:
        print(f"  No Excel files found for {year}")
        return []
    
    downloaded = []
    
    for i, (original_filename, file_url) in enumerate(excel_links, 1):
        # Extract county name (remove extension)
        name_without_ext = Path(original_filename).stem
        extension = Path(original_filename).suffix
        
        # Create new filename: county_year.xls
        new_filename = f"{name_without_ext}_{year}{extension}"
        local_path = output_dir / new_filename
        
        print(f"  [{i}/{len(excel_links)}] {original_filename} -> {new_filename}")
        
        # Pass the page URL as referer
        if download_file(file_url, local_path, referer=page_url):
            downloaded.append(local_path)
        
        # Small delay to be polite to the server
        time.sleep(0.5)
    
    return downloaded


def find_toc_sheet_xlrd(workbook) -> int | None:
    """
    Find the 'Table of Contents' sheet in an xlrd workbook (case-insensitive).
    Returns sheet index or None.
    """
    target = "table of contents"
    
    for idx, name in enumerate(workbook.sheet_names()):
        if name.lower() == target:
            return idx
    
    return None


def find_toc_sheet_openpyxl(workbook) -> str | None:
    """
    Find the 'Table of Contents' sheet in an openpyxl workbook (case-insensitive).
    """
    target = "table of contents"
    
    for sheet_name in workbook.sheetnames:
        if sheet_name.lower() == target:
            return sheet_name
    
    return None


def print_toc_sheet(file_path: Path) -> None:
    """
    Print the contents of the 'Table of Contents' sheet from an Excel file.
    Supports both .xls (xlrd) and .xlsx (openpyxl) formats.
    """
    print(f"\n{'='*60}")
    print(f"File: {file_path.name}")
    print('='*60)
    
    suffix = file_path.suffix.lower()
    
    # Handle older .xls files with xlrd
    if suffix == '.xls':
        if not HAS_XLRD:
            print("  Error: xlrd package not installed. Run: pip install xlrd")
            return
        
        try:
            workbook = xlrd.open_workbook(file_path)
            
            toc_idx = find_toc_sheet_xlrd(workbook)
            
            if toc_idx is None:
                print(f"  No 'Table of Contents' sheet found.")
                print(f"  Available sheets: {workbook.sheet_names()}")
                return
            
            sheet = workbook.sheet_by_index(toc_idx)
            print(f"  Found sheet: '{sheet.name}'")
            print("-" * 40)
            
            for row_idx in range(sheet.nrows):
                row = sheet.row_values(row_idx)
                if any(cell for cell in row):
                    formatted_cells = [str(cell) if cell else "" for cell in row]
                    print("  |  ".join(formatted_cells))
                    
        except Exception as e:
            print(f"  Error reading .xls file: {e}")
    
    # Handle .xlsx files with openpyxl
    elif suffix in ('.xlsx', '.xlsm'):
        if not HAS_OPENPYXL:
            print("  Error: openpyxl package not installed. Run: pip install openpyxl")
            return
        
        try:
            workbook = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
            
            toc_sheet_name = find_toc_sheet_openpyxl(workbook)
            
            if toc_sheet_name is None:
                print(f"  No 'Table of Contents' sheet found.")
                print(f"  Available sheets: {workbook.sheetnames}")
                workbook.close()
                return
            
            print(f"  Found sheet: '{toc_sheet_name}'")
            print("-" * 40)
            
            sheet = workbook[toc_sheet_name]
            
            for row in sheet.iter_rows(values_only=True):
                if any(cell is not None for cell in row):
                    formatted_cells = [str(cell) if cell is not None else "" for cell in row]
                    print("  |  ".join(formatted_cells))
            
            workbook.close()
            
        except Exception as e:
            print(f"  Error reading .xlsx file: {e}")
    
    else:
        print(f"  Unsupported file format: {suffix}")


def main():
    """Main entry point."""
    
    # ==========================================================================
    # CONFIGURATION
    # ==========================================================================
    
    BASE_URL = "https://sos.iowa.gov/precinct-results-county"
    YEARS = ["2016", "2018", "2020"]
    OUTPUT_DIRECTORY = "./iowa_election_results"
    
    # Set to True to also print Table of Contents from each file
    PRINT_TOC = False
    
    # ==========================================================================
    
    output_path = Path(OUTPUT_DIRECTORY)
    output_path.mkdir(parents=True, exist_ok=True)
    
    all_downloaded = []
    
    for year in YEARS:
        print(f"\n{'#'*60}")
        print(f"# Processing year: {year}")
        print('#'*60)
        
        downloaded = download_year_files(BASE_URL, year, output_path)
        all_downloaded.extend(downloaded)
        
        print(f"\nDownloaded {len(downloaded)} files for {year}")
    
    print(f"\n{'='*60}")
    print(f"SUMMARY: Downloaded {len(all_downloaded)} total files")
    print(f"Location: {output_path.absolute()}")
    print('='*60)
    
    # Optionally print Table of Contents from each file
    if PRINT_TOC and all_downloaded:
        print("\n\nPrinting Table of Contents sheets...")
        for file_path in all_downloaded:
            print_toc_sheet(file_path)


if __name__ == "__main__":
    main()


############################################################
# Processing year: 2016
############################################################
Scanning: https://sos.iowa.gov/precinct-results-county-2016-general
  Error fetching page: 403 Client Error: Forbidden for url: https://sos.iowa.gov/precinct-results-county-2016-general
  No Excel files found for 2016

Downloaded 0 files for 2016

############################################################
# Processing year: 2018
############################################################
Scanning: https://sos.iowa.gov/precinct-results-county-2018-general
  Error fetching page: 403 Client Error: Forbidden for url: https://sos.iowa.gov/precinct-results-county-2018-general
  No Excel files found for 2018

Downloaded 0 files for 2018

############################################################
# Processing year: 2020
############################################################
Scanning: https://sos.iowa.gov/precinct-results-county-2020-gener

## Step 4: The Solution — Using Selenium (Browser Automation)

Since the website blocks Python scripts, we need a different approach: **control a real web browser!**

**Selenium** is a tool that lets Python control Chrome (or Firefox). It literally opens a browser window and clicks through pages like a human would. The website can't tell the difference because it IS a real browser.

### Why This Works When Headers Don't

Websites use various tricks to detect bots:
- Checking headers (we tried faking these)
- Running JavaScript to detect automation
- Checking mouse movements and timing
- Looking at browser fingerprints

Selenium bypasses most of these because it's running a real Chrome browser — all the JavaScript runs normally, the fingerprint looks real, etc.

### New Packages Needed

- **selenium**: Controls the browser
- **webdriver-manager**: Automatically downloads the correct Chrome driver

### ⚠️ Oops! We Forgot to Install

Look at the next code cell — it tries to use Selenium but fails with `ModuleNotFoundError: No module named 'selenium'`. This happens when you try to `import` a package that isn't installed.

**The fix**: Run the `pip install` cell that comes after the error.

In [4]:
#!/usr/bin/env python3
"""
Download Iowa Secretary of State precinct election results (Excel files)
using Selenium to control a real browser (bypasses bot protection).

Requirements:
    pip install selenium webdriver-manager openpyxl xlrd
"""

import os
import time
from pathlib import Path
from urllib.parse import urljoin, urlparse, unquote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager


def setup_driver(download_dir: str) -> webdriver.Chrome:
    """
    Set up Chrome WebDriver with download directory configured.
    
    Args:
        download_dir: Directory where files will be downloaded
    
    Returns:
        Configured Chrome WebDriver
    """
    chrome_options = Options()
    
    # Configure download behavior
    prefs = {
        "download.default_directory": str(Path(download_dir).absolute()),
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)
    
    # Optional: run headless (no visible browser window)
    # chrome_options.add_argument("--headless")
    
    # Avoid detection
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # Initialize the driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Further avoid detection
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    
    return driver


def get_excel_links(driver: webdriver.Chrome, page_url: str) -> list[tuple[str, str]]:
    """
    Navigate to a page and find all Excel file links.
    
    Args:
        driver: Selenium WebDriver
        page_url: URL to scrape
    
    Returns:
        List of tuples: (filename, full_url)
    """
    print(f"Navigating to: {page_url}")
    driver.get(page_url)
    
    # Wait for page to load
    time.sleep(3)
    
    # Find all links
    links = driver.find_elements(By.TAG_NAME, "a")
    
    excel_links = []
    excel_extensions = ('.xls', '.xlsx', '.xlsm')
    
    for link in links:
        try:
            href = link.get_attribute("href")
            if href and any(href.lower().endswith(ext) for ext in excel_extensions):
                # Extract filename from URL
                parsed = urlparse(href)
                filename = unquote(Path(parsed.path).name)
                excel_links.append((filename, href))
        except:
            continue
    
    print(f"  Found {len(excel_links)} Excel file(s)")
    return excel_links


def download_files_for_year(driver: webdriver.Chrome, base_url: str, year: str, 
                            output_dir: Path, temp_dir: Path) -> list[Path]:
    """
    Download all Excel files for a given election year.
    
    Args:
        driver: Selenium WebDriver
        base_url: Base URL pattern
        year: Election year
        output_dir: Final output directory (with renamed files)
        temp_dir: Temporary download directory
    
    Returns:
        List of downloaded file paths
    """
    page_url = f"{base_url}-{year}-general"
    
    # Get all Excel links
    excel_links = get_excel_links(driver, page_url)
    
    if not excel_links:
        print(f"  No Excel files found for {year}")
        return []
    
    downloaded = []
    
    for i, (original_filename, file_url) in enumerate(excel_links, 1):
        print(f"  [{i}/{len(excel_links)}] Downloading: {original_filename}")
        
        # Clear temp directory of any existing files with this name
        temp_file = temp_dir / original_filename
        if temp_file.exists():
            temp_file.unlink()
        
        # Click the link to download (or navigate directly)
        driver.get(file_url)
        
        # Wait for download to complete
        timeout = 30
        start_time = time.time()
        while not temp_file.exists():
            if time.time() - start_time > timeout:
                print(f"    Timeout waiting for download: {original_filename}")
                break
            time.sleep(0.5)
        
        # Also wait for .crdownload to disappear (Chrome's partial download)
        crdownload = temp_dir / f"{original_filename}.crdownload"
        while crdownload.exists():
            if time.time() - start_time > timeout:
                break
            time.sleep(0.5)
        
        if temp_file.exists():
            # Rename and move to output directory
            name_without_ext = Path(original_filename).stem
            extension = Path(original_filename).suffix
            new_filename = f"{name_without_ext}_{year}{extension}"
            final_path = output_dir / new_filename
            
            # Move/rename the file
            temp_file.rename(final_path)
            downloaded.append(final_path)
            print(f"    Saved as: {new_filename}")
        
        time.sleep(1)  # Small delay between downloads
    
    return downloaded


def main():
    """Main entry point."""
    
    # ==========================================================================
    # CONFIGURATION
    # ==========================================================================
    
    BASE_URL = "https://sos.iowa.gov/precinct-results-county"
    YEARS = ["2016", "2018", "2020"]
    OUTPUT_DIRECTORY = "./iowa_election_results"
    
    # ==========================================================================
    
    output_path = Path(OUTPUT_DIRECTORY)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Use the output directory as the download directory too
    # (Selenium will download here, then we rename)
    temp_download_path = output_path / "_temp_downloads"
    temp_download_path.mkdir(parents=True, exist_ok=True)
    
    print("Setting up browser...")
    driver = setup_driver(str(temp_download_path))
    
    all_downloaded = []
    
    try:
        for year in YEARS:
            print(f"\n{'#'*60}")
            print(f"# Processing year: {year}")
            print('#'*60)
            
            downloaded = download_files_for_year(
                driver, BASE_URL, year, output_path, temp_download_path
            )
            all_downloaded.extend(downloaded)
            
            print(f"\nDownloaded {len(downloaded)} files for {year}")
        
    finally:
        print("\nClosing browser...")
        driver.quit()
        
        # Clean up temp directory
        if temp_download_path.exists():
            for f in temp_download_path.iterdir():
                f.unlink()
            temp_download_path.rmdir()
    
    print(f"\n{'='*60}")
    print(f"SUMMARY: Downloaded {len(all_downloaded)} total files")
    print(f"Location: {output_path.absolute()}")
    print('='*60)


if __name__ == "__main__":
    main()

ModuleNotFoundError: No module named 'selenium'

In [3]:
pip install selenium webdriver-manager openpyxl xlrd


Collecting selenium
  Downloading selenium-4.40.0-py3-none-any.whl.metadata (7.7 kB)
Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting trio<1.0,>=0.31.0 (from selenium)
  Downloading trio-0.32.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket<1.0,>=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting trio-typing>=0.10.0 (from selenium)
  Downloading trio_typing-0.10.0-py3-none-any.whl.metadata (10 kB)
Collecting types-certifi>=2021.10.8.3 (from selenium)
  Downloading types_certifi-2021.10.8.3-py3-none-any.whl.metadata (1.4 kB)
Collecting types-urllib3>=1.26.25.14 (from selenium)
  Downloading types_urllib3-1.26.25.14-py3-none-any.whl.metadata (1.7 kB)
Collecting typing_extensions<5.0,>=4.15.0 (from selenium)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting urllib3<3.0,>=2.6.3 (from urllib3[socks]<3.0,>=2.6.3->selenium)
 

## Step 5: First Selenium Attempt (and Debugging)

Now with Selenium installed, let's try again. The code below:
1. Opens a Chrome browser window
2. Navigates to the Iowa SOS website
3. Finds all Excel file links
4. Downloads each file

### What You'll See

- A Chrome window will pop up (don't close it!)
- The browser will automatically navigate to pages
- Files will download to your `iowa_election_results` folder

### Key Functions

| Function | Purpose |
|----------|---------|
| `setup_driver()` | Configures Chrome for automated downloading |
| `get_excel_links()` | Finds all Excel file links on a page |
| `download_files_for_year()` | Downloads all files for one election year |
| `main()` | Orchestrates the whole process |

---

## ⚠️ Debugging: What Went Wrong

The next few cells show the **debugging process** — fixing errors as they appeared.

### Error 1: Typo in the Code

Look carefully at one of the cells — there's a typo where `def main():` was typed as `ef main():` (missing the "d"). Python shows:
```
SyntaxError: invalid syntax
```
The fix was simply adding the missing "d".

### Error 2: Downloads Not Completing

The first version had a simple download wait that didn't work well:
- It assumed the filename would stay the same
- It didn't handle Chrome's `.crdownload` partial files properly

### What Was Added to Fix Downloads

The working version adds two new functions:

**`wait_for_download()`** — Smarter download detection:
```python
# Wait until we find an Excel file AND no partial downloads exist
excel_files = [f for f in all_files if f.suffix.lower() in ('.xls', '.xlsx')]
has_partial = any(f.suffix == '.crdownload' for f in all_files)

if excel_files and not has_partial:
    return excel_files[0]  # Download complete!
```

**`clear_temp_dir()`** — Cleans up between downloads so we know which file is new

In [None]:
#!/usr/bin/env python3
"""
Download Iowa Secretary of State precinct election results (Excel files)
using Selenium to control a real browser (bypasses bot protection).

Requirements:
    pip install selenium webdriver-manager openpyxl xlrd
"""

import os
import time
from pathlib import Path
from urllib.parse import urljoin, urlparse, unquote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager


def setup_driver(download_dir: str) -> webdriver.Chrome:
    """
    Set up Chrome WebDriver with download directory configured.
    
    Args:
        download_dir: Directory where files will be downloaded
    
    Returns:
        Configured Chrome WebDriver
    """
    chrome_options = Options()
    
    # Configure download behavior
    prefs = {
        "download.default_directory": str(Path(download_dir).absolute()),
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)
    
    # Optional: run headless (no visible browser window)
    # chrome_options.add_argument("--headless")
    
    # Avoid detection
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # Initialize the driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Further avoid detection
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    
    return driver


def get_excel_links(driver: webdriver.Chrome, page_url: str) -> list[tuple[str, str]]:
    """
    Navigate to a page and find all Excel file links.
    
    Args:
        driver: Selenium WebDriver
        page_url: URL to scrape
    
    Returns:
        List of tuples: (filename, full_url)
    """
    print(f"Navigating to: {page_url}")
    driver.get(page_url)
    
    # Wait for page to load
    time.sleep(3)
    
    # Find all links
    links = driver.find_elements(By.TAG_NAME, "a")
    
    excel_links = []
    excel_extensions = ('.xls', '.xlsx', '.xlsm')
    
    for link in links:
        try:
            href = link.get_attribute("href")
            if href and any(href.lower().endswith(ext) for ext in excel_extensions):
                # Extract filename from URL
                parsed = urlparse(href)
                filename = unquote(Path(parsed.path).name)
                excel_links.append((filename, href))
        except:
            continue
    
    print(f"  Found {len(excel_links)} Excel file(s)")
    return excel_links


def download_files_for_year(driver: webdriver.Chrome, base_url: str, year: str, 
                            output_dir: Path, temp_dir: Path) -> list[Path]:
    """
    Download all Excel files for a given election year.
    
    Args:
        driver: Selenium WebDriver
        base_url: Base URL pattern
        year: Election year
        output_dir: Final output directory (with renamed files)
        temp_dir: Temporary download directory
    
    Returns:
        List of downloaded file paths
    """
    page_url = f"{base_url}-{year}-general"
    
    # Get all Excel links
    excel_links = get_excel_links(driver, page_url)
    
    if not excel_links:
        print(f"  No Excel files found for {year}")
        return []
    
    downloaded = []
    
    for i, (original_filename, file_url) in enumerate(excel_links, 1):
        print(f"  [{i}/{len(excel_links)}] Downloading: {original_filename}")
        
        # Clear temp directory of any existing files with this name
        temp_file = temp_dir / original_filename
        if temp_file.exists():
            temp_file.unlink()
        
        # Click the link to download (or navigate directly)
        driver.get(file_url)
        
        # Wait for download to complete
        timeout = 30
        start_time = time.time()
        while not temp_file.exists():
            if time.time() - start_time > timeout:
                print(f"    Timeout waiting for download: {original_filename}")
                break
            time.sleep(0.5)
        
        # Also wait for .crdownload to disappear (Chrome's partial download)
        crdownload = temp_dir / f"{original_filename}.crdownload"
        while crdownload.exists():
            if time.time() - start_time > timeout:
                break
            time.sleep(0.5)
        
        if temp_file.exists():
            # Rename and move to output directory
            name_without_ext = Path(original_filename).stem
            extension = Path(original_filename).suffix
            new_filename = f"{name_without_ext}_{year}{extension}"
            final_path = output_dir / new_filename
            
            # Move/rename the file
            temp_file.rename(final_path)
            downloaded.append(final_path)
            print(f"    Saved as: {new_filename}")
        
        time.sleep(1)  # Small delay between downloads
    
    return downloaded


def main():
    """Main entry point."""
    
    # ==========================================================================
    # CONFIGURATION
    # ==========================================================================
    
    BASE_URL = "https://sos.iowa.gov/precinct-results-county"
    YEARS = ["2016", "2018", "2020"]
    OUTPUT_DIRECTORY = "./iowa_election_results"
    
    # ==========================================================================
    
    output_path = Path(OUTPUT_DIRECTORY)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Use the output directory as the download directory too
    # (Selenium will download here, then we rename)
    temp_download_path = output_path / "_temp_downloads"
    temp_download_path.mkdir(parents=True, exist_ok=True)
    
    print("Setting up browser...")
    driver = setup_driver(str(temp_download_path))
    
    all_downloaded = []
    
    try:
        for year in YEARS:
            print(f"\n{'#'*60}")
            print(f"# Processing year: {year}")
            print('#'*60)
            
            downloaded = download_files_for_year(
                driver, BASE_URL, year, output_path, temp_download_path
            )
            all_downloaded.extend(downloaded)
            
            print(f"\nDownloaded {len(downloaded)} files for {year}")
        
    finally:
        print("\nClosing browser...")
        driver.quit()
        
        # Clean up temp directory
        if temp_download_path.exists():
            for f in temp_download_path.iterdir():
                f.unlink()
            temp_download_path.rmdir()
    
    print(f"\n{'='*60}")
    print(f"SUMMARY: Downloaded {len(all_downloaded)} total files")
    print(f"Location: {output_path.absolute()}")
    print('='*60)


if __name__ == "__main__":
    main()

In [None]:
#!/usr/bin/env python3
"""
Download Iowa Secretary of State precinct election results (Excel files)
using Selenium to control a real browser (bypasses bot protection).

Requirements:
    pip install selenium webdriver-manager openpyxl xlrd
"""

import os
import time
from pathlib import Path
from urllib.parse import urljoin, urlparse, unquote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager


def setup_driver(download_dir: str) -> webdriver.Chrome:
    """
    Set up Chrome WebDriver with download directory configured.
    
    Args:
        download_dir: Directory where files will be downloaded
    
    Returns:
        Configured Chrome WebDriver
    """
    chrome_options = Options()
    
    # Configure download behavior
    prefs = {
        "download.default_directory": str(Path(download_dir).absolute()),
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)
    
    # Optional: run headless (no visible browser window)
    # chrome_options.add_argument("--headless")
    
    # Avoid detection
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # Initialize the driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Further avoid detection
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    
    return driver


def get_excel_links(driver: webdriver.Chrome, page_url: str) -> list[tuple[str, str]]:
    """
    Navigate to a page and find all Excel file links.
    
    Args:
        driver: Selenium WebDriver
        page_url: URL to scrape
    
    Returns:
        List of tuples: (filename, full_url)
    """
    print(f"Navigating to: {page_url}")
    driver.get(page_url)
    
    # Wait for page to load
    time.sleep(3)
    
    # Find all links
    links = driver.find_elements(By.TAG_NAME, "a")
    
    excel_links = []
    excel_extensions = ('.xls', '.xlsx', '.xlsm')
    
    for link in links:
        try:
            href = link.get_attribute("href")
            if href and any(href.lower().endswith(ext) for ext in excel_extensions):
                # Extract filename from URL
                parsed = urlparse(href)
                filename = unquote(Path(parsed.path).name)
                excel_links.append((filename, href))
        except:
            continue
    
    print(f"  Found {len(excel_links)} Excel file(s)")
    return excel_links


def wait_for_download(temp_dir: Path, timeout: int = 30) -> Path | None:
    """
    Wait for a download to complete in the temp directory.
    Returns the path to the downloaded file, or None if timeout.
    """
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        # Look for any new files (excluding .crdownload partial files)
        files = [f for f in temp_dir.iterdir() 
                 if f.is_file() and not f.suffix == '.crdownload']
        
        if files:
            # Wait a moment to ensure download is fully complete
            time.sleep(0.5)
            
            # Check that no .crdownload files remain
            crdownloads = list(temp_dir.glob("*.crdownload"))
            if not crdownloads:
                return files[0]
        
        time.sleep(0.5)
    
    return None


def clear_temp_dir(temp_dir: Path):
    """Remove all files from temp directory."""
    for f in temp_dir.iterdir():
        try:
            f.unlink()
        except:
            pass


def download_files_for_year(driver: webdriver.Chrome, base_url: str, year: str, 
                            output_dir: Path, temp_dir: Path) -> list[Path]:
    """
    Download all Excel files for a given election year.
    
    Args:
        driver: Selenium WebDriver
        base_url: Base URL pattern
        year: Election year
        output_dir: Final output directory (with renamed files)
        temp_dir: Temporary download directory
    
    Returns:
        List of downloaded file paths
    """
    page_url = f"{base_url}-{year}-general"
    
    # Get all Excel links
    excel_links = get_excel_links(driver, page_url)
    
    if not excel_links:
        print(f"  No Excel files found for {year}")
        return []
    
    downloaded = []
    filename_counts = {}  # Track duplicates
    
    for i, (original_filename, file_url) in enumerate(excel_links, 1):
        print(f"  [{i}/{len(excel_links)}] Downloading: {original_filename}")
        
        try:
            # Clear temp directory before each download
            clear_temp_dir(temp_dir)
            
            # Navigate to download the file
            driver.get(file_url)
            
            # Wait for download to complete
            downloaded_file = wait_for_download(temp_dir, timeout=30)
            
            if downloaded_file:
                # Rename and move to output directory
                name_without_ext = Path(original_filename).stem
                extension = Path(original_filename).suffix
                
                # Handle duplicates by adding a suffix
                base_new_name = f"{name_without_ext}_{year}"
                
                if base_new_name in filename_counts:
                    filename_counts[base_new_name] += 1
                    new_filename = f"{base_new_name}_{filename_counts[base_new_name]}{extension}"
                else:
                    filename_counts[base_new_name] = 1
                    new_filename = f"{base_new_name}{extension}"
                
                final_path = output_dir / new_filename
                
                # Move/rename the file
                downloaded_file.rename(final_path)
                downloaded.append(final_path)
                print(f"    Saved as: {new_filename}")
            else:
                print(f"    Timeout or error downloading: {original_filename}")
            
        except Exception as e:
            print(f"    Error downloading {original_filename}: {e}")
            # Continue with next file instead of crashing
            continue
        
        time.sleep(1)  # Small delay between downloads
    
    # Return to the page listing (helps with next year)
    try:
        driver.get(page_url)
        time.sleep(1)
    except:
        pass
    
    return downloaded

ef main():
    """Main entry point."""
    
    # ==========================================================================
    # CONFIGURATION
    # ==========================================================================
    
    BASE_URL = "https://sos.iowa.gov/precinct-results-county"
    YEARS = ["2016", "2018", "2020"]
    OUTPUT_DIRECTORY = "./iowa_election_results"
    
    # ==========================================================================
    
    output_path = Path(OUTPUT_DIRECTORY)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Use the output directory as the download directory too
    # (Selenium will download here, then we rename)
    temp_download_path = output_path / "_temp_downloads"
    temp_download_path.mkdir(parents=True, exist_ok=True)
    
    print("Setting up browser...")
    driver = setup_driver(str(temp_download_path))
    
    all_downloaded = []
    
    try:
        for year in YEARS:
            print(f"\n{'#'*60}")
            print(f"# Processing year: {year}")
            print('#'*60)
            
            downloaded = download_files_for_year(
                driver, BASE_URL, year, output_path, temp_download_path
            )
            all_downloaded.extend(downloaded)
            
            print(f"\nDownloaded {len(downloaded)} files for {year}")
        
    finally:
        print("\nClosing browser...")
        driver.quit()
        
        # Clean up temp directory
        if temp_download_path.exists():
            for f in temp_download_path.iterdir():
                f.unlink()
            temp_download_path.rmdir()
    
    print(f"\n{'='*60}")
    print(f"SUMMARY: Downloaded {len(all_downloaded)} total files")
    print(f"Location: {output_path.absolute()}")
    print('='*60)


if __name__ == "__main__":
    main()

In [None]:
#!/usr/bin/env python3
"""
Download Iowa Secretary of State precinct election results (Excel files)
using Selenium to control a real browser (bypasses bot protection).

Requirements:
    pip install selenium webdriver-manager openpyxl xlrd
"""

import os
import time
from pathlib import Path
from urllib.parse import urljoin, urlparse, unquote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager


def setup_driver(download_dir: str) -> webdriver.Chrome:
    """
    Set up Chrome WebDriver with download directory configured.
    
    Args:
        download_dir: Directory where files will be downloaded
    
    Returns:
        Configured Chrome WebDriver
    """
    chrome_options = Options()
    
    # Configure download behavior
    prefs = {
        "download.default_directory": str(Path(download_dir).absolute()),
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)
    
    # Optional: run headless (no visible browser window)
    # chrome_options.add_argument("--headless")
    
    # Avoid detection
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # Initialize the driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Further avoid detection
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    
    return driver


def get_excel_links(driver: webdriver.Chrome, page_url: str) -> list[tuple[str, str]]:
    """
    Navigate to a page and find all Excel file links.
    
    Args:
        driver: Selenium WebDriver
        page_url: URL to scrape
    
    Returns:
        List of tuples: (filename, full_url)
    """
    print(f"Navigating to: {page_url}")
    driver.get(page_url)
    
    # Wait for page to load
    time.sleep(3)
    
    # Find all links
    links = driver.find_elements(By.TAG_NAME, "a")
    
    excel_links = []
    excel_extensions = ('.xls', '.xlsx', '.xlsm')
    
    for link in links:
        try:
            href = link.get_attribute("href")
            if href and any(href.lower().endswith(ext) for ext in excel_extensions):
                # Extract filename from URL
                parsed = urlparse(href)
                filename = unquote(Path(parsed.path).name)
                excel_links.append((filename, href))
        except:
            continue
    
    print(f"  Found {len(excel_links)} Excel file(s)")
    return excel_links


def wait_for_download(temp_dir: Path, timeout: int = 30) -> Path | None:
    """
    Wait for a download to complete in the temp directory.
    Returns the path to the downloaded file, or None if timeout.
    """
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        # Look for any new files (excluding .crdownload partial files)
        files = [f for f in temp_dir.iterdir() 
                 if f.is_file() and not f.suffix == '.crdownload']
        
        if files:
            # Wait a moment to ensure download is fully complete
            time.sleep(0.5)
            
            # Check that no .crdownload files remain
            crdownloads = list(temp_dir.glob("*.crdownload"))
            if not crdownloads:
                return files[0]
        
        time.sleep(0.5)
    
    return None


def clear_temp_dir(temp_dir: Path):
    """Remove all files from temp directory."""
    for f in temp_dir.iterdir():
        try:
            f.unlink()
        except:
            pass


def download_files_for_year(driver: webdriver.Chrome, base_url: str, year: str, 
                            output_dir: Path, temp_dir: Path) -> list[Path]:
    """
    Download all Excel files for a given election year.
    
    Args:
        driver: Selenium WebDriver
        base_url: Base URL pattern
        year: Election year
        output_dir: Final output directory (with renamed files)
        temp_dir: Temporary download directory
    
    Returns:
        List of downloaded file paths
    """
    page_url = f"{base_url}-{year}-general"
    
    # Get all Excel links
    excel_links = get_excel_links(driver, page_url)
    
    if not excel_links:
        print(f"  No Excel files found for {year}")
        return []
    
    downloaded = []
    filename_counts = {}  # Track duplicates
    
    for i, (original_filename, file_url) in enumerate(excel_links, 1):
        print(f"  [{i}/{len(excel_links)}] Downloading: {original_filename}")
        
        try:
            # Clear temp directory before each download
            clear_temp_dir(temp_dir)
            
            # Navigate to download the file
            driver.get(file_url)
            
            # Wait for download to complete
            downloaded_file = wait_for_download(temp_dir, timeout=30)
            
            if downloaded_file:
                # Rename and move to output directory
                name_without_ext = Path(original_filename).stem
                extension = Path(original_filename).suffix
                
                # Handle duplicates by adding a suffix
                base_new_name = f"{name_without_ext}_{year}"
                
                if base_new_name in filename_counts:
                    filename_counts[base_new_name] += 1
                    new_filename = f"{base_new_name}_{filename_counts[base_new_name]}{extension}"
                else:
                    filename_counts[base_new_name] = 1
                    new_filename = f"{base_new_name}{extension}"
                
                final_path = output_dir / new_filename
                
                # Move/rename the file
                downloaded_file.rename(final_path)
                downloaded.append(final_path)
                print(f"    Saved as: {new_filename}")
            else:
                print(f"    Timeout or error downloading: {original_filename}")
            
        except Exception as e:
            print(f"    Error downloading {original_filename}: {e}")
            # Continue with next file instead of crashing
            continue
        
        time.sleep(1)  # Small delay between downloads
    
    # Return to the page listing (helps with next year)
    try:
        driver.get(page_url)
        time.sleep(1)
    except:
        pass
    
    return downloaded


def main():
    """Main entry point."""
    
    # ==========================================================================
    # CONFIGURATION
    # ==========================================================================
    
    BASE_URL = "https://sos.iowa.gov/precinct-results-county"
    YEARS = ["2016", "2018", "2020"]
    OUTPUT_DIRECTORY = "./iowa_election_results"
    
    # ==========================================================================
    
    output_path = Path(OUTPUT_DIRECTORY)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Use the output directory as the download directory too
    # (Selenium will download here, then we rename)
    temp_download_path = output_path / "_temp_downloads"
    temp_download_path.mkdir(parents=True, exist_ok=True)
    
    print("Setting up browser...")
    driver = setup_driver(str(temp_download_path))
    
    all_downloaded = []
    
    try:
        for year in YEARS:
            print(f"\n{'#'*60}")
            print(f"# Processing year: {year}")
            print('#'*60)
            
            downloaded = download_files_for_year(
                driver, BASE_URL, year, output_path, temp_download_path
            )
            all_downloaded.extend(downloaded)
            
            print(f"\nDownloaded {len(downloaded)} files for {year}")
        
    finally:
        print("\nClosing browser...")
        driver.quit()
        
        # Clean up temp directory
        if temp_download_path.exists():
            for f in temp_download_path.iterdir():
                f.unlink()
            temp_download_path.rmdir()
    
    print(f"\n{'='*60}")
    print(f"SUMMARY: Downloaded {len(all_downloaded)} total files")
    print(f"Location: {output_path.absolute()}")
    print('='*60)


if __name__ == "__main__":
    main()
    

---

## ✅ FINAL WORKING VERSION

**This is the cell to run!** It includes all the bug fixes.

### Summary of What's Different From Earlier Versions

| Problem | Original Code | Fixed Code |
|---------|---------------|------------|
| Typo | `ef main():` | `def main():` |
| Download detection | Checked for specific filename | Looks for ANY new Excel file |
| Partial files | Didn't check for `.crdownload` | Waits until no partial files exist |
| Temp folder | Files accumulated | `clear_temp_dir()` cleans up each time |
| Error handling | Script crashed on errors | `try/except` continues to next file |

### What to Expect

- Downloads take 10-15 minutes (we're being polite to the server with pauses)
- You'll see progress printed: `[1/99] Downloading: Adair.xls`
- Final summary: `SUMMARY: Downloaded 297 total files`
- Files end up in `./iowa_election_results/` folder

In [5]:
#!/usr/bin/env python3
"""
Download Iowa Secretary of State precinct election results (Excel files)
using Selenium to control a real browser (bypasses bot protection).

Requirements:
    pip install selenium webdriver-manager openpyxl xlrd
"""

import os
import time
from pathlib import Path
from urllib.parse import urljoin, urlparse, unquote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager


def setup_driver(download_dir: str) -> webdriver.Chrome:
    """
    Set up Chrome WebDriver with download directory configured.
    
    Args:
        download_dir: Directory where files will be downloaded
    
    Returns:
        Configured Chrome WebDriver
    """
    chrome_options = Options()
    
    # Configure download behavior
    prefs = {
        "download.default_directory": str(Path(download_dir).absolute()),
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)
    
    # Optional: run headless (no visible browser window)
    # chrome_options.add_argument("--headless")
    
    # Avoid detection
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # Initialize the driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Further avoid detection
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    
    return driver


def get_excel_links(driver: webdriver.Chrome, page_url: str) -> list[tuple[str, str]]:
    """
    Navigate to a page and find all Excel file links.
    
    Args:
        driver: Selenium WebDriver
        page_url: URL to scrape
    
    Returns:
        List of tuples: (filename, full_url)
    """
    print(f"Navigating to: {page_url}")
    driver.get(page_url)
    
    # Wait for page to load
    time.sleep(3)
    
    # Find all links
    links = driver.find_elements(By.TAG_NAME, "a")
    
    excel_links = []
    excel_extensions = ('.xls', '.xlsx', '.xlsm')
    
    for link in links:
        try:
            href = link.get_attribute("href")
            if href and any(href.lower().endswith(ext) for ext in excel_extensions):
                # Extract filename from URL
                parsed = urlparse(href)
                filename = unquote(Path(parsed.path).name)
                excel_links.append((filename, href))
        except:
            continue
    
    print(f"  Found {len(excel_links)} Excel file(s)")
    return excel_links


def wait_for_download(temp_dir: Path, timeout: int = 60) -> Path | None:
    """
    Wait for a download to complete in the temp directory.
    Returns the path to the downloaded file, or None if timeout.
    
    Handles Chrome's behavior of using .tmp or .crdownload files during download.
    """
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        # Get all files in temp directory
        all_files = list(temp_dir.iterdir())
        
        # Check for partial download indicators
        has_partial = any(
            f.suffix.lower() in ('.crdownload', '.tmp') or 
            f.name.endswith('.crdownload')
            for f in all_files
        )
        
        # Look for completed Excel files
        excel_files = [
            f for f in all_files 
            if f.is_file() and f.suffix.lower() in ('.xls', '.xlsx', '.xlsm')
        ]
        
        if excel_files and not has_partial:
            # Found a completed Excel file
            time.sleep(0.5)  # Brief pause to ensure fully written
            return excel_files[0]
        
        time.sleep(0.5)
    
    return None


def clear_temp_dir(temp_dir: Path):
    """Remove all files from temp directory."""
    for f in temp_dir.iterdir():
        try:
            f.unlink()
        except:
            pass


def download_files_for_year(driver: webdriver.Chrome, base_url: str, year: str, 
                            output_dir: Path, temp_dir: Path) -> list[Path]:
    """
    Download all Excel files for a given election year.
    
    Args:
        driver: Selenium WebDriver
        base_url: Base URL pattern
        year: Election year
        output_dir: Final output directory (with renamed files)
        temp_dir: Temporary download directory
    
    Returns:
        List of downloaded file paths
    """
    page_url = f"{base_url}-{year}-general"
    
    # Get all Excel links
    excel_links = get_excel_links(driver, page_url)
    
    if not excel_links:
        print(f"  No Excel files found for {year}")
        return []
    
    downloaded = []
    filename_counts = {}  # Track duplicates
    
    for i, (original_filename, file_url) in enumerate(excel_links, 1):
        print(f"  [{i}/{len(excel_links)}] Downloading: {original_filename}")
        
        try:
            # Clear temp directory before each download
            clear_temp_dir(temp_dir)
            
            # Navigate to download the file
            driver.get(file_url)
            
            # Wait for download to complete
            downloaded_file = wait_for_download(temp_dir, timeout=30)
            
            if downloaded_file:
                # Rename and move to output directory
                name_without_ext = Path(original_filename).stem
                extension = Path(original_filename).suffix
                
                # Handle duplicates by adding a suffix
                base_new_name = f"{name_without_ext}_{year}"
                
                if base_new_name in filename_counts:
                    filename_counts[base_new_name] += 1
                    new_filename = f"{base_new_name}_{filename_counts[base_new_name]}{extension}"
                else:
                    filename_counts[base_new_name] = 1
                    new_filename = f"{base_new_name}{extension}"
                
                final_path = output_dir / new_filename
                
                # Move/rename the file
                downloaded_file.rename(final_path)
                downloaded.append(final_path)
                print(f"    Saved as: {new_filename}")
            else:
                print(f"    Timeout or error downloading: {original_filename}")
            
        except Exception as e:
            print(f"    Error downloading {original_filename}: {e}")
            # Continue with next file instead of crashing
            continue
        
        time.sleep(1)  # Small delay between downloads
    
    # Return to the page listing (helps with next year)
    try:
        driver.get(page_url)
        time.sleep(1)
    except:
        pass
    
    return downloaded


def main():
    """Main entry point."""
    
    # ==========================================================================
    # CONFIGURATION
    # ==========================================================================
    
    BASE_URL = "https://sos.iowa.gov/precinct-results-county"
    YEARS = ["2016", "2018", "2020"]
    OUTPUT_DIRECTORY = "./iowa_election_results"
    
    # ==========================================================================
    
    output_path = Path(OUTPUT_DIRECTORY)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Use the output directory as the download directory too
    # (Selenium will download here, then we rename)
    temp_download_path = output_path / "_temp_downloads"
    temp_download_path.mkdir(parents=True, exist_ok=True)
    
    print("Setting up browser...")
    driver = setup_driver(str(temp_download_path))
    
    all_downloaded = []
    
    try:
        for year in YEARS:
            print(f"\n{'#'*60}")
            print(f"# Processing year: {year}")
            print('#'*60)
            
            downloaded = download_files_for_year(
                driver, BASE_URL, year, output_path, temp_download_path
            )
            all_downloaded.extend(downloaded)
            
            print(f"\nDownloaded {len(downloaded)} files for {year}")
        
    finally:
        print("\nClosing browser...")
        driver.quit()
        
        # Give Windows a moment to release handles
        time.sleep(2)
        
        # Clean up temp directory (ignore errors on Windows)
        try:
            if temp_download_path.exists():
                for f in temp_download_path.iterdir():
                    try:
                        f.unlink()
                    except:
                        pass
                try:
                    temp_download_path.rmdir()
                except:
                    print(f"Note: Could not remove temp folder. You can manually delete: {temp_download_path}")
        except:
            pass
    
    print(f"\n{'='*60}")
    print(f"SUMMARY: Downloaded {len(all_downloaded)} total files")
    print(f"Location: {output_path.absolute()}")
    print('='*60)


if __name__ == "__main__":
    main()
    

Setting up browser...

############################################################
# Processing year: 2016
############################################################
Navigating to: https://sos.iowa.gov/precinct-results-county-2016-general
  Found 101 Excel file(s)
  [1/101] Downloading: adair.xlsx
    Saved as: adair_2016.xlsx
  [2/101] Downloading: adams.xlsx
    Saved as: adams_2016.xlsx
  [3/101] Downloading: allamakee.xlsx
    Saved as: allamakee_2016.xlsx
  [4/101] Downloading: appanoose.xlsx
    Saved as: appanoose_2016.xlsx
  [5/101] Downloading: audubon.xlsx
    Saved as: audubon_2016.xlsx
  [6/101] Downloading: benton.xlsx
    Saved as: benton_2016.xlsx
  [7/101] Downloading: black hawk.xlsx
    Saved as: black hawk_2016.xlsx
  [8/101] Downloading: boone.xlsx
    Saved as: boone_2016.xlsx
  [9/101] Downloading: bremer.xlsx
    Saved as: bremer_2016.xlsx
  [10/101] Downloading: buchanan.xlsx
    Saved as: buchanan_2016.xlsx
  [11/101] Downloading: buena vista.xlsx
    Saved 

InvalidSessionIdException: Message: invalid session id; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#invalidsessionidexception
Stacktrace:
0   chromedriver                        0x00000001007f6fd4 cxxbridge1$str$ptr + 3095476
1   chromedriver                        0x00000001007ef3ec cxxbridge1$str$ptr + 3063756
2   chromedriver                        0x00000001002d2640 _RNvCs5DBLTqoOdVp_7___rustc35___rust_no_alloc_shim_is_unstable_v2 + 74500
3   chromedriver                        0x000000010030f3d8 _RNvCs5DBLTqoOdVp_7___rustc35___rust_no_alloc_shim_is_unstable_v2 + 323740
4   chromedriver                        0x00000001003381b8 _RNvCs5DBLTqoOdVp_7___rustc35___rust_no_alloc_shim_is_unstable_v2 + 491132
5   chromedriver                        0x00000001003375a4 _RNvCs5DBLTqoOdVp_7___rustc35___rust_no_alloc_shim_is_unstable_v2 + 488040
6   chromedriver                        0x000000010029f66c chromedriver + 112236
7   chromedriver                        0x00000001007b55dc cxxbridge1$str$ptr + 2826684
8   chromedriver                        0x00000001007b8d1c cxxbridge1$str$ptr + 2840828
9   chromedriver                        0x000000010079abec cxxbridge1$str$ptr + 2717644
10  chromedriver                        0x00000001007b95a0 cxxbridge1$str$ptr + 2843008
11  chromedriver                        0x000000010078acc4 cxxbridge1$str$ptr + 2652324
12  chromedriver                        0x000000010029d2b0 chromedriver + 103088
13  dyld                                0x000000018c83eb98 start + 6076


---

## Summary: What We Learned

### Key Takeaways

1. **Web scraping** is the process of extracting data from websites programmatically
2. Many websites have **bot protection** that blocks simple scraping scripts
3. **Selenium** can control a real browser to bypass these protections
4. Python's **loops** let us repeat tasks (downloading 297 files) automatically

### The Big Picture

What would have taken **hours** of clicking and downloading by hand took just a few minutes of computer time. This is the power of automation!

### Python Skills Used
- Installing packages with `pip`
- Importing libraries
- Working with file paths (`Path` from `pathlib`)
- Loops (`for year in years:`)
- Functions (reusable blocks of code)
- Error handling (`try/except`)

### Next Steps
Now that we have all the Excel files downloaded, Stage 2 will:
- Read each Excel file
- Extract the data we need
- Combine everything into one dataset

---
*Questions? Ask your instructor or check the Python documentation at https://docs.python.org*