# Stage 1: Downloading Election Data from the Web

## The Goal

We need precinct-level election results from Iowa's Secretary of State website for 2016, 2018, and 2020. Each county has its own Excel file, and there are 99 counties √ó 3 years = **297 files** to download.

Doing this by hand would take hours of clicking. Instead, we'll write code to automate it.

## The Big Picture (Language-Agnostic)

Whether you're used to **Stata**, **R**, or **Python**, the logic is the same:

1. **Loop** through each year (2016, 2018, 2020)
2. **Go to** the webpage listing that year's files
3. **Find** all the download links for Excel files
4. **Download** each file
5. **Rename** it to include the year (so we know which election it's from)

The code below does exactly this. We'll try a simple approach first (which fails), then use a more powerful tool that works.

---

## Key Concepts (For Stata/R Users)

### Loops: Repeating Actions

| Stata | R | Python |
|-------|---|--------|
| `foreach year in 2016 2018 2020 {` | `for (year in c(2016, 2018, 2020)) {` | `for year in [2016, 2018, 2020]:` |
| `    ...` | `    ...` | `    ...` |
| `}` | `}` | |

### Building Text (String Concatenation)

Creating a URL by combining pieces:

| Stata | R | Python |
|-------|---|--------|
| `"https://site.com/" + "\`year'"` | `paste0("https://site.com/", year)` | `f"https://site.com/{year}"` |

### Conditional Logic

| Stata | R | Python |
|-------|---|--------|
| `if x == 1 {` | `if (x == 1) {` | `if x == 1:` |

### What is HTML?

Webpages are written in **HTML** ‚Äî a structured text format with "tags":

```html
<a href="https://example.com/Adair_2018.xls">Download Adair County</a>
```

- `<a>` = a link (anchor)
- `href="..."` = where the link points to
- The text between tags is what you see on the page

**BeautifulSoup** (Python) and **rvest** (R) are tools that read HTML and let you search for specific tags ‚Äî like finding all the `<a>` links on a page.

---

## Setup: Install Required Tools

First, we install the packages (libraries) we'll need. 

- **requests** ‚Äî downloads web pages (like `httr` in R)
- **beautifulsoup4** ‚Äî parses HTML to find links (like `rvest` in R)
- **openpyxl** / **xlrd** ‚Äî reads Excel files (like `readxl` in R or `import excel` in Stata)

Run this cell once:

In [None]:
# Install packages (only need to run once)
# The "!" means "run this as a terminal command"
!pip install requests beautifulsoup4 openpyxl xlrd

## First Attempt: Simple Web Scraping

The code below tries to:

1. **Build the URL** for each year's results page
2. **Download the HTML** of that page
3. **Parse the HTML** to find all links
4. **Filter** to keep only Excel files for that year
5. **Download each file** and save it

### Reading the Code

```python
for year in years:                    # Loop through each year
    url = f"...{year}..."             # Build the URL (like paste0 in R)
    response = requests.get(url)       # Download the page
    soup = BeautifulSoup(response.text) # Parse the HTML
    
    for link in soup.find_all('a'):   # Find all <a> tags (links)
        href = link['href']            # Get the URL from the link
        if href.endswith('.xls'):      # Check if it's an Excel file
            # Download it...
```

### What to Expect

**This will fail!** You'll see "403 Forbidden" errors. That's intentional ‚Äî we'll learn why and fix it.

---

In [None]:
#!/usr/bin/env python3
"""
First attempt: Download Iowa election results using requests + BeautifulSoup
"""

import requests
from bs4 import BeautifulSoup
from pathlib import Path
import time

# =============================================================================
# CONFIGURATION - What years do we want?
# =============================================================================

years = ["2016", "2018", "2020"]
output_folder = Path("./iowa_election_results")

# Create the output folder if it doesn't exist
output_folder.mkdir(exist_ok=True)
print(f"Files will be saved to: {output_folder.absolute()}\n")

# =============================================================================
# MAIN LOOP - Go through each year
# =============================================================================

for year in years:
    
    # Build the URL for this year's page
    # This is like: paste0("https://sos.iowa.gov/precinct-results-county-", year, "-general") in R
    url = f"https://sos.iowa.gov/precinct-results-county-{year}-general"
    print(f"Processing {year}...")
    print(f"  URL: {url}")
    
    # Download the webpage
    # This is like: httr::GET(url) in R
    response = requests.get(url)
    
    # Check if the page loaded successfully
    # Status 200 = OK, 403 = Forbidden, 404 = Not Found
    if response.status_code != 200:
        print(f"  ERROR: Could not load page (status {response.status_code})")
        continue  # Skip to next year (like "next" in R loops)
    
    # Parse the HTML to find all links
    # BeautifulSoup reads the HTML structure so we can search it
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all links (<a> tags) that point to Excel files
    excel_links = []
    for link in soup.find_all('a', href=True):  # Find all <a> tags that have an href
        href = link['href']                       # Get the URL
        
        # Check if this link is to an Excel file for our year
        is_excel = href.lower().endswith('.xls') or href.lower().endswith('.xlsx')
        is_correct_year = f"{year}general" in href.lower()
        
        if is_excel and is_correct_year:
            excel_links.append(href)
    
    print(f"  Found {len(excel_links)} Excel files")
    
print("\n" + "="*50)
print("RESULT: Downloads failed due to 403 errors.")
print("The website detected we're a script, not a human!")

## Why Did It Fail?

We got **403 Forbidden** errors. What happened?

### HTTP Status Codes

When you request a webpage, the server sends back a status code:

| Code | Meaning |
|------|---------|
| 200 | Success ‚Äî here's the page |
| 403 | Forbidden ‚Äî I'm refusing to serve you |
| 404 | Not Found ‚Äî page doesn't exist |

### Bot Protection

Many websites check **who** is asking for the page:

- Real browser? ‚úÖ Allowed
- Python script? ‚ùå Blocked

They do this by looking at **headers** ‚Äî extra information sent with each request. Your browser sends headers like:

```
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0...
```

But Python's `requests` library sends:

```
User-Agent: python-requests/2.31.0
```

The website sees "python-requests" and says "Nope, you're a bot!"

### Solutions

1. **Add fake headers** ‚Äî pretend to be a browser (sometimes works)
2. **Use a real browser** ‚Äî control Chrome/Edge with Selenium (always works)

We'll skip option 1 (this website blocks it too) and go straight to Selenium.

---

## The Solution: Browser Automation with Selenium

**Selenium** is a tool that lets Python control a real web browser (Chrome, Edge, or Firefox). 

Instead of Python pretending to be a browser, Python actually *operates* a browser ‚Äî clicking buttons, navigating pages, and downloading files just like a human would.

### Why This Works

The website can't tell the difference because it IS a real browser! All the JavaScript runs, cookies work normally, and the browser fingerprint looks legitimate.

### The Logic is the Same

Remember our original approach?

1. Loop through years
2. Go to the webpage
3. Find Excel links
4. Download each file

**Selenium does the exact same thing** ‚Äî we just swap out the tools:

| Task | Before (requests) | After (Selenium) |
|------|-------------------|------------------|
| Go to webpage | `requests.get(url)` | `driver.get(url)` |
| Find links | `soup.find_all('a')` | `driver.find_elements(By.TAG_NAME, 'a')` |
| Download file | `requests.get(file_url)` | Browser downloads automatically |

---

## Install Selenium

Run this cell to install the browser automation tools:

In [None]:
!pip install selenium

## Choose Your Browser

Selenium can control Chrome, Edge, or Firefox. **Change the line below** to match what you have installed:

- **Edge** is pre-installed on Windows
- **Chrome** is most common
- **Firefox** also works

In [None]:
# ============================================================
# BROWSER SELECTION - Change this to match what you have!
# ============================================================

BROWSER = "chrome"   # Options: "chrome", "edge", "firefox"

print(f"‚úÖ Will use: {BROWSER.title()}")
print(f"   (If you don't have {BROWSER.title()}, change BROWSER above and re-run this cell)")

## The Working Code

This cell does the full download. It will:

1. Open a browser window (or run "headless" if there's no display)
2. Navigate to each year's page
3. Find all Excel file links
4. Download each file
5. Rename files with the year (e.g., `Adair.xls` ‚Üí `Adair_2018.xls`)

**This takes 10-15 minutes** because we pause between downloads to be polite to the server.

### Note: Headless Mode Detection

The code automatically detects if you're running in a cloud environment (no monitor):

```python
headless = os.environ.get('CODESPACES') == 'true' or (os.environ.get('DISPLAY') is None and os.name != 'nt')
```

| Check | What it catches |
|-------|-----------------|
| `CODESPACES == 'true'` | GitHub Codespaces |
| `DISPLAY is None` | Any Linux server without a monitor (Gitpod, AWS, Replit, etc.) |
| `os.name != 'nt'` | Excludes Windows (which doesn't use DISPLAY) |

**On your laptop:** You'll see the browser window open and navigate automatically.

**In the cloud:** The browser runs invisibly ("headless") since there's no screen to display it.

---

### A Note on "Practical Plumbing"

The code includes two helper functions ‚Äî `wait_for_download` and `clear_folder` ‚Äî that handle messy real-world details:

**`wait_for_download`**: When you download manually, you wait for it to finish before clicking the next link. Computers don't wait automatically, so this function checks every half-second: *"Is the download done yet?"*

**`clear_folder`**: Before each download, we empty the temp folder. Why? So when a new file appears, we know it's the one we just downloaded (not an old one). In plain English this is just "delete all files in folder" ‚Äî but Python has no simple command for this, so we loop through and delete each file individually:

```python
for f in folder.iterdir():   # Loop through each file
    try:
        f.unlink()           # Delete it (.unlink = delete)
    except:
        pass                 # If it fails, skip it
```

**The bigger point:** If you're thinking "this seems like a lot of code for something simple" ‚Äî you're right! The core logic (loop through years, find links, download files) is straightforward. The extra code handles all the timing, file management, and edge cases that reality throws at you. Introductory courses rarely show this stuff, but it's what real automation requires.

---

In [None]:
#!/usr/bin/env python3
"""
Download Iowa election results using Selenium (browser automation).
Supports Chrome, Edge, and Firefox.
"""

import os
import time
from pathlib import Path
from urllib.parse import urlparse, unquote

from selenium import webdriver
from selenium.webdriver.common.by import By

# =============================================================================
# SETUP FUNCTIONS
# =============================================================================

def setup_driver(download_dir: str):
    """
    Set up the web browser for automated downloading.
    Uses the BROWSER variable you set in the previous cell.
    """
    browser = BROWSER.lower()
    download_path = str(Path(download_dir).absolute())
    
    # Check if we're in a headless environment (no display)
    headless = os.environ.get('CODESPACES') == 'true' or (os.environ.get('DISPLAY') is None and os.name != 'nt')
    
    if headless:
        print("Running in headless mode (no visible browser window)")
    
    # Set up the browser based on user's choice
    if browser == "chrome":
        from selenium.webdriver.chrome.options import Options
        options = Options()
        options.add_experimental_option("prefs", {
            "download.default_directory": download_path,
            "download.prompt_for_download": False,
        })
        if headless:
            options.add_argument("--headless=new")
            options.add_argument("--no-sandbox")
        options.add_argument("--disable-blink-features=AutomationControlled")
        driver = webdriver.Chrome(options=options)
        
    elif browser == "edge":
        from selenium.webdriver.edge.options import Options
        options = Options()
        options.add_experimental_option("prefs", {
            "download.default_directory": download_path,
            "download.prompt_for_download": False,
        })
        if headless:
            options.add_argument("--headless=new")
        driver = webdriver.Edge(options=options)
        
    elif browser == "firefox":
        from selenium.webdriver.firefox.options import Options
        options = Options()
        options.set_preference("browser.download.folderList", 2)
        options.set_preference("browser.download.dir", download_path)
        options.set_preference("browser.helperApps.neverAsk.saveToDisk", 
            "application/vnd.ms-excel,application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
        if headless:
            options.add_argument("--headless")
        driver = webdriver.Firefox(options=options)
    else:
        raise ValueError(f"Unknown browser: {browser}. Use 'chrome', 'edge', or 'firefox'")
    
    print(f"‚úÖ Started {browser.title()} browser")
    return driver


def get_excel_links(driver, page_url: str) -> list:
    """
    Navigate to a page and find all Excel file links.
    
    This is the Selenium version of what we did with BeautifulSoup:
    - driver.get(url) instead of requests.get(url)
    - driver.find_elements() instead of soup.find_all()
    """
    print(f"  Navigating to: {page_url}")
    driver.get(page_url)
    time.sleep(3)  # Wait for page to load
    
    # Find all <a> tags (links) on the page
    links = driver.find_elements(By.TAG_NAME, "a")
    
    # Filter to keep only Excel files
    excel_links = []
    for link in links:
        try:
            href = link.get_attribute("href")
            if href and (href.lower().endswith('.xls') or href.lower().endswith('.xlsx')):
                filename = unquote(Path(urlparse(href).path).name)
                excel_links.append((filename, href))
        except:
            continue
    
    print(f"  Found {len(excel_links)} Excel files")
    return excel_links


def wait_for_download(temp_dir: Path, timeout: int = 60):
    """Wait for a download to complete."""
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        files = list(temp_dir.iterdir())
        
        # Check for partial downloads (Chrome uses .crdownload)
        partial = [f for f in files if f.suffix in ('.crdownload', '.tmp', '.part')]
        excel = [f for f in files if f.suffix.lower() in ('.xls', '.xlsx')]
        
        if excel and not partial:
            time.sleep(0.5)
            return excel[0]
        
        time.sleep(0.5)
    
    return None


def clear_folder(folder: Path):
    """Remove all files from a folder."""
    for f in folder.iterdir():
        try:
            f.unlink()
        except:
            pass


# =============================================================================
# MAIN DOWNLOAD SCRIPT
# =============================================================================

def main():
    """Download all Iowa election Excel files."""
    
    # Configuration
    BASE_URL = "https://sos.iowa.gov/precinct-results-county"
    YEARS = ["2016", "2018", "2020"]
    OUTPUT_DIR = Path("./iowa_election_results")
    TEMP_DIR = OUTPUT_DIR / "_temp"
    
    # Create folders
    OUTPUT_DIR.mkdir(exist_ok=True)
    TEMP_DIR.mkdir(exist_ok=True)
    
    print(f"Files will be saved to: {OUTPUT_DIR.absolute()}")
    print(f"Using browser: {BROWSER}\n")
    
    # Start the browser
    driver = setup_driver(str(TEMP_DIR))
    all_downloaded = []
    
    try:
        # Loop through each year
        for year in YEARS:
            print(f"\n{'='*50}")
            print(f"Processing {year}...")
            print('='*50)
            
            # Build URL and find Excel links
            page_url = f"{BASE_URL}-{year}-general"
            excel_links = get_excel_links(driver, page_url)
            
            if not excel_links:
                print(f"  No files found for {year}")
                continue
            
            # Download each file
            for i, (filename, file_url) in enumerate(excel_links, 1):
                print(f"  [{i}/{len(excel_links)}] Downloading: {filename}")
                
                # Clear temp folder
                clear_folder(TEMP_DIR)
                
                # Navigate to file URL (triggers download)
                driver.get(file_url)
                
                # Wait for download
                downloaded = wait_for_download(TEMP_DIR)
                
                if downloaded:
                    # Rename with year suffix: Adair.xls -> Adair_2018.xls
                    new_name = f"{Path(filename).stem}_{year}{Path(filename).suffix}"
                    final_path = OUTPUT_DIR / new_name
                    downloaded.rename(final_path)
                    all_downloaded.append(final_path)
                    print(f"       Saved as: {new_name}")
                else:
                    print(f"       Warning: Download may have failed")
                
                time.sleep(1)  # Pause between downloads
    
    finally:
        print("\nClosing browser...")
        driver.quit()
        
        # Clean up temp folder
        try:
            clear_folder(TEMP_DIR)
            TEMP_DIR.rmdir()
        except:
            pass
    
    # Summary
    print(f"\n{'='*50}")
    print(f"DONE! Downloaded {len(all_downloaded)} files")
    print(f"Location: {OUTPUT_DIR.absolute()}")
    print('='*50)

# Run it!
main()

---

## Alternative: The "Just Loop Through a List" Approach

The code above discovers links by parsing HTML. But there's a simpler way to think about it:

**You already know what you want to download.** You can see the 99 county names on the webpage. Why not just make a list and loop through it?

This is closer to how you'd think about it in Stata:
```stata
foreach county in Adair Adams Allamakee ... {
    * download file for `county'
}
```

### The Simpler Approach:

1. Make a list of county names
2. For each county, build the download URL
3. Download it
4. Repeat for each election year

**Note:** You might discover that one year has a different list of counties ‚Äî that's a real data quality issue to investigate!

---

In [None]:
# ============================================================
# SIMPLER APPROACH: Just loop through a list of counties
# ============================================================

# Iowa's 99 counties (in alphabetical order)
IOWA_COUNTIES = [
    "Adair", "Adams", "Allamakee", "Appanoose", "Audubon",
    "Benton", "Black Hawk", "Boone", "Bremer", "Buchanan",
    "Buena Vista", "Butler", "Calhoun", "Carroll", "Cass",
    "Cedar", "Cerro Gordo", "Cherokee", "Chickasaw", "Clarke",
    "Clay", "Clayton", "Clinton", "Crawford", "Dallas",
    "Davis", "Decatur", "Delaware", "Des Moines", "Dickinson",
    "Dubuque", "Emmet", "Fayette", "Floyd", "Franklin",
    "Fremont", "Greene", "Grundy", "Guthrie", "Hamilton",
    "Hancock", "Hardin", "Harrison", "Henry", "Howard",
    "Humboldt", "Ida", "Iowa", "Jackson", "Jasper",
    "Jefferson", "Johnson", "Jones", "Keokuk", "Kossuth",
    "Lee", "Linn", "Louisa", "Lucas", "Lyon",
    "Madison", "Mahaska", "Marion", "Marshall", "Mills",
    "Mitchell", "Monona", "Monroe", "Montgomery", "Muscatine",
    "O'Brien", "Osceola", "Page", "Palo Alto", "Plymouth",
    "Pocahontas", "Polk", "Pottawattamie", "Poweshiek", "Ringgold",
    "Sac", "Scott", "Shelby", "Sioux", "Story",
    "Tama", "Taylor", "Union", "Van Buren", "Wapello",
    "Warren", "Washington", "Wayne", "Webster", "Winnebago",
    "Winneshiek", "Woodbury", "Worth", "Wright"
]

print(f"Iowa has {len(IOWA_COUNTIES)} counties")
print(f"First few: {IOWA_COUNTIES[:5]}")
print(f"Last few: {IOWA_COUNTIES[-5:]}")

In [None]:
# ============================================================
# DOWNLOAD FILES - The Simple Way
# ============================================================
# One loop per year. Easy to read, easy to understand.
#
# URL pattern:
# https://sos.iowa.gov/elections/pdf/precinctresults/2018general/audubon.xls
#                                                      ^^^^        ^^^^^^^
#                                                      year        county (lowercase)
#
# FILE EXTENSIONS:
#   2016 = .xlsx
#   2018 = .xls
#   2020 = .xlsx

import os
import time
from pathlib import Path
from selenium import webdriver

# --- Setup ---
OUTPUT_DIR = Path("./iowa_election_results")
TEMP_DIR = OUTPUT_DIR / "_temp"
OUTPUT_DIR.mkdir(exist_ok=True)
TEMP_DIR.mkdir(exist_ok=True)

# Start the browser (uses BROWSER variable from earlier)
browser = BROWSER.lower()
if browser == "chrome":
    from selenium.webdriver.chrome.options import Options
    options = Options()
    options.add_experimental_option("prefs", {"download.default_directory": str(TEMP_DIR.absolute())})
    driver = webdriver.Chrome(options=options)
elif browser == "edge":
    from selenium.webdriver.edge.options import Options
    options = Options()
    options.add_experimental_option("prefs", {"download.default_directory": str(TEMP_DIR.absolute())})
    driver = webdriver.Edge(options=options)
elif browser == "firefox":
    from selenium.webdriver.firefox.options import Options
    options = Options()
    options.set_preference("browser.download.dir", str(TEMP_DIR.absolute()))
    options.set_preference("browser.download.folderList", 2)
    driver = webdriver.Firefox(options=options)

print(f"Started {browser} browser")
print(f"Files will be saved to: {OUTPUT_DIR.absolute()}\n")

# --- Helper functions (the "plumbing") ---

def wait_for_download(timeout=60):
    """Wait until download completes."""
    start = time.time()
    while time.time() - start < timeout:
        files = list(TEMP_DIR.iterdir())
        partial = [f for f in files if f.suffix in ('.crdownload', '.tmp', '.part')]
        done = [f for f in files if f.suffix in ('.xls', '.xlsx')]
        if done and not partial:
            return done[0]
        time.sleep(0.5)
    return None

def clear_temp():
    """Empty the temp folder."""
    for f in TEMP_DIR.iterdir():
        try:
            f.unlink()
        except:
            pass

# ============================================================
# YEAR 2016 (.xlsx files)
# ============================================================
print("="*50)
print("Downloading 2016 files...")
print("="*50)

for county in IOWA_COUNTIES:
    # County name: lowercase, no spaces, no apostrophes
    # e.g., "O'Brien" -> "obrien", "Black Hawk" -> "blackhawk"
    county_lower = county.lower().replace(" ", "").replace("'", "")
    
    # 2016 uses .xlsx format
    url = f"https://sos.iowa.gov/elections/pdf/precinctresults/2016general/{county_lower}.xlsx"
    
    print(f"  {county}...", end=" ")
    
    clear_temp()
    driver.get(url)
    downloaded = wait_for_download()
    
    if downloaded:
        new_name = f"{county}_2016.xlsx"
        downloaded.rename(OUTPUT_DIR / new_name)
        print("‚úì")
    else:
        print("FAILED - check if file exists on website")
    
    time.sleep(0.5)

# ============================================================
# YEAR 2018 (.xls files)
# ============================================================
print("\n" + "="*50)
print("Downloading 2018 files...")
print("="*50)

for county in IOWA_COUNTIES:
    # County name: lowercase, no spaces, no apostrophes
    county_lower = county.lower().replace(" ", "").replace("'", "")
    
    # 2018 uses .xls format (older Excel format)
    url = f"https://sos.iowa.gov/elections/pdf/precinctresults/2018general/{county_lower}.xls"
    
    print(f"  {county}...", end=" ")
    
    clear_temp()
    driver.get(url)
    downloaded = wait_for_download()
    
    if downloaded:
        new_name = f"{county}_2018.xls"
        downloaded.rename(OUTPUT_DIR / new_name)
        print("‚úì")
    else:
        print("FAILED - check if file exists on website")
    
    time.sleep(0.5)

# ============================================================
# YEAR 2020 (.xlsx files)
# ============================================================
print("\n" + "="*50)
print("Downloading 2020 files...")
print("="*50)

for county in IOWA_COUNTIES:
    county_lower = county.lower().replace(" ", "").replace("'", "")
    
    # 2020 uses .xlsx format
    url = f"https://sos.iowa.gov/elections/pdf/precinctresults/2020general/{county_lower}.xlsx"
    
    print(f"  {county}...", end=" ")
    
    clear_temp()
    driver.get(url)
    downloaded = wait_for_download()
    
    if downloaded:
        new_name = f"{county}_2020.xlsx"
        downloaded.rename(OUTPUT_DIR / new_name)
        print("‚úì")
    else:
        print("FAILED - check if file exists on website")
    
    time.sleep(0.5)

# --- Cleanup ---
driver.quit()
print("\n" + "="*50)
print("DONE!")
print(f"Files saved to: {OUTPUT_DIR.absolute()}")
print("="*50)

### What's Different About This Version?

| Original Approach | Simpler Approach |
|-------------------|------------------|
| Parse HTML to find links | Already know the county names |
| One function handles all years | Separate loop for each year (easier to read) |
| Abstract, flexible | Concrete, obvious |
| Adapts if website changes | Breaks if URL pattern changes |

**The simpler version is easier to understand** because it matches how you think: *"For each county in my list, download the file."*

**The original version is more robust** because it discovers links automatically ‚Äî useful if you don't know the URL pattern in advance.

### Did Any Downloads Fail?

If a county shows "FAILED", investigate:
- Does that county exist for that year?
- Is the URL pattern different?
- Is it a PDF instead of Excel?

**This is real data work** ‚Äî discovering inconsistencies and handling them.

---

---

## Summary: What We Learned

### The Process

1. We tried to download files using Python's `requests` library
2. The website blocked us (403 Forbidden) because it detected we were a script
3. We used **Selenium** to control a real browser, which the website accepts

### Key Concepts (Transferable to R/Stata)

| Concept | What it means |
|---------|---------------|
| **Loop** | Repeat an action for each item in a list |
| **String building** | Create text by combining pieces (URLs, filenames) |
| **Conditional logic** | Only do something if a condition is true |
| **HTTP status codes** | 200=OK, 403=Forbidden, 404=Not Found |
| **HTML parsing** | Reading the structure of a webpage to find links |

### Python-Specific Tools

| Tool | Purpose | R Equivalent |
|------|---------|--------------|
| `requests` | Download web pages | `httr::GET()` |
| `BeautifulSoup` | Parse HTML | `rvest::read_html()` |
| `selenium` | Control a browser | `RSelenium` |
| `pathlib.Path` | Work with file paths | `fs` package |

---

## Next Steps

Now that we have the Excel files downloaded, **Stage 2** will:
- Read each Excel file
- Extract the data we need
- Combine everything into one clean dataset

---

## üîç Challenge: Did We Get Everything?

After running the download, check your results:

1. **How many files did you download?** (Should be ~99 per year √ó 3 years)
2. **Are all 99 Iowa counties represented for each year?**

If something is missing, investigate:
- Go to the original webpage in your browser
- Look at what files are actually listed
- Is everything an Excel file? Or is there something different?

**Real-world data is messy.** Sometimes files are in unexpected formats, links are broken, or naming conventions change. Part of data analysis is discovering these issues and deciding how to handle them.

---
*Questions? Ask your instructor!*