**Author:** Revekka Gershovch

**Date:** Jan 28, 2025

**Purpose:** To clean and aggregate election returns data for years 1824 to 1968 from ICPSR 1, United States Historical Election Returns

# Deciphering variable names

**1.** Since this dataset is provided in ASCII format with a SAS or SPSS setup files, I have extracted all the dataset into a csv format using a very niche R library called asciiSetupReader written specifically for extraction of pre-2000s dataset formatted in this weird way. As variable names in CSV, I used labels defined in the setup file. You can find this file in our StateLaws Dropbox: the path to the file is 1_code/similarity_code/Political_similarity_code/ICSPR_00001_to_csv.R

**2.** "Scope of Project" documentation for the study that can be found here: https://www.icpsr.umich.edu/web/ICPSR/studies/1. According to it "There is no actual codebook for this collection. Variable information is contained in the setup files." Thus, here I am making a codebook for naming conventions in my file so that if I or anyone else ever needs to go to the raw data, they would not have to spend hours figuring out what variable in the raw data mean. 

# Codebook for ICPSR 1, United States Historical Election Returns

## State and County Identifiers
| **Column Name**         | **Description**                                                                                     |
|-------------------------|-----------------------------------------------------------------------------------------------------|
| `ICPR_STATE_CODE`       | ICPSR standardized state code.                                                                      |
| `COUNTY_NAME`           | Standardized county name.                                                                           |
| `IDENTIFICATION_NUMBER` | Unique numeric identifier for each county, enabling consistent referencing.                         |

## Congressional District Numbers
| **Column Name**           | **Description**                                                                                   |
|---------------------------|---------------------------------------------------------------------------------------------------|
| `CONG_DIST_NUMBER_YYYY`   | Congressional district number for a specific year (e.g., `CONG_DIST_NUMBER_1825`). May indicate the number of districts for split counties. |

## Election Results

### General Format

X###_##_TYPE_RACE_PARTYCODE_VOTE

### Components
| **Component**     | **Description**                                                                                           |
|-------------------|---------------------------------------------------------------------------------------------------------|
| `X###`           | Election year (e.g., `X824` = 1824).                                                                      |
| `##`             | Election type/level: <br> **1** = Presidential, **2** = Gubernatorial, **3** = Congressional/House elections. |
| `TYPE`           | Type of election: <br> **G** = General, **M** = Midterm, **S** = Special.                                 |
| `RACE`           | Race type: <br> Examples: `PRES` = President, `GOV` = Governor.                                           |
| `PARTYCODE`      | Code representing the political party. See the attached party codes file for definitions (e.g., `0025` = National Republican). |
| `VOTE`           | Number of votes received by the candidate.                                                                |
| `TOTAL_VOTE`     | Total votes cast for the specific race or election.                                                       |

### Examples
| **Column Name**               | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X824_1_G_PRES_0025_VOTE`     | Votes for the National Republican candidate in the 1824 presidential general election.      |
| `X825_2_G_GOV_0659_VOTE`      | Votes for a specific party in the 1825 gubernatorial general election.                      |
| `X827_3_M_H_AL_9001_VOTE`     | Votes in a midterm House election in district `9001` for Alabama in 1827.                   |
| `X836_2_G_GOV_TOTAL_VOTE`     | Total gubernatorial votes in the 1836 general election.                                     |

## Handling Duplicate or Corrected Entries
| **Column Name Example**       | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X825_2_G_GOV_0659_VOTE.1`    | A vote for a second candidate from '0659' party in 1825 gubernatorial election.|
| `X831_3_M_H_AL_0100_VOTE.2`   | A duplicate or re-evaluated entry for midterm House votes in district `0100` for Alabama in 1831. |

## Important Notes
- **Corrections:** Some entries, such as Jackson County in Georgia (`ID: 1510`), should be corrected to `1570` when analyzing by county.
- **Missing Values:** For counties not reporting data or not participating in elections, identifiers like `98` (placeholders) are used.
- **Party Codes:** Refer to the party codes section of the documentation contained in /Users/revekkagershovich/Dropbox (MIT)/StateLaws/2_data/1_raw/political_data/ICPSR_election_returns/DS0204/00001-0204-Documentation.txt for the specific meaning of codes like `0025`, `0659`, etc. which represent political parties.

In [None]:
import os
import os.path as path
import pandas as pd
import numpy as np
import re
from tqdm import tqdm

In [None]:
parent_dir = os.path.abspath("/Users/revekkagershovich/Dropbox (MIT)/StateLaws")
os.chdir(parent_dir)
assert os.path.exists(parent_dir), "parent_dir does not exist"
intermed_data_dir = "./2_data/2_intermediate/political_data"
assert os.path.exists(intermed_data_dir), "Data directory does not exist"
raw_data_dir = "./2_data/1_raw/political_data"
assert os.path.exists(raw_data_dir), "Data directory does not exist"

In [None]:
# Generate a list of numbers from 0001 to 0203 as zero-padded strings
numbers = [f"{num:04d}" for num in range(1, 204)]

# Define an empty list to store processed DataFrames
dfs = []

missing_files = []

In [None]:
for number in tqdm(numbers):
    file_path = os.path.join(raw_data_dir, f"ICPSR_election_returns/DS{number}/00001-{number}-Data.csv")
    
    if number == '0091':
        continue 

    if not os.path.exists(file_path):
        missing_files.append(number)
        continue

    df = pd.read_csv(file_path)

    if number >= '0170':
        df.rename(columns={'ICPR_COUNTY_CODE': 'ICPR_STATE_CODE'}, inplace=True)
    
    # Delete all congressional district columns
    df.drop(columns=[col for col in df.columns if col.startswith('CONG')], inplace=True)


    # MELT 1

    # Step 1: Identify columns and group them by their suffix
    # Add all variables starting with "CONG" to id_vars
    id_vars = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER']
    grouped_columns = {}

    # Group columns by their suffix (everything after the first underscore and without the year part)
    for col in df.columns:
        if col.startswith('X'):
            suffix = '_'.join(col.split('_')[1:])  # Extract the suffix after the first underscore
            if suffix not in grouped_columns:
                grouped_columns[suffix] = []
            grouped_columns[suffix].append(col)

    # Step 2: Reshape each group and combine into a single table
    reshaped_dataframes = []

    for suffix, cols in grouped_columns.items():
        # Reshape the group into long format
        temp_df = pd.melt(df, id_vars=id_vars, value_vars=cols,
                        var_name='year', value_name=suffix)
        # Extract the year and adjust to full year format
        temp_df['year'] = temp_df['year'].str.extract(r'X(\d+)').astype(int) + 1000
        reshaped_dataframes.append(temp_df)

    # Step 3: Merge all reshaped groups into a single DataFrame
    df = reshaped_dataframes[0]
    for additional_df in reshaped_dataframes[1:]:
        df = df.merge(additional_df, on=id_vars + ['year'], how='outer')

    df = df[['year'] + [col for col in df.columns if col != 'year']]

    print(f"Melt 1 for df {number} complete")

    # Rename TOTAL_VOTE columns to match the party code format ('0000' instead of a party code)
    renamed_columns = {col: col.replace("TOTAL_VOTE", "0000_VOTE") for col in df.columns if "TOTAL_VOTE" in col}

    df = df.rename(columns=renamed_columns)

    renamed_columns = {col: col.replace("TOTA_VOTE", "0000_VOTE") for col in df.columns if "TOTA_VOTE" in col}

    df = df.rename(columns=renamed_columns)

    # Rename OTHER_VOTE columns to match the party code format ('0000' instead of a party code)
    renamed_columns = {col: col.replace("OTHER_VOTE", "3000_VOTE") for col in df.columns if "OTHER_VOTE" in col}

    df = df.rename(columns=renamed_columns)

    renamed_columns = {col: col.replace("0594L_VOTE", "0594_VOTE") for col in df.columns if "0594L_VOTE" in col}

    df = df.rename(columns=renamed_columns)

    # Handling multiple candidates per party
    # Step 1: Identify columns with suffixes .1, .2, .3, etc.
    suffix_pattern = re.compile(r"(.*)\.(\d+)$")  # Matches columns ending in .1, .2, etc.
    grouped_columns = {}

    for col in df.columns:
        match = suffix_pattern.match(col)
        if match:
            base_name = match.group(1)  # Extract base column name (without suffix)
            if base_name not in grouped_columns:
                grouped_columns[base_name] = []
            grouped_columns[base_name].append(col)

    # Step 2: Identify related columns and rename base columns with .0 postfix
    for base_name in grouped_columns.keys():
        if base_name in df.columns:  # If the original base column exists
            df.rename(columns={base_name: base_name + ".0"}, inplace=True)
            grouped_columns[base_name].append(base_name + ".0")  # Include renamed base column

    # Step 3: Create new summed columns
    for base_name, related_columns in grouped_columns.items():
        df[base_name] = df[related_columns].sum(axis=1)

    # Step 4: Drop all columns with suffixes .0, .1, .2, etc.
    columns_to_drop = [col for col in df.columns if re.search(r"\.\d+$", col)]
    df.drop(columns=columns_to_drop, inplace=True)

    # MELT 2
    # Step 1: Identify columns and group them by their base name (e.g., `2_G_GOV_VOTE`)
    id_vars = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER', 'year'] + [col for col in df.columns if col.startswith('CONG')]
    grouped_columns = {}

    # Group columns by removing the numeric segment before "_VOTE" (the second-to-last segment)
    for col in df.columns:
        if '_VOTE' in col:  # Ensure we're only processing relevant columns
            parts = col.split('_')
            base_name = '_'.join(parts[:-2] + ['VOTE']) if parts[-2].isdigit() else col  # Remove numeric part before "VOTE"
            
            if base_name not in grouped_columns:
                grouped_columns[base_name] = []
            grouped_columns[base_name].append(col)

    # Step 2: Reshape each group and combine into a single table
    reshaped_dataframes = []

    for base_name, cols in grouped_columns.items():
        # Reshape the group into long format
        temp_df = pd.melt(df, id_vars=id_vars, value_vars=cols,
                        var_name='party', value_name=base_name)


        # Extract the 4-digit party code from column names
        extracted_party = temp_df['party'].str.extract(r'_(\d{4})_')
        temp_df['party'] = extracted_party[0]  # Get first column from extracted DataFrame

        reshaped_dataframes.append(temp_df)
    
    # Step 3: Merge all reshaped groups into a single DataFrame
    df = reshaped_dataframes[0]
    for additional_df in reshaped_dataframes[1:]:
        df = df.merge(additional_df, on=id_vars + ['party'], how='outer')

    df = df[['party'] + [col for col in df.columns if col != 'party']]

    print(f"Melt 2 for df {number} complete")

    # Renaming variables

    df.rename(columns={'ICPR_STATE_CODE': 'ICPSR_state_code', 'COUNTY_NAME': 'county_name', 
                            'IDENTIFICATION_NUMBER': 'county_id'
    }, inplace=True)

    df['county_name'] = df['county_name'].str.title()

    # Create ICPSR to FIPS and ICPSR to State Name mappings
    icpsr_to_fips = {
        1: 9,  2: 23, 3: 25, 4: 33, 5: 44, 6: 50, 11: 10, 12: 34, 13: 36, 14: 42, 21: 17,
        22: 18, 23: 26, 24: 39, 31: 19, 32: 20, 33: 27, 34: 29, 35: 31, 36: 38, 37: 46,
        40: 51, 41: 1, 42: 5, 43: 12, 44: 13, 45: 22, 46: 28, 47: 37, 48: 45, 49: 48,
        51: 21, 52: 24, 53: 40, 54: 47, 56: 54, 49: 48, 72: 41, 73: 53, 97: 97, 98: 11
    }

    # Add 'state_fips' column to cong_df based on 'ICPSR_state_code'
    df['state_fips'] = df['ICPSR_state_code'].map(icpsr_to_fips)

    # Dropping rows with all zero or NaN votes
    # Define the columns to keep (these should not be considered when checking for empty/zero rows)
    columns_to_exclude = ['party', 'ICPSR_state_code', 'county_name', 'county_id', 'year', 'state_fips']

    # Identify numeric columns that should be checked for being empty or zero
    columns_to_check = [col for col in df.columns if col not in columns_to_exclude]

    # Drop rows where all of the columns in `columns_to_check` are either NaN or 0
    df = df[~(df[columns_to_check].isna() | (df[columns_to_check] == 0)).all(axis=1)]

    # Aggregate from county to state level
    # Define columns to group by (state-level aggregation by party-year pair)
    groupby_columns = ['party', 'year', 'state_fips']

    # Define columns to exclude from summation
    columns_to_exclude = ['party', 'ICPSR_state_code', 'county_name', 'county_id', 'year', 'state_fips']

    # Define columns to sum (all columns except those in columns_to_exclude)
    columns_to_sum = [col for col in df.columns if col not in columns_to_exclude]

    # Perform aggregation by summing vote counts at the state level
    df = df.groupby(groupby_columns, as_index=False)[columns_to_sum].sum()

    # Save the processed DataFrame
    df.to_csv(os.path.join(raw_data_dir, f"ICPSR_election_returns/DS{number}/00001-{number}-processed.csv"), index=False)

    # Append processed DataFrame to the list
    dfs.append(df)

In [None]:
# Concatenate all stored DataFrames into a single DataFrame
df = pd.concat(dfs, ignore_index=True)

# Display the aggregated dataset
print(df.head())

In [None]:
df.columns

In [None]:
cols = ['1_G_PRES_VOTE', '2_G_GOV_VOTE',
       '3_M_H_AL_VOTE', '3_W_H_AL_VOTE', '3_S_H_AL_VOTE', '3_G_CONG_VOTE',
       '3_S_CONG_VOTE', '3_G_H_AL_VOTE', '6_G_SEN_VOTE', '4_G_SEN_VOTE',
       '4_S_SEN_VOTE', '6_S_SEN_VOTE', '5_S_SEN_VOTE', '5_G_SEN_VOTE',
       '2_S_GOV_VOTE', '2_G_GV03_VOTE', '2_G_GV11_VOTE', '7_G_ATGN_VOTE',
       'GOV_VOTE', 'PRES_VOTE', 'SEN_6_VOTE', 'SEN_4_VOTE', '3_S_CG04_VOTE',
       '3_M_CONG_VOTE', '3_S_CGNOV_VOTE', '3_S_CGJAN_VOTE', '3_S_CG11_VOTE',
       '2_G_SEN_VOTE', '3_G_GOV_VOTE', '3_S_70CG_VOTE', '3_S_71CG_VOTE',
       '1_S_PRES_VOTE', '3_S_CG08_VOTE', '3_S_CG12_VOTE', '1_G_CONG_VOTE',
       '7_G_SEN_VOTE', '2_G_PRES_VOTE', 'G_PRES_VOTE', '3_G_HAL1_VOTE',
       '3_G_HAL2_VOTE', '7_ATGN_VOTE']

for col in cols:
    print(f"Count for {col}:")
    print(df[f"{col}"].count()/len(df))

NEXT STEPS: 

1. Find out how many datasets I did not manage to process, and why

2. What all those different variables mean, and keep the governor-related ones

3. Find governor election returns for after 1970s and merge them in