**Created by:** Revekka Gershovich **When:** Dic 4, 2024 **Why:** To clean and aggregate election returns data for years 1824 to 1968 from ICPSR 1, United States Historical Election Returns

In [None]:
import os
import os.path as path
import pandas as pd
import numpy as np

In [None]:
parent_dir = os.path.abspath("/Users/revekkagershovich/Dropbox (MIT)/StateLaws")
os.chdir(parent_dir)
assert os.path.exists(parent_dir), "parent_dir does not exist"
intermed_data_dir = "./2_data/2_intermediate/political_data"
assert os.path.exists(intermed_data_dir), "Data directory does not exist"
raw_data_dir = "./2_data/1_raw/political_data"
assert os.path.exists(raw_data_dir), "Data directory does not exist"

In [None]:
df = pd.read_csv(path.join(raw_data_dir, "./ICPSR_election_returns/DS0001/00001-0001-Data.csv"))

In [None]:
df2 = pd.read_csv(path.join(raw_data_dir, "./ICPSR_election_returns/DS0002/00001-0002-Data.csv"))

In [None]:
df2.columns

In [None]:
df.columns

# Deciphering variable names

**1.** Since this dataset is provided in ASCII format with a SAS or SPSS setup files, I have extracted all the dataset into a csv format using a very niche R library called asciiSetupReader written specifically for extraction of pre-2000s dataset formatted in this weird way. As variable names in CSV, I used labels defined in the setup file. You can find this file in our StateLaws Dropbox: the path to the file is 1_code/similarity_code/Political_similarity_code/ICSPR_00001_to_csv.R

**2.** "Scope of Project" documentation for the study that can be found here: https://www.icpsr.umich.edu/web/ICPSR/studies/1. According to it "There is no actual codebook for this collection. Variable information is contained in the setup files." Thus, here I am making a codebook for naming conventions in my file so that if I or anyone else ever needs to go to the raw data, they would not have to spend hours figuring out what variable in the raw data mean. 

# Codebook for ICPSR 1, United States Historical Election Returns

## State and County Identifiers
| **Column Name**         | **Description**                                                                                     |
|-------------------------|-----------------------------------------------------------------------------------------------------|
| `ICPR_STATE_CODE`       | ICPSR standardized state code.                                                                      |
| `COUNTY_NAME`           | Standardized county name.                                                                           |
| `IDENTIFICATION_NUMBER` | Unique numeric identifier for each county, enabling consistent referencing.                         |

## Congressional District Numbers
| **Column Name**           | **Description**                                                                                   |
|---------------------------|---------------------------------------------------------------------------------------------------|
| `CONG_DIST_NUMBER_YYYY`   | Congressional district number for a specific year (e.g., `CONG_DIST_NUMBER_1825`). May indicate the number of districts for split counties. |

## Election Results

### General Format

X###_##_TYPE_RACE_PARTYCODE_VOTE

### Components
| **Component**     | **Description**                                                                                           |
|-------------------|---------------------------------------------------------------------------------------------------------|
| `X###`           | Election year (e.g., `X824` = 1824).                                                                      |
| `##`             | Election type/level: <br> **1** = Presidential, **2** = Gubernatorial, **3** = Congressional/House elections. |
| `TYPE`           | Type of election: <br> **G** = General, **M** = Midterm, **S** = Special.                                 |
| `RACE`           | Race type: <br> Examples: `PRES` = President, `GOV` = Governor.                                           |
| `PARTYCODE`      | Code representing the political party. See the attached party codes file for definitions (e.g., `0025` = National Republican). |
| `VOTE`           | Number of votes received by the candidate.                                                                |
| `TOTAL_VOTE`     | Total votes cast for the specific race or election.                                                       |

### Examples
| **Column Name**               | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X824_1_G_PRES_0025_VOTE`     | Votes for the National Republican candidate in the 1824 presidential general election.      |
| `X825_2_G_GOV_0659_VOTE`      | Votes for a specific party in the 1825 gubernatorial general election.                      |
| `X827_3_M_H_AL_9001_VOTE`     | Votes in a midterm House election in district `9001` for Alabama in 1827.                   |
| `X836_2_G_GOV_TOTAL_VOTE`     | Total gubernatorial votes in the 1836 general election.                                     |

## Handling Duplicate or Corrected Entries
| **Column Name Example**       | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X825_2_G_GOV_0659_VOTE.1`    | A secondary entry for verification or correction of votes in the 1825 gubernatorial election.|
| `X831_3_M_H_AL_0100_VOTE.2`   | A duplicate or re-evaluated entry for midterm House votes in district `0100` for Alabama in 1831. |

## Important Notes
- **Corrections:** Some entries, such as Jackson County in Georgia (`ID: 1510`), should be corrected to `1570` when analyzing by county.
- **Missing Values:** For counties not reporting data or not participating in elections, identifiers like `98` (placeholders) are used.
- **Party Codes:** Refer to the party codes section of the documentation contained in /Users/revekkagershovich/Dropbox (MIT)/StateLaws/2_data/1_raw/political_data/ICPSR_election_returns/DS0204/00001-0204-Documentation.txt for the specific meaning of codes like `0025`, `0659`, etc. which represent political parties.

In [None]:
df.head(20)

In [None]:
# Print all column names as a list
print(df.columns)

In [None]:
# Step 1: Identify columns and group them by their suffix
# Add all variables starting with "CONG" to id_vars
id_vars = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER'] + [col for col in df.columns if col.startswith('CONG')]
grouped_columns = {}

# Group columns by their suffix (everything after the first underscore and without the year part)
for col in df.columns:
    if col.startswith('X'):
        suffix = '_'.join(col.split('_')[1:])  # Extract the suffix after the first underscore
        if suffix not in grouped_columns:
            grouped_columns[suffix] = []
        grouped_columns[suffix].append(col)

In [None]:
print(grouped_columns.keys())

In [None]:
# Step 2: Reshape each group and combine into a single table
reshaped_dataframes = []

for suffix, cols in grouped_columns.items():
    # Reshape the group into long format
    temp_df = pd.melt(df, id_vars=id_vars, value_vars=cols,
                      var_name='year', value_name=suffix)
    # Extract the year and adjust to full year format
    temp_df['year'] = temp_df['year'].str.extract(r'X(\d+)').astype(int) + 1000
    reshaped_dataframes.append(temp_df)

In [None]:
# Step 3: Merge all reshaped groups into a single DataFrame
final_df = reshaped_dataframes[0]
for additional_df in reshaped_dataframes[1:]:
    final_df = final_df.merge(additional_df, on=id_vars + ['year'], how='outer')

final_df = final_df[['year'] + [col for col in final_df.columns if col != 'year']]

In [None]:
final_df.columns

In [None]:
# Select columns that contain "TOTAL" in the name
total_columns = [col for col in final_df.columns if "TOTAL" in col]

# Print only these columns
print(final_df[total_columns])

In [None]:
# Identify duplicate pairs of IDENTIFICATION_NUMBER and year
duplicates = final_df[final_df.duplicated(subset=['IDENTIFICATION_NUMBER', 'year'], keep=False)]

# Display the duplicate pairs
print("Duplicate Pairs of IDENTIFICATION_NUMBER and Year:")
print(duplicates)

In [None]:
# Rename TOTAL_VOTE columns to match the party code format ('0000' instead of a party code)
renamed_columns = {col: col.replace("TOTAL_VOTE", "0000_VOTE") for col in final_df.columns if "TOTAL_VOTE" in col}
final_df = final_df.rename(columns=renamed_columns)

In [None]:
list(final_df.columns)

In [None]:
# Step 1: Identify columns and group them by their base name (e.g., `3_G_CONG_VOTE`)
id_vars = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER', 'year'] + [col for col in df.columns if col.startswith('CONG')]
grouped_columns = {}

# Group columns by removing the party part (everything except the middle number)
for col in final_df.columns:
    if '_VOTE' in col:  # Ensure we're only processing relevant columns
        base_name = '_'.join(col.split('_')[:3] + ['VOTE'])  # Keep everything but the middle part
        if base_name not in grouped_columns:
            grouped_columns[base_name] = []
        grouped_columns[base_name].append(col)

In [None]:
print(grouped_columns.keys())

In [None]:
# Step 2: Reshape each group and combine into a single table
reshaped_dataframes = []

for base_name, cols in grouped_columns.items():
    # Reshape the group into long format
    temp_df = pd.melt(final_df, id_vars=id_vars, value_vars=cols,
                      var_name='party', value_name=base_name)

    # If it's a TOTAL_VOTE column, assign '0000' as the party code
    if "TOTAL_VOTE" in base_name:
        temp_df['party'] = '0000'
    else:
        # Extract the 4-digit party code from column names
        extracted_party = temp_df['party'].str.extract(r'_(\d{4})_')
        temp_df['party'] = extracted_party[0]  # Get first column from extracted DataFrame

    reshaped_dataframes.append(temp_df)

In [None]:
# Step 3: Merge all reshaped groups into a single DataFrame
final_df_long = reshaped_dataframes[0]
for additional_df in reshaped_dataframes[1:]:
    final_df_long = final_df_long.merge(additional_df, on=id_vars + ['party'], how='outer')

final_df_long = final_df_long[['party'] + [col for col in final_df_long.columns if col != 'party']]

Gusi

In [None]:
final_df_long['party'].unique()

In [None]:
final_df_long['year'].unique()

In [None]:
final_df_long.pivot_table(index='COUNTY_NAME', columns='year', aggfunc='size', fill_value=0)

In [None]:
final_df_long.columns

In [None]:
print(final_df_long.head())

In [None]:
# Step 1: Identify all "CONG_DIST_NUMBER_####" columns
cong_columns = [col for col in final_df_long.columns if col.startswith('CONG_DIST_NUMBER_')]

# Step 2: Preserve all important vote-related columns
vote_columns = ['1_G_PRES_VOTE', '2_G_GOV_VOTE', '3_M_H_VOTE', 
                          '3_W_H_VOTE', '3_S_H_VOTE', '3_G_CONG_VOTE']

# Step 3: Reshape CONG_DIST_NUMBER columns while keeping vote columns intact
cong_df = final_df_long.melt(
    id_vars=['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER', 'year'] + vote_columns,
    value_vars=cong_columns, 
    var_name='cong_column', 
    value_name='district_number'
)

# Step 4: Extract the year from the column name
cong_df['cong_year'] = cong_df['cong_column'].str.extract(r'_(\d{4})').astype(int)

# Step 5: Drop the original column name and rename for clarity
cong_df = cong_df.drop(columns=['cong_column'])

In [None]:
cong_df.columns

In [None]:
# Filter dataset where '3_W_H_VOTE' is NOT missing
filtered_df_W = final_df_long[['COUNTY_NAME', 'year', '3_W_H_VOTE']].dropna()

print(filtered_df_W.shape)

print(filtered_df_W['year'].unique())

In [None]:
# Filter dataset where '3_S_H_VOTE' is NOT missing
filtered_df_S = final_df_long[['COUNTY_NAME', 'year', '3_S_H_VOTE']].dropna()

print(filtered_df_S.shape)

print(filtered_df_S['year'].unique())

In [None]:
# Filter dataset where '3_M_H_VOTE' is NOT missing
filtered_df_M = final_df_long[['COUNTY_NAME', 'year', '3_M_H_VOTE']].dropna()

print(filtered_df_M.shape)

print(filtered_df_M['year'].unique())

In [None]:
# Filter dataset where '3_G_CONG_VOTE' is NOT missing
filtered_df_CONG = final_df_long[['COUNTY_NAME', 'year', '3_G_CONG_VOTE']].dropna()

print(filtered_df_CONG.shape)

print(filtered_df_CONG['year'].unique())

In [None]:
cong_df.rename(columns={'ICPR_STATE_CODE': 'ICPSR_state_code', 'COUNTY_NAME': 'county_name', 
                        'IDENTIFICATION_NUMBER': 'county_id', '1_G_PRES_VOTE':'general_presidential_vote', '2_G_GOV_VOTE':'general_gubernatorial_vote', '3_M_H_VOTE': 'midterm_house_vote', '3_W_H_VOTE':'w_house_vote', '3_S_H_VOTE': 'special_house_vote', '3_G_CONG_VOTE': 'general_congress_vote'}, inplace=True)

cong_df['county_name'] = cong_df['county_name'].str.title()

In [None]:
cong_df.sample(10) 

In [None]:
# Create ICPSR to FIPS and ICPSR to State Name mappings
icpsr_to_fips = {
    1: 9,  2: 23, 3: 25, 4: 33, 5: 44, 6: 50, 11: 10, 12: 34, 13: 36, 14: 42, 21: 17,
    22: 18, 23: 26, 24: 39, 31: 19, 32: 20, 33: 27, 34: 29, 35: 31, 36: 38, 37: 46,
    40: 51, 41: 1, 42: 5, 43: 12, 44: 13, 45: 22, 46: 28, 47: 37, 48: 45, 49: 48,
    51: 21, 52: 24, 53: 40, 54: 47, 56: 54, 49: 48, 72: 41, 73: 53, 97: 97, 98: 11
}

icpsr_to_state = {
    1: "Connecticut", 2: "Maine", 3: "Massachusetts", 4: "New Hampshire", 5: "Rhode Island", 6: "Vermont",
    11: "Delaware", 12: "New Jersey", 13: "New York", 14: "Pennsylvania", 21: "Illinois", 22: "Indiana",
    23: "Michigan", 24: "Ohio", 31: "Iowa", 32: "Kansas", 33: "Minnesota", 34: "Missouri", 35: "Nebraska",
    36: "North Dakota", 37: "South Dakota", 40: "Virginia", 41: "Alabama", 42: "Arkansas", 43: "Florida",
    44: "Georgia", 45: "Louisiana", 46: "Mississippi", 47: "North Carolina", 48: "South Carolina",
    49: "Texas", 51: "Kentucky", 52: "Maryland", 53: "Oklahoma", 54: "Tennessee", 56: "West Virginia",
    72: "Oregon", 73: "Washington", 97: "Other", 98: "District of Columbia"