**Created by:** Revekka Gershovich **When:** Dic 4, 2024 **Why:** To clean and aggregate election returns data for years 1824 to 1968 from ICPSR 1, United States Historical Election Returns

In [24]:
import os
import os.path as path
import pandas as pd
import numpy as np

In [25]:
parent_dir = os.path.abspath("/Users/revekkagershovich/Dropbox (MIT)/StateLaws")
os.chdir(parent_dir)
assert os.path.exists(parent_dir), "parent_dir does not exist"
intermed_data_dir = "./2_data/2_intermediate/political_data"
assert os.path.exists(intermed_data_dir), "Data directory does not exist"
raw_data_dir = "./2_data/1_raw/political_data"
assert os.path.exists(raw_data_dir), "Data directory does not exist"

In [26]:
df = pd.read_csv(path.join(raw_data_dir, "./ICPSR_election_returns/DS0001/00001-0001-Data.csv"))

In [27]:
df.columns

Index(['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER',
       'CONG_DIST_NUMBER_1825', 'CONG_DIST_NUMBER_1829',
       'CONG_DIST_NUMBER_1833', 'CONG_DIST_NUMBER_1835',
       'CONG_DIST_NUMBER_1837', 'CONG_DIST_NUMBER_1841',
       'CONG_DIST_NUMBER_1845',
       ...
       'X860_1_G_PRES_0604_VOTE', 'X860_1_G_PRES_9001_VOTE',
       'X860_1_G_PRES_TOTAL_VOTE', 'X860_2_G_GOV_0100_VOTE',
       'X860_2_G_GOV_0200_VOTE', 'X860_2_G_GOV_0605_VOTE',
       'X860_2_G_GOV_0728_VOTE', 'X860_2_G_GOV_1195_VOTE',
       'X860_2_G_GOV_9999_VOTE', 'X860_2_G_GOV_TOTAL_VOTE'],
      dtype='object', length=499)

# Deciphering variable names

**1.** Since this dataset is provided in ASCII format with a SAS or SPSS setup files, I have extracted all the dataset into a csv format using a very niche R library called asciiSetupReader written specifically for extraction of pre-2000s dataset formatted in this weird way. As variable names in CSV, I used labels defined in the setup file. You can find this file in our StateLaws Dropbox: the path to the file is 1_code/similarity_code/Political_similarity_code/ICSPR_00001_to_csv.R

**2.** "Scope of Project" documentation for the study that can be found here: https://www.icpsr.umich.edu/web/ICPSR/studies/1. According to it "There is no actual codebook for this collection. Variable information is contained in the setup files." Thus, here I am making a codebook for naming conventions in my file so that if I or anyone else ever needs to go to the raw data, they would not have to spend hours figuring out what variable in the raw data mean. 

# Codebook for ICPSR 1, United States Historical Election Returns

## State and County Identifiers
| **Column Name**         | **Description**                                                                                     |
|-------------------------|-----------------------------------------------------------------------------------------------------|
| `ICPR_STATE_CODE`       | ICPSR standardized state code.                                                                      |
| `COUNTY_NAME`           | Standardized county name.                                                                           |
| `IDENTIFICATION_NUMBER` | Unique numeric identifier for each county, enabling consistent referencing.                         |

## Congressional District Numbers
| **Column Name**           | **Description**                                                                                   |
|---------------------------|---------------------------------------------------------------------------------------------------|
| `CONG_DIST_NUMBER_YYYY`   | Congressional district number for a specific year (e.g., `CONG_DIST_NUMBER_1825`). May indicate the number of districts for split counties. |

## Election Results

### General Format

X###_##_TYPE_RACE_PARTYCODE_VOTE

### Components
| **Component**     | **Description**                                                                                           |
|-------------------|---------------------------------------------------------------------------------------------------------|
| `X###`           | Election year (e.g., `X824` = 1824).                                                                      |
| `##`             | Election type/level: <br> **1** = Presidential, **2** = Gubernatorial, **3** = Congressional/House elections. |
| `TYPE`           | Type of election: <br> **G** = General, **M** = Midterm, **S** = Special.                                 |
| `RACE`           | Race type: <br> Examples: `PRES` = President, `GOV` = Governor.                                           |
| `PARTYCODE`      | Code representing the political party. See the attached party codes file for definitions (e.g., `0025` = National Republican). |
| `VOTE`           | Number of votes received by the candidate.                                                                |
| `TOTAL_VOTE`     | Total votes cast for the specific race or election.                                                       |

### Examples
| **Column Name**               | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X824_1_G_PRES_0025_VOTE`     | Votes for the National Republican candidate in the 1824 presidential general election.      |
| `X825_2_G_GOV_0659_VOTE`      | Votes for a specific party in the 1825 gubernatorial general election.                      |
| `X827_3_M_H_AL_9001_VOTE`     | Votes in a midterm House election in district `9001` for Alabama in 1827.                   |
| `X836_2_G_GOV_TOTAL_VOTE`     | Total gubernatorial votes in the 1836 general election.                                     |

## Handling Duplicate or Corrected Entries
| **Column Name Example**       | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X825_2_G_GOV_0659_VOTE.1`    | A secondary entry for verification or correction of votes in the 1825 gubernatorial election.|
| `X831_3_M_H_AL_0100_VOTE.2`   | A duplicate or re-evaluated entry for midterm House votes in district `0100` for Alabama in 1831. |

## Important Notes
- **Corrections:** Some entries, such as Jackson County in Georgia (`ID: 1510`), should be corrected to `1570` when analyzing by county.
- **Missing Values:** For counties not reporting data or not participating in elections, identifiers like `98` (placeholders) are used.
- **Party Codes:** Refer to the party codes section of the documentation contained in /Users/revekkagershovich/Dropbox (MIT)/StateLaws/2_data/1_raw/political_data/ICPSR_election_returns/DS0204/00001-0204-Documentation.txt for the specific meaning of codes like `0025`, `0659`, etc. which represent political parties.

In [28]:
df.head()

Unnamed: 0,ICPR_STATE_CODE,COUNTY_NAME,IDENTIFICATION_NUMBER,CONG_DIST_NUMBER_1825,CONG_DIST_NUMBER_1829,CONG_DIST_NUMBER_1833,CONG_DIST_NUMBER_1835,CONG_DIST_NUMBER_1837,CONG_DIST_NUMBER_1841,CONG_DIST_NUMBER_1845,...,X860_1_G_PRES_0604_VOTE,X860_1_G_PRES_9001_VOTE,X860_1_G_PRES_TOTAL_VOTE,X860_2_G_GOV_0100_VOTE,X860_2_G_GOV_0200_VOTE,X860_2_G_GOV_0605_VOTE,X860_2_G_GOV_0728_VOTE,X860_2_G_GOV_1195_VOTE,X860_2_G_GOV_9999_VOTE,X860_2_G_GOV_TOTAL_VOTE
0,1,FAIRFIELD,10,98,98,98,98,98,4,4,...,2033,0,10454,7136,6921,0,0,0,0,14057
1,1,HARTFORD,30,98,98,98,98,98,1,1,...,3088,0,15156,8975,8753,0,0,0,0,17728
2,1,LITCHFIELD,50,98,98,98,98,98,5,4,...,1567,0,8150,4656,5203,0,0,0,0,9859
3,1,MIDDLESEX,70,98,98,98,98,98,2,2,...,1335,0,5510,3490,2942,0,0,0,0,6432
4,1,NEW HAVEN,90,98,98,98,98,98,2,2,...,4368,0,16540,9765,8709,0,0,0,0,18474


In [29]:
# Print all column names as a list
print(df.columns.tolist())

# Or, print each column name on a new line
for col in df.columns:
    print(col)

['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER', 'CONG_DIST_NUMBER_1825', 'CONG_DIST_NUMBER_1829', 'CONG_DIST_NUMBER_1833', 'CONG_DIST_NUMBER_1835', 'CONG_DIST_NUMBER_1837', 'CONG_DIST_NUMBER_1841', 'CONG_DIST_NUMBER_1845', 'CONG_DIST_NUMBER_1849', 'CONG_DIST_NUMBER_1853', 'CONG_DIST_NUMBER_1857', 'X824_1_G_PRES_0020_VOTE', 'X824_1_G_PRES_0611_VOTE', 'X824_1_G_PRES_9999_VOTE', 'X824_1_G_PRES_TOTAL_VOTE', 'X824_2_G_GOV_0012_VOTE', 'X824_2_G_GOV_0200_VOTE', 'X824_2_G_GOV_1063_VOTE', 'X824_2_G_GOV_TOTAL_VOTE', 'X825_2_G_GOV_0001_VOTE', 'X825_2_G_GOV_0012_VOTE', 'X825_2_G_GOV_0659_VOTE', 'X825_2_G_GOV_0659_VOTE.1', 'X825_2_G_GOV_0659_VOTE.2', 'X825_2_G_GOV_TOTAL_VOTE', 'X825_3_M_H_AL_9001_VOTE', 'X825_3_M_H_AL_9002_VOTE', 'X825_3_M_H_AL_9003_VOTE', 'X825_3_M_H_AL_9004_VOTE', 'X825_3_M_H_AL_9005_VOTE', 'X825_3_M_H_AL_9006_VOTE', 'X825_3_M_H_AL_9007_VOTE', 'X825_3_M_H_AL_9008_VOTE', 'X825_3_M_H_AL_9009_VOTE', 'X825_3_M_H_AL_9010_VOTE', 'X825_3_M_H_AL_9011_VOTE', 'X825_3_M_H_AL_901

In [64]:
# Step 1: Identify columns and group them by their suffix
# Add all variables starting with "CONG" to id_vars
id_vars = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER'] + [col for col in df.columns if col.startswith('CONG')]
grouped_columns = {}

# Group columns by their suffix (everything after the first underscore and without the year part)
for col in df.columns:
    if col.startswith('X'):
        suffix = '_'.join(col.split('_')[1:])  # Extract the suffix after the first underscore
        if suffix not in grouped_columns:
            grouped_columns[suffix] = []
        grouped_columns[suffix].append(col)

In [65]:
print(grouped_columns.keys())

dict_keys(['1_G_PRES_0020_VOTE', '1_G_PRES_0611_VOTE', '1_G_PRES_9999_VOTE', '1_G_PRES_TOTAL_VOTE', '2_G_GOV_0012_VOTE', '2_G_GOV_0200_VOTE', '2_G_GOV_1063_VOTE', '2_G_GOV_TOTAL_VOTE', '2_G_GOV_0001_VOTE', '2_G_GOV_0659_VOTE', '2_G_GOV_0659_VOTE.1', '2_G_GOV_0659_VOTE.2', '3_M_H_AL_9001_VOTE', '3_M_H_AL_9002_VOTE', '3_M_H_AL_9003_VOTE', '3_M_H_AL_9004_VOTE', '3_M_H_AL_9005_VOTE', '3_M_H_AL_9006_VOTE', '3_M_H_AL_9007_VOTE', '3_M_H_AL_9008_VOTE', '3_M_H_AL_9009_VOTE', '3_M_H_AL_9010_VOTE', '3_M_H_AL_9011_VOTE', '3_M_H_AL_9012_VOTE', '3_M_H_AL_9013_VOTE', '3_M_H_AL_9014_VOTE', '3_M_H_AL_9015_VOTE', '3_M_H_AL_9016_VOTE', '3_M_H_AL_9017_VOTE', '3_M_H_AL_9018_VOTE', '3_M_H_AL_9019_VOTE', '3_M_H_AL_2020_VOTE', '3_M_H_AL_TOTAL_VOTE', '1_G_PRES_0025_VOTE', '1_G_PRES_0101_VOTE', '2_G_GOV_0025_VOTE', '3_M_H_AL_9020_VOTE', '3_M_H_AL_9021_VOTE', '3_M_H_AL_9022_VOTE', '2_G_GOV_0026_VOTE', '2_G_GOV_0026_VOTE.1', '2_G_GOV_0100_VOTE', '2_G_GOV_9001_VOTE', '3_M_H_AL_0025_VOTE', '3_M_H_AL_0025_VOTE.1', '

In [66]:
# Step 2: Reshape each group and combine into a single table
reshaped_dataframes = []

for suffix, cols in grouped_columns.items():
    # Reshape the group into long format
    temp_df = pd.melt(df, id_vars=id_vars, value_vars=cols,
                      var_name='year', value_name=suffix)
    # Extract the year and adjust to full year format
    temp_df['year'] = temp_df['year'].str.extract(r'X(\d+)').astype(int) + 1000
    reshaped_dataframes.append(temp_df)

In [70]:
# Step 3: Merge all reshaped groups into a single DataFrame
final_df = reshaped_dataframes[0]
for additional_df in reshaped_dataframes[1:]:
    final_df = final_df.merge(additional_df, on=id_vars + ['year'], how='outer')

final_df = final_df[['year'] + [col for col in final_df.columns if col != 'year']]

In [71]:
final_df.tail()

Unnamed: 0,year,ICPR_STATE_CODE,COUNTY_NAME,IDENTIFICATION_NUMBER,CONG_DIST_NUMBER_1825,CONG_DIST_NUMBER_1829,CONG_DIST_NUMBER_1833,CONG_DIST_NUMBER_1835,CONG_DIST_NUMBER_1837,CONG_DIST_NUMBER_1841,...,3_G_CONG_0200_VOTE,3_G_CONG_0310_VOTE,1_G_PRES_0200_VOTE,1_G_PRES_0310_VOTE,3_G_CONG_0037_VOTE,3_G_CONG_0604_VOTE,1_G_PRES_0037_VOTE,1_G_PRES_0604_VOTE,2_G_GOV_0605_VOTE,2_G_GOV_1195_VOTE
291,1856,1,WINDHAM,150,98,98,98,98,98,6,...,,,3913.0,56.0,,,,,,
292,1857,1,WINDHAM,150,98,98,98,98,98,6,...,2250.0,0.0,,,,,,,,
293,1858,1,WINDHAM,150,98,98,98,98,98,6,...,,,,,,,,,,
294,1859,1,WINDHAM,150,98,98,98,98,98,6,...,3006.0,,,,0.0,0.0,,,,
295,1860,1,WINDHAM,150,98,98,98,98,98,6,...,,,3619.0,,,,20.0,303.0,0.0,0.0


In [78]:
# Identify duplicate pairs of IDENTIFICATION_NUMBER and year
duplicates = final_df[final_df.duplicated(subset=['IDENTIFICATION_NUMBER', 'year'], keep=False)]

# Display the duplicate pairs
print("Duplicate Pairs of IDENTIFICATION_NUMBER and Year:")
print(duplicates)

Duplicate Pairs of IDENTIFICATION_NUMBER and Year:
Empty DataFrame
Columns: [year, ICPR_STATE_CODE, COUNTY_NAME, IDENTIFICATION_NUMBER, CONG_DIST_NUMBER_1825, CONG_DIST_NUMBER_1829, CONG_DIST_NUMBER_1833, CONG_DIST_NUMBER_1835, CONG_DIST_NUMBER_1837, CONG_DIST_NUMBER_1841, CONG_DIST_NUMBER_1845, CONG_DIST_NUMBER_1849, CONG_DIST_NUMBER_1853, CONG_DIST_NUMBER_1857, 1_G_PRES_0020_VOTE, 1_G_PRES_0611_VOTE, 1_G_PRES_9999_VOTE, 1_G_PRES_TOTAL_VOTE, 2_G_GOV_0012_VOTE, 2_G_GOV_0200_VOTE, 2_G_GOV_1063_VOTE, 2_G_GOV_TOTAL_VOTE, 2_G_GOV_0001_VOTE, 2_G_GOV_0659_VOTE, 2_G_GOV_0659_VOTE.1, 2_G_GOV_0659_VOTE.2, 3_M_H_AL_9001_VOTE, 3_M_H_AL_9002_VOTE, 3_M_H_AL_9003_VOTE, 3_M_H_AL_9004_VOTE, 3_M_H_AL_9005_VOTE, 3_M_H_AL_9006_VOTE, 3_M_H_AL_9007_VOTE, 3_M_H_AL_9008_VOTE, 3_M_H_AL_9009_VOTE, 3_M_H_AL_9010_VOTE, 3_M_H_AL_9011_VOTE, 3_M_H_AL_9012_VOTE, 3_M_H_AL_9013_VOTE, 3_M_H_AL_9014_VOTE, 3_M_H_AL_9015_VOTE, 3_M_H_AL_9016_VOTE, 3_M_H_AL_9017_VOTE, 3_M_H_AL_9018_VOTE, 3_M_H_AL_9019_VOTE, 3_M_H_AL_2020_VO

In [79]:
final_df.shape

(296, 129)

In [72]:
# Step 1: Identify columns and group them by their base name (e.g., `3_G_CONG_VOTE`)
id_vars = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER', 'year'] + [col for col in df.columns if col.startswith('CONG')]
grouped_columns = {}

# Group columns by removing the party part (everything except the middle number)
for col in final_df.columns:
    if '_VOTE' in col:  # Ensure we're only processing relevant columns
        base_name = '_'.join(col.split('_')[:3] + ['VOTE'])  # Keep everything but the middle part
        if base_name not in grouped_columns:
            grouped_columns[base_name] = []
        grouped_columns[base_name].append(col)

In [73]:
# Step 2: Reshape each group and combine into a single table
reshaped_dataframes = []

for base_name, cols in grouped_columns.items():
    # Reshape the group into long format
    temp_df = pd.melt(final_df, id_vars=id_vars, value_vars=cols,
                      var_name='party', value_name=base_name)
    # Extract the `party` from the middle part of the column name
    temp_df['party'] = temp_df['party'].str.extract(r'_(\d{4})_').astype(str)
    reshaped_dataframes.append(temp_df)

In [76]:
# Step 3: Merge all reshaped groups into a single DataFrame
final_df_long = reshaped_dataframes[0]
for additional_df in reshaped_dataframes[1:]:
    final_df_long = final_df_long.merge(additional_df, on=id_vars + ['party'], how='outer')

final_df_long = final_df_long[['party'] + [col for col in final_df_long.columns if col != 'party']]

In [77]:
final_df_long.head()

Unnamed: 0,party,ICPR_STATE_CODE,COUNTY_NAME,IDENTIFICATION_NUMBER,year,CONG_DIST_NUMBER_1825,CONG_DIST_NUMBER_1829,CONG_DIST_NUMBER_1833,CONG_DIST_NUMBER_1835,CONG_DIST_NUMBER_1837,...,CONG_DIST_NUMBER_1845,CONG_DIST_NUMBER_1849,CONG_DIST_NUMBER_1853,CONG_DIST_NUMBER_1857,1_G_PRES_VOTE,2_G_GOV_VOTE,3_M_H_VOTE,3_W_H_VOTE,3_S_H_VOTE,3_G_CONG_VOTE
0,1,1,FAIRFIELD,10,1824,98,98,98,98,98,...,4,4,4,4,,,,,,
1,12,1,FAIRFIELD,10,1824,98,98,98,98,98,...,4,4,4,4,,938.0,,,,
2,20,1,FAIRFIELD,10,1824,98,98,98,98,98,...,4,4,4,4,4.0,,,,,
3,25,1,FAIRFIELD,10,1824,98,98,98,98,98,...,4,4,4,4,,,,,,
4,25,1,FAIRFIELD,10,1824,98,98,98,98,98,...,4,4,4,4,,,,,,


Gusi

In [62]:
# List of columns to always keep
key_columns = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER']

# Filter columns that contain '2_G_GOV'
filtered_columns = [col for col in df.columns if 'G_PRES_9999' in col]

# Combine key columns with the filtered columns
columns_to_keep = key_columns + filtered_columns

# Create a new DataFrame with the filtered columns
filtered_df = df[columns_to_keep]

In [63]:
filtered_df.columns

Index(['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER',
       'X824_1_G_PRES_9999_VOTE', 'X828_1_G_PRES_9999_VOTE',
       'X836_1_G_PRES_9999_VOTE', 'X840_1_G_PRES_9999_VOTE',
       'X844_1_G_PRES_9999_VOTE', 'X848_1_G_PRES_9999_VOTE',
       'X852_1_G_PRES_9999_VOTE', 'X856_1_G_PRES_9999_VOTE'],
      dtype='object')

In [37]:
filtered_df.head()

Unnamed: 0,ICPR_STATE_CODE,COUNTY_NAME,IDENTIFICATION_NUMBER,X824_2_G_GOV_TOTAL_VOTE,X825_2_G_GOV_TOTAL_VOTE,X826_2_G_GOV_TOTAL_VOTE,X827_2_G_GOV_TOTAL_VOTE,X828_2_G_GOV_TOTAL_VOTE,X829_2_G_GOV_TOTAL_VOTE,X830_2_G_GOV_TOTAL_VOTE,...,X851_2_G_GOV_TOTAL_VOTE,X852_2_G_GOV_TOTAL_VOTE,X853_2_G_GOV_TOTAL_VOTE,X854_2_G_GOV_TOTAL_VOTE,X855_2_G_GOV_TOTAL_VOTE,X856_2_G_GOV_TOTAL_VOTE,X857_2_G_GOV_TOTAL_VOTE,X858_2_G_GOV_TOTAL_VOTE,X859_2_G_GOV_TOTAL_VOTE,X860_2_G_GOV_TOTAL_VOTE
0,1,FAIRFIELD,10,1143,1502,2138,2840,1620,1330,1828,...,9232,9299,8456,8728,9754,9806,9651,10732,13186,14057
1,1,HARTFORD,30,1341,1807,1857,2023,1727,1957,2254,...,12468,12385,12163,11888,13035,13543,13039,14421,15572,17728
2,1,LITCHFIELD,50,1199,1610,2057,2548,1540,1880,2199,...,8330,8357,7889,7513,7771,7721,7155,8561,9387,9859
3,1,MIDDLESEX,70,447,761,980,874,607,540,624,...,4433,4662,4759,4523,5236,5305,4708,5177,5887,6432
4,1,NEW HAVEN,90,1123,1552,2097,1878,1107,1246,2078,...,10483,11499,11416,11488,12893,13430,13669,13805,15368,18474


In [40]:
# Step 1: Identify columns to reshape
id_vars = ['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER'] 
value_vars = [col for col in filtered_df.columns if col.startswith('X')]

# Step 2: Reshape the DataFrame from wide to long format
filtered_df_long = pd.melt(filtered_df, id_vars=id_vars, value_vars=value_vars, 
                           var_name='year', value_name='2_G_GOV_TOTAL_VOTE')

# Step 3: Extract year from the 'year' column and convert to integer
filtered_df_long['year'] = filtered_df_long['year'].str.extract(r'X(\d+)').astype(int)+1000

In [41]:
filtered_df_long.head()

Unnamed: 0,ICPR_STATE_CODE,COUNTY_NAME,IDENTIFICATION_NUMBER,year,2_G_GOV_TOTAL_VOTE
0,1,FAIRFIELD,10,1824,1143
1,1,HARTFORD,30,1824,1341
2,1,LITCHFIELD,50,1824,1199
3,1,MIDDLESEX,70,1824,447
4,1,NEW HAVEN,90,1824,1123
