**Created by:** Revekka Gershovich **When:** Dic 4, 2024 **Why:** To clean and aggregate election returns data for years 1824 to 1968 from ICPSR 1, United States Historical Election Returns

In [1]:
import os
import os.path as path
import pandas as pd
import numpy as np

In [2]:
parent_dir = os.path.abspath("/Users/revekkagershovich/Dropbox (MIT)/StateLaws")
os.chdir(parent_dir)
assert os.path.exists(parent_dir), "parent_dir does not exist"
intermed_data_dir = "./2_data/2_intermediate/political_data"
assert os.path.exists(intermed_data_dir), "Data directory does not exist"
raw_data_dir = "./2_data/1_raw/political_data"
assert os.path.exists(raw_data_dir), "Data directory does not exist"

In [5]:
df = pd.read_csv(path.join(raw_data_dir, "./ICPSR_election_returns/DS0001/00001-0001-Data.csv"))

In [7]:
df.columns

Index(['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER',
       'CONG_DIST_NUMBER_1825', 'CONG_DIST_NUMBER_1829',
       'CONG_DIST_NUMBER_1833', 'CONG_DIST_NUMBER_1835',
       'CONG_DIST_NUMBER_1837', 'CONG_DIST_NUMBER_1841',
       'CONG_DIST_NUMBER_1845',
       ...
       'X860_1_G_PRES_0604_VOTE', 'X860_1_G_PRES_9001_VOTE',
       'X860_1_G_PRES_TOTAL_VOTE', 'X860_2_G_GOV_0100_VOTE',
       'X860_2_G_GOV_0200_VOTE', 'X860_2_G_GOV_0605_VOTE',
       'X860_2_G_GOV_0728_VOTE', 'X860_2_G_GOV_1195_VOTE',
       'X860_2_G_GOV_9999_VOTE', 'X860_2_G_GOV_TOTAL_VOTE'],
      dtype='object', length=499)

# Deciphering variable names

**1.** Since this dataset is provided in ASCII format with a SAS or SPSS setup files, I have extracted all the dataset into a csv format using a very niche R library called asciiSetupReader written specifically for extraction of pre-2000s dataset formatted in this weird way. As variable names in CSV, I used labels defined in the setup file. You can find this file in our StateLaws Dropbox: the path to the file is 1_code/similarity_code/Political_similarity_code/ICSPR_00001_to_csv.R

**2.** "Scope of Project" documentation for the study that can be found here: https://www.icpsr.umich.edu/web/ICPSR/studies/1. According to it "There is no actual codebook for this collection. Variable information is contained in the setup files." Thus, here I am making a codebook for naming conventions in my file so that if I or anyone else ever needs to go to the raw data, they would not have to spend hours figuring out what variable in the raw data mean. 

# Codebook for ICPSR 1, United States Historical Election Returns

## State and County Identifiers
| **Column Name**         | **Description**                                                                                     |
|-------------------------|-----------------------------------------------------------------------------------------------------|
| `ICPR_STATE_CODE`       | ICPSR standardized state code.                                                                      |
| `COUNTY_NAME`           | Standardized county name.                                                                           |
| `IDENTIFICATION_NUMBER` | Unique numeric identifier for each county, enabling consistent referencing.                         |

## Congressional District Numbers
| **Column Name**           | **Description**                                                                                   |
|---------------------------|---------------------------------------------------------------------------------------------------|
| `CONG_DIST_NUMBER_YYYY`   | Congressional district number for a specific year (e.g., `CONG_DIST_NUMBER_1825`). May indicate the number of districts for split counties. |

## Election Results

### General Format

X###_##_TYPE_RACE_PARTYCODE_VOTE

### Components
| **Component**     | **Description**                                                                                           |
|-------------------|---------------------------------------------------------------------------------------------------------|
| `X###`           | Election year (e.g., `X824` = 1824).                                                                      |
| `##`             | Election type/level: <br> **1** = Presidential, **2** = Gubernatorial, **3** = Congressional/House elections. |
| `TYPE`           | Type of election: <br> **G** = General, **M** = Midterm, **S** = Special.                                 |
| `RACE`           | Race type: <br> Examples: `PRES` = President, `GOV` = Governor.                                           |
| `PARTYCODE`      | Code representing the political party. See the attached party codes file for definitions (e.g., `0025` = National Republican). |
| `VOTE`           | Number of votes received by the candidate.                                                                |
| `TOTAL_VOTE`     | Total votes cast for the specific race or election.                                                       |

### Examples
| **Column Name**               | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X824_1_G_PRES_0025_VOTE`     | Votes for the National Republican candidate in the 1824 presidential general election.      |
| `X825_2_G_GOV_0659_VOTE`      | Votes for a specific party in the 1825 gubernatorial general election.                      |
| `X827_3_M_H_AL_9001_VOTE`     | Votes in a midterm House election in district `9001` for Alabama in 1827.                   |
| `X836_2_G_GOV_TOTAL_VOTE`     | Total gubernatorial votes in the 1836 general election.                                     |

## Handling Duplicate or Corrected Entries
| **Column Name Example**       | **Description**                                                                             |
|-------------------------------|---------------------------------------------------------------------------------------------|
| `X825_2_G_GOV_0659_VOTE.1`    | A secondary entry for verification or correction of votes in the 1825 gubernatorial election.|
| `X831_3_M_H_AL_0100_VOTE.2`   | A duplicate or re-evaluated entry for midterm House votes in district `0100` for Alabama in 1831. |

## Important Notes
- **Corrections:** Some entries, such as Jackson County in Georgia (`ID: 1510`), should be corrected to `1570` when analyzing by county.
- **Missing Values:** For counties not reporting data or not participating in elections, identifiers like `98` (placeholders) are used.
- **Party Codes:** Refer to the party codes section of the documentation contained in /Users/revekkagershovich/Dropbox (MIT)/StateLaws/2_data/1_raw/political_data/ICPSR_election_returns/DS0204/00001-0204-Documentation.txt for the specific meaning of codes like `0025`, `0659`, etc. which represent political parties.

In [18]:
# Print all column names as a list
print(df.columns.tolist())

# Or, print each column name on a new line
for col in df.columns:
    print(col)

['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER', 'CONG_DIST_NUMBER_1825', 'CONG_DIST_NUMBER_1829', 'CONG_DIST_NUMBER_1833', 'CONG_DIST_NUMBER_1835', 'CONG_DIST_NUMBER_1837', 'CONG_DIST_NUMBER_1841', 'CONG_DIST_NUMBER_1845', 'CONG_DIST_NUMBER_1849', 'CONG_DIST_NUMBER_1853', 'CONG_DIST_NUMBER_1857', 'X824_1_G_PRES_0020_VOTE', 'X824_1_G_PRES_0611_VOTE', 'X824_1_G_PRES_9999_VOTE', 'X824_1_G_PRES_TOTAL_VOTE', 'X824_2_G_GOV_0012_VOTE', 'X824_2_G_GOV_0200_VOTE', 'X824_2_G_GOV_1063_VOTE', 'X824_2_G_GOV_TOTAL_VOTE', 'X825_2_G_GOV_0001_VOTE', 'X825_2_G_GOV_0012_VOTE', 'X825_2_G_GOV_0659_VOTE', 'X825_2_G_GOV_0659_VOTE.1', 'X825_2_G_GOV_0659_VOTE.2', 'X825_2_G_GOV_TOTAL_VOTE', 'X825_3_M_H_AL_9001_VOTE', 'X825_3_M_H_AL_9002_VOTE', 'X825_3_M_H_AL_9003_VOTE', 'X825_3_M_H_AL_9004_VOTE', 'X825_3_M_H_AL_9005_VOTE', 'X825_3_M_H_AL_9006_VOTE', 'X825_3_M_H_AL_9007_VOTE', 'X825_3_M_H_AL_9008_VOTE', 'X825_3_M_H_AL_9009_VOTE', 'X825_3_M_H_AL_9010_VOTE', 'X825_3_M_H_AL_9011_VOTE', 'X825_3_M_H_AL_901

In [11]:
df2 = pd.read_csv(path.join(raw_data_dir, "./ICPSR_election_returns/DS0002/00001-0002-Data.csv"))

In [14]:
df2.columns

Index(['ICPR_STATE_CODE', 'COUNTY_NAME', 'IDENTIFICATION_NUMBER',
       'CONG_DIST_NUMBER_1861', 'CONG_DIST_NUMBER_1863',
       'CONG_DIST_NUMBER_1865', 'CONG_DIST_NUMBER_1867',
       'CONG_DIST_NUMBER_1869', 'CONG_DIST_NUMBER_1871',
       'CONG_DIST_NUMBER_1872',
       ...
       'X904_3_G_H_AL_9999_VOTE', 'X904_3_G_H_AL_TOTAL_VOTE',
       'X904_3_G_CONG_0100_VOTE', 'X904_3_G_CONG_0200_VOTE',
       'X904_3_G_CONG_0341_VOTE', 'X904_3_G_CONG_0361_VOTE',
       'X904_3_G_CONG_0380_VOTE', 'X904_3_G_CONG_0505_VOTE',
       'X904_3_G_CONG_9999_VOTE', 'X904_3_G_CONG_TOTAL_VOTE'],
      dtype='object', length=513)

In [12]:
df2.head(10)

Unnamed: 0,ICPR_STATE_CODE,COUNTY_NAME,IDENTIFICATION_NUMBER,CONG_DIST_NUMBER_1861,CONG_DIST_NUMBER_1863,CONG_DIST_NUMBER_1865,CONG_DIST_NUMBER_1867,CONG_DIST_NUMBER_1869,CONG_DIST_NUMBER_1871,CONG_DIST_NUMBER_1872,...,X904_3_G_H_AL_9999_VOTE,X904_3_G_H_AL_TOTAL_VOTE,X904_3_G_CONG_0100_VOTE,X904_3_G_CONG_0200_VOTE,X904_3_G_CONG_0341_VOTE,X904_3_G_CONG_0361_VOTE,X904_3_G_CONG_0380_VOTE,X904_3_G_CONG_0505_VOTE,X904_3_G_CONG_9999_VOTE,X904_3_G_CONG_TOTAL_VOTE
0,1,FAIRFIELD,10,4,4,4,4,4,4,0,...,0,40279,16138,23114,0,193,565,191,5,40206
1,1,HARTFORD,30,1,1,1,1,1,1,1,...,3,41394,16391,23459,115,341,833,155,15,41309
2,1,LITCHFIELD,50,4,4,4,4,4,4,0,...,0,13654,4622,8708,0,174,74,0,34,13612
3,1,MIDDLESEX,70,2,2,2,2,2,2,0,...,0,8313,3192,4957,9,92,52,1,1,8304
4,1,NEW HAVEN,90,2,2,2,2,2,2,0,...,25,56482,21487,31875,210,324,2327,172,30,56425
5,1,NEW LONDON,110,3,3,3,3,3,3,0,...,0,17779,6905,10584,0,168,132,40,0,17829
6,1,TOLLAND,130,1,1,1,1,1,1,1,...,4,5054,1827,2904,0,60,218,29,3,5041
7,1,WINDHAM,150,3,3,3,3,3,3,0,...,0,7969,2813,4957,0,102,63,44,0,7979
