**Date:** Dec 17, 2024 
**Author:** Revekka Gershovich
**Purpose:** Align and merge ICPSR dataset (1834-1975) and Klarner dataset (1935-2011)
**Preceded by Cleaning_icpsr16_partisan_composition.ipynb file 

I have a different data source available from **Carl Klarner** for years 1935-2011. It can be found in raw_data_dir folder called Klarner_stateComposition. The folder contains two datasets. The one I load as klarner1 contains the actual data cleaned by Klarner while klarner2 contains data about the data sources from which the dataset was derived along with multiple parameters re: how various data sources recorded various data point. All of this is recorded in Word documents that are in the same folder. This documentation also explains how odd states are handled. 

Tha paper in the folder discusses problems such as legislature switches mid_session and biases that come from a bad measure of party control. We have to accept many of those problems due to the fact that they are only solved in this dataset after 1935 and we have to use a less reliable source for before then. 

### My next steps: 
1. Bring data to the format in which all the data is which involves renaming variables and computing measures of proportions of dems and reps in session, i.e. in both legislative chambers. 
2. Filtering both ICPRS and Klarner datasets to overlap years, i.e. 1935-1975, and merging those two datasets to see the discrepancies so as to resolve any or at least inconsistencies between the two datasets coding
3. Dropping the years after 1935 from ICPSR dataset since it is less reliable of the sources, and appending Klarner dataset to it. 
4. I also have ncsl_state_composition data for years after 2011, that I will append after all that is done
5. The final step would be to append data for all US governors that I downloaded from here: https://github.com/jacobkap/governors, and making sure that the rest of my data is consistent with it. 

In [38]:
import os
import os.path as path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [39]:
parent_dir = os.path.abspath("/Users/revekkagershovich/Dropbox (MIT)/StateLaws")
os.chdir(parent_dir)
assert os.path.exists(parent_dir), "parent_dir does not exist"
intermed_data_dir = "./2_data/2_intermediate/political_data"
assert os.path.exists(intermed_data_dir), "Data directory does not exist"
raw_data_dir = "./2_data/1_raw/political_data/all_partisanComposition"
assert os.path.exists(raw_data_dir), "Data directory does not exist"

In [40]:
icpsr = pd.read_csv(os.path.join(intermed_data_dir, "icpsr.csv"))

# Cleaning Klarner Dataset

In [41]:
# # Keeping this cell so that you can easily load the data with info re: data sources

# klarner2 = pd.read_excel(os.path.join(raw_data_dir, "Klarner_partisan_composition/StatePartisanBalance1934to2011_SourceFiles_2011_05_24.xlsx"))

In [42]:
# Load Karl Klarner's dataset for years 1934-2011
klarner1_0 = pd.read_excel(os.path.join(raw_data_dir, "Klarner_partisan_composition/Partisan_Balance_For_Use2011_06_09b.xlsx"))

# Define columns to check for missing data
columns_to_check = [
    'govparty_c', 'sen_dem_prop_all', 'sen_rep_prop_all', 'hs_dem_prop_all', 'hs_rep_prop_all',
    'sen_dem_in_sess', 'sen_rep_in_sess', 'sen_tot_in_sess',
    'hs_dem_in_sess', 'hs_rep_in_sess', 'hs_tot_in_sess'
]

# Create a boolean mask identifying rows where all `columns_to_check` are NA
mask = klarner1_0[columns_to_check].isna().all(axis=1)

identifiers = ['year', 'election_year', 'state', 'stateno', 'fips']

# Create a boolean mask identifying rows where all columns other than identifiers are NA
NA_mask = klarner1_0.loc[:, ~klarner1_0.columns.isin(identifiers)].isna().all(axis=1)

# Identify rows where all columns other than identifiers are NA
NaN_data = klarner1_0[mask]

# Inspect the distribution of `election_year` for NaN data
print("Data for how many states is missing each year?")
print(NaN_data['election_year'].value_counts())

Data for how many states is missing each year?
election_year
2011    50
2012    50
2013    50
2014    50
1934    50
1933    49
1935    45
1942     2
1946     2
1945     2
1944     2
1943     2
1938     2
1941     2
1940     2
1939     2
1937     2
1936     2
1947     2
Name: count, dtype: int64


From the output it is clear that for years 2011-2014 and 1934, all states formally exist in the data but all variables of interest apart from identificator columns are missing. 
Most data also seems to be missing for years 1934 and 1935 (or not all states were present in the data to begin with). And for years 1936-1947 data for two states is missing. I will now find out for which states. 

In [43]:
# Define years of interest to filter rows to drop
years_of_interest = [1942, 1946, 1945, 1944, 1943, 1938, 1941, 1940, 1939, 1937, 1936, 1947]

# Filter rows to drop for the specific years of interest
filtered_rows = NaN_data[NaN_data['election_year'].isin(years_of_interest)]

# Group by `election_year` and list unique states for each year
states_by_year = filtered_rows.groupby('election_year')['state'].unique()

# Display the states grouped by `election_year`
print(states_by_year)  # Alaska and Hawaii data are missing for years 1936-1947

election_year
1936    [Alaska, Hawaii]
1937    [Alaska, Hawaii]
1938    [Alaska, Hawaii]
1939    [Alaska, Hawaii]
1940    [Alaska, Hawaii]
1941    [Alaska, Hawaii]
1942    [Alaska, Hawaii]
1943    [Alaska, Hawaii]
1944    [Alaska, Hawaii]
1945    [Alaska, Hawaii]
1946    [Alaska, Hawaii]
1947    [Alaska, Hawaii]
Name: state, dtype: object


In [44]:

# Remove rows where all `columns_to_check` are NA from `klarner1_0`
klarner_noNAs = klarner1_0[~mask]

# Print the count of `election_year` values, ordered by year
print("Number of states in data for each election year:")
print(klarner_noNAs['election_year'].value_counts().sort_index())

# When we drop all columns with NAs in non_dentification columns,
# we drop all observations before 1935 and after 2010, and in 1935 we only have five states.

# Filter data for 1935 and list unique states
states_1935 = klarner_noNAs[klarner_noNAs['election_year'] == 1935]['state'].unique()

# Display the result
print(states_1935)

Number of states in data for each election year:
election_year
1935     5
1936    48
1937    48
1938    48
1939    48
        ..
2006    50
2007    50
2008    50
2009    50
2010    50
Name: count, Length: 76, dtype: int64
['Kentucky' 'Mississippi' 'New Jersey' 'Virginia' 'New York']


In [45]:
# Keeping only the columns that are needed for the analysis
klarner1 = klarner_noNAs[['state', 'election_year', 'sen_dem_prop_all', 'sen_rep_prop_all', 'hs_dem_prop_all', 
'hs_rep_prop_all', 'sen_dem_in_sess', 'sen_rep_in_sess', 'sen_tot_in_sess', 'hs_dem_in_sess', 
'hs_rep_in_sess', 'hs_tot_in_sess', 'govparty_c']].copy()

# Dictionary mapping state names to abbreviations
state_to_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD', 'Massachusetts': 'MA',
    'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO', 'Montana': 'MT',
    'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM', 
    'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK', 
    'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC', 
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT', 
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'}

# Add a new column to the DataFrame with the abbreviations
klarner1.loc[:, 'state_abbrev'] = klarner1['state'].map(state_to_abbrev)

# Since in ICPSR I only have state abbreviations, I will drop the column containing the full state names
klarner1 = klarner1.drop(columns=['state'])

# Rename the columns to match the ICPSR dataset
klarner1 = klarner1.rename(columns={
    'election_year': 'year',
    'govparty_c': 'gov_party',
    'sen_dem_prop_all': 'dem_upphse',
    'sen_rep_prop_all': 'rep_upphse',
    'hs_dem_prop_all': 'dem_lowhse',
    'hs_rep_prop_all': 'rep_lowhse',
})

# Keeping only the final dataset columns and the columns needed to calculate the share of Democrats and Republicans in the session
klarner1 = klarner1[[
    'year', 'state_abbrev', 'gov_party',
    'dem_upphse', 'rep_upphse', 'dem_lowhse', 'rep_lowhse',
    'sen_dem_in_sess', 'sen_rep_in_sess', 'sen_tot_in_sess',
    'hs_dem_in_sess', 'hs_rep_in_sess', 'hs_tot_in_sess'
]]

# Calculate the overall share of Democratic and Republican seats in the session (upper + lower house) - this measure is not available in the dataset
klarner1['shr_dem_in_sess'] = (klarner1['sen_dem_in_sess'] + klarner1['hs_dem_in_sess']) / (klarner1['hs_tot_in_sess'] + klarner1['sen_tot_in_sess'])
klarner1['shr_rep_in_sess'] = (klarner1['sen_rep_in_sess'] + klarner1['hs_rep_in_sess']) / (klarner1['hs_tot_in_sess'] + klarner1['sen_tot_in_sess'])

# Drop the columns that were used for calculating overall share of dem/rep seats in the session and are not needed anymore
klarner1 = klarner1.drop(columns=['sen_dem_in_sess', 'sen_rep_in_sess', 'sen_tot_in_sess', 'hs_dem_in_sess', 'hs_rep_in_sess', 'hs_tot_in_sess'])

# In the Klarner dataset democrats are coded as 1 like I map ICPSR data, but Republicans are coded as 0, not 2. 
# I will change this to match ICPSR data. Value 0.5 signify non-major party governor, however I was just 
# dropping those values in ICPSR data, and they are automatically dropped with the mapping.
klarner1['gov_party'] = klarner1['gov_party'].map({1.0: 1, 0.0: 2})

print(klarner1.sample(5, random_state = 44))

      year state_abbrev  gov_party  dem_upphse  rep_upphse  dem_lowhse  \
473   1962           ID        2.0    0.477273    0.522727    0.460317   
1705  1964           UT        1.0    0.555556    0.444444    0.565217   
1683  2006           TX        2.0    0.354839    0.645161    0.460000   
366   1969           GA        1.0    0.875000    0.125000    0.856410   
1728  2003           UT        2.0    0.241379    0.758621    0.253333   

      rep_lowhse  shr_dem_in_sess  shr_rep_in_sess  
473     0.539683         0.467290         0.532710  
1705    0.434783         0.562500         0.437500  
1683    0.540000         0.441989         0.558011  
366     0.138461         0.860558         0.135458  
1728    0.746667         0.250000         0.750000  


In [46]:
# Check the number of unique states
print(f"Number of unique states: {klarner1['state_abbrev'].nunique()}")

# Assertions
assert klarner1['state_abbrev'].nunique() == 50, "There should be 50 states."

assert klarner1['year'].min() == 1935, "The minimum year should be 1935."
assert klarner1['year'].max() == 2010, "The maximum year should be 2010 because the election year should be off-set one year back."
assert klarner1['year'].nunique() == 76, "There should be 76 unique years."

assert klarner1['gov_party'].nunique() == 2, "There should be 2 unique parties."

# Check if all values in 'gov_party' are valid
valid_values = {1, 2}  
assert klarner1['gov_party'].dropna().isin(valid_values).all(), "All values in 'gov_party' should be 1, 2, or NaN."

Number of unique states: 50


Now that Klarner data is cleaned and brought to the standart format, I will filter ICPSR and Klarner data to contain only the years that are in both datasets, i.e. 1935-1975. The idea is to make sure that this data is the same in both datasets, and then apply the coding to the rest of the data to make sure that it is consistent across all datasets.

# Comparing ICPSR and Klarner Datasets

In [47]:
# Filter ICPSR and Klarner data to keep only years in both datasets, i.e. 1935-1975
icpsr_filt = icpsr[icpsr['year'] >= 1935]
klarner_filt = klarner1[klarner1['year'] <= 1975]

## Comparing datasets for even years only

In [48]:
# Filtering both datasets only for even years to see if they align for the even years 
# at least since odd yeats are more tricky.
icpsr_filt_even = icpsr_filt[icpsr_filt['year'] % 2 == 0]
klarner_filt_even = klarner_filt[klarner_filt['year'] % 2 == 0]

In [49]:
# Merging datasets only for even years to be able to compare the data from both sources
merged_even = pd.merge(klarner_filt_even, icpsr_filt_even, 
on=['year', 'state_abbrev'], suffixes=('_klarner', '_icpsr'), how='outer')

# Reorder the columns
merged_even = merged_even[['year', 'state_abbrev', 'gov_party_klarner', 'gov_party_icpsr',
                           'dem_upphse_klarner', 'dem_upphse_icpsr', 'rep_upphse_klarner', 
                           'rep_upphse_icpsr', 'dem_lowhse_klarner', 'dem_lowhse_icpsr', 
                           'rep_lowhse_klarner', 'rep_lowhse_icpsr', 'shr_dem_in_sess_klarner', 
                           'shr_dem_in_sess_icpsr', 'shr_rep_in_sess_klarner', 'shr_rep_in_sess_icpsr']]

In [50]:
unequal_rows_gov = merged_even[merged_even['gov_party_klarner'] != merged_even['gov_party_icpsr']]
print(f"There are {unequal_rows_gov.shape[0]} rows where 'gov_party_klarner' is not equal to 'gov_party_icpsr'.")

There are 288 rows where 'gov_party_klarner' is not equal to 'gov_party_icpsr'.


In [51]:
# Here is a dataset to double-check the problematic years with links to sources generated by ChatGPT
# Create the DataFrame
data = {
    "year": [1938, 1942, 1948, 1950, 1954, 1958, 1964, 1966, 1968, 1968],
    "state_abbrev": ["OH", "NY", "WY", "CO", "NY", "NY", "UT", "UT", "MD", "UT"],
    "governor": [
        "John W. Bricker", "Thomas E. Dewey", "Arthur G. Crane", "Daniel I.J. Thornton",
        "W. Averell Harriman", "Nelson A. Rockefeller", "Cal Rampton", "Cal Rampton",
        "Marvin Mandel", "Cal Rampton"
    ],
    "party_code": [2, 2, 2, 2, 1, 2, 1, 1, 1, 1],
    "source_url": [
        "https://en.wikipedia.org/wiki/1938_Ohio_gubernatorial_election",
        "https://en.wikipedia.org/wiki/1942_New_York_state_election",
        "https://en.wikipedia.org/wiki/List_of_governors_of_Wyoming",
        "https://en.wikipedia.org/wiki/1950_Colorado_gubernatorial_election",
        "https://en.wikipedia.org/wiki/1954_New_York_state_election",
        "https://en.wikipedia.org/wiki/1958_New_York_state_election",
        "https://en.wikipedia.org/wiki/1964_Utah_gubernatorial_election",
        "https://en.wikipedia.org/wiki/List_of_governors_of_Utah",
        "https://en.wikipedia.org/wiki/List_of_governors_of_Maryland",
        "https://en.wikipedia.org/wiki/List_of_governors_of_Utah"
    ]
}

gov_mismatch = pd.DataFrame(data)

# Perform an inner merge to compare rows based on 'year' and 'state_abbrev'
comparison = unequal_rows_gov.merge(gov_mismatch, on=['year', 'state_abbrev'], how='inner')

# Identify mismatched rows where 'gov_party_klarner' does not match 'party_code'
mismatched_rows_klarner = comparison[comparison['gov_party_klarner'] != comparison['party_code']]

# Display the mismatched rows
print(f"Number of rows that are mismatched between Klarner and the correct data: {mismatched_rows_klarner.shape[0]}")

# Identify mismatched rows where 'gov_party_icpsr' does not match 'party_code'
mismatched_rows_icpsr = comparison[comparison['gov_party_icpsr'] != comparison['party_code']]

# Display the mismatched rows
print(f"Number of rows that are mismatched between ICPSR and the correct data: {mismatched_rows_icpsr.shape[0]}")

print("Thus it is clear that Klarner is the more accurate dataset, and all the instances of mismatch are due to errors in ICPSR data.")

Number of rows that are mismatched between Klarner and the correct data: 0
Number of rows that are mismatched between ICPSR and the correct data: 10
Thus it is clear that Klarner is the more accurate dataset, and all the instances of mismatch are due to errors in ICPSR data.


In [52]:
# Identify rows where 'dem_upphse_klarner' and 'dem_upphse_icpsr' are not close, excluding rows where both are NaN
unequal_rows_dem_upphse = merged_even[
    ~np.isclose(merged_even['dem_upphse_klarner'], merged_even['dem_upphse_icpsr'], atol=0.05) &
    ~(merged_even['dem_upphse_klarner'].isna() & merged_even['dem_upphse_icpsr'].isna())
]

unequal_rows_dem_upphse = unequal_rows_dem_upphse[['year', 'state_abbrev', 'dem_upphse_klarner', 'dem_upphse_icpsr']]

# Print the number of rows where the values are not equal
print(f"There are {unequal_rows_dem_upphse.shape[0]} rows where 'dem_upphse_klarner' is not equal to 'dem_upphse_icpsr'.")

# Display the mismatched rows
print(unequal_rows_dem_upphse)

There are 131 rows where 'dem_upphse_klarner' is not equal to 'dem_upphse_icpsr'.
     year state_abbrev  dem_upphse_klarner  dem_upphse_icpsr
0    1936           AL                 NaN          1.000000
14   1936           KY            0.684211               NaN
22   1936           MS            1.000000               NaN
42   1936           VA            0.950000               NaN
44   1936           WA            0.804348          0.891304
..    ...          ...                 ...               ...
912  1972           MS            0.961538               NaN
918  1972           NJ            0.400000               NaN
924  1972           OR            0.600000          0.533333
932  1972           VA            0.825000               NaN
937  1972           WY            0.433333          0.366667

[131 rows x 4 columns]


In [53]:
# We can see that out of 6 states for which mismatch occurs most often, 
# five states have odd-year state elections, i.e. MA, VA, KY, NJ, and LA which 
# suggests that they are coded differently in the two datasets.
unequal_rows_dem_upphse['state_abbrev'].value_counts()

state_abbrev
MS    19
VA    19
KY    18
NJ    12
MN    12
LA     9
HI     7
MD     7
AK     5
VT     3
NV     2
IN     2
ME     2
ID     1
UT     1
RI     1
OR     1
CT     1
ND     1
AL     1
CO     1
WV     1
WI     1
TN     1
NH     1
WA     1
WY     1
Name: count, dtype: int64

In [54]:
unequal_rows_dem_upphse_odd_states = unequal_rows_dem_upphse[unequal_rows_dem_upphse['state_abbrev'].isin(['MS', 'VA', 'KY', 'NJ', 'LA'])]

# # Display the mismatched rows for odd-year state elections
# print(unequal_rows_dem_upphse_odd_states)

print(unequal_rows_dem_upphse_odd_states['dem_upphse_icpsr'].value_counts(dropna = False))
# We can see that the ICPSR data has a lot of missing values for these states in eveh years where Klarner does not meaning 
# in ICPSR the data is only available for the actual election years.

print(unequal_rows_dem_upphse_odd_states.dropna()) 
# In the remaining rows it seems like ICPSR values are just rounded up Klarner values.

dem_upphse_icpsr
NaN    77
Name: count, dtype: int64
Empty DataFrame
Columns: [year, state_abbrev, dem_upphse_klarner, dem_upphse_icpsr]
Index: []


In [55]:
unequal_rows_dem_upphse_only_even_states = unequal_rows_dem_upphse[~unequal_rows_dem_upphse['state_abbrev'].isin(['MS', 'VA', 'KY', 'NJ', 'LA'])]

# Display the mismatched rows for even-year state elections
print(unequal_rows_dem_upphse_only_even_states.head())

# Mostly those differences are small, and are probably caused by adjustments in the Klarner dataset or mistakes/missing data in icpsr data.

     year state_abbrev  dem_upphse_klarner  dem_upphse_icpsr
0    1936           AL                 NaN          1.000000
44   1936           WA            0.804348          0.891304
113  1940           MD            0.793103               NaN
171  1942           NH            0.291667          0.375000
183  1942           TN            0.848485          0.909091


In [56]:
# Filter and print rows where 'dem_upphse_klarner' is NA
print(unequal_rows_dem_upphse_only_even_states[unequal_rows_dem_upphse_only_even_states['dem_upphse_klarner'].isna()])

merged_even[(merged_even['state_abbrev'] == 'AL') & (merged_even['year'] == 1936)]

# There is missing data for Alabama in 1936 in the Klarner dataset, 
# and the ICPSR dataset has a value of 0.0 for the Democratic proportion in the upper house which is correct. 

   year state_abbrev  dem_upphse_klarner  dem_upphse_icpsr
0  1936           AL                 NaN               1.0


Unnamed: 0,year,state_abbrev,gov_party_klarner,gov_party_icpsr,dem_upphse_klarner,dem_upphse_icpsr,rep_upphse_klarner,rep_upphse_icpsr,dem_lowhse_klarner,dem_lowhse_icpsr,rep_lowhse_klarner,rep_lowhse_icpsr,shr_dem_in_sess_klarner,shr_dem_in_sess_icpsr,shr_rep_in_sess_klarner,shr_rep_in_sess_icpsr
0,1936,AL,1.0,1.0,,1.0,,0.0,,0.990566,,0.009434,,0.992908,,0.007092


In [57]:
# Exclude rows already in unequal_rows_dem_upphse
remaining_rows = merged_even.loc[~merged_even.index.isin(unequal_rows_dem_upphse.index)]

# Identify rows where 'rep_upphse_klarner' and 'rep_upphse_icpsr' are not close, excluding NaN matches
unequal_rows_rep_upphse = remaining_rows[
    ~np.isclose(remaining_rows['rep_upphse_klarner'], remaining_rows['rep_upphse_icpsr'], atol=0.1) &
    ~(remaining_rows['rep_upphse_klarner'].isna() & remaining_rows['rep_upphse_icpsr'].isna())
]

# Display the results
print(f"There are {unequal_rows_rep_upphse.shape[0]} rows where 'rep_upphse_klarner' is not equal to 'rep_upphse_icpsr'.")
print(unequal_rows_rep_upphse[['year', 'state_abbrev', 'rep_upphse_klarner', 'rep_upphse_icpsr']])
# i double-ckecked that Klarner data is correct for the row with a mismatch.

There are 2 rows where 'rep_upphse_klarner' is not equal to 'rep_upphse_icpsr'.
     year state_abbrev  rep_upphse_klarner  rep_upphse_icpsr
381  1950           UT            0.304348          0.782609
945  1974           DE            0.380952          0.142857


In [58]:
unequal_rows_dem_lowhse = remaining_rows[
    ~np.isclose(remaining_rows['dem_lowhse_klarner'], remaining_rows['dem_lowhse_icpsr'], atol=0.1) &
    ~(remaining_rows['dem_lowhse_klarner'].isna() & remaining_rows['dem_lowhse_icpsr'].isna())
]

# Display the results
print(f"There are {unequal_rows_dem_lowhse.shape[0]} rows where 'rep_upphse_klarner' is not equal to 'rep_upphse_icpsr'.")
print(unequal_rows_dem_lowhse[['year', 'state_abbrev', 'dem_lowhse_klarner', 'dem_lowhse_icpsr']])

There are 13 rows where 'rep_upphse_klarner' is not equal to 'rep_upphse_icpsr'.
     year state_abbrev  dem_lowhse_klarner  dem_lowhse_icpsr
28   1936           NJ            0.350000          0.650000
38   1936           SD            0.359223          0.173333
46   1936           WV            0.765957          0.872340
239  1944           WY            0.500000          0.363636
269  1946           NM            0.612245          0.734694
337  1948           WY            0.375000          0.500000
428  1952           SD            0.026667          0.240000
437  1952           WY            0.428571          0.196429
478  1954           SD            0.240000          0.360000
578  1958           SD            0.426667          0.240000
678  1962           SD            0.226667          0.400000
902  1972           IN            0.270000          0.464646
935  1972           WI            0.626263          0.373737


## Forward Filling 1935-1975 ICPSR Dataset

In [59]:
print(icpsr_filt['year'].value_counts().sort_index().head(8))

year
1935     5
1936    42
1937     4
1938    42
1939    46
1940    42
1941     3
1942    42
Name: count, dtype: int64


We can see that the data is only available for the actual election years, and not for the odd years in between (or even years for states on odd-year election cycle states). I will forward-fill the ICPSR dataset for each state for each year where data is not available. 

In [70]:
# Create a complete grid of years and states
years = icpsr_filt['year'].unique()
states = icpsr_filt['state_abbrev'].unique()

# Create a DataFrame with all combinations
all_combos = pd.MultiIndex.from_product([years, states], names=['year', 'state_abbrev']).to_frame(index=False)

In [97]:
# Merge the complete grid with the original dataset
icpsr_complete = pd.merge(all_combos, icpsr_filt, on=['year', 'state_abbrev'], how='left')

print(
    icpsr_complete[icpsr_complete['state_abbrev'] == 'AZ']
    [['year', 'state_abbrev', 'gov_party', 'rep_upphse']]
    .sort_values(by='year', ascending=True)
    .head(3)
)

# Identify identifier columns (e.g., year and state_abbrev)
id_cols = ['year', 'state_abbrev']
gov_id_cols = ['year', 'state_abbrev', 'gov_party']

# Identify non-identifier columns
non_id_cols = [col for col in icpsr_complete.columns if col not in id_cols]
non_gov_id_cols = [col for col in icpsr_complete.columns if col not in gov_id_cols]

# Forward-fill for each state
for state in states:
    # Subset the data for the current state and sort by year
    state_data = icpsr_complete[icpsr_complete['state_abbrev'] == state].sort_values(by='year')
    
    # Forward-fill non-identifier columns
    state_data[non_id_cols] = state_data[non_id_cols].ffill()
    
    # Forward-fill non-gov_party columns
    state_data[non_gov_id_cols] = state_data[non_gov_id_cols].ffill()
    
    # Update the main DataFrame
    icpsr_complete.loc[state_data.index, non_id_cols] = state_data[non_id_cols]
    icpsr_complete.loc[state_data.index, non_gov_id_cols] = state_data[non_gov_id_cols]

# Display results for AZ
print(
    icpsr_complete[icpsr_complete['state_abbrev'] == 'AZ']
    [['year', 'state_abbrev', 'gov_party', 'rep_upphse']]
    .sort_values(by='year', ascending=True)
    .head(3)
)


      year state_abbrev  gov_party  rep_upphse
1087  1935           AZ        NaN         NaN
37    1936           AZ        1.0         0.0
1137  1937           AZ        NaN         NaN
      year state_abbrev  gov_party  rep_upphse
1087  1935           AZ        NaN         NaN
37    1936           AZ        1.0         0.0
1137  1937           AZ        1.0         0.0


In [92]:
# Merge the complete grid with the original dataset
icpsr_complete = pd.merge(all_combos, icpsr_filt, on=['year', 'state_abbrev'], how='left')

print(
    icpsr_complete[icpsr_complete['state_abbrev'] == 'AZ']
    [['year', 'state_abbrev', 'gov_party', 'rep_upphse']]
    .sort_values(by='year', ascending=True)
    .head(3)
)

# Identify identifier columns (e.g., year and state_abbrev)
id_cols = ['year', 'state_abbrev']
gov_id_cols = ['year', 'state_abbrev', 'gov_party']

# Identify non-identifier columns
non_id_cols = [col for col in icpsr_complete.columns if col not in id_cols]
non_gov_id_cols = [col for col in icpsr_complete.columns if col not in gov_id_cols] 

# Step 1: Mark rows where all non-identifier columns are NaN
icpsr_complete['all_empty_but_id'] = icpsr_complete[non_id_cols].isna().all(axis=1)

print(f"length of non_id_cols is {icpsr_complete['all_empty_but_id'].sum()}")

# Step 2: Forward-fill for rows with all non-identifier columns NaN
# Group by state_abbrev and fill based on the previous year
def fill_all_empty(group):
    group.loc[group['all_empty_but_id'], non_id_cols] = group[non_id_cols].ffill()
    return group

# Step 3: Apply the function to the DataFrame
icpsr_complete = icpsr_complete.groupby('state_abbrev').apply(fill_all_empty).reset_index(drop=True)

icpsr_complete = icpsr_complete.drop(columns=['all_empty_but_id'])

print(
    icpsr_complete[icpsr_complete['state_abbrev'] == 'AZ']
    [['year', 'state_abbrev', 'gov_party', 'rep_upphse']]
    .sort_values(by='year', ascending=True)
    .head(3)
)

# # Step 4: Mark rows where all non-identifier and gov_party columns are NaN
# icpsr_complete['all_empty_but_gov_id'] = icpsr_complete[non_gov_id_cols].isna().all(axis=1)

# print(f"length of non_gov_id_cols is {icpsr_complete['all_empty_but_gov_id'].sum()}")

# # Step 5: Forward-fill for rows with all non-identifier and gov_party columns NaN
# # Group by state_abbrev and fill based on the previous year
# def fill_all_empty_gov(group):
#     group.loc[group['all_empty_but_gov_id'], non_gov_id_cols] = group[non_gov_id_cols].ffill()
#     return group

# # Step 6: Apply the function to the DataFrame
# icpsr_complete = icpsr_complete.groupby('state_abbrev').apply(fill_all_empty_gov).reset_index(drop=True)

# icpsr_complete = icpsr_complete.drop(columns=['all_empty_but_gov_id'])

# print(
#     icpsr_complete[icpsr_complete['state_abbrev'] == 'AZ']
#     [['year', 'state_abbrev', 'gov_party', 'rep_upphse']]
#     .sort_values(by='year', ascending=True)
#     .head(3)
# )


      year state_abbrev  gov_party  rep_upphse
1087  1935           AZ        NaN         NaN
37    1936           AZ        1.0         0.0
1137  1937           AZ        NaN         NaN
length of non_id_cols is 1034
     year state_abbrev  gov_party  rep_upphse
141  1935           AZ        2.0         0.4
120  1936           AZ        1.0         0.0
142  1937           AZ        2.0         0.4


  icpsr_complete = icpsr_complete.groupby('state_abbrev').apply(fill_all_empty).reset_index(drop=True)


## Checking if after forward-filling ICPSR, Klarner and ICPSR data are aligned

In [88]:
# Merging klarner_filt which a dataset for the same years as icpsr_complete to see if the data aligns
merged = pd.merge(klarner_filt, icpsr_complete, on=['year', 'state_abbrev'], suffixes=('_klarner', '_icpsr_comp'), how='outer')
merged = pd.merge(merged, icpsr_filt, on=['year', 'state_abbrev'], suffixes=('_klarner', '_icpsr_filt'), how='outer')

print(merged.columns)

Index(['year', 'state_abbrev', 'gov_party_klarner', 'dem_upphse_klarner',
       'rep_upphse_klarner', 'dem_lowhse_klarner', 'rep_lowhse_klarner',
       'shr_dem_in_sess_klarner', 'shr_rep_in_sess_klarner',
       'gov_party_icpsr_comp', 'dem_upphse_icpsr_comp',
       'rep_upphse_icpsr_comp', 'dem_lowhse_icpsr_comp',
       'rep_lowhse_icpsr_comp', 'shr_dem_in_sess_icpsr_comp',
       'shr_rep_in_sess_icpsr_comp', 'gov_party', 'dem_upphse', 'rep_upphse',
       'dem_lowhse', 'rep_lowhse', 'shr_dem_in_sess', 'shr_rep_in_sess'],
      dtype='object')


In [89]:
# Display rows where 'gov_party_klarner' is not equal to 'gov_party_icpsr'
unequal_rows_gov = merged[merged['gov_party_klarner'] != merged['gov_party_icpsr_comp']]
print(f"There are {unequal_rows_gov.shape[0]} rows where 'gov_party_klarner' is not equal to 'gov_party_icpsr'.")
# print(unequal_rows_gov)
print(unequal_rows_gov['year'].value_counts().sort_index().head())

There are 764 rows where 'gov_party_klarner' is not equal to 'gov_party_icpsr'.
year
1935    50
1936    11
1937    27
1938     8
1939    22
Name: count, dtype: int64


In [90]:
print(
    merged[merged['state_abbrev'] == 'AZ']
    [['year', 'state_abbrev', 'gov_party_klarner', 'gov_party_icpsr_comp', 'gov_party']]
    .sort_values(by='year', ascending=True)
    .head(10)
)

# This shows that something goes wrong with forward filling because if 1936 gov_party for AZ is 1, then 
# in 1937 it should be 1 as well, but it is 2. 

     year state_abbrev  gov_party_klarner  gov_party_icpsr_comp  gov_party
3    1935           AZ                NaN                   2.0        NaN
53   1936           AZ                1.0                   1.0        1.0
103  1937           AZ                1.0                   2.0        NaN
153  1938           AZ                1.0                   1.0        1.0
203  1939           AZ                1.0                   1.0        1.0
253  1940           AZ                1.0                   1.0        1.0
303  1941           AZ                1.0                   2.0        NaN
353  1942           AZ                1.0                   1.0        1.0
403  1943           AZ                1.0                   2.0        NaN
453  1944           AZ                1.0                   1.0        1.0
