# PPOL 565 Final Project: Census Voting Data
### Colette Yeager

### Background
- Past studies of voting data
- Use in campaigns, determining how to get more people to vote

### The Data

#### Current Population Survey: Voting Supplement
- "Provides demographic information on persons who did and did not register to vote"
- Years 1994 - 2020
- Model: Look at Voted or Registered to Vote as dependent variable
- Metropolitan, Geographic region, Race, Gender, Age, Marital status, Number of household members, Military status, Education completed, and Family income as independent variable features

#### Issues that arose 
- Variable names from some years differed from others

In [32]:
import pandas as pd
import numpy as np
import requests

year = '1994'
years = ['1994', '1996', '1998', '2000', '2002', '2004', '2006', '2008', '2010', '2012', '2014', '2016', '2018', '2020']
statedict = {'AL': '1', 'AK': '2', 'AZ': '4', 'AR': '5', 'CA': '6', 'CO': '8', 'CT': '9', 'DE': '10', 'DC': '11', 'FL': '12',
             'GA': '13', 'HI': '15', 'ID': '16', 'IL': '17', 'IN': '18', 'IA': '19', 'KS': '20', 'KY': '21', 'LA': '22', 
             'ME':'23', 'MD': '24', 'MA': '25', 'MI': '26', 'MN': '27', 'MS': '28', 'MO': '29', 'MT': '30', 'NE': '31', 'NV': 
             '32', 'NH': '33', 'NJ': '34', 'NM': '35', 'NY': '36', 'NC': '37', 'ND': '38', 'OH': '39', 'OK': '40', 'OR': '41', 
             'PA': '42', 'RI': '44', 'SC': '45', 'SD': '46', 'TN': '47', 'TX': '48', 'UT': '49', 'VT': '50', 'VA': '51', 'WA': 
             '53', 'WV': '54', 'WI': '55', 'WY': '56'}
rvsestatedict = {v: k for k, v in statedict.items()}
STATE = '1'
url = (f"http://api.census.gov/data/{year}/cps/voting/nov")

In [33]:
if (year == '1994'):
    param_list = ["PES3,PES4,GEMETSTA,GEREG,PERACE,PRHSPNON,"+
                  "PESEX,PRTAGE,PEMARITL,HRNUMHOU,PEAFNOW,"+
                  "PEEDUCA,HUFAMINC,PREXPLF,PRFTLF"]
elif(year == '1996' or year == '1998' or year == '2000' or year == '2002'):
    param_list = ["PES1,PES2,GEMETSTA,GEREG,PERACE,PRHSPNON,"+
                  "PESEX,PRTAGE,PEMARITL,HRNUMHOU,PEAFNOW,"+
                  "PEEDUCA,HUFAMINC,PREXPLF,PRFTLF"]
elif(year == '2004' or year == '2006' or year == '2008'):
    param_list = ["PES1,PES2,GTMETSTA,GEREG,PTDTRACE,PEHSPNON,"+
                  "PESEX,PRTAGE,PEMARITL,HRNUMHOU,PEAFNOW,"+
                  "PEEDUCA,HUFAMINC,PREXPLF,PRFTLF"]
else:
    param_list = ["PES1,PES2,GTMETSTA,GEREG,PTDTRACE,PEHSPNON,"+
                  "PESEX,PRTAGE,PEMARITL,HRNUMHOU,PEAFNOW,"+
                  "PEEDUCA,HEFAMINC,PREXPLF,PRFTLF"]

In [34]:
r = requests.get(url,
                params = {"get": param_list,
                         "for": f"state:{STATE}"})

# Create dataframe with data
census_df = pd.DataFrame(data = r.json())
census_df.rename(columns = census_df.iloc[0], inplace = True)
census_df.drop([0], axis = 0, inplace = True)
# Change column names
census_df.columns = ["Voted", "Registered_to_Vote", "Metropolitan",
                     "Geographic_Region", "Race", "Hispanic",
                     "Female", "Age", "Marital_Status", 
                     "Household_Members", "In_Armed_Forces", 
                     "Education_Completed", "Family_Income_category", "Employment_Status", 
                     "Full_Time", "State"]
# Replace number with state abbreviation
census_df.replace({'State': rvsestatedict}, inplace = True)
# Change column types
census_df = census_df.astype({"Voted": int, "Registered_to_Vote": int, "Metropolitan": int, 
                              "Geographic_Region": int, "Race": int, "Hispanic": int, "Female": int,
                              "Age" : int, "Marital_Status": int, "Household_Members": int, 
                              "In_Armed_Forces": int, "Education_Completed": int,
                              "Family_Income_category": int, "Employment_Status": int, "Full_Time": int, "State": str})
col2 = census_df.pop('State')
census_df.insert(0, 'State', col2)

- Values for Registered to Vote Variable

In [35]:
census_df.loc[census_df.Voted == 1, 'Registered_to_Vote'] = 1

- Family Income

In [36]:
census_df.Family_Income_category.unique()

array([ 2,  4,  6,  8, 14, 10,  7, -3, 11, 13, 12,  9,  1,  3, -2,  5])

From the Census API Variable list:

values: {

      "-1": "Blank", 
      "7": "20,000 To 24,999",
      "-2": "Don't Know",
      "10": "35,000 To 39,999",
      "5": "12,500 To 14,999",
      "1": "Less Than $5,000",
      "12": "50,000 To 59,999",
      "3": "7,500 To 9,999",
      "4": "10,000 To 12,499",
      "11": "40,000 To 49,999",
      "13": "60,000 To 74,999",
      "14": "75,000 Or More",
      "8": "25,000 To 29,999",
      "2": "5,000 To 7,499",
      "-3": "Refused",
      "9": "30,000 To 34,999",
      "6": "15,000 To 19,999"
}

In [37]:
census_df['Family_Income_actual'] = census_df.Family_Income_category.replace({1: 5000, 2: 6250, 
                                                                              3: 8250, 4: 11250, 
                                                                              5: 13750, 6: 17500, 
                                                                              7: 22500, 8: 27500, 
                                                                              9: 32500, 10: 37500, 
                                                                              11: 45000, 12: 55000, 
                                                                              13: 67500, 14: 87500, 
                                                                              15: 125000, 16: 150000})

#### Other Preprocessing
- Create some dummy variables
- Change values of variables from 1 and 2 to 0 and 1
- Drop NA values
- Drop rows for people under 18