# Data set selection

## My background

Prior to the Flatiron Data Science fellowship, I worked in the Information Technology policy space at a Federal agency in Washington, DC. In my role, I supported the agency's compliance with the Office of Management and Budget's [Open Data Policy](https://project-open-data.cio.gov/policy-memo/). The policy requires agencies to provide the public access to non-sensitive datasets on [Data.gov](https://www.data.gov/). Naturally, when I needed to find a dataset for a project for this fellowship, I turned to data.gov and other governmental sites. 

For a previous project, I focused on the [Federal Election Campaign Individual Contributions](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2020&min_date=01%2F01%2F2019&max_date=12%2F31%2F2020) data sets. I built a model using 2019 DC-resident campaign contribution records to predict which candidate, of the leading Democratic candidates, the individual donated to ([Github Repo](https://github.com/ali0003433/predict-recipient-presidential-contributions)). 

## Inspiration for topic

For my capstone project, I chose to continue with the campaign-related data and focused. I researched the existing public data by starting iwth Nate Silver's [FiveThirtyEight](https://fivethirtyeight.com/). I also used Eitan D. Hersh's <i>Hacking the Electorate</i>, Sasha Issenberg's <i>The Victory Lab</i>, and Tim Carney's <i>Alienated America</i>. In his book, Carney examines Trump's base, those who voted for him in the primary elections. His thesis is that by focusing on the economic downturn in the Rust Belt, xenophobia, and racism, we are missing an important reality about Trump supporters. <br><br>Carney cites evidences that Trump supporters are more likely to live in areas where a reasonable person might agree with the early rallying cry "the American Dream is dead" ...because of factors such as a decline in life expectancy, an uptick in drug and alcohol-related suicides, a decrease in community engagement, and a decrease in marriage rates, but only amongst the least educated. 

I had read in some of Carney's interviews that he uses American Community Survey (ACS) data from the US Census Bureau. I could combine ACS data with county-level 2016 presidential returns from MIT's Election Lab. This would be a different population than what Carney, the 2016 Republican primaries, so it would allow me to apply his thesis to a slightly different population of voters.  

## First data set considered (not selected)

Here is the first dataset I considered: [County-Level Presidential Returns 2000-2016](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ) 

Import libraries

In [1]:
import numpy as np
import pandas as pd

Read in Presidential returns file and show the top of the dataframe

In [2]:
df = pd.read_csv('../data/raw/countypres_2000-2016.csv')
df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2000,Alabama,AL,Autauga,1001.0,President,Al Gore,democrat,4942.0,17208,20190722
1,2000,Alabama,AL,Autauga,1001.0,President,George W. Bush,republican,11993.0,17208,20190722
2,2000,Alabama,AL,Autauga,1001.0,President,Ralph Nader,green,160.0,17208,20190722
3,2000,Alabama,AL,Autauga,1001.0,President,Other,,113.0,17208,20190722
4,2000,Alabama,AL,Baldwin,1003.0,President,Al Gore,democrat,13997.0,56480,20190722


Check shape of the data

In [3]:
df.shape

(50524, 11)

List attributes then consult MIT Election Lab's data dictionary

In [4]:
list(df.columns)

['year',
 'state',
 'state_po',
 'county',
 'FIPS',
 'office',
 'candidate',
 'party',
 'candidatevotes',
 'totalvotes',
 'version']

Check number of null values 

In [5]:
df.isna().sum()

year                  0
state                 0
state_po             64
county                0
FIPS                 64
office                0
candidate             0
party             15789
candidatevotes      404
totalvotes            0
version               0
dtype: int64

The FIPS (Federal Information Processing Standards) attribute will allow for mapping to Census Bureau data. What do the 64 null values look like? 

In [6]:
df.loc[df.FIPS.isna() == True].head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
12612,2000,Connecticut,,Statewide writein,,President,Al Gore,democrat,,0,20190722
12613,2000,Maine,,Maine UOCAVA,,President,Al Gore,democrat,,0,20190722
12614,2000,Alaska,,District 99,,President,Al Gore,democrat,,0,20190722
12615,2000,Rhode Island,,Federal Precinct,,President,Al Gore,democrat,,0,20190722
12616,2000,Connecticut,,Statewide writein,,President,George W. Bush,republican,,0,20190722


It appears that there is no missing FIPS data. The records with null values refer to vote counts that are not FIPS-related
<br>
<br>

Check the number of unique FIPS values

In [7]:
len(df.FIPS.unique())

3155

Check value counts for each party to ensure they are equivalent

In [8]:
df.party.value_counts()

republican    15789
democrat      15789
green          3157
Name: party, dtype: int64

My next step would be to research data available from the Census Bureau's American Community Surveys (ACS) <br><br>
After checking the data available from the ACS, I need to learn more about exactly what Carney looked at to form his thesis. <br><br> In a few articles and in his book, Carney mentions working with Emily Ekins on the book. My research led me to her publications: ['Religious Trump Voters'](https://www.voterstudygroup.org/publication/religious-trump-voters) and ['The Five Types of Trump Voters](https://www.voterstudygroup.org/publication/the-five-types-trump-voters). In her work, Ekins uses the Voter Study Group survey data published by the Democracy Fund. Additionally, it does use ACS data to create its sampling weights, which Carney was likely referring to. 

## Second data set considered  (selected)

After discovering the data Ekins used, I took a look: [2016 Voter Study Group Survey](https://www.voterstudygroup.org/publication/2016-voter-survey).

Read in Voter Study Group file and display head

In [9]:
df = pd.read_csv('../data/raw/20161201_voter_study_group.csv')
df.head()

Unnamed: 0,case_identifier,weight,PARTY_AGENDAS_rand_2016,pp_primary16_2016,pp_demprim16_2016,pp_repprim16_2016,inputstate_2016,izip_2016,votereg2_2016,votereg_f_2016,...,post_HouseCand3Name_2012,post_HouseCand3Party_2012,post_SenCand1Name_2012,post_SenCand1Party_2012,post_SenCand2Name_2012,post_SenCand2Party_2012,post_SenCand3Name_2012,post_SenCand3Party_2012,starttime_2016,endtime_2016
0,779,0.358213,2,1,1.0,,6,94952,1,1.0,...,,,Shelley Berkley,Democratic,Dean Heller,Republican,,,29nov2016 22:59:43,29nov2016 23:28:24
1,2108,0.562867,2,2,,1.0,4,85298,1,1.0,...,,,Richard Carmona,Democratic,Jeff Flake,Republican,,,29nov2016 15:41:28,29nov2016 18:58:28
2,2597,0.552138,2,1,1.0,,55,54904,1,1.0,...,,,Tammy Baldwin,Democratic,Tommy Thompson,Republican,,,29nov2016 16:08:39,29nov2016 16:32:43
3,4148,0.207591,1,1,3.0,,40,74104,1,1.0,...,,,,,,,,,14dec2016 18:46:33,14dec2016 19:11:20
4,4460,0.333729,2,2,,4.0,48,78253,1,1.0,...,,,Paul Sadler,Democratic,Ted Cruz,Republican,,,01dec2016 10:17:47,01dec2016 10:59:48


Check the shape of the data, number of observations by number of attributes 

In [10]:
df.shape

(8000, 668)

668 features. List attribute names then consult the [data dictionary](link)

In [11]:
list(df.columns)[0:10]

['case_identifier',
 'weight',
 'PARTY_AGENDAS_rand_2016',
 'pp_primary16_2016',
 'pp_demprim16_2016',
 'pp_repprim16_2016',
 'inputstate_2016',
 'izip_2016',
 'votereg2_2016',
 'votereg_f_2016']

Check number of null values

In [12]:
df.isna().sum().head()

case_identifier               0
weight                        0
PARTY_AGENDAS_rand_2016       0
pp_primary16_2016             0
pp_demprim16_2016          5026
dtype: int64

Check 'presvote16post_2016' attribute (2016 Presidential Election candidate choice) for nulls and value counts

In [13]:
print(df['presvote16post_2016'].isna().sum())
df['presvote16post_2016'].value_counts(ascending=False)

394


1.0    3545
2.0    3479
3.0     231
6.0     182
4.0     112
7.0      33
5.0      24
Name: presvote16post_2016, dtype: int64

Pros:
* Attributes line up with some of the features Carney utilized
* Large variety of attributes to choose from, over 600 
* Comes with weight for each observation so results can be generalized to registered voting population
* Finer level of detail than Census Bureau data 
<br>
<br> 

Cons: 
* Can not replicate Carney's methodology exactly
* Uses complex survey design so will require extra time to learn how to work with that in Python

In [22]:
df = df[['imiss_i_2016', 'imiss_c_2016']]
df = df.astype(str)
pd.get_dummies(df)

Unnamed: 0,imiss_i_2016_1.0,imiss_i_2016_2.0,imiss_i_2016_3.0,imiss_i_2016_4.0,imiss_i_2016_nan,imiss_c_2016_1.0,imiss_c_2016_2.0,imiss_c_2016_3.0,imiss_c_2016_4.0,imiss_c_2016_nan
0,0,1,0,0,0,1,0,0,0,0
1,0,1,0,0,0,0,1,0,0,0
2,0,0,0,1,0,0,1,0,0,0
3,0,0,0,1,0,0,0,1,0,0
4,0,1,0,0,0,1,0,0,0,0
5,1,0,0,0,0,0,1,0,0,0
6,1,0,0,0,0,0,1,0,0,0
7,0,0,1,0,0,1,0,0,0,0
8,0,1,0,0,0,0,0,0,1,0
9,0,1,0,0,0,0,1,0,0,0
