# Calculating Vote rates by Voter and Household
Special consideration needs to be made around calculating voter rates, ie, how many elections could a voter have voted in, how many did they actually vote in, and calculating these features back in time, so they can be accurately used to train a model where we have the results - ie where we know whether a voter did actually vote.

We have data on 6 elections (primary and general in 2012, 2014 and 2016). To predict vote rates in 2018's primary and general we need to have a trained model as we have no ground truth data for 2018 voting behavior. We will train the model on the two previous cycles 2016 and 2014 where we do have ground truth the actual voting behavior data. For the training we need to calculate past voting rates as they would look at the time of the 2014 and 2016 votes as this is one of the key features we have for 2018. 

We also calculate the actual Ground Truth for the general election in the 2012, 2014 and 2016 cycles, and the voting rate for primaries and general elections as an additional feature.

In [1]:
# imports
import pandas as pd
import numpy as np
from collections import Counter

from modules.lv_utils import load_households
from modules.lv_utils import load_voters
from modules.lv_utils import find_changes

In [2]:
# load the data
households = load_households('data_clean/20180627_households_district3.csv')
voters = load_voters('data_clean/20180628_voters_district3.csv')

In [3]:
v = voters
h = households
print(v.columns)
print(h.columns)

Index(['Vid', 'Abbr', 'Precinct', 'PrecinctSub', 'Party', 'PartyMain',
       'RegDate', 'PAV', 'RegDateOriginal', 'E6_110816', 'E5_060716',
       'E4_110414', 'E3_060314', 'E2_110612', 'E1_060512', 'District',
       'VoterScore', 'VoterScorePossible', 'VoterScorePctOfPoss', 'BirthYear',
       'OldestInHouseBirthYear', 'IsOldestInHouse', 'havePhone',
       'BirthPlaceState', 'BirthPlaceStateRegion', 'BirthPlaceCountry',
       'BirthPlaceCountryRegion', 'Gender', 'sameMailAddress', 'MailCountry',
       'isApt', 'Zip', 'StreetType', 'EmailProvider', 'E5_060716BT',
       'E1_060512BT', 'Tot_Possible_Votes', 'Act_Votes', 'Pct_Possible_Votes',
       'Hid', 'cHid'],
      dtype='object')
Index(['Hid', 'StreetType', 'Zip', 'Precinct', 'PrecinctSub', 'District',
       'CityArea', 'isApt', 'cHid'],
      dtype='object')


| Original Data Column | Description of data |
|:---:|:---|:---:|
| 'E6_110816' | A,V or N for 2016 general. |
| 'E5_060716' | A,V or N for 2016 primary. |
| 'E4_110414' | A,V or N for 2014 general. |
| 'E3_060314' | A,V or N for 2014 primary. |
| 'E2_110612' | A,V or N for 2012 general. |
| 'E1_060512' | A,V or N for 2012 primary. |

| Output Column | Description of data |
|:---:|:---|:---:|
| 'E78_nVotesPos' | Given all data (2012,14,& 16) how many times could this voter have voted. (A, V or N), for use predicting 2018 vote behavior |
| 'E78_nVotes' | Given all data (2012,14,& 16) how many times did this voter vote. (A or V) |
| 'E78_nVotesPct' | What is their E78 vote rate ('E78_nVotes'/'E78_nVotesPos') |
| 'E56_nVotesPos' | Given 2012 & 14 data how many times could this voter have voted ie for predicting 2016 vote behavior. (A, V or N) |
| 'E56_nVotes' | Given 2012 & 14 data how many times did this voter vote ie for predicting 2016 vote behavior. (A or V) |
| 'E56_nVotesPct' | What is their E56 vote rate ('E56_nVotes'/'E56_nVotesPos') |
| 'E34_nVotesPos' | Given 2012 data how many times could this voter have voted ie for predicting 2014 vote behavior. (A, V or N) |
| 'E34_nVotes' | Given 2012 data how many times did this voter vote ie for predicting 2014 vote behavior. (A or V) |
| 'E34_nVotesPct' | What is their E34 vote rate ('E34_nVotes'/'E34_nVotesPos') |
| 'Eap_nVotesPos' | Given 2012, 2014 & 2018 primary data how many times could this voter have voted ie for predicting primary election vote behavior. (A, V or N) |
| 'Eap_nVotes' | Given the 2012, 2014 & 2018 primary data only how many times did this voter vote ie for predicting primary election vote behavior. (A or V) |
| 'Eap_nVotesPct' | What is their Eap vote rate ('Eap_nVotes'/'Eap_nVotesPos') |
| 'Eag_nVotesPos' | Given 2012, 2014 & 2018 general data how many times could this voter have voted ie for predicting general election vote behavior. (A, V or N) |
| 'Eag_nVotes' | Given 2012, 2014 & 2018 general data how many times did this voter vote ie for predicting general election vote behavior. (A or V) |
| 'Eag_nVotesPct' | What is their Eag vote rate ('Eag_nVotes'/'Eag_nVotesPos') |
| 'E4_GndTth' | Did they vote in the 2014 General election ((A or V)=>1 for yes, (N)=>0 for no) for model training |
| 'E2_GndTth' | Did they vote in the 2012 General election ((A or V)=>1 for yes, (N)=>0 for no) for model training |
| 'E3_GndTth' | Did they vote in the 2014 General election ((A or V)=>1 for yes, (N)=>0 for no) for model training |
| 'E4_GndTth' | Did they vote in the 2014 General election ((A or V)=>1 for yes, (N)=>0 for no) for model training |
| 'E5_GndTth' | Did they vote in the 2014 General election ((A or V)=>1 for yes, (N)=>0 for no) for model training |
| 'E6_GndTth' | Did they vote in the 2016 General election ((A or V)=>1 for yes, (N)=>0 for no) for model training |

In [4]:
election_f = ['E6_110816', 'E5_060716', 'E4_110414', 'E3_060314', 'E2_110612', 'E1_060512',]
clean_f = ['Tot_Possible_Votes', 'Act_Votes','Pct_Possible_Votes']
new_col_names = ['nVotesPos', 'nVotes','nVotesPct']

In [5]:
def add_vote_cols(df, pre):
    """Take in dataframe with 'votes' string column and prefix,
    output the nVotesPos, nVotes and nVotesPct columns"""
    df[pre+'_nVotesPos'] = df.e_sum.str.len()
    # counting the actual number of in person or absentee votes cast by that voter
    df[pre+'_nVotes'] = df.e_sum.str.count('[AV]')
    # calculating a percent of possible votes for that voter
    df[pre+'_nVotesPct'] = (df[pre+'_nVotes']/df[pre+'_nVotesPos']).fillna(-1)

In [6]:
def add_vote_cols_for(elec, df, pre):
    df['e_sum'] = df.loc[:,elec].sum(axis='columns')
    add_vote_cols(df, pre)

In [7]:
# calculating vote rates for the 2012 data only
elec = ['E2_110612', 'E1_060512']
add_vote_cols_for(elec, v, 'E34')

# adding in the 2014 data and calculating the vote rates for the 2012 & 2014 data
elec.extend(['E4_110414', 'E3_060314'])
add_vote_cols_for(elec, v, 'E56')

# adding in the 2016 data and calculating the vote rates for all election data
elec.extend(['E6_110816', 'E5_060716'])
add_vote_cols_for(elec, v, 'E78')

# selecting the 2 2012 elections and calculating the vote rates for the primaries
elec = (['E2_110612', 'E1_060512'])
add_vote_cols_for(elec, v, 'E12')

# selecting the 2 2014 elections and calculating the vote rates for the primaries
elec = (['E4_110414', 'E3_060314'])
add_vote_cols_for(elec, v, 'E14')

# selecting the 2 2016 elections and calculating the vote rates for the generals
elec = (['E6_110816', 'E5_060716'])
add_vote_cols_for(elec, v, 'E16')

# selecting the 3 primary election data and calculating the vote rates for the primaries
elec = (['E1_060512', 'E3_060314', 'E5_060716'])
add_vote_cols_for(elec, v, 'Eap')

# selecting the 3 general election data and calculating the vote rates for the generals
elec = (['E2_110612', 'E4_110414', 'E6_110816'])
add_vote_cols_for(elec, v, 'Eag')

In [8]:
# adding vote rates for each individual election
ev = ['E6_110816', 'E5_060716','E4_110414', 'E3_060314', 'E2_110612', 'E1_060512']
pre = ['E6', 'E5','E4', 'E3', 'E2', 'E1']

for e, p in zip(ev,pre):
    #df_e['e_sum'] = df_e.loc[:,e].sum(axis='columns')
    voters[p+'_nVotesPos'] = voters[e].str.len()
    # counting the actual number of in person or absentee votes cast by that voter
    voters[p+'_nVotes'] = voters[e].str.count('[AV]')
    # calculating a percent of possible votes for that voter
    voters[p+'_nVotesPct'] = (voters[p+'_nVotes']/voters[p+'_nVotesPos']).fillna(-1)

In [9]:
# adding the ground truth columns
v = pd.concat([v, pd.DataFrame(np.zeros((v.shape[0],6)), columns = ['E1_GndTth', 'E2_GndTth', 'E3_GndTth', 
                                                                    'E4_GndTth', 'E5_GndTth', 'E6_GndTth'])], axis=1)
for (oc,ic) in [('E1_GndTth','E1_060512'), ('E2_GndTth','E2_110612'), ('E3_GndTth','E3_060314'), 
                ('E4_GndTth','E4_110414'), ('E5_GndTth','E5_060716'), ('E6_GndTth','E6_110816')]:
    v.loc[v[ic].isin(['A', 'V']), oc] = 1
    v.loc[v[ic] == '', oc] = -1
#v[['E1_060512', 'E1_GndTth', 'E2_110612', 'E2_GndTth', 'E3_060314', 'E3_GndTth',
#   'E4_110414', 'E4_GndTth', 'E5_060716', 'E5_GndTth', 'E6_110816', 'E6_GndTth']].head(15)

In [10]:
# checking calculation:
temp = v.loc[:,['E78_nVotesPos','E78_nVotes','E78_nVotesPct']]
temp.rename(columns = {'E78_nVotesPos':'Tot_Possible_Votes',
                      'E78_nVotes':'Act_Votes',
                      'E78_nVotesPct':'Pct_Possible_Votes'}, inplace = True)
print('newly calculated columns match previously calculated ones: {}'.format(
    v[['Tot_Possible_Votes', 'Act_Votes','Pct_Possible_Votes']].equals(
    temp)))

find_changes(v[['Tot_Possible_Votes', 'Act_Votes','Pct_Possible_Votes']],temp )

newly calculated columns match previously calculated ones: True
(0, 4)
[]


Unnamed: 0,id,col,from,to


In [11]:
# dropping the now extra columns
c_to_drop = ['Tot_Possible_Votes', 'Act_Votes','Pct_Possible_Votes','e_sum']
v = v.drop(c_to_drop, axis='columns')

## Saving out the enhanced data

In [12]:
v.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13307 entries, 0 to 13306
Data columns (total 86 columns):
Vid                        13307 non-null int64
Abbr                       13307 non-null int64
Precinct                   13307 non-null int64
PrecinctSub                13307 non-null int64
Party                      13307 non-null category
PartyMain                  13307 non-null object
RegDate                    13307 non-null datetime64[ns]
PAV                        13307 non-null category
RegDateOriginal            13307 non-null datetime64[ns]
E6_110816                  13307 non-null category
E5_060716                  13307 non-null category
E4_110414                  13307 non-null category
E3_060314                  13307 non-null category
E2_110612                  13307 non-null category
E1_060512                  13307 non-null category
District                   13307 non-null int64
VoterScore                 13307 non-null float64
VoterScorePossible         133

In [13]:
date = pd.Timestamp("today").strftime("%Y%m%d")
v.set_index('Vid', inplace=True)
v.to_csv('data_clean/{}_votersWithRate_district3.csv'.format(date))