# Cleaning the Voter data
Before further analysis the data was cleaned and PII (Personally Indentifiable Information) removed or obscured.

In [1]:
# imports
import pandas as pd

In [2]:
# load the data
voters = pd.read_csv('data_raw/180613_voters_district3_all_data.txt', sep='\t')
hh_lookup = pd.read_csv('data_clean/20180616_households_lookup_NO_GIT.txt', index_col='Hid')

For the Voter data the value_counts() for each column were reviewed and the following was done to clean and anonymize the data:

| Original Data Column | Description of action | output column(s) | Type |
|:---:|:---|:---:|:---:|
| 'VoterID' | Rows of table were randomly shuffled, the index reset and the new index used as new UID. | 'vid' | Num |
| 'Status' | Dropped, all entries are the same 'A'. | | |
| 'Abbr' | Kept as is, although it’s not understood it is a clean number. | 'Abbr' | Num |
| 'Affidavit' | Dropped as PII. | | |
| 'LastVoted' | Only 6 entries column dropped as too sparse. | | |
| 'Salutation' | Dropped PII, 6388 missing values. | | |
| 'LastName' | Dropped as PII. | | |
| 'FirstName' | Used to help fill missing gender data then dropped as PII. | | |
| 'MiddleName' | 4020 NaN, and a lot of initials only, dropped as PII. | | |
| 'Suffix' | 13057 NaN, dropped as PII. | | |
| 'HouseNumber' | Dropped as PII. | | |
| 'HouseNumberSuffix' | Dropped empty. | | |
| 'StreetPrefix' |  Dropped empty. | | |
| 'Street' | Dropped as PII. | | |
| 'StreetType' | Populate missing values using 'Street': ‘Common’ => ‘CMN’ ‘GREEN’ => ‘GRN’ and two cross streets => ‘UKN’. | 'StreetType' | Cat |
| 'BuildingNumber' | Only 3 entries dropped. | | |
| 'ApartmentNumber' | Converted to a True/False field. | 'isApt' | Bool |
| 'City' | Dropped, all entries are the same. | | |
| 'State' | 6 missing rows, dropped as all should be the same. | | |
| 'Zip' | Cleaned all to 5 digit numerical zip code entries.  | 'Zip' | Num |
| 'Precinct' | Converted to number and kept. | 'Precinct' | Num |
| 'PrecinctSub' | Converted to number and kept. | 'PrecinctSub' | Num |
| 'Party' | Converted to category and kept. | 'Party' | Cat |
| 'RegDate' | Converted to a dateTime and kept. | 'RegDate' | Date |
| 'ImageID | An int between 0 and 48277945 meaning unknown, dropped as Images are often PII. | | |
| 'Phone1' | 5266 NaN's 8041 values, converted to True/False. | 'havePhone' | Bool |
| 'Phone2' | Only 2 values, dropped as PII. | | |
| 'Military' | Only 9 'Y' dropped too sparse. | | |
| 'Gender' | 5223 NaN's 1743 ‘F’ and 1717 ‘M’ were added by comparing to a database of name genders (https://github.com/organisciak/names), remaining missing data was set to ‘UKN’.  | 'Gender' | Cat |
| 'PAV' | Is voter a Permanent Absentee Voter, converted to category and kept. | 'PAV' | Cat |
| 'BirthPlace ' | This mixed two and three letter code was assumed to be a two USA state code, and only if that failed to match assumed to be a two or three letter country code. Output was 2 clean columns, state and country code data gathered from wikipedia,  'UNK' added for the 1296 NaN's. | 'BirthPlaceState', 'BirthPlaceCountry' | Cat, Cat |
| 'BirthDate' | Cleaned full birthday into 'BirthYear', rest dropped as PII. | 'BirthYear' | Int |
| Mailing Address columns | Compared with main address to create a True/False, Country kept as a category. | 'sameMailAddress', 'MailCountry' | Bool, Cat |
| 'LTDate' | An internal column to registration office, dropped. | | |
| 'email' | 9009 NaN's, Cleaned to keep the service provider with UKN for NaNs. | 'EmailProvider' | Str |
| 'RegDateOriginal' | Converted to a dateTime and kept. | 'RegDateOriginal' | Date |
| 'PermCategory' | An internal column to registration office, dropped. | | |
| 'PrecinctName' | Shown to be formatted combination of Precinct and PrecinctSub so dropped. | | |
| 'ResAddrLine1' |  Dropped empty. | | |
| 'ResAddrLine2' |  Dropped empty. | | |
| 'E1_110816' | Code indicated vote, converted to category and kept. | 'E6_110816' | Cat |
| 'E2_060716' | Code indicated vote and ballot used, Cleaned into 'Vote' and 'BallotType' and kept. | 'E5_060716', 'E5_060716BT' | Cat |
| 'E3_110414' | Code indicated vote, converted to category and kept. | 'E4_110414' | Cat |
| 'E4_060314' | Code indicated vote, converted to category and kept. | 'E3_060314' | Cat |
| 'E5_110612' | Code indicated vote, converted to category and kept. | 'E2_110612' | Cat |
| 'E6_060512' | Code indicated vote and ballot used, Cleaned into 'Vote' and 'BallotType' and kept. | 'E1_060512','E1_060512BT' | Cat |
| | Added column to indicate number of elections voter has been registered for. | 'Tot_Possible_Votes' | Num |
| | Added column to indicate number of elections voter actually voted in. | 'Act_Votes' | Num |
| | Added column to indicated % of possible elections actually voted in. | 'Pct_Possible_Votes' | Num |
| 'District' | Kept as is in case we need to add in other district data. | 'District' | Num |
| 'VoterScore' | Score assigned by my friend based on which election someone has reported data for and voted (A or V) | 'VoterScore' | Num |
| 'VoterScorePossible' | Score assigned by my friend assuming all reported data was 'vote' (A or V) | 'VoterScorePossible' | Num |
| 'VoterScorePctOfPoss' | 'VoterScore'/'VoterScorePctOfPoss' | 'VoterScorePctOfPoss' | Num |
| Household | Unique key linking each voter to a household, looked up and converted to anonymized Hid. | 'Hid' | Num |

In [3]:
v = voters

Dropping all the columns not needed during cleaning.

In [4]:
to_drop = ['Status','Affidavit','LastVoted','Salutation','LastName','MiddleName',
           'Suffix','HouseNumberSuffix','StreetPrefix',
          'City', 'State', 'ImageID', 'Phone2', 'Military','LTDate', 'PermCategory',
            'ResAddrLine1', 'ResAddrLine2']
v = v.drop(to_drop, axis='columns')

Converting a couple of columns we are keeping to categorical data

### Precinct Name
Confirming that this is a formated concatination of the Precinct and PrecinctSug and then dropping the column.

In [5]:
precinctcomp = v.loc[:,['PrecinctName', 'Precinct', 'PrecinctSub']]
precinctcomp['Combined']= (['.'.join(i) for i in 
                        zip(precinctcomp["Precinct"].map(str),precinctcomp["PrecinctSub"].map("{0:0>2}".format))])
precinctcomp['Combined'] = precinctcomp['Combined'].apply(lambda x: "{}{}".format('PRECINCT NO. ', x))
assert (precinctcomp['PrecinctName'] == precinctcomp['Combined']).all()
print('PrecinctName is a concatination and can be dropped')

PrecinctName is a concatination and can be dropped


In [6]:
v = v.drop('PrecinctName', axis='columns')

### 'RegDate' and 'RegDateOriginal'
Converting to dates

In [7]:
v.RegDate = pd.to_datetime(v.RegDate.map(lambda x: x.replace(' 0:00', '')))
v.RegDateOriginal = pd.to_datetime(v.RegDateOriginal.map(lambda x: x.replace(' 0:00', '')))

In [8]:
v[['RegDate', 'RegDateOriginal']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13307 entries, 0 to 13306
Data columns (total 2 columns):
RegDate            13307 non-null datetime64[ns]
RegDateOriginal    13307 non-null datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 208.0 KB


### BirthDate
Converting to dates and keeping the year only

In [9]:
v.BirthDate = pd.to_datetime(v.BirthDate.map(lambda x: x.replace(' 0:00', '')))
v['BirthYear'] = v.BirthDate.map(lambda x: x.year)
v = v.drop('BirthDate', axis='columns')

In [10]:
v[['BirthYear']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13307 entries, 0 to 13306
Data columns (total 1 columns):
BirthYear    13307 non-null int64
dtypes: int64(1)
memory usage: 104.0 KB


### Phone number

In [11]:
v['havePhone'] = v.Phone1.notnull()
v = v.drop('Phone1', axis='columns')

In [12]:
v[['havePhone']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13307 entries, 0 to 13306
Data columns (total 1 columns):
havePhone    13307 non-null bool
dtypes: bool(1)
memory usage: 13.1 KB


### Birth Place
This column is a mixture of 3 letter country codes and 2 letter state codes, I will separate them out and create a dedicated 'BirthPlaceState' column where we have that information and a 'BirthPlaceCountry' for the country, setting it to USA for states. I will also introduce an 'UKN' category to capture where the information is missing.

In [13]:
# Reading in lists of USA State codes and 2 and 3 letter country codes
usa_states = pd.read_csv('data_raw/us_state_codes.csv', index_col='Code').reset_index()
country_codes = pd.read_csv('data_raw/country_codes.csv')
country_codes = country_codes[['name', 'alpha-2', 'alpha-3']]
# Namibia's 2 letter code 'NA' had to be escaped to avoid being read in as NaN!

In [14]:
def getCodeTuple(code):
    """Return (code, state, country) tuple, using 'UKN' where not in our look up lists."""
    res = (code, 'UKN','UKN')
    if not isinstance(code, str):
        pass
    elif len(code) == 2:
        if code in usa_states.Code.values:
            t = usa_states.loc[usa_states.Code == code]
            res = t.to_records(index=False)[0]
        elif code in country_codes['alpha-2'].values:
            t = country_codes.loc[country_codes['alpha-2'] == code]
            res = (code,'UKN',country_codes.loc[
                country_codes['alpha-2'] == code,'name'].values[0])
    elif len(code) == 3:
        if (code in country_codes['alpha-3'].values):
            t = country_codes.loc[country_codes['alpha-3'] == code]
            res = (code,'UKN',country_codes.loc[
                country_codes['alpha-3'] == code,'name'].values[0])
    return res


In [15]:
# test code for getCodeTuple Function
code = v.BirthPlace[v.BirthPlace.isnull()].values[0]    # NaN test
code = ['UK', 'LVA', 'IN', code, 'AC', 'KP']
[getCodeTuple(c) for c in code]

[('UK', 'UKN', 'UKN'),
 ('LVA', 'UKN', 'Latvia'),
 ('IN', 'Indiana', 'USA'),
 (nan, 'UKN', 'UKN'),
 ('AC', 'UKN', 'UKN'),
 ('KP', 'UKN', "Korea (Democratic People's Republic of)")]

In [16]:
cleanbp = pd.DataFrame.from_records([
    getCodeTuple(c) for c in v.BirthPlace.unique()], columns=[
    'Code', 'BirthPlaceState', 'BirthPlaceCountry'])
cleanbp.head()

Unnamed: 0,Code,BirthPlaceState,BirthPlaceCountry
0,AR,Arkansas,USA
1,CA,California,USA
2,IND,UKN,India
3,TX,Texas,USA
4,MEX,UKN,Mexico


In [17]:
v = pd.merge(v, cleanbp, left_on=['BirthPlace'], right_on=['Code'], how='left')

In [18]:
# confirming merge and dropping the unclean columns
assert (v.BirthPlace.loc[v.BirthPlace.notnull() == True] 
        == v.Code.loc[v.Code.notnull() == True]).all()
print('BirthPlace and Code match and can be dropped')
v = v.drop(['BirthPlace','Code'], axis='columns')
v[['BirthPlaceState', 'BirthPlaceCountry']].info()

BirthPlace and Code match and can be dropped
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13307 entries, 0 to 13306
Data columns (total 2 columns):
BirthPlaceState      13307 non-null object
BirthPlaceCountry    13307 non-null object
dtypes: object(2)
memory usage: 311.9+ KB


### Gender
Some values in the Gender column are missing, we will use FirstName to attempt to assign one of the two standard Genders

In [19]:
# loading in my gender name look up data
import json
# from https://github.com/organisciak/names
with open('data_raw/name_genders.json', 'r') as f:
    j_data = json.load(f)
names = pd.Series(j_data)
names.size

10658

In [20]:
# before cleaning
v.Gender.value_counts(dropna = False)

NaN    5223
F      4212
M      3872
Name: Gender, dtype: int64

In [21]:
def cleanGender(row):
    if str(row.Gender) == 'nan':
        row.Gender = 'UNK'
        if row.FirstName.title() in names.index:
            row.Gender = names[row.FirstName.title()]
    return row.Gender

In [22]:
v['CleanGender'] = v.apply(cleanGender, axis=1).astype('category')
v.CleanGender.value_counts(dropna=False)
#v.loc[v.CleanGender.isnull() == True,['FirstName','Gender','CleanGender']]

F      5955
M      5589
UNK    1763
Name: CleanGender, dtype: int64

In [23]:
print('The cleaning process was able to assign genders for {} females and {} males.\nA total of {} more gender entries'.format(
v.CleanGender.value_counts(dropna=False)['F']-v.Gender.value_counts(dropna=False)['F'],
v.CleanGender.value_counts(dropna=False)['M']-v.Gender.value_counts(dropna=False)['M'],
v.Gender.value_counts(dropna=False)[0]-v.CleanGender.value_counts(dropna=False)['UNK']))

The cleaning process was able to assign genders for 1743 females and 1717 males.
A total of 3460 more gender entries


In [24]:
v = v.drop(['FirstName','Gender'], axis='columns')

### Creating 'sameMailAddress' and CleanMailCountry columns
comparing any mailing address' give to the street address to see which ones are the same

In [25]:
def cleanMail(row): 
    
    build_num = row.BuildingNumber
    if str(build_num) != 'nan':
        build_num = str(int(row.BuildingNumber))
    mail_zip = '' if str(row.MailZip) == 'nan' else row.MailZip[0:5]
    row['cc_full_add'] = ' '.join([x for x in [str(row.HouseNumber), row.Street,
                                 row.StreetType, build_num, 
                                                    str(row.AptNumber)] if str(x) != 'nan'])
    
    if (row.cc_full_add == row.MailStreet) & (mail_zip == row.Zip[0:5]):
        row['sameMailAddress'] = True
    else:
        if row.MailState in usa_states.Code.values:
            row.MailCountry = 'USA'
        row['sameMailAddress'] = False
    row['CleanMailCountry'] = row.MailCountry
    return row

In [26]:
v = v.apply(cleanMail, axis=1)
v.CleanMailCountry = v.CleanMailCountry.astype('category')
#v.iloc[:10,:].apply(sameMailAddress, axis=1)

In [27]:
# Explore the HH's Whose address\'s didn't match the pattern
#v.loc[v.Household.isin(['HH-17611', 'HH-20607', 'HH-21939','HH-22036']
#                      ),['Household','sameMailAddress','cc_full_add','HouseNumber',
#       'Street','StreetType','AptNumber','MailStreet']].sort_values('Household')

Quick Grouping to explore which Permanent Absentee Voters have different mailing address'.

In [28]:
v[[ 'Street','MailStreet','MailCountry','sameMailAddress',
    'CleanMailCountry', 'PAV']].groupby(['PAV','sameMailAddress']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Street,MailStreet,MailCountry,CleanMailCountry
PAV,sameMailAddress,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
N,False,139,139,138,138
N,True,4641,4641,0,0
Y,False,306,306,306,306
Y,True,8221,8221,0,0


In [29]:
to_drop = ['CareOf','MailStreet','MailCity','MailState','MailZip','MailCountry','cc_full_add']
v = v.drop(to_drop, axis='columns')

### Set is_Apt field
A True/False column denoting if the household has an Apt number.

In [30]:
v['isApt'] = v.AptNumber.notnull()
v.isApt.value_counts()

False    10481
True      2826
Name: isApt, dtype: int64

### Cleaning the Zip data
Setting all fields to be 5 digit zip

In [31]:
v['ZipClean'] = v.Zip.astype(str).str[0:5].astype('int')
v['ZipClean'].value_counts()

94538    7365
94536    5942
Name: ZipClean, dtype: int64

### Cleaning StreetType
significant data is missing from this column, I used the 'Street' and 'Full address' to identify the missing categories 'GRN', 'UKN' and add them, I was also able to identify that COMMON and COMMONS had not been mapped to 'CMN' so fixed that.

In [32]:
v.loc[:,'StreetTypeClean'] = v['StreetType']
# Cleaning the data by setting all Full Address' that include ' COMMON' to have StreetType 'CMN'
v.loc[v.Street.str.contains(' COMMON') == True,['StreetTypeClean']] = 'CMN'
# Same for ' GREEN' => 'GRN', and the two cross street parital address' to 'UKN'
v.loc[v.Street.str.contains(' GREEN') == True,['StreetTypeClean']] = 'GRN'
v.loc[v.Street.str.contains('/') == True,['StreetTypeClean']] = 'UKN'
v.StreetTypeClean = v.StreetTypeClean.astype('category')

In [33]:
# check code
v.loc[(v.Street.str.contains('COMMON') == True)
      |(v.Street.str.contains(' GREEN') == True)
      |(v.Street.str.contains('/') == True)
      ,['Street','StreetType', 'StreetTypeClean']]

t = v[['Street','StreetType', 'StreetTypeClean']].groupby('Street').count()
t.sum()

StreetType         11887
StreetTypeClean    13307
dtype: int64

In [34]:
to_drop = ['HouseNumber', 'Street', 'StreetType', 'BuildingNumber', 'AptNumber', 'Zip']
v = v.drop(to_drop, axis='columns')

### Email Providers
removing full email address' and leaving the the providers information where we have it.

In [35]:
#[x[1] for x in v.Email.str.split('@')]
#[y if str(y) != 'nan' else 'UKN' for y in v.Email]
v['EmailProvider'] = [x.split('@')[-1].lower() for x in [
    y if str(y) != 'nan' else 'UKN' for y in v.Email]]

In [36]:
v = v.drop('Email', axis='columns')

In [37]:
v.EmailProvider.value_counts().head(8)

ukn              9009
gmail.com        1850
yahoo.com        1311
hotmail.com       363
comcast.net       154
aol.com           111
sbcglobal.net     108
att.net            41
Name: EmailProvider, dtype: int64

### Voter history
Separating out the Vote from Ballot Type and setting the null character to ''.

In [38]:
#f['ballot_type'] = f['vote'].str.extract('\((.*?)\)', expand=True).fillna('')
#f['vote'] = f['vote'].replace('\(.*\)', '', regex=True)

v['E1_110816'] = v['E1_110816'].fillna('').astype('category')
v['E2_060716BT'] = v['E2_060716'].str.extract('\((.*?)\)', expand=True).fillna('')
v['E2_060716'] = v['E2_060716'].replace('\(.*\)', '', regex=True).fillna('').astype('category')
v['E3_110414'] = v['E3_110414'].fillna('').astype('category')
v['E4_060314'] = v['E4_060314'].fillna('').astype('category')
v['E5_110612'] = v['E5_110612'].fillna('').astype('category')
v['E6_060512BT'] = v['E6_060512'].str.extract('\((.*?)\)', expand=True).fillna('')
v['E6_060512'] = v['E6_060512'].replace('\(.*\)', '', regex=True).fillna('').astype('category')
v.E2_060716BT = v.E2_060716BT.astype('category')
v.E6_060512BT = v.E6_060512BT.astype('category')

In [39]:
v[['E1_110816', 'E2_060716', 'E3_110414',
       'E4_060314', 'E5_110612', 'E6_060512','E2_060716BT', 'E6_060512BT']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13307 entries, 0 to 13306
Data columns (total 8 columns):
E1_110816      13307 non-null category
E2_060716      13307 non-null category
E3_110414      13307 non-null category
E4_060314      13307 non-null category
E5_110612      13307 non-null category
E6_060512      13307 non-null category
E2_060716BT    13307 non-null category
E6_060512BT    13307 non-null category
dtypes: category(8)
memory usage: 210.5 KB


In [40]:
election_fields = ['E1_110816','E2_060716','E3_110414','E4_060314','E5_110612','E6_060512']
bt_fields = ['E2_060716BT', 'E6_060512BT']
t = v[election_fields].stack(0).reset_index()
t.columns = ['item','election', 'vote']
t = t.groupby(['election','vote']).count().unstack('vote')
t.columns = t.columns.droplevel()
t = t.reindex(columns = ['V', 'A', 'N', ''])
t

vote,V,A,N,Unnamed: 4_level_0
election,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
E1_110816,3090,6130,3122,965
E2_060716,1594,3252,6255,2206
E3_110414,1533,2781,5673,3320
E4_060314,824,1983,6920,3580
E5_110612,3081,3856,2359,4011
E6_060512,817,1760,6000,4730


Adding fields to indicate number of elections this voter was registred for and how many elections they actually voted in, also a column indicating the % of elections they were registered for in which they actually voted.

In [41]:
# create a temp column that combines all votes records into one string upto 6 characters long
v['e_sum'] = v.loc[:,election_fields].sum(axis='columns')
# the length of the string gives the total numbe of possible votes for that voter
v['Tot_Possible_Votes'] = v.e_sum.str.len()
# counting the actual number of in person or absentee votes cast by that voter
v['Act_Votes'] = v.e_sum.str.count('[AV]')
# calculating a percent of possible votes for that voter
v['Pct_Possible_Votes'] = (v.Act_Votes/v.Tot_Possible_Votes).fillna(-1)
v = v.drop('e_sum', axis='columns')
v[['Tot_Possible_Votes','Act_Votes','Pct_Possible_Votes']].groupby(
    ['Tot_Possible_Votes','Act_Votes']).count().sort_values('Tot_Possible_Votes', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Pct_Possible_Votes
Tot_Possible_Votes,Act_Votes,Unnamed: 2_level_1
6,6,1405
6,5,994
6,4,1088
6,3,1209
6,2,1491
6,1,1042
6,0,1253
5,0,75
5,5,69
5,4,88


In [42]:
n = election_fields + ['Tot_Possible_Votes','Act_Votes','Pct_Possible_Votes','RegDate','RegDateOriginal']
print('{} voters have RegDates after the last election held on 08 Nov 2016'.format(
    len(v.loc[v.RegDate > pd.to_datetime('2016-11-08'), n])))
print('{} voters have 0 possible votes'.format(
    v[['Tot_Possible_Votes','Act_Votes','Pct_Possible_Votes']].groupby(
    ['Tot_Possible_Votes','Act_Votes']).count().sort_values('Tot_Possible_Votes', ascending=False).loc[(0,0)][0]))

1879 voters have RegDates after the last election held on 08 Nov 2016
929 voters have 0 possible votes


### Household ID
Converting the public HH id to my private one.

In [43]:
v = pd.merge(v, hh_lookup.reset_index(), left_on='Household', right_on='Household_Id', how='left')

In [44]:
v[['Hid', 'Household', 'Household_Id']].info()
assert (v['Household'] == v['Household_Id']).all()
print('Household ids match and can be dropped')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13307 entries, 0 to 13306
Data columns (total 3 columns):
Hid             13307 non-null int64
Household       13307 non-null object
Household_Id    13307 non-null object
dtypes: int64(1), object(2)
memory usage: 415.8+ KB
Household ids match and can be dropped


In [45]:
v = v.drop(['Household', 'Household_Id'], axis='columns')

### Creating my own Voter ID
to finish anonymizing the data, and creating a look up table to enable convertion back if needed.

In [46]:
# randomly shuffling the household row and reset index to make the new order the index
v = v.sample(frac=1).reset_index(drop=True)
v.index.name = 'Vid'
v = v.reset_index()

In [47]:
vid_lookup = v[['Vid', 'VoterID']]
v = v.drop('VoterID', axis='columns')

### Saving out the cleaned data
The cleaned data is saved to file together with the lookup table

In [48]:
cols_for_cat = ['CleanGender', 'Party', 'PAV', 'BirthPlaceState', 'BirthPlaceCountry']
v[cols_for_cat] = v[cols_for_cat].apply(lambda x: x.astype('category'))
v.rename(columns={'CleanGender':'Gender', 
                  'CleanMailCountry':'MailCountry', 
                  'ZipClean':'Zip',
                  'StreetTypeClean':'StreetType',
                  'E1_110816':'E6_110816',
                  'E2_060716':'E5_060716',
                  'E2_060716BT':'E5_060716BT',
                  'E3_110414':'E4_110414',
                  'E4_060314':'E3_060314',
                  'E5_110612':'E2_110612',
                  'E6_060512':'E1_060512',
                  'E6_060512BT':'E1_060512BT'
                 }, inplace=True)

In [49]:
clean = v.copy()
clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13307 entries, 0 to 13306
Data columns (total 44 columns):
Vid                    13307 non-null int64
Abbr                   13307 non-null int64
Precinct               13307 non-null int64
PrecinctSub            13307 non-null int64
Party                  13307 non-null category
RegDate                13307 non-null datetime64[ns]
PAV                    13307 non-null category
RegDateOriginal        13307 non-null datetime64[ns]
E6_110816              13307 non-null category
E5_060716              13307 non-null category
E4_110414              13307 non-null category
E3_060314              13307 non-null category
E2_110612              13307 non-null category
E1_060512              13307 non-null category
District               13307 non-null int64
VoterScore             13307 non-null float64
VoterScorePossible     13307 non-null float64
VoterScorePctOfPoss    13307 non-null float64
UpdateStatus           13307 non-null object
cc_201

In [50]:
clean.set_index('Vid', inplace=True)
vid_lookup.set_index('Vid', inplace=True)

In [51]:
date = pd.Timestamp("today").strftime("%Y%m%d")
clean.to_csv('data_clean/{}_voters_district3.txt'.format(date))
vid_lookup.to_csv('data_clean/{}_voters_lookup_NO_GIT.txt'.format(date))