# Data Cleaning
Reads the dataset in /data/dorfman/2016-national-gop-primary.csv and removes unneeded columns and poll entries/rows taken before 2016. Uses only recent and relevant data. Also constructs a DataFrame for candidate dropout dates.

## Imports

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd
import os
import hundred

## Read Data
http://elections.huffingtonpost.com/pollster/2016-national-gop-primary

Downloaded as a CSV and imported into Jupyter.

In [2]:
polls = pd.read_csv('/data/dorfman/2016-national-gop-primary.csv')
polls.head()

Unnamed: 0,Pollster,Start Date,End Date,Entry Date/Time (ET),Number of Observations,Population,Mode,Trump,Cruz,Rubio,...,Pataki,Perry,Rand Paul,Santorum,Walker,Undecided,Pollster URL,Source URL,Partisan,Affiliation
0,Morning Consult,2016-03-11,2016-03-13,2016-03-14 17:45:27 UTC 2016-03-14 17:45:27 UTC,1516,Registered Voters - Republican,Internet,42,23,12,...,,,,,,9,http://elections.huffingtonpost.com/pollster/p...,,Nonpartisan,
1,YouGov/Economist,2016-03-10,2016-03-12,2016-03-14 17:17:34 UTC 2016-03-14 17:17:34 UTC,400,Likely Voters - Republican,Internet,53,22,10,...,,,,,,4,http://elections.huffingtonpost.com/pollster/p...,,Nonpartisan,
2,Ipsos/Reuters,2016-03-05,2016-03-09,2016-03-10 16:53:52 UTC 2016-03-10 16:53:52 UTC,639,Registered Voters - Republican,Internet,41,24,13,...,,,,,,5,http://elections.huffingtonpost.com/pollster/p...,,Nonpartisan,
3,Morning Consult,2016-03-04,2016-03-06,2016-03-08 15:38:21 UTC 2016-03-08 15:38:21 UTC,781,Registered Voters - Republican,Internet,40,23,14,...,,,,,,8,http://elections.huffingtonpost.com/pollster/p...,,Nonpartisan,
4,ABC/Post,2016-03-03,2016-03-06,2016-03-08 12:06:27 UTC 2016-03-08 12:06:27 UTC,400,Registered Voters - Republican,Live Phone,34,25,18,...,,,,,,7,http://elections.huffingtonpost.com/pollster/p...,,Nonpartisan,


In [3]:
assert polls.columns.size == 29

## Clean Polls
Delete columns that are not needed.

In [4]:
del polls['Start Date']
del polls['Entry Date/Time (ET)']
del polls['Number of Observations']
del polls['Population']
del polls['Mode']
del polls['Pollster URL']
del polls['Source URL']
del polls['Partisan']
del polls['Affiliation']
polls.head()

Unnamed: 0,Pollster,End Date,Trump,Cruz,Rubio,Kasich,Carson,Bush,Christie,Fiorina,Gilmore,Graham,Huckabee,Jindal,Pataki,Perry,Rand Paul,Santorum,Walker,Undecided
0,Morning Consult,2016-03-13,42,23,12,9,,,,,,,,,,,,,,9
1,YouGov/Economist,2016-03-12,53,22,10,11,,,,,,,,,,,,,,4
2,Ipsos/Reuters,2016-03-09,41,24,13,13,4.0,,,,,,,,,,,,,5
3,Morning Consult,2016-03-06,40,23,14,10,,,,,,,,,,,,,,8
4,ABC/Post,2016-03-06,34,25,18,13,,,,,,,,,,,,,,7


In [5]:
assert polls.columns.size == 20

Replace all NaNs in the 'Undecided' column with zeros.

In [6]:
polls['Undecided'] = polls['Undecided'].fillna(0)

Make sure each poll summates to 100%. Polling averages are provided as integers so precision is lost. If the polling sum is less than 100, the remainder is added to 'Undecided'. If the polling sum is greater than 100, the surplus is subtracted from 'Undecided'.

In [7]:
hundred.Equals100(polls, 2)
for p in range(len(polls[2:])):
    assert sum(polls.iloc[p][2:].dropna()) == 100

Remove polls taken before 2016.

In [8]:
polls = polls[polls['End Date'] >= '2016-01-01']
polls = polls.rename(columns = {'End Date': 'date'})

polls.date = pd.Series(pd.DatetimeIndex(polls.date))
polls.index = polls.date
polls = polls.groupby('date').mean()

In [9]:
assert all(polls.index >= '2016-01-01')

Remove candidates who suspended their campaigns before January 1, 2016. Change Rand Paul's name to be just his last name.

In [10]:
del polls['Jindal']
del polls['Pataki']
del polls['Perry']
del polls['Walker']
del polls['Graham']

polls = polls.rename(columns = {'Rand Paul': 'Paul'})
polls.head()

Unnamed: 0_level_0,Trump,Cruz,Rubio,Kasich,Carson,Bush,Christie,Fiorina,Gilmore,Huckabee,Paul,Santorum,Undecided
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2016-01-03,35.0,18.0,13.0,2.0,9.0,6,4.0,3.0,0.0,2.0,2.0,1.0,5.0
2016-01-06,38.5,17.0,10.5,2.5,8.5,6,3.5,2.0,0.0,2.0,4.0,1.0,4.5
2016-01-07,35.0,20.0,13.0,2.0,10.0,4,2.0,3.0,0.0,1.0,2.0,0.0,8.0
2016-01-08,34.0,18.0,9.0,2.0,8.0,4,4.0,2.0,,1.0,3.0,,15.0
2016-01-10,39.25,17.25,10.75,2.5,8.0,5,3.25,2.5,0.0,2.5,2.25,0.25,6.5


In [11]:
assert list(polls.columns) == ['Trump', 'Cruz', 'Rubio', 'Kasich', 'Carson', 'Bush', 'Christie',
       'Fiorina', 'Gilmore', 'Huckabee', 'Paul', 'Santorum', 'Undecided']

Make sure each poll summates to 100% after grouping.

In [12]:
hundred.Equals100(polls)
for p in range(len(polls.index)):
    assert sum(polls.iloc[p].dropna()) == 100

Polls are now downsized to only include candidates that have been active in the race for nomination since the start of 2016. The only remaining columns in the DataFrame are those of candidates' and their polling numbers. Date duplicates are removed by grouping and retreiving the mean of polls conducted that day.

## Create Dictionary on Candidates Dropped
Source: https://en.wikipedia.org/wiki/United_States_presidential_election,_2016#Withdrawn_candidates_2

Manually made.

In [13]:
def inRace(name):
    """Returns whether the candidate is still in the race as of March 12, 2016
    
    Parameters
    ----------
    name : str
        The name of the candidate
    """
    if name == 'Trump' or name == 'Rubio' or name == 'Cruz' or name == 'Kasich':
        return True
    else:
        return False

In [14]:
def InitDict():
    """Returns ad dictionary containing the candidates' names, whether they've dropped or not, and the date of dropping"""
    dictOfCand = []
    candidates = polls.columns[:-1]

    for c in candidates:
        person = {}
        person['name'] = c
        person['dropped'] = False if inRace(c) else True
        person['date'] = ''
        dictOfCand.append(person)
    
    return dictOfCand

dictOfCand = InitDict()
dictOfCand

[{'date': '', 'dropped': False, 'name': 'Trump'},
 {'date': '', 'dropped': False, 'name': 'Cruz'},
 {'date': '', 'dropped': False, 'name': 'Rubio'},
 {'date': '', 'dropped': False, 'name': 'Kasich'},
 {'date': '', 'dropped': True, 'name': 'Carson'},
 {'date': '', 'dropped': True, 'name': 'Bush'},
 {'date': '', 'dropped': True, 'name': 'Christie'},
 {'date': '', 'dropped': True, 'name': 'Fiorina'},
 {'date': '', 'dropped': True, 'name': 'Gilmore'},
 {'date': '', 'dropped': True, 'name': 'Huckabee'},
 {'date': '', 'dropped': True, 'name': 'Paul'},
 {'date': '', 'dropped': True, 'name': 'Santorum'}]

In [15]:
assert len(dictOfCand) == polls.columns.size - 1

Set dates of campaign suspension for each candidate that dropped.

In [16]:
for d in dictOfCand:
    if d['name'] == 'Carson':
        d['date'] = '2016-03-04'
    elif d['name'] == 'Bush':
        d['date'] = '2016-02-16'
    elif d['name'] == 'Christie':
        d['date'] = '2016-02-10'
    elif d['name'] == 'Fiorina':
        d['date'] = '2016-02-10'
    elif d['name'] == 'Gilmore':
        d['date'] = '2016-02-12'
    elif d['name'] == 'Huckabee':
        d['date'] = '2016-02-01'
    elif d['name'] == 'Paul':
        d['date'] = '2016-02-03'
    elif d['name'] == 'Santorum':
        d['date'] = '2016-02-03'
    
dictOfCand

[{'date': '', 'dropped': False, 'name': 'Trump'},
 {'date': '', 'dropped': False, 'name': 'Cruz'},
 {'date': '', 'dropped': False, 'name': 'Rubio'},
 {'date': '', 'dropped': False, 'name': 'Kasich'},
 {'date': '2016-03-04', 'dropped': True, 'name': 'Carson'},
 {'date': '2016-02-16', 'dropped': True, 'name': 'Bush'},
 {'date': '2016-02-10', 'dropped': True, 'name': 'Christie'},
 {'date': '2016-02-10', 'dropped': True, 'name': 'Fiorina'},
 {'date': '2016-02-12', 'dropped': True, 'name': 'Gilmore'},
 {'date': '2016-02-01', 'dropped': True, 'name': 'Huckabee'},
 {'date': '2016-02-03', 'dropped': True, 'name': 'Paul'},
 {'date': '2016-02-03', 'dropped': True, 'name': 'Santorum'}]

Convert 'dictOfCand'to a DataFrame

In [17]:
candData = pd.DataFrame(dictOfCand)
candData.index = candData.name
del candData['name']
candData

Unnamed: 0_level_0,date,dropped
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Trump,,False
Cruz,,False
Rubio,,False
Kasich,,False
Carson,2016-03-04,True
Bush,2016-02-16,True
Christie,2016-02-10,True
Fiorina,2016-02-10,True
Gilmore,2016-02-12,True
Huckabee,2016-02-01,True


## Null Polling on Candidates that Drop Out
If a candidate drops out and they still appear in the polls, add their polling percentage to 'Undecided' and make their value 'NaN'.

In [18]:
for c in dictOfCand:
    if c['date'] != '':
        polls.loc[(polls[c['name']].notnull()) & (polls.index > c['date']), 'Undecided'] += \
            polls[(polls[c['name']].notnull()) & (polls.index > c['date'])][c['name']]
        polls.loc[(polls[c['name']].notnull()) & (polls.index > c['date']), c['name']] = float('NaN')
polls.tail()

Unnamed: 0_level_0,Trump,Cruz,Rubio,Kasich,Carson,Bush,Christie,Fiorina,Gilmore,Huckabee,Paul,Santorum,Undecided
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2016-03-02,41.0,16.0,20.0,10.0,10.0,,,,,,,,3.0
2016-03-06,35.75,23.75,17.5,13.5,,,,,,,,,9.5
2016-03-09,41.0,24.0,13.0,13.0,,,,,,,,,9.0
2016-03-12,53.0,22.0,10.0,11.0,,,,,,,,,4.0
2016-03-13,42.0,23.0,12.0,9.0,,,,,,,,,14.0


## Write Data to Files

In [19]:
polls.to_csv('polls.csv')
candData.to_csv('candidates.csv')