# Exploration
A brief look into the data and how it is graphed as a whole. Plots cleaned data to explore any immediately noticeable trends. Bootstrapped the data with sampling from the existing data. Sampling data is averaged with the original data with sampling data being weighted at 1/3 of the average.

## Imports

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import datetime
import numpy as np
import pandas as pd
import os

import stats

ImportError: No module named 'hundred'

## Read Files Written in 02-DataCleaning

In [None]:
polls = pd.read_csv('polls.csv')
candidates = pd.read_csv('candidates.csv', index_col='name')

Convert date from string to datetime.

In [None]:
polls.date = pd.Series(pd.DatetimeIndex(polls.date))
polls.index = polls.date
del polls['date']

candidates.date = pd.to_datetime(candidates.date)

## Graph
Displays all data used at large.

In [None]:
def GraphAllPolls():
    """Graphs out all polling data"""
    plt.figure(figsize=(20,10))

    for p in polls:
        plt.plot(polls[p])

    plt.axvline(candidates['date']['Carson'], color='#c9c95b')
    plt.axvline(candidates['date']['Bush'], color='#72bcd4')
    # Christie and Fiorina dropped out on the same day. Offset Christie's line by several hours so both lines display
    plt.axvline(candidates['date']['Christie'] + datetime.timedelta(hours=8), color='b') 
    plt.axvline(candidates['date']['Fiorina'], color='#66b266')
    plt.axvline(candidates['date']['Gilmore'], color='#e54444')
    plt.axvline(candidates['date']['Huckabee'], color='purple')
    # Paul and Santorum dropped out on the same day. Offset Paul's line by several hours so both lines display
    plt.axvline(candidates['date']['Paul'] + datetime.timedelta(hours=8), color='#c9c95b')
    plt.axvline(candidates['date']['Santorum'], color='#72bcd4')

    plt.title("GOP Candidate Polling", size=20)
    plt.xlabel("Date of Poll", size=16)
    plt.ylabel("Polling Percentage", size=16)

    # x and y limits are a little greater than needed to display the legend without blocking out data
    plt.xlim('2016-01-03', '2016-03-22')
    plt.ylim(0, 60)
    plt.legend(fontsize=12)

In [None]:
GraphAllPolls()

## Sampling the Data
Bootstrap data by sampling 10 and averaging it out. It is then averaged with the original data. This is a weighted average with the original data having 2/3 the weight and the bootstrapped data having 1/3 the weight.

In [None]:
def bootstrap(data, l=100):
    """Samples data l times and averages the sampled data with the original data. The original data is weighted twice
    as much as the sampled data.
    
    Parameters
    ----------
    data : DataFrame
        DataFrame holding all polling data.
    l : int
        Amount of times to sample data.
    """
    
    data = data.fillna(0)
    
    means = []
    for i in range(l):
        means.append(data.sample(n=len(polls.index), replace=False))
        means[i].index = data.index
    
    avg = sum(means)
    return (avg + data * l * 2)/(l * 3)

Get a glimpse of the difference between the original polling data and the bootstrapped data.

In [None]:
bootPolls = bootstrap(polls)
(polls - bootPolls).tail()

In [None]:
polls = bootPolls

If a candidate drops out and they still appear in the polls, add their polling percentage to 'Undecided' and make their value 'NaN'.

In [None]:
for c in candidates.index:
    date = str(candidates.loc[c]['date'])
    if date != 'NaT':
        polls.loc[(polls[c].notnull()) & (polls.index > date), 'Undecided'] += \
            polls[(polls[c].notnull()) & (polls.index > date)][c]
        polls.loc[(polls[c].notnull()) & (polls.index > date), c] = float('NaN')

polls.tail()

Confirm polls still sum up to 100.

In [None]:
hundred.Equals100(polls)
for p in range(len(polls.index)):
    assert sum(polls.iloc[p].dropna()) == 100

In [None]:
GraphAllPolls()

## Write Data to Files
Write bootstrapped polling data to file.

In [None]:
polls.to_csv('bootPolls.csv')