# 2016 US Election Forecast

This is a re-implementation of [Drew Linzer's election forecasting model](http://votamatic.org/wp-content/uploads/2013/07/Linzer-JASA13.pdf), originally implemented by [Pierre-Antoine Kremp](https://github.com/pkremp/polls). The model is fit using PyMC3.

In [102]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import pymc3 as pm
from pollster import Pollster
import us
from datetime import date

## Import data

Download and process data from the Huffington Post using their public API.

In [2]:
pollster = Pollster()

In [3]:
states = [state.name.lower() for state in us.STATES]
bad_states = 'district of columbia', 'florida', 'california'
stubs = ["2016-{0}-president-trump-vs-clinton".format(state) for state in states if state not in bad_states]
stubs += ["2016-general-election-trump-vs-clinton",
           "2016-california-presidential-general-election-trump-vs-clinton",
           "2016-florida-presidential-general-election-trump-vs-clinton"]

In [10]:
url = lambda stub: "http://elections.huffingtonpost.com/pollster/{0}.csv".format('-'.join(stub.split(' ')))

In [29]:
raw_polls = [pd.read_csv(url(stub)).assign(state=stub.split('-')[1]) for stub in stubs]

In [44]:
all_polls = pd.concat(raw_polls)
all_polls.columns = all_polls.columns.str.lower()
all_polls.shape

(2892, 20)

In [45]:
all_polls.isnull().sum()

affiliation                  0
clinton                      0
end date                     0
entry date/time (et)         0
johnson                   2535
mcmullin                  2883
mode                         0
number of observations     528
other                      923
partisan                     0
pollster                     0
pollster url                 0
population                   0
question iteration           0
question text             1793
source url                   0
start date                   0
trump                        1
undecided                  146
state                        0
dtype: int64

In [46]:
all_polls.head()

Unnamed: 0,affiliation,clinton,end date,entry date/time (et),johnson,mcmullin,mode,number of observations,other,partisan,pollster,pollster url,population,question iteration,question text,source url,start date,trump,undecided,state
0,,35.0,2016-10-31,2016-11-01T13:54:44Z,,,Internet,485.0,,Nonpartisan,SurveyMonkey,http://elections.huffingtonpost.com/pollster/p...,Likely Voters,1,,https://www.surveymonkey.com/elections/map?pol...,2016-10-25,55.0,3.0,alabama
1,,37.0,2016-10-29,2016-11-01T12:54:47Z,,,Internet,349.0,,Nonpartisan,UPI/CVOTER,http://elections.huffingtonpost.com/pollster/p...,Likely Voters,1,,https://www.documentcloud.org/documents/321097...,2016-10-23,58.0,5.0,alabama
2,,39.0,2016-10-27,2016-10-31T21:52:12Z,,,Internet,505.0,,Nonpartisan,Ipsos/Reuters,http://elections.huffingtonpost.com/pollster/p...,Likely Voters,1,,http://big.assets.huffingtonpost.com/2016.Reut...,2016-10-07,51.0,10.0,alabama
3,,36.0,2016-10-24,2016-10-26T13:40:15Z,,,Internet,415.0,,Nonpartisan,SurveyMonkey,http://elections.huffingtonpost.com/pollster/p...,Likely Voters,1,,https://www.surveymonkey.com/elections/map?pol...,2016-10-18,52.0,2.0,alabama
4,,38.0,2016-10-16,2016-10-20T15:26:38Z,,,Internet,327.0,,Nonpartisan,UPI/CVOTER,http://elections.huffingtonpost.com/pollster/p...,Likely Voters,1,,https://assets.documentcloud.org/documents/314...,2016-10-09,57.0,,alabama


Date-time conversion

In [97]:
all_polls[['begin', 'end']].head()

Unnamed: 0,begin,end
0,2016-10-25,2016-10-31
1,2016-10-23,2016-10-29
2,2016-10-07,2016-10-27
3,2016-10-18,2016-10-24
4,2016-10-09,2016-10-16


In [119]:
all_polls['end'] = pd.to_datetime(all_polls['end date'])
all_polls['begin'] = pd.to_datetime(all_polls['start date'])
all_polls['poll_time'] = (all_polls.end - all_polls.begin).dt.days
all_polls['poll_date'] = (all_polls.end - (all_polls.end - all_polls.begin) / 2)
all_polls['week'] = all_polls.poll_date.dt.week
all_polls['day_of_week'] = all_polls.poll_date.dt.dayofweek

Deal with inconsistency in pollster names

In [81]:
all_polls.pollster = all_polls.pollster.replace({"Fox News":"FOX",
                            "WashPost":"Washington Post",
                            "ABC News":"ABC"})

In [82]:
all_polls.undecided = all_polls.undecided.fillna(0)

Combine other candidate categories

In [68]:
all_polls['other'] = all_polls[['johnson', 'mcmullin', 'other']].fillna(0).sum(1)

In [83]:
all_polls['both'] = all_polls.clinton + all_polls.trump

In [110]:
start_date = date(2016, 4, 1)
rows_to_keep = ((all_polls['number of observations']>1)
               & (all_polls.poll_date >= start_date)
               & (all_polls.population.isin(['Likely Voters', 'Registered Voters', 'Adults'])))

In [120]:
cols_to_keep = ['begin', 'end', 'poll_time', 'poll_date', 'week', 'day_of_week', 
                'mode', 'population', 'number of observations',
               'clinton', 'trump', 'both', 'other']
poll_data = (all_polls.loc[rows_to_keep, cols_to_keep]
                .rename(columns={'mode':'method', 'population':'vtype', 'number of observations':'n_obs'}))

In [122]:
poll_data['poll_type'] = poll_data.vtype.replace({"Likely Voters":0, 
                                                     "Registered Voters":1,
                                                     "Adults":2})

In [123]:
poll_data.head()

Unnamed: 0,begin,end,poll_time,poll_date,week,day_of_week,method,vtype,n_obs,clinton,trump,both,other,poll_type
0,2016-10-25,2016-10-31,6,2016-10-28 00:00:00,43,4,Internet,Likely Voters,485.0,35.0,55.0,90.0,0.0,0
1,2016-10-23,2016-10-29,6,2016-10-26 00:00:00,43,2,Internet,Likely Voters,349.0,37.0,58.0,95.0,0.0,0
2,2016-10-07,2016-10-27,20,2016-10-17 00:00:00,42,0,Internet,Likely Voters,505.0,39.0,51.0,90.0,0.0,0
3,2016-10-18,2016-10-24,6,2016-10-21 00:00:00,42,4,Internet,Likely Voters,415.0,36.0,52.0,88.0,0.0,0
4,2016-10-09,2016-10-16,7,2016-10-12 12:00:00,41,2,Internet,Likely Voters,327.0,38.0,57.0,95.0,0.0,0


## Specify model

## Platform information

In [1]:
%load_ext watermark

In [7]:
%watermark -v -m -g -p pandas,numpy,pymc3

CPython 3.5.2
IPython 5.1.0

pandas 0.19.0
numpy 1.11.2
pymc3 3.0.rc2

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)
system     : Darwin
release    : 16.1.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
Git hash   : 6c363171114ef79674b6b85be416ad70c121ed5d
