### THIS IS A WORK IN PROGRESS (BROKEN)

# 2016 US Election Forecast

This is a re-implementation of [Drew Linzer's election forecasting model](http://votamatic.org/wp-content/uploads/2013/07/Linzer-JASA13.pdf), originally implemented in Stan by [Pierre-Antoine Kremp](https://github.com/pkremp/polls). The model is fit using PyMC3.

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import us
from datetime import date

## Import data

Download and process data from the Huffington Post. 

In [3]:
states = [state.name.lower() for state in us.STATES]
bad_states = 'district of columbia', 'florida', 'california'
stubs = ["2016-{0}-president-trump-vs-clinton".format(state) for state in states if state not in bad_states]
stubs += ["2016-general-election-trump-vs-clinton",
           "2016-california-presidential-general-election-trump-vs-clinton",
           "2016-florida-presidential-general-election-trump-vs-clinton"]

In [4]:
url = lambda stub: "http://elections.huffingtonpost.com/pollster/{0}.csv".format('-'.join(stub.split(' ')))

In [5]:
raw_polls = [pd.read_csv(url(stub)).assign(state=stub.split('-')[1]) for stub in stubs]

In [14]:
all_polls = pd.concat(raw_polls, ignore_index=True)
all_polls.columns = all_polls.columns.str.lower()
all_polls.shape

(2982, 20)

In [15]:
all_polls.isnull().sum()

affiliation                  0
clinton                      0
end date                     0
entry date/time (et)         0
johnson                   2597
mcmullin                  2971
mode                         0
number of observations     541
other                      928
partisan                     0
pollster                     0
pollster url                 0
population                   0
question iteration           0
question text             1825
source url                   0
start date                   0
trump                        1
undecided                  153
state                        0
dtype: int64

Date-time conversion

In [24]:
all_polls['end'] = pd.to_datetime(all_polls['end date'])
all_polls['begin'] = pd.to_datetime(all_polls['start date'])
all_polls['poll_time'] = (all_polls.end - all_polls.begin).dt.days
poll_date = (all_polls.end - (all_polls.end - all_polls.begin) / 2)
all_polls['poll_date'] = poll_date.dt.date
all_polls['week'] = poll_date.dt.week
all_polls['day_of_week'] = poll_date.dt.dayofweek

Deal with inconsistency in pollster names

In [25]:
all_polls.pollster = all_polls.pollster.replace({"Fox News":"FOX",
                            "WashPost":"Washington Post",
                            "ABC News":"ABC"})

Combine other candidate categories

In [26]:
all_polls['other'] = all_polls[['johnson', 'mcmullin', 'other']].fillna(0).sum(1)

In [27]:
all_polls['both'] = all_polls.clinton + all_polls.trump

Fill NA values where needed.

In [28]:
all_polls.undecided = all_polls.undecided.fillna(0)

Important dates

In [29]:
start_date = date(2016, 4, 1)
election_date = date(2016, 11, 8)

Rows and columns we need for analysis

In [30]:
rows_to_keep = ((all_polls['number of observations']>1)
               & (all_polls.poll_date >= start_date)
               & (all_polls.population.isin(['Likely Voters', 'Registered Voters', 'Adults'])))

cols_to_keep = ['state', 'begin', 'end', 'poll_time', 'poll_date', 'week', 'day_of_week', 
               'pollster', 'mode', 'population', 'number of observations',
               'clinton', 'trump', 'both', 'other']

In [31]:
poll_data = (all_polls.loc[rows_to_keep, cols_to_keep]
                .rename(columns={'mode':'method', 'population':'vtype', 'number of observations':'n_obs'}))

Derived columns

In [32]:
poll_data['poll_type'] = poll_data.vtype.replace({"Likely Voters":0, 
                                                     "Registered Voters":1,
                                                     "Adults":2})
poll_data['p_clinton'] = poll_data.clinton / poll_data.both
poll_data['n_clinton'] = poll_data.n_obs * poll_data.clinton / 100
poll_data['n_respondents'] = poll_data.n_obs * poll_data.both / 100

In [33]:
poll_data.head()

Unnamed: 0,state,begin,end,poll_time,poll_date,week,day_of_week,pollster,method,vtype,n_obs,clinton,trump,both,other,poll_type,p_clinton,n_clinton,n_respondents
0,alabama,2016-10-27,2016-11-02,6,2016-10-30,43,6,SurveyMonkey,Internet,Likely Voters,621.0,35.0,53.0,88.0,0.0,0,0.397727,217.35,546.48
1,alabama,2016-10-23,2016-10-29,6,2016-10-26,43,2,UPI/CVOTER,Internet,Likely Voters,349.0,37.0,58.0,95.0,0.0,0,0.389474,129.13,331.55
2,alabama,2016-10-07,2016-10-27,20,2016-10-17,42,0,Ipsos/Reuters,Internet,Likely Voters,505.0,39.0,51.0,90.0,0.0,0,0.433333,196.95,454.5
3,alabama,2016-10-18,2016-10-26,8,2016-10-22,42,5,SurveyMonkey,Internet,Likely Voters,486.0,36.0,52.0,88.0,0.0,0,0.409091,174.96,427.68
4,alabama,2016-10-09,2016-10-16,7,2016-10-12,41,2,UPI/CVOTER,Internet,Likely Voters,327.0,38.0,57.0,95.0,0.0,0,0.4,124.26,310.65


In [34]:
poll_data.shape

(1741, 19)

Remove old polls

In [35]:
recent_poll_data = poll_data[poll_data.poll_date>start_date]

Remove overlapping polls

In [36]:
poll_data_2016 = recent_poll_data.drop_duplicates(['state', 'poll_date', 'pollster'])

In [37]:
poll_data_2016.to_csv('data/clean/poll_data_2016.csv')
poll_data_2016.shape

(1288, 19)

Get pollster list

In [82]:
pollsters = poll_data_2016.pollster.unique()

Split polling data into state and national

In [74]:
national_poll_ind = poll_data_2016.state=='general'
national_data_2016 = poll_data_2016[national_poll_ind]
state_data_2016 = poll_data_2016[~national_poll_ind]

Range of days for election period

In [75]:
state_days = pd.date_range(state_data_2016.poll_date.min(), 
                         state_data_2016.poll_date.max())
national_days = pd.date_range(national_data_2016.poll_date.min(), 
                         national_data_2016.poll_date.max())

Obtain index in date sequence of each poll, to use for indexing coefficients.

In [78]:
state_day_index_series = pd.Series(range(len(state_days)), index=state_days)
STATE_DAY_IND = state_day_index_series.loc[state_data_2016.poll_date]

national_day_index_series = pd.Series(range(len(national_days)), index=national_days)
NATIONAL_DAY_IND = national_day_index_series.loc[national_data_2016.poll_date]

Same idea for states

In [79]:
state_index_series = pd.Series(range(len(states)), index=states)
STATE_IND = state_index_series.loc[state_data_2016.state]

And for pollster house effects

In [85]:
pollster_index_series = pd.Series(range(len(pollsters)), index=pollsters)
STATE_POLLSTER_IND = pollster_index_series.loc[state_data_2016.pollster]
NATIONAL_POLLSTER_IND = pollster_index_series.loc[national_data_2016.pollster]

### 2012 data

For use in deriving priors, weights and getting electoral votes

In [113]:
data_2012 = pd.read_csv('data/raw/2012.csv', index_col=-3).sort_index()
new_index = pd.Series(data_2012.index.values).str.lower().replace({'d.c.':'district of columbia'})
data_2012.index = new_index

In [114]:
national_score = data_2012.obama_count.sum() / (data_2012.romney_count + data_2012.obama_count).sum()
national_score

0.51963863890611295

In [115]:
data_2012['score'] = data_2012.obama_count / (data_2012.romney_count + data_2012.obama_count)
data_2012['diff_score'] = data_2012.score - national_score
data_2012['share_national'] = (data_2012.total_count * (1 + data_2012.adult_pop_growth_2011_15)
                               / (data_2012.total_count*(1+data_2012.adult_pop_growth_2011_15)).sum())

In [116]:
data_2012.head()

Unnamed: 0,state,obama,romney,obama_count,romney_count,total_count,ev,adult_pop_growth_2011_15,score,diff_score,share_national
alabama,AL,38.36,60.55,795696,1255925,2074338,9,0.021734,0.387838,-0.131801,0.015766
alaska,AK,40.81,54.8,122640,164676,300495,3,0.033483,0.426847,-0.092792,0.00231
arizona,AZ,44.59,53.65,1025232,1233654,2299254,11,0.071607,0.453866,-0.065772,0.018329
arkansas,AR,36.88,60.57,394409,647744,1069468,6,0.020381,0.378456,-0.141183,0.008118
california,CA,60.24,37.12,7854285,4839958,13038547,55,0.056436,0.618728,0.099089,0.102468


Extract columns of interest

In [119]:
prior_diff_score = data_2012.diff_score
state_weights = data_2012.share_national/data_2012.share_national.sum()
ev_states = data_2012.ev

### Constants

In [121]:
STATE_POLLS = state_data_2016.shape[0]
NATIONAL_POLLS = national_data_2016.shape[0]
POLLSTERS = poll_data_2016.pollster.unique().shape[0]
STATES = len(states)
DAYS = all_days.shape[0]
NATIONAL, STATE = 0, 1

## Specify model

In [122]:
from pymc3 import Model, sample
from pymc3 import Binomial, Normal, Deterministic, Flat
from pymc3.math import invlogit

In [None]:
with Model() as election_model:
    
    # Pollster house effect
    μ_c = Normal('μ_c', 0, 1, shape=POLLSTERS)
    σ_c = Uniform('σ_c', 0, 0.2)
    
    σ_u = Uniform('σ_u', 0, 0.1, shape=2)
    u_state = Flat('u_state', shape=STATE_POLLS)
    u_national = Flat('u_national', shape=NATIONAL_POLLS)
    
    σ_b = Uniform('σ_b', 0, 10)
    β_backwards = GaussianRandomWalk('β_backwards', sd=σ_b, init=Normal.dist(0, 1), shape=(DAYS, STATES))
    β = β_backwards[::-1]
    
    σ_d = Uniform('σ_d', 0, 10)
    δ_backwards = GaussianRandomWalk('δ_backwards', sd=σ_d, init=Normal.dist(0, 1), shape=(DAYS, STATES))
    δ = δ_backwards[::-1]
    
    π_state = Deterministic('π_state', invlogit(β[STATE_DAY_IND, STATE_IND] + δ[STATE_DAY_IND] 
                                                + σ_c * μ_c[STATE_POLLSTER_IND] + σ_u[STATE] * u_state))
    
    ### FINISH THIS
    state_avg = * state_weights
    π_national = Deterministic('π_national', invlogit(σ_c * μ_c[NATIONAL_POLLSTER_IND] + σ_u[NATIONAL] * u_national))
    
    # Binomial likelihoods of Clinton count
    state_clinton = Binomial('state_clinton', state_data_2016.n_respondents, π_state, 
                             observed=state_data_2016.n_clinton)
    national_clinton = Binomial('national_clinton', national_data_2016.n_respondents, π_national, 
                             observed=national_data_2016.n_clinton)

In [None]:
with election_model:
    
    trace = sample(2000, njobs=2)

## Platform information

This analysis was performed with the following system:

In [1]:
%load_ext watermark

In [7]:
%watermark -v -m -g -p pandas,numpy,pymc3

CPython 3.5.2
IPython 5.1.0

pandas 0.19.0
numpy 1.11.2
pymc3 3.0.rc2

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)
system     : Darwin
release    : 16.1.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
Git hash   : 6c363171114ef79674b6b85be416ad70c121ed5d


## References

1. Linzer DA. Dynamic Bayesian Forecasting of Presidential Elections in the States. Journal of the American Statistical Association. 2013;108(501):124-134. doi:10.1080/01621459.2012.737735.
2. Gelman, A. [The Polls of the Future Are Reproducible and Open Source](http://www.slate.com/articles/technology/future_tense/2016/11/the_polls_of_the_future_will_be_reproducible_and_open_source.html). Slate, November 1, 2016.