"Fundamentals" based election prediction

you will explore an alternate election prediction model, using various economic and political indicators instead of polling data -- and also deal with the challenges of model building when there is very little training data. Political scientists have long analyzed these types of "fundamentals" models, and they can be reasonably accurate. For example, fundamentals [slightly favored](https://fivethirtyeight.com/features/it-wasnt-clintons-election-to-lose/) the Republicans in 2016

Data sources which I used to generate `election-fundamentals.csv`:

- Historical presidential approval ratings (highest and lowest for each president) from [Wikipedia](https://en.wikipedia.org/wiki/United_States_presidential_approval_rating) 
- GDP growth in election year from [World Bank](https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?locations=US)

Note that there are some timing issues here which more careful forecasts would avoid. The presidential approval rating is for the entire presidential term.The GDP growth is for the entire election year. These variables might have higher predictive power if they were (for example) sampled in the last quarters before the election.

For a comprehensive view of election prediction from non-poll data, and how well it might or might not be able to do, try [this](https://fivethirtyeight.com/features/models-based-on-fundamentals-have-failed-at-predicting-presidential-elections/) from Fivethirtyeight.

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [3]:
# First, import data/election-fundamentals.csv and take a look at what we have
fund= pd.read_csv('election-fundamentals.csv')
fund

Unnamed: 0,year,incumbent_president,incumbent_party,term,highest_approval,lowest_approval,year_gdp_growth,winner
0,1960,Esienhower,R,2,79,47,2.6,D
1,1964,Johnson,D,1,79,34,5.8,D
2,1968,Johnson,D,2,79,34,4.8,R
3,1972,Nixon,R,1,66,24,5.3,R
4,1976,Nixon,R,2,66,24,5.4,D
5,1980,Carter,D,1,74,28,-0.2,R
6,1984,Reagan,R,1,71,35,7.26,R
7,1988,Reagan,R,2,71,35,4.2,R
8,1992,Bush,R,1,89,29,3.6,D
9,1996,Clinton,D,1,73,37,3.8,D


In [4]:
# How many elections do we have data for?
len(fund)

15

In [5]:
# Rather than predicting the winning party, we're going to predict whether the same party stays in power or flips
# This is going to be the target variable
#fund.flips = fund.winner != fund.incumbent_party
fund['flips']= fund.winner != fund.incumbent_party
fund

Unnamed: 0,year,incumbent_president,incumbent_party,term,highest_approval,lowest_approval,year_gdp_growth,winner,flips
0,1960,Esienhower,R,2,79,47,2.6,D,True
1,1964,Johnson,D,1,79,34,5.8,D,False
2,1968,Johnson,D,2,79,34,4.8,R,True
3,1972,Nixon,R,1,66,24,5.3,R,False
4,1976,Nixon,R,2,66,24,5.4,D,True
5,1980,Carter,D,1,74,28,-0.2,R,True
6,1984,Reagan,R,1,71,35,7.26,R,False
7,1988,Reagan,R,2,71,35,4.2,R,False
8,1992,Bush,R,1,89,29,3.6,D,True
9,1996,Clinton,D,1,73,37,3.8,D,False


In [6]:
#fund = fund.replace({'True':1,'False':0},inplace=True) 
#fund

In [7]:
# Pull out all other numeric columns as features. Create features and and target numpy arrays
#fields = 
features = pd.concat(
    [
        fund.term,
        fund.highest_approval,
        fund.lowest_approval,
        fund.year_gdp_growth,
    ],
    axis=1)
# Code the target variable as True if we are predicting that this loan gets repaid in full
target = fund.flips 
features.head()

Unnamed: 0,term,highest_approval,lowest_approval,year_gdp_growth
0,2,79,47,2.6
1,1,79,34,5.8
2,2,79,34,4.8
3,1,66,24,5.3
4,2,66,24,5.4


In [8]:
# Use 3-fold cross validation to see how well we can do with a RandomForestClassifier. 
# Print out the scores

from sklearn.model_selection import cross_val_score

x = features.values
y = fund.flips.values
my_classifier = RandomForestClassifier()
scores = cross_val_score(my_classifier, x, y, cv=3)
scores

array([0.66666667, 0.4       , 0.75      ])

In [9]:
scores.mean()

0.6055555555555555

How predictable are election results just from these variables, as compared to a coin flip?

(your answer here)

In [10]:
# Now create a logistic regression using all the data
# Normally we'd split into test and training years, but here we're only interested in the coefficients
# Predict the result on the training data

train, test = train_test_split(fund, test_size=0.3)
x_train = train.iloc[:,3:7].values
y_train = train.iloc[:,-1].values

x_test = test.iloc[:,3:7].values
y_test = test.iloc[:,-1].values
##

In [11]:
# What is the influence of each feature?
# Remeber to use np.exp to turn the lr coefficients into odds ratios

x = features.values
y = target.values
lr = LogisticRegression()
lr.fit(x,y)

coeffs = pd.DataFrame(np.exp(lr.coef_), columns=features.columns)
coeffs


Unnamed: 0,term,highest_approval,lowest_approval,year_gdp_growth
0,3.383662,1.025426,0.957309,0.573158


Describe the effect of each one of our features on whether or not the party in power flips. What feature has the biggest effect? How does economic growth relate? Are there any factors that operate backwards from what you would expect, and if so what do you think is happening?

(your answer here)

#The biggest effect features is "Term", and the least is GDP. So based this modelling, surprisingly, the gdp feature is not very associcated with election flips, rather than politcal term. So did people vote not very much for economic performance?

In [22]:
from sklearn.model_selection import cross_val_score

x = features.values
y = fund.flips.values
my_classifier = RandomForestClassifier()
scores = cross_val_score(my_classifier, x, y, cv=3)
scores

array([0.66666667, 0.4       , 0.75      ])

In [23]:
scores.mean()

0.6055555555555555