# **Biden vs. Trump: An Empirical Bayesian Approach.**

## Introduction 
In 2016 Donald John Trump became the president of the United States against all odds, many prominent election research organisations gave the then contender Hillary Clinton a higher chance of winning, with confident predictions in favor of the former secretary of state ranging around eighty to ninety percent, nevertheless Donald Trump emerged as the winner. 

### Bayesian Statistics
The 2016 presidential election is proof that elections can be notoriously difficult to predict acurately. However, certain techniques such as Bayesian statistics has been known to produce more accurate results. The core idea behind bayesian methods is to update an uninformed/best-guess judgement based on new information/data. 

### Assumptions
1. A key assumption is that the distribution of votes in the population and its sample is aproximately normal. This of course comes with its own set of assumptions. for more click on [this link](https://en.wikipedia.org/wiki/Normal_distribution)


### Polling Data
The data we use to model the election was gotten from opinion polls conducted by various organisations. It was sourced from FiveThirtyEight a popular data journalism platform. A little cleaning will be neccesary to ensure the integrity of the data.



In [None]:
#load required packages
library(tidyverse)
library(lubridate)
library(dslabs)


#read in the data 
us_polls_2020 <- read_csv("../input/presidential-polls-source-fivethirtyeight/president_polls (1).csv")
 
names(us_polls_2020)
#parse end_date as a date format
us_polls_2020$end_date <- mdy(us_polls_2020$end_date)



In [None]:
# filter the data for polls with a rating "B-" rating and above,
#and polls that end on the "2/11/2020" or later.
#then select relevant columns using their indici
#then select completed polls by filtering for observations with the max sample size

us_polls <- us_polls_2020 %>%  
  select(2,4,6,8,12,13,16,20,21,27,34,36,38, url) %>%
  group_by(poll_id) %>% 
  filter(end_date == max(end_date) & sample_size == max(sample_size) & !duplicated(answer, fromLast = FALSE)) %>% 
  ungroup()
  
#remove any duplications and spread the answer column based on "pct"
us_polls_tidish <- us_polls[!duplicated(us_polls, fromLast = FALSE),] %>%
               spread(answer, pct) 
head(us_polls_tidish)

The data set above has been filtered to remove duplicated values and incomplete polling results, however to bring the data to complete "tidy" status, we must ensure that each contains a unique observation(poll) while each feature as a unique column; here the observed percentage for each candidate can be considered a "feature". Reshaping the data into this format will allow for easier computation later on.


In [None]:
#create one row per poll
us_polls_tidy <- us_polls_tidish %>% 
  group_by(poll_id) %>%
  summarize(state = unique(state),
            pollster = unique(pollster),
            pollster_grade =  unique(fte_grade),
            sample_size = unique(sample_size), 
            end_date = unique(end_date),
            Biden_prop = sum(ifelse(!is.na(Biden), Biden/100, 0)),
            Trump_prop = sum(ifelse(!is.na(Trump), Trump/100, 0)),
            West_prop = sum(ifelse(!is.na(West),West/100,0)),
            Hawkins_prop = sum(ifelse(!is.na(Hawkins), Hawkins/100,0)),
            Jorgensen_prop = sum(ifelse(!is.na(Jorgensen), Jorgensen/100, 0))
            )

* Now that the data is tidy we can commence modelling.

In [None]:
#plot biden and trump proportions over time
ggplot(us_polls_tidy, aes(end_date, Biden_prop * 100)) +
  geom_point(colour = "blue") +
  geom_point(aes(end_date, Trump_prop * 100), colour = "red") +
   theme(axis.line = element_line(colour = "black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()) +
  xlab("Time") +
  ylab("Poll Results") +
  labs(title = "Biden Vs. Trump",
       subtitle = "Respondent Support over Time")

# We can see here that Biden(blue) seems to be polling higher than trump 
#The data also shows that four data points report missing values for Biden

### Modelling The Outcome
Earlier, I mentioned bayesian statistics, here i will go into futher detail. As i said earlier bayesian methods enable us to make decisions by factoring new information, as such, this methods allow us to model several levels of varibility as we would "new information". Here and throughout the analysis we would assume that the outcome of the elections will not be affected by other candidates other than Trump and Biden.


For the purpose of this analysis we would denote the true proportion of Biden voters as $p$ and the spread as: $d$. We can make a priori(uninformed) estimate of the spread, denoted as $\mu$. We assume that $\mu$ which is an priori estimate of the spread is assumed to be approximately normal with a standard error of $\tau^2$. Represented as, 

$$ d\ \sim\ N(\mu,\ \tau^2).$$ 

While the following formular represents randomness due to sampling and pollster effect:

$$ Y|d \sim N(d,\ \sigma^2).$$


The spread is the difference between the proportions observed in our data, this can be denoted mathematically as:         
$$ d = p\ -\ (1-p),$$
$$ d = 2p\ - \ 1.$$

Therefore, p can be derived as,
$$ p = \frac {d + 1} {2} $$

we can derive an estimate of $p$ from the polling data represented as $\hat X$, and use this to compute a standard deviation of the observed data.



### Simulating the Outcome
The mathetical model,  
$ X_{i,j} = d + b + h_i + \epsilon_{i,j} $ represents a 

Where,
* The index  $i$  represents the different pollsters.
* The index  $j$  represents the different polls
* $X_{i,j}$  is the  $j$ th poll by the  i th pollster 
* $d$  is the actual spread of the election
* $b$  is the general bias affecting all pollsters
* $h_i$  represents the house effect for the  i th pollster
* $ϵ_{i,j}$ represents the random error associated with the  i,j th poll

In [None]:
#compute se, spread and standard error
mu <- 0 #best quess spread
tau <- 0.02 # best quess standard deviation
set.seed(2020-11-04)

us_polls_tidier <-  us_polls_tidy[,1:8] %>%  
  filter(state != is.na(state) &
         end_date >= "2020-10-31"
        ) %>%
  mutate(j = length(unique(pollster)),
         spread = Biden_prop - Trump_prop,
         p_hat = (spread + 1)/2,
         se_xhat = 2 * sqrt(p_hat * (1-p_hat)/sample_size))%>%
  ungroup() %>% 
  group_by(state) %>%
  summarize(N = n(),
            avg_spread = mean(spread),
            sd = sd(spread)) %>%
   mutate( median_se = median(sd, na.rm = TRUE), 
           se = ifelse(is.na(sd),median_se , sd),
            sigma =  sqrt(se/sqrt(N)^2 + .025^2),
            B =  sigma^2 / (sigma^2 + tau^2),
            posterior_mean = B*mu + (1-B)*avg_spread,
            posterior_se =  sqrt(1 / (1/sigma^2 + 1/tau^2)),
            upper_ci = posterior_mean + qnorm(0.975)* posterior_se,
            lower_ci = posterior_mean - qnorm(0.975)* posterior_se,
            pct = 1 - pnorm(0, posterior_mean, posterior_se))

electoral_vote <- results_us_election_2016[,1:2]
us_polls_final <- left_join(us_polls_tidier, electoral_vote, by = "state")
Biden_votes <- replicate(10000, {
    us_polls_final %>% mutate(
                       simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
                       Biden = ifelse(simulated_result > 0, electoral_votes, 0)) %>%    # award votes if Biden wins state
                      summarize(Biden = sum(Biden, na.rm = TRUE)) %>%    # total votes for Biden
                         .$Biden   
})
mean(Biden_votes> 269)    
           



### Conclusion
Here we see that Biden wins about 80% of the time in our simulation of 10000 iterations.

#### ***Note***
Predicting elections is not straight forward, in 2016 most people predicted that Hillary clinton will with overwhelming odds, technically speaking you can only call an election if it falls within a predetermined confidence interval. Famous mathematians and quantitative analyst have long quarrelled about the soundness of the method. Nevertheless, this method or a slightly more complex variation of it is championed by top poll agregators.