## Super Simple Forecast: Presidential Popular Vote 2020
by Fai Tosuratana for POL683 -- Fall 2020

## 1. General idea: economic factors affect voters’ preference
**Economic conditions => Who you vote for** 
* If the economy is good => incumbent (Trump)
* If the economy is bad => challenger (Biden)

**Assumptions:**
* Other variables *(COVID, foreign affairs, social issues, etc.)*  will reflect in the economic conditions.
* Everybody votes on Nov 3rd.
* We forecast on Oct 24, which is 10 days before the election days

## 2. General idea for the model
### Lagged time economic conditions => popularity => votes
* The economic conditions 10-40 days before Day N can predict how many people say they would vote for a candidate on Day N (polling data). **-- REGRESSION STEP** 
* Use the economic conditions 10-40 days before Nov 3 to predict 'popularity' on Nov 3. **-- FORECASTING STEP** 
* **Bonus step**: the number of people who will vote for a candidate is (polling data)x(voter's turnout) 

### Data
**Who you vote for = polling average from 538** 
* file = presidential_polls_2020.csv 
* DV = pct_estimate which is polling average for each candidate for each day
* ignoring convention boost data points
* unit is day-state (including national)

**Economic conditions = economic index from 538**
* file = economic_index.csv 
* IVs = "current_zscore" of stock market, spending, manufacturing, jobs, inflation, income
* current_zscore = number of standard deviations from the previous 2-year average for the current value of the indicator
* unit is day

Reference:  
https://github.com/fivethirtyeight/data/tree/master/election-forecasts-2020  
https://data.fivethirtyeight.com/

### Before we go to section 3, let's load some R packages and get our data sets together

In [None]:
#### Setting directory, loading packages ####
setwd("C:/Users/tosur/OneDrive/Desktop/POL683/midterm-data")
install.packages("tidyverse") 
library(dplyr) # in tidyverse for merging
install.packages("tidyr")
library(tidyr) # for spread function

In [None]:
#### Download data sets #### 
####** poll average data set####
poll.p <- read.csv("presidential_poll_averages_2020.csv")
states.p <- unique(poll.p['state'])
#dim(states.p)
#View(states.p)
#typeof(states.p) #is a list
#class(states.p) #is a dataframe?

####** economic index data set####
econ.p <- read.csv("economic_index.csv")
econ.wide <- econ.p %>% spread(category, current_zscore) 
#View(econ.wide)
#colname(econ.wide)

econ <- econ.wide %>% select(modeldate, "stock market", spending, manufacturing, jobs, inflation, income, combined) 
# For some reason "stock market" has to be in quote for the command to run

In [16]:
#### Build a new data set with econ index averaged for the past month starting on the 10th day ####
## that is T-10 to T-40

####** build vectors with average of past T-10 to T-40 day value####
#(A better way to do this is to put names into a vector and loop these but I've been having low-grade fever and my brian can't do it right now)

##stock market
stock.avg.1m <- matrix(nrow=1)
for(i in 1:146) {
  j <- (i+10)*7
  k <- (i+40)*7
  m <- mean(econ$`stock market`[j:k], na.rm = TRUE)
  stock.avg.1m <- rbind(stock.avg.1m, m)
}  
stock.avg.1m <- stock.avg.1m[2:137]

##spending
spending.avg.1m <- matrix(nrow=1)
for(i in 1:146) {
  j <- (i+10)*7
  k <- (i+40)*7
  m <- mean(econ$`spending`[j:k], na.rm = TRUE)
  spending.avg.1m <- rbind(spending.avg.1m, m)
}  
spending.avg.1m <- spending.avg.1m[2:137]

##manufacturing
manufacturing.avg.1m <- matrix(nrow=1)
for(i in 1:146) {
  j <- (i+10)*7
  k <- (i+40)*7
  m <- mean(econ$`manufacturing`[j:k], na.rm = TRUE)
  manufacturing.avg.1m <- rbind(manufacturing.avg.1m, m)
}  
manufacturing.avg.1m <- manufacturing.avg.1m[2:137]

##jobs
jobs.avg.1m <- matrix(nrow=1)
for(i in 1:146) {
  j <- (i+10)*7
  k <- (i+40)*7
  m <- mean(econ$`jobs`[j:k], na.rm = TRUE)
  jobs.avg.1m <- rbind(jobs.avg.1m, m)
}  
jobs.avg.1m <- jobs.avg.1m[2:137]

##inflation
inflation.avg.1m <- matrix(nrow=1)
for(i in 1:146) {
  j <- (i+10)*7
  k <- (i+40)*7
  m <- mean(econ$`inflation`[j:k], na.rm = TRUE)
  inflation.avg.1m <- rbind(inflation.avg.1m, m)
}  
inflation.avg.1m <- inflation.avg.1m[2:137]

##income
income.avg.1m <- matrix(nrow=1)
for(i in 1:146) {
  j <- (i+10)*7
  k <- (i+40)*7
  m <- mean(econ$`income`[j:k], na.rm = TRUE)
  income.avg.1m <- rbind(income.avg.1m, m)
}  
income.avg.1m <- income.avg.1m[2:137]

##combined
combined.avg.1m <- matrix(nrow=1)
for(i in 1:146) {
  j <- (i+10)*7
  k <- (i+40)*7
  m <- mean(econ$`combined`[j:k], na.rm = TRUE)
  combined.avg.1m <- rbind(combined.avg.1m, m)
}  
combined.avg.1m <- combined.avg.1m[2:137]

####** put vectors into a new data frame####
date <- unique(econ$'modeldate')
#View(date)
econ.avg.1m <- data.frame("modeldate" = date[1:136], "stock market" = stock.avg.1m, "spending" = spending.avg.1m, "manufacturing" = manufacturing.avg.1m, "jobs" = jobs.avg.1m, "inflation" = inflation.avg.1m, "income" = income.avg.1m, "combined" = combined.avg.1m)
#View(econ.avg.1m)

## 3. Regression step 

Very simple OLS model:

\begin{equation*}
pct.estimate  =  \beta_0 + \beta_1 Stock Market + \beta_2 Spending + \beta_3 Manufactoring + \beta_4 Jobs + \beta_5 Inflation + \beta_6 Income + \epsilon_i 
\end{equation*}

In [15]:
####Regress to get the B's for the forecast model####
#Let's try to do this for one state (actually, let's do national) and see what happens
#Biden
poll.National.Biden <- poll.p %>% filter(state == "National", candidate_name == "Joseph R. Biden Jr.")
poll.National.Biden.c <- merge(poll.National.Biden[1:136,],econ.avg.1m, by.x = "modeldate", by.y = "modeldate")
#View(poll.National.Biden.c)

lm1 <- lm( pct_estimate ~ stock.market + spending + manufacturing + jobs + inflation + income , poll.National.Biden.c) 
summary(lm1)

#Trump
poll.National.Trump <- poll.p %>% filter(state == "National", candidate_name == "Donald Trump")
poll.National.Trump.c <- merge(poll.National.Trump[1:136,],econ.avg.1m, by.x = "modeldate", by.y = "modeldate")

lm2 <- lm( pct_estimate ~ stock.market + spending + manufacturing + jobs + inflation + income , poll.National.Trump.c) 
summary(lm2)


Call:
lm(formula = pct_estimate ~ stock.market + spending + manufacturing + 
    jobs + inflation + income, data = poll.National.Biden.c)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64573 -0.21509 -0.00903  0.17265  0.55651 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    63.8155     1.0421  61.237  < 2e-16 ***
stock.market   -3.7783     0.3197 -11.818  < 2e-16 ***
spending       -1.0147     0.1227  -8.267 1.51e-13 ***
manufacturing   0.7324     0.4782   1.532 0.128085    
jobs            1.4126     0.1839   7.680 3.60e-12 ***
inflation      -6.4757     1.7657  -3.667 0.000358 ***
income         -0.3893     0.1521  -2.560 0.011631 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2679 on 128 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.8294,	Adjusted R-squared:  0.8214 
F-statistic: 103.7 on 6 and 128 DF,  p-value: < 2.2e-16



Call:
lm(formula = pct_estimate ~ stock.market + spending + manufacturing + 
    jobs + inflation + income, data = poll.National.Trump.c)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51960 -0.14487 -0.00857  0.12785  0.84444 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   36.54473    0.82783  44.145  < 2e-16 ***
stock.market   3.86620    0.25397  15.223  < 2e-16 ***
spending       0.21826    0.09750   2.238  0.02692 *  
manufacturing -0.04796    0.37989  -0.126  0.89974    
jobs          -0.80394    0.14611  -5.502 1.96e-07 ***
inflation      1.95480    1.40265   1.394  0.16584    
income        -0.34556    0.12080  -2.861  0.00494 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2128 on 128 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.8922,	Adjusted R-squared:  0.8871 
F-statistic: 176.5 on 6 and 128 DF,  p-value: < 2.2e-16


## 4. Forecasting step 

### From our regression step, we get the models
#### For Biden 
\begin{equation*}
pct.estimate  =  63.8155 + (-3.7783)Stock Market + (-1.0147)Spending + (0.7324)Manufactoring 
\\ + (1.4126)Jobs + (-6.4757)Inflation + (-0.3893)Income 
\end{equation*}

#### For Trump 
\begin{equation*}
pct.estimate  =  36.54473 + (3.86620)Stock Market + (0.21826)Spending + (-0.04796)Manufactoring 
\\ + (-0.80394)Jobs + (1.95480)Inflation + (-0.34556)Income 
\end{equation*}

#### Then
We will find the average value of each economic index from T-10 to T-40 from election day, and predict how "popular" each candidate is on Nov 3rd.

In [18]:
##lagged economic conditions
lagged.econ <- econ.avg.1m[1,2:7] # extracting the average value of each economic index from T-10 to T-40 from election day

##Biden
beta0.lm1 <- summary(lm1)$coefficients[1,1] # extracting beta_0
coeff.lm1 <- summary(lm1)$coefficients[2:7,1] # extracting other beta's in a form of vector

biden.pop <- (coeff.lm1 %*% t(lagged.econ)) + beta0.lm1
biden.pop

##Trump
beta0.lm2 <- summary(lm2)$coefficients[1,1] # extracting beta_0
coeff.lm2 <- summary(lm2)$coefficients[2:7,1] # extracting other beta's in a form of vector

trump.pop <- (coeff.lm2 %*% t(lagged.econ)) + beta0.lm2
trump.pop

1
52.39488


1
42.25925


### From the 'model', Biden will have 52.39488% of the votes and Trump 42.25925% of the votes. If we make it such that it has two options only, Biden will have about 55.354% of the votes and Trump 44.646% of the votes.


## 5. Some improvements

* If we're doing this for each state, it might be even more 'accurate.' 
* We can also pair by-state prediction with how easy it is to vote in the state<sup>[1]</sup>. This means that the percentage of 'popularity' might have to be weighted by different turnout rates in addition to the voting populations.

[1] https://www.liebertpub.com/doi/full/10.1089/elj.2017.0478 from https://fivethirtyeight.com/features/how-fivethirtyeights-2020-presidential-forecast-works-and-whats-different-because-of-covid-19/

In [17]:
# Try to do it for states. Let's do Wisconsin. We can see that there're differences.
poll.WI.Biden <- poll.p %>% filter(state == "Wisconsin", candidate_name == "Joseph R. Biden Jr.")
poll.WI.Biden.c <- merge(poll.WI.Biden[1:136,],econ.avg.1m, by.x = "modeldate", by.y = "modeldate")
lm3 <- lm( pct_estimate ~ stock.market + spending + manufacturing + jobs + inflation + income , poll.WI.Biden.c) 
summary(lm3) ##Wisconsin
summary(lm1) ##National


Call:
lm(formula = pct_estimate ~ stock.market + spending + manufacturing + 
    jobs + inflation + income, data = poll.WI.Biden.c)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72129 -0.13192  0.04922  0.16721  0.95169 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    52.0893     1.2432  41.898  < 2e-16 ***
stock.market    1.0346     0.3814   2.713 0.007593 ** 
spending       -0.5318     0.1464  -3.631 0.000406 ***
manufacturing   1.4941     0.5705   2.619 0.009889 ** 
jobs            0.9459     0.2194   4.311 3.22e-05 ***
inflation       3.0928     2.1065   1.468 0.144504    
income          0.1729     0.1814   0.953 0.342326    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3196 on 128 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.8067,	Adjusted R-squared:  0.7976 
F-statistic: 89.01 on 6 and 128 DF,  p-value: < 2.2e-16



Call:
lm(formula = pct_estimate ~ stock.market + spending + manufacturing + 
    jobs + inflation + income, data = poll.National.Biden.c)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64573 -0.21509 -0.00903  0.17265  0.55651 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    63.8155     1.0421  61.237  < 2e-16 ***
stock.market   -3.7783     0.3197 -11.818  < 2e-16 ***
spending       -1.0147     0.1227  -8.267 1.51e-13 ***
manufacturing   0.7324     0.4782   1.532 0.128085    
jobs            1.4126     0.1839   7.680 3.60e-12 ***
inflation      -6.4757     1.7657  -3.667 0.000358 ***
income         -0.3893     0.1521  -2.560 0.011631 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2679 on 128 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.8294,	Adjusted R-squared:  0.8214 
F-statistic: 103.7 on 6 and 128 DF,  p-value: < 2.2e-16
