# Exploring Relationships between US Election & COVID-19

We have all been impacted by the pandemic this past year.

Are there correlation between political beliefs and how we are impacted by COVID-19?  We use a proxy of political beliefs by using the recent presidential election by state, and infer if there is reason to believe that there are differential impact of COVID-19 to Americans.

The data used in this notebook is taken as a snapshot right after the November 2020 election.

<b>NULL Hypothesis</b>: there is no difference between political beliefs (or affiliation) and how we are impacted by COVID-19 in 2020.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
plt.rcParams['figure.figsize'] = (12,10)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")
%matplotlib inline

## Data Sources

Source: https://www.nbcnews.com/politics/2020-elections/president-results

Date: November 14, 2020


In [None]:
presidential_results = "../data/us-presidential-results-2020.csv"

presidential_data = pd.read_csv(presidential_results,
                               index_col="state",
                               thousands=",")
presidential_data.dtypes

In [None]:
presidential_data.tail(5)

In [None]:
presidential_data['voters'] = presidential_data.trump + presidential_data.biden
presidential_data['trump_pct'] = 100*presidential_data.trump / presidential_data.voters
presidential_data.head()

## COVID-19 Data

Source: https://covid.cdc.gov/covid-data-tracker/#cases_casesper100klast7days

Date: November 14, 2020

In [None]:
covid_raw_cases = "../data/us-covid-11-14-2020.csv"

covid_data = pd.read_csv(covid_raw_cases, skiprows=3)
covid_data.rename(columns= { "State/Territory" : "state",
                             "Total Cases" : "cases",
                             "Case Rate per 100000" : "case_rate",
                             "Case Rate per 100000 in Last 7 Days" : "case_rate_7days"
                           },
                  inplace=True)
covid_data.shape

In [None]:
covid_data.head(5)

In [None]:
covid_data = covid_data[["state", "cases", "case_rate", "case_rate_7days"]]
covid_data.head()

## Merging Voting data + COVID-19 data

In [None]:
data = presidential_data.join(covid_data.set_index('state'), on="state", how="inner")

In [None]:
data.dtypes

In [None]:
data

## Visualizing

In [None]:
data.plot(kind='scatter', x='trump_pct', y='cases', alpha=0.5)

In [None]:
data.plot(kind='scatter', x='trump_pct', y='case_rate', alpha=0.5)

In [None]:
data.plot(kind='scatter', x='trump_pct', y='case_rate_7days', alpha=0.5)

In [None]:
# plotting a regression line
sns.lmplot(x='trump_pct', y='case_rate_7days', data=data, aspect=1.6, scatter_kws={'alpha':0.2});
plt.title("COVID Cases per 100,000 in the last 7 days & Trump Vote%")
plt.xlabel("Percentage of Trump Vote in 2020")
plt.ylabel("COVID Cases per 100,000 in week of 11/14/2020")

## Linear Regression in sklearn

In [None]:
# single variable linear regression
feature_cols = ['trump_pct']
X = data[feature_cols]
y = data.case_rate_7days

X.head()

### sklearn 4-steps ML¶

In [None]:
# Step 1: Import the model class

from sklearn.linear_model import LinearRegression

In [None]:
# Step 2: Instantiate the estimator

model = LinearRegression()

In [None]:
# Step 3: Fit the model with data (training)

model.fit(X.values,y)

In [None]:
# Step 4: Predict response for new inputs

print(f"0%: {model.predict([[0.0]])}")
print(f"50%: {model.predict([[50.0]])}")
print(f"100%: {model.predict([[100.0]])}")

Let's look at the _intercept_ and _coeff_ : bring back the algebra!

In [None]:
print("intercept: {0}, coeff: {1}".
      format(model.intercept_, model.coef_))

Recall that each model always contains some amount of random irreducible error $\epsilon$. So, given a prediction $\hat{y}$, the actual $y = \hat{y} + \epsilon$. Below, we will assume $y$ is exactly linear.

- We are often taught the formula for a line is: $y = mx + b$.
- Note this can alternatively be written: $y = \beta_0 + \beta_1 x$.

Interpreting the intercept ($\beta_0$):

- It is the value of $y$ when all independent variables are 0.
- Here, it is the estimated # COVID-19 cases per 100,000 when the Trump vote percentage is 0.
- **Note:** It does not always make sense to interpret the intercept. (Why?)

Interpreting the "trump_pct" coefficient ($\beta_1$):

- **Interpretation:** An increase of 1 percentage for Trump is _associated with_ increasing the number of cases per 100,000 people in the state population by $\beta_1$.
- Here, a Trump vote percentage increase of 1 % in a state is _associated with_ an increase of 1.6 COVID-19 cases per 100,000 people in the state population.
- This is not a statement of causation.
- $\beta_1$ would be **negative** if an increase in Trump vote percentage was associated with a **decrease** in # COVID-19 cases per 100,000 population.
- $\beta_1$ would be **zero** if Trump vote percentage is not associated with COVID-19 cases per 100,000 population.

---

---

Here, we will generalize this to $n$ independent variables as follows:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$

- $y$ is the response.
- $\beta_0$ is the intercept.
- $\beta_1$ is the coefficient for $x_1$ (the first feature).
- $\beta_n$ is the coefficient for $x_n$ (the nth feature).
- $\epsilon$ is the _error_ term

This equation is still called **linear** because the highest degree of the independent variables (e.g. $x_i$) is 1. Note that because the $\beta$ values are constants, they will not be independent variables in the final model, as seen above.


---

The $\beta$ values are called the **model coefficients**:

- These values are estimated (or "learned") during the model fitting process using the **least squares criterion**.
- Specifically, we are trying to find the line (mathematically) that minimizes the **sum of squared residuals** (or "sum of squared errors").
- Once we've learned these coefficients, we can use the model to predict the response.

![Estimating coefficients](./images/estimating_coefficients.png)

In the diagram above:

- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the vertical distances between the observed values and the least squares line.


### Consider these 3 predictions again:

In [None]:
print(f"0%: {model.predict([[0.0]])}")
print(f"50%: {model.predict([[50.0]])}")
print(f"100%: {model.predict([[100.0]])}")