#Problem Statement

Election, COVID, and Demographic Data by County: What Factors Influenced the USA 2020 Election?(https://www.kaggle.com/etsc9287/2020-general-election-polls) Awesome data compilation of “How Voting Was” in the years 2016, and 2020 (the COVID-19 year). As can be seen, there are definitely factors that caused a shift in the opinion of voters between 2016 and 2020. With this dataset, these factors can be found out, as well as other things responsible for the poll results in both years across the states.

#Objective

Objective is to develop models to predict the performance by the main two candidates. Model will predict percentage performance in 2020 by Joe Biden and Donald Trump

#Featured Techniques

*   EDA
*   Linear Regression
*   Lasso Regression
*   Random Forest Regression
*   Gradient Boost Regression
*   ANN Model



In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/county_statistics.csv')

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df.head()

In [None]:
df.info()

## Cleaning the data

We don't have data of 2016 voting trend for 1522 counties. So we cant use them to predict how vote share changes because of covid and other factors. That's why I am going to drop them.

In [None]:
df.dropna(inplace=True, axis=0)

In [None]:
# convert some cols to percentages to remove any implicit dependence on other variables

df["turnout_change"] = df["total_votes20"] - df["total_votes16"]
df["perc_turnout_change"] = df["turnout_change"] / df["TotalPop"]

df["trump_change"] = df["percentage20_Donald_Trump"] - df["percentage16_Donald_Trump"]
df["dem_change"] = df["percentage20_Joe_Biden"] - df["percentage16_Hillary_Clinton"]

df["case_rate"] = df["cases"] / df["TotalPop"]
df["death_rate"] = df["deaths"] / df["cases"]

df['Men_Ratio'] = df['Men'] / df['TotalPop']
df['Employed%'] = df['Employed'] / df['TotalPop']

df['majority'] = df[['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']].idxmax(axis=1)

## EDA

In [None]:
# removing counties from hawaii and alaska for making a better map
df_mainland = df[~df["state"].isin(["AK", "HI"])]

plt.figure(figsize = (14,10))
sns.scatterplot(data = df_mainland, x = "long", y = "lat", hue = "percentage20_Joe_Biden", size = "total_votes20", 
                sizes = (20, 200), size_norm = (10000, 800000), hue_norm = (0.031,0.944), palette = "coolwarm_r")
plt.title("The 2020 Election (Red = More Republican; Blue = More Democratic)");

It seems that Seeing this, it seems that the cities with high population are more inclined towards Democrats. Whereas the rural population is more inclined towards Republicans.

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df_mainland, x = "long", y = "lat", hue = "dem_change", size = "total_votes20", sizes = (20, 200),
                size_norm = (10000, 800000), palette = "coolwarm_r", hue_norm=(-0.3, 0.3))
plt.title("Voting Shifts From the 2016 Election to 2020 Election");

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df, x = "dem_change", y = "TotalPop", hue = "dem_change", palette = "coolwarm", sizes=(10,200))
plt.title("Voting Shifts From the 2016 Election to 2020 Election")

It seems democrats continued their dominance in urban areas and republicans in rural areas. Trump's 2020 gains in rural America is offset by Biden's urban dominance. It also seems that in rural and sub-urban areas there is minute shift from republicans to democrats.

https://www.brookings.edu/research/bidens-victory-came-from-the-suburbs/

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df_mainland, x = "long", y = "lat", hue = "dem_change", size = "death_rate", sizes=(10,200),palette = "coolwarm_r", hue_norm=(-0.3, 0.3))
plt.title("Affect of death rate on democrat vote shift")

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df, x = "trump_change", y = "death_rate", hue = "trump_change", size = "death_rate",palette = "coolwarm", sizes=(10,200), hue_norm=(-0.3, 0.3))
plt.title("Affect of death rate on republican vote shift")

It can also be seen that counties where Trump received the most votes by a massive margin have a higher death rate than counties where President Joe Biden won in a relative landslide. It was speculated that the salience of the pandemic will be a major problem for Trump's electoral campaign because an overwhelming number of voters judged that he had mishandled the crisis. why i am not surprised, huh.
 
https://www.usnews.com/news/health-news/articles/2022-02-03/counties-that-voted-for-trump-have-higher-covid-death-rates

In [None]:
sum(df["dem_change"]>0)

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df, x = "percentage20_Joe_Biden", y = "death_rate", hue = "percentage20_Joe_Biden", size = "total_votes20", sizes=(10,200),palette = "coolwarm_r")
plt.title("Affect of death rate on democrat vote precentage")

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df, x = "percentage20_Joe_Biden", y = "Unemployment", hue = "percentage20_Joe_Biden", size = "total_votes20", sizes=(10,200),palette = "coolwarm_r")
plt.title("Affect of unemployment rate on democrat vote percentage")

There seems to be small trend between unemployment ratio and vote precentage for biden. It seems that counties that have higher unemployment rate, have high Democratic vote share.


In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df, x = "percentage20_Joe_Biden", y = "case_rate", hue = "percentage20_Joe_Biden", size = "case_rate", sizes=(10,200),palette = "coolwarm_r")
plt.title("Affect of case rate on democrat vote precentage")

Case rate vs Vote share seems to follow the same trend as Death rate vs Vote Share.

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df, x = "Men_Ratio", y = "percentage20_Joe_Biden", hue = "percentage20_Joe_Biden", size = "total_votes20", sizes=(10,200),palette = "coolwarm_r")
plt.title("Affect of sex ratio on democrat vote precentage")

It is clear from the plot above that the counties with higher Men% are republican dominated. From this we can assume that men are more inclined towards republicans then women.

https://www.economist.com/united-states/2018/07/21/male-voters-are-sticking-with-the-republican-party

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df, x = "trump_change", y = "Hispanic", hue = "trump_change",palette = "coolwarm", sizes=(10,200), hue_norm=(-0.3, 0.3))
plt.title("Voting trend of hispanic community")

In [None]:
plt.figure(figsize = (20,15))
sns.scatterplot(data = df_mainland, x = "long", y = "lat", hue = "dem_change", size = "death_rate", style='majority',sizes=(10,200),palette = "coolwarm_r", hue_norm=(-0.3, 0.3))
plt.title("Voting Shifts From the 2016 Election to 2020 Election")

In [None]:
from matplotlib.ticker import PercentFormatter

def Political_affection(x_, y1_, y2_):
    plt.figure(figsize=(15, 8))
    ax = sns.barplot(x=x_, y=y1_, data=df, ci=None)
    width_scale = 0.45
    for bar in ax.containers[0]:
        bar.set_width(bar.get_width() * width_scale)
    ax.yaxis.set_major_formatter(PercentFormatter(1))

    ax2 = ax.twinx()
    sns.barplot(x=x_, y=y2_, data=df, alpha=0.7, hatch='xx', ax=ax2, ci=None)
    for bar in ax2.containers[0]:
        x = bar.get_x()
        w = bar.get_width()
        bar.set_x(x + w * (1- width_scale))
        bar.set_width(w * width_scale)

    plt.show()

In [None]:
Political_affection('majority', 'percentage16_Donald_Trump', 'percentage20_Donald_Trump')

In [None]:
Political_affection('majority', 'percentage16_Hillary_Clinton', 'percentage20_Joe_Biden')

In [None]:
Political_affection('majority', 'percentage20_Donald_Trump', 'percentage20_Joe_Biden')

In [None]:
Political_affection('majority', 'trump_change', 'dem_change')

It is clear form the plots above that trump is more popular in white and latino communities. This divides seems to be deepening further in 2020 elections. Trump had majority in almost 70% of the white dominant counties. He has also become quite more popular in latino community compared to 2016. African Americans have always be loyal democrats. But democtats popularity among African Americans seems to decreased in recent elections. Same goes for asian community too. Native vote share has also seen a huge jump in favour of biden in 2020 election.

https://www.nbcnews.com/news/nbcblk/black-men-drifted-democrats-toward-trump-record-numbers-polls-show-n1246447
https://www.nytimes.com/2021/04/02/us/politics/trump-latino-voters-2020.html

#Modelling

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import  mean_squared_error
from sklearn.metrics import r2_score
import lightgbm as lgb
from xgboost import XGBRegressor

In [None]:
df.columns

In [None]:
# Filtering dataset for training the model. Removing columns that might not be useful for training model.

new_cols = ['TotalPop','Employed', 'Hispanic', 'White', 'Black', 
                  'Native', 'Asian', 'Pacific', 'VotingAgeCitizen', 'Income', 'IncomeErr', 
                  'IncomePerCap', 'IncomePerCapErr', 'Poverty', 'ChildPoverty', 'Professional', 
                  'Service', 'Office', 'Construction', 'Production', 'Drive', 'Carpool', 'Transit',
                  'Walk', 'OtherTransp', 'WorkAtHome', 'MeanCommute', 'PrivateWork', 
                  'PublicWork', 'SelfEmployed', 'FamilyWork', 'Unemployment',
                  'percentage16_Donald_Trump','percentage16_Hillary_Clinton','total_votes16',
                  'votes16_Donald_Trump','votes16_Hillary_Clinton', 'cases', 'deaths']
features=df[new_cols]

In [None]:
#fit and transform
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = new_cols)

In [None]:
y = np.array(df['percentage20_Joe_Biden']).reshape(-1,1)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(scaled_features,y,test_size=0.25,random_state=42)

In [None]:
X_train.shape, X_test.shape,  y_train.shape, y_test.shape

In [None]:
#Function to train and evaulate a given model
def fit_Evaluate(model,X_train,y_train,X_test,y_test):  
    model.fit(X_train, y_train)
    prediction= model.predict(X_test)
    mea = mean_absolute_error(y_test, prediction)
    rss = np.sum(np.square(y_test - prediction))
    rmse = np.sqrt(mean_squared_error(y_test, prediction))
    return mea,rss,rmse

**Linear model:**

In [None]:
#Linear model
linear_model = LinearRegression()
# Training the Model
mae_LR,rss_LR,rmse_LR = fit_Evaluate(linear_model,X_train,y_train,X_test,y_test)
print("The MAE for the linear model is {:.3f} ".format(mae_LR))
print("The RSS for the linear model is {:.3f}".format(rss_LR))
print("The RMSE for the linear model is {:.3f}".format(rmse_LR))

In [None]:
linear_model.fit(X_train, y_train)
linear_pred = linear_model.predict(X_test)
# Measuring the accuracy of the model
r2_LR = round(r2_score(linear_pred,y_test)*100,3)
r2_LR

In [None]:
linear_pred_df =pd.DataFrame(linear_pred)
linear_pred_df.columns = ['percentage20_Joe_Biden']

linear_pred_df.head(5)

**Lasso Regression:**

In [None]:
#Lasso regression
lasso_reg = Lasso(alpha=0.001)
mae_Lasso,rss_Lasso,rmse_Lasso= fit_Evaluate(lasso_reg,X_train,y_train,X_test,y_test)
print("The MAE the lasso model is {:.3f} ".format(mae_Lasso))
print("The RSS for the lasso model is {:.3f}".format(rss_Lasso))
print("The RMSE for the lasso model is {:.3f}".format(rmse_Lasso))

In [None]:
lasso_reg.fit(X_train, y_train)
lasso_pred = lasso_reg.predict(X_test)
# Measuring the accuracy of the model
r2_lasso = round(r2_score(lasso_pred,y_test)*100,3)
r2_lasso

**Random Forest (RF):**

In [None]:
#Random Forest
random_forest = RandomForestRegressor()
mae_RF,rss_RF,rmse_RF= fit_Evaluate(random_forest,X_train,y_train,X_test,y_test)
print("The MAE for the RandomForestRegressor is {:.3f} ".format(mae_RF))
print("The RSS for the RandomForestRegressor is {:.3f}".format(rss_RF))
print("The RMSE for the RandomForestRegressor is {:.3f}".format(rmse_RF))

In [None]:
random_forest.fit(X_train, y_train)
rf_pred = random_forest.predict(X_test)
# Measuring the accuracy of the model
r2_RF = round(r2_score(rf_pred,y_test)*100,3)
r2_RF

**Gradient Boosting:**

In [None]:
gradient_boost = GradientBoostingRegressor()
mae_GB,rss_GB,rmse_GB= fit_Evaluate(gradient_boost,X_train,y_train,X_test,y_test)
print("The mean absolute error for the gradient boost is {:.3f} ".format(mae_GB))
print("The residual  sum of sqaures for the gradient boost is {:.3f}".format(rss_GB))
print("The root mean squared error for the gradient boost is {:.3f}".format(rmse_GB))

In [None]:
gradient_boost.fit(X_train, y_train)
gb_pred = gradient_boost.predict(X_test)
# Measuring the accuracy of the model
r2_GB = round(r2_score(gb_pred,y_test)*100,3)
r2_GB

# Building ANN Model

In [None]:
# Spiltting dataset into train, validation and test sets
train, Test = train_test_split(df, test_size=0.25, random_state=42)
val, test = train_test_split(df, test_size=0.4, random_state=42)

In [None]:
X_train = train[new_cols]
X_test = test[new_cols]
X_val = val[new_cols]

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
# Defining a ann model
def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3):
    model = keras.Sequential()
    model.add(keras.layers.InputLayer([X_train.shape[1],]))
    model.add(keras.layers.BatchNormalization())
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons,activation="relu"))
    model.add(keras.layers.Dense(1))
    optimizer = keras.optimizers.Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, metrics=["accuracy"], loss='mean_squared_error')
    return model

In [None]:
# Importing optuna for hyperparameter tuning
!pip install optuna
import optuna

In [None]:
from keras.backend import clear_session
from sklearn.metrics import mean_squared_error

#Defining tuning objective

class Objective:
    
    def __init__(self):
        self.best_booster = None
        self._booster = None

    
    def __call__(self, trial):

        clear_session()
        
        # tunable parameters
        lr = trial.suggest_float("lr", 3e-4,3e-2, log=True)
        n_hidden = trial.suggest_int("n_hidden", 1,4, log=True)
        n_neurons = trial.suggest_int("n_neurons", 8,128, log=True)



        # building ann model
        model = build_model(n_hidden, n_neurons, lr)

        # fitting the model
        model.fit(
            np.asarray(X_train).astype("float32"),
            np.asarray(train['percentage20_Donald_Trump']).astype("float32"),
            epochs=15, validation_data=(X_val, val['percentage20_Donald_Trump']), verbose=0)

        self._booster = model
        
        # prediction and erroe on valdiation set
        val_preds = model.predict(X_val)
        val_rmse = mean_squared_error(val['percentage20_Donald_Trump'], val_preds, squared=False)


        return val_rmse
    
    # callback to save the best model
    def callback(self, study, trial):
        if study.best_trial == trial:
            self.best_booster = self._booster

In [None]:
objective = Objective()

study = optuna.create_study(pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
                            direction="minimize", study_name="Keras Regressor")

study.optimize(objective, n_trials=25, callbacks=[objective.callback])

In [None]:
print("Best trial:")
trial = study.best_trial

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

best_model = objective.best_booster

In [None]:
best_model.summary()

In [None]:
ann_pred = best_model.predict(X_test)

In [None]:
rmse_ANN = mean_squared_error(test['percentage20_Donald_Trump'], ann_pred, squared = False)
mae_ANN = mean_absolute_error(test['percentage20_Donald_Trump'], ann_pred)
print(round(rmse_ANN, 3))
print(round(mae_ANN, 3))

In [None]:
# Measuring the accuracy of the model
r2_ANN = round(r2_score(ann_pred,test['percentage20_Donald_Trump'])*100,3)
r2_ANN

In [None]:
ann_pred_df =pd.DataFrame(ann_pred)
ann_pred_df.columns = ['percentage20_Donald_Trump']

ann_pred_df.head(5)

In [None]:
data = [['LinearRegression',r2_LR, round(mae_LR,3), round(rmse_LR,3)],['LassoRegression',r2_lasso, round(mae_Lasso,3), round(rmse_Lasso,3)],
        ['RandomForestRegression',r2_RF, round(mae_RF,3), round(rmse_RF,3)],['GradientBoostingRegression',r2_GB, round(mae_GB,3), round(rmse_GB,3)],
        ['ANN Model',r2_ANN, round(mae_ANN,3), round(rmse_ANN,3)]]
SummaryResults = pd.DataFrame(data,columns=['Model','R-Score Percentage', 'MAE', 'RMSE'])
SummaryResults

#Results

Best result was obtained from the GradientBoosting Regression