<h2>CS 4780/5780 Final Project: </h2>
<h3>Election Result Prediction for US Counties</h3>



<h3>Introduction:</h3>

<p> The final project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The programming project provide templates for how to do this, and the most recent video lectures summarize some of the tricks you will need (e.g. feature normalization, feature construction). So, this final project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is forecasting election results. Economic and sociological factors have been widely used when making predictions on the voting results of US elections. Economic and sociological factors vary a lot among counties in the United States. In addition, as you may observe from the election map of recent elections, neighbor counties show similar patterns in terms of the voting results. In this project you will bring the power of machine learning to make predictions for the county-level election results using Economic and sociological factors and the geographic structure of US counties. </p>
<p>

<h3>Your Task:</h3>
Plase read the project description PDF file carefully and make sure you write your code and answers to all the questions in this Jupyter Notebook. Your answers to the questions are a large portion of your grade for this final project. Please import the packages in this notebook and cite any references you used as mentioned in the project description. You need to print this entire Jupyter Notebook as a PDF file and submit to Gradescope and also submit the ipynb runnable version to Canvas for us to run.

<h3>Due Date:</h3>
The final project dataset and template jupyter notebook will be due on <strong>December 15th</strong> . Note that <strong>no late submissions will be accepted</strong>  and you cannot use any of your unused slip days before.
</p>

<h2>Part 1: Basics</h2><p>

<h3>1.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [1]:
import os
import pandas as pd
import numpy as np
# TODO
from sklearn.preprocessing import StandardScaler
import math
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

<h3>Weighted Accuracy:</h3><p>
Since our dataset labels are heavily biased, you need to use the following function to compute weighted accuracy throughout your training and validation process and we use this for testing on Kaggle.
<p>

In [2]:
def weighted_accuracy(pred, true):
    assert(len(pred) == len(true))
    num_labels = len(true)
    num_pos = sum(true)
    num_neg = num_labels - num_pos
    frac_pos = num_pos/num_labels
    weight_pos = 1/frac_pos
    weight_neg = 1/(1-frac_pos)
    num_pos_correct = 0
    num_neg_correct = 0
    for pred_i, true_i in zip(pred, true):
        num_pos_correct += (pred_i == true_i and true_i == 1)
        num_neg_correct += (pred_i == true_i and true_i == 0)
    weighted_accuracy = ((weight_pos * num_pos_correct) 
                         + (weight_neg * num_neg_correct))/((weight_pos * num_pos) + (weight_neg * num_neg))
    return weighted_accuracy

<h3>2.1 Preprocessing and Feature Extraction:</h3><p>
Given the training dataset and graph information, you need to correctly preprocess the dataset (e.g. feature normalization). For baseline solution in this part, you might not need to introduce extra features to reach the baseline test accuracy.
<p>

In [3]:
# You may change this but we suggest loading data with the following code and you may need to change
# datatypes and do necessary data transformation after loading the raw data to the dataframe.
dataset_path = "./train_2016.csv"
# df = pd.read_csv(dataset_path, sep=',',header=None, encoding='unicode_escape')

# Chose to include header to remember column identifiers
df = pd.read_csv(dataset_path, sep=',', encoding='unicode_escape')
df.head()

Unnamed: 0,FIPS,County,DEM,GOP,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate
0,18019,"Clark County, IN",18791,30012,51837,4.9,12.8,11.0,20.9,4.2
1,6035,"Lassen County, CA",2026,6533,49793,-18.4,9.2,6.3,12.0,6.9
2,40081,"Lincoln County, OK",2423,10838,44914,-1.3,11.4,11.7,15.1,5.3
3,31153,"Sarpy County, NE",27704,44649,74374,9.2,14.2,5.0,40.1,2.9
4,28055,"Issaquena County, MS",395,298,26957,-12.8,9.8,5.3,6.7,14.0


In [4]:
#preprocessing 
#extracted state name from country information as a feature
get_state_from_county = lambda county : county[county.index(",") + 2:] 
df["state_name"] = df["County"].apply(get_state_from_county)

# 0 for GOP, 1 for DEM
df["target"] = (df["DEM"] > df["GOP"]).astype(int)
df["float_target"] = (df["DEM"] / (df["DEM"] + df["GOP"])).astype(float)

#converted median income to int
parse_numerical_string = lambda income : int(income.replace(",", ""))
df["MedianIncome"] = df["MedianIncome"].apply(parse_numerical_string)
df.head()

Unnamed: 0,FIPS,County,DEM,GOP,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,state_name,target,float_target
0,18019,"Clark County, IN",18791,30012,51837,4.9,12.8,11.0,20.9,4.2,IN,0,0.385038
1,6035,"Lassen County, CA",2026,6533,49793,-18.4,9.2,6.3,12.0,6.9,CA,0,0.23671
2,40081,"Lincoln County, OK",2423,10838,44914,-1.3,11.4,11.7,15.1,5.3,OK,0,0.182716
3,31153,"Sarpy County, NE",27704,44649,74374,9.2,14.2,5.0,40.1,2.9,NE,0,0.382901
4,28055,"Issaquena County, MS",395,298,26957,-12.8,9.8,5.3,6.7,14.0,MS,1,0.569986


In [5]:
dataset = df 

In [6]:
#Label columns for whether they is a numerical value, or a non-numerical value for the case of the state name
#which will be treated as an index 
categorical_columns = ['state_name']
numerical_columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
outputs = ['target']

In [7]:
#convert to type category
for category in categorical_columns:
    dataset[category] = dataset[category].astype('category')
statname = dataset['state_name'].cat.codes.values

#creates respective tensors for categorical, numerical, and output data
categorical_data = np.stack([statname], 1)
categorical_data = torch.tensor(categorical_data, dtype=torch.int64)

numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
numerical_data = torch.tensor(numerical_data, dtype=torch.float)

outputs = torch.tensor(dataset[outputs].values).flatten()


In [8]:
#choosing embedding size by the number of unique states divided by 2
categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns]
categorical_embedding_sizes = [(col_size, min(50, (col_size+1)//2)) for col_size in categorical_column_sizes]
print(categorical_embedding_sizes)

[(50, 25)]


In [9]:
#training set size: 1244, test set size: 311
total_records = 1555
test_records = int(total_records * .15)

#partition training and validation set respectively
categorical_train_data = categorical_data[:total_records-test_records]
categorical_test_data = categorical_data[total_records-test_records:total_records]
numerical_train_data = numerical_data[:total_records-test_records]
numerical_test_data = numerical_data[total_records-test_records:total_records]
train_outputs = outputs[:total_records-test_records]
test_outputs = outputs[total_records-test_records:total_records]

<h3>2.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 1.1.

In [10]:
class Model(nn.Module):

    def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
        super().__init__()
        #sets up embedding with embedding size for state name
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
        #dropout randomly zeros elements to avoid overfitting the training set
        self.embedding_dropout = nn.Dropout(p)
        #normalizes numerical data per batch 
        self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)

        #calculates total input size for first layer of the nn
        num_categorical_cols = sum((nf for ni, nf in embedding_size))
        input_size = num_categorical_cols + num_numerical_cols

        all_layers = []
        #each layer has a ReLU activation along with Batch Normalization with dropout
        for i in layers:
            all_layers.append(nn.Linear(input_size, i))
            all_layers.append(nn.ReLU(inplace=True))
            all_layers.append(nn.BatchNorm1d(i))
            all_layers.append(nn.Dropout(p))
            input_size = i
        
        #finishes with last output layer
        all_layers.append(nn.Linear(layers[-1], output_size))
        #Creates the network
        self.layers = nn.Sequential(*all_layers)
        
    #forward pass
    def forward(self, x_categorical, x_numerical):
        #adds embedding for categorical columns
        embeddings = []
        for i,e in enumerate(self.all_embeddings):
            embeddings.append(e(x_categorical[:,i]))
        x = torch.cat(embeddings, 1)
        x = self.embedding_dropout(x)
        #applies batch normalization
        x_numerical = self.batch_norm_num(x_numerical)
        x = torch.cat([x, x_numerical], 1)
        #performs the forward pass calculations
        x = self.layers(x)
        return x

In [37]:
#instantiates the model
model = Model(categorical_embedding_sizes, numerical_data.shape[1], 2, [200,100,50,50], p=0.35)
#sets CrossEntropyLoss to loss function with weights on the demovrats due to imbalanced data. 
#CrossEntropy Loss combines a Log Softmax layer with and negative log likelyhood loss in one single class

#loss_function = nn.CrossEntropyLoss(weight = torch.Tensor([1.0, 1.1]))
loss_function = nn.NLLLoss(weight = torch.Tensor([1.0, 1.1]))

#uses Stochastic Gradient Descent, optionally included Nesterov momentum 
#optimizer = torch.optim.SGD(model.parameters(), lr=5e-4, momentum=.9)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

#number of times to run through the training set
epochs = 700

#stores losses
aggregated_losses = []

for i in range(epochs):
    i += 1
    
    #calculates loss off of model prediction
    y_pred = model(categorical_train_data, numerical_train_data)
    single_loss = loss_function(y_pred, train_outputs)
    aggregated_losses.append(single_loss)

    #prints loss
    if i%25 == 1:
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')
    
    #backpropagation
    optimizer.zero_grad()
    single_loss.backward()
    optimizer.step()

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')

epoch:   1 loss: -0.06863649
epoch:  26 loss: -0.11102536
epoch:  51 loss: -0.15216479
epoch:  76 loss: -0.22497703
epoch: 101 loss: -0.29829210
epoch: 126 loss: -0.37059617
epoch: 151 loss: -0.54081571
epoch: 176 loss: -0.61411351
epoch: 201 loss: -0.75948822
epoch: 226 loss: -0.93056804
epoch: 251 loss: -1.04870176
epoch: 276 loss: -1.18483686
epoch: 301 loss: -1.28914666
epoch: 326 loss: -1.38577700
epoch: 351 loss: -1.53217435
epoch: 376 loss: -1.67537022
epoch: 401 loss: -1.74851191
epoch: 426 loss: -1.90516901
epoch: 451 loss: -2.00635886
epoch: 476 loss: -2.16849065
epoch: 501 loss: -2.38289809
epoch: 526 loss: -2.49256897
epoch: 551 loss: -2.64232898
epoch: 576 loss: -2.80099821
epoch: 601 loss: -3.03859496
epoch: 626 loss: -3.14676261
epoch: 651 loss: -3.25770736
epoch: 676 loss: -3.34842992
epoch: 700 loss: -3.5492279530


In [38]:
#prints out testing loss
with torch.no_grad():
    y_val = model(categorical_test_data, numerical_test_data)
    loss = loss_function(y_val, test_outputs)
print(f'Loss: {loss:.8f}')

Loss: -3.39637876


In [39]:
y_output = np.argmax(y_val, axis=1)

In [40]:
weighted_accuracy(y_output, test_outputs)

tensor(0.8460)