<h2>CS 4780/5780 Final Project: </h2>
<h3>Election Result Prediction for US Counties</h3>

Names and NetIDs for your group members: Eric Osband (eo255), Anthony Cuturuffo (acc284), Eddie Freedman (ebf45???)

<h3>Introduction:</h3>

<p> The final project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The programming project provide templates for how to do this, and the most recent video lectures summarize some of the tricks you will need (e.g. feature normalization, feature construction). So, this final project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is forecasting election results. Economic and sociological factors have been widely used when making predictions on the voting results of US elections. Economic and sociological factors vary a lot among counties in the United States. In addition, as you may observe from the election map of recent elections, neighbor counties show similar patterns in terms of the voting results. In this project you will bring the power of machine learning to make predictions for the county-level election results using Economic and sociological factors and the geographic structure of US counties. </p>
<p>

<h3>Your Task:</h3>
Plase read the project description PDF file carefully and make sure you write your code and answers to all the questions in this Jupyter Notebook. Your answers to the questions are a large portion of your grade for this final project. Please import the packages in this notebook and cite any references you used as mentioned in the project description. You need to print this entire Jupyter Notebook as a PDF file and submit to Gradescope and also submit the ipynb runnable version to Canvas for us to run.

<h3>Due Date:</h3>
The final project dataset and template jupyter notebook will be due on <strong>December 15th</strong> . Note that <strong>no late submissions will be accepted</strong>  and you cannot use any of your unused slip days before.
</p>

![image.png; width="100";](attachment:image.png)

<h2>Part 1: Basics</h2><p>

<h3>1.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [34]:
import os
import pandas as pd
import numpy as np
# TODO
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import math
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

<h3>1.2 Weighted Accuracy:</h3><p>
Since our dataset labels are heavily biased, you need to use the following function to compute weighted accuracy throughout your training and validation process and we use this for testing on Kaggle.
<p>

In [35]:
def weighted_accuracy(pred, true):
    assert(len(pred) == len(true))
    num_labels = len(true)
    num_pos = sum(true)
    num_neg = num_labels - num_pos
    frac_pos = num_pos/num_labels
    weight_pos = 1/frac_pos
    weight_neg = 1/(1-frac_pos)
    num_pos_correct = 0
    num_neg_correct = 0
    for pred_i, true_i in zip(pred, true):
        num_pos_correct += (pred_i == true_i and true_i == 1)
        num_neg_correct += (pred_i == true_i and true_i == 0)
    weighted_accuracy = ((weight_pos * num_pos_correct) 
                         + (weight_neg * num_neg_correct))/((weight_pos * num_pos) + (weight_neg * num_neg))
    return weighted_accuracy

<h2>Part 2: Baseline Solution</h2><p>
Note that your code should be commented well and in part 2.4 you can refer to your comments. (e.g. # Here is SVM, 
# Here is validation for SVM, etc). Also, we recommend that you do not to use 2012 dataset and the graph dataset to reach the baseline accuracy for 68% in this part, a basic solution with only 2016 dataset and reasonable model selection will be enough, it will be great if you explore thee graph and possibly 2012 dataset in Part 3.

<h3>2.1 Preprocessing and Feature Extraction:</h3><p>
Given the training dataset and graph information, you need to correctly preprocess the dataset (e.g. feature normalization). For baseline solution in this part, you might not need to introduce extra features to reach the baseline test accuracy.
<p>

In [58]:
def get_NN_dataset():
    '''
    Gets Neural Network dataset.
    '''
    df = pd.read_csv("./train_2016.csv", sep=',', encoding='unicode_escape')
    #looked at many online resources for neural networks and failed many times during implementation
    #but converged to https://stackabuse.com/introduction-to-pytorch-for-classification/
    #Used the model of the neural network and changed around the parameters, layers, weights, epoch
    #size, dropout, and loss function. 
    #for better accuracy on our test set. 
    #preprocessing 
    get_state_from_county = lambda county : county[county.index(",") + 2:] 
    df["state_name"] = df["County"].apply(get_state_from_county)

    # 0 for GOP, 1 for DEM
    df["target"] = (df["DEM"] > df["GOP"]).astype(int)
    df["float_target"] = (df["DEM"] / (df["DEM"] + df["GOP"])).astype(float)

    parse_numerical_string = lambda income : int(income.replace(",", ""))
    df["MedianIncome"] = df["MedianIncome"].apply(parse_numerical_string)
    return df

In [59]:
dataset = get_NN_dataset()
#Label columns for whether they is a numerical value, or a non-numerical value for the case of the state name
#which will be treated as an index 
categorical_columns = ['state_name']
numerical_columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
outputs = ['target']

#convert to type category
for category in categorical_columns:
    dataset[category] = dataset[category].astype('category')
statname = dataset['state_name'].cat.codes.values

#creates respective tensors for categorical, numerical, and output data
categorical_data = np.stack([statname], 1)
categorical_data = torch.tensor(categorical_data, dtype=torch.int64)

numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
numerical_data = torch.tensor(numerical_data, dtype=torch.float)

outputs = torch.tensor(dataset[outputs].values).flatten()

In [60]:
#choosing embedding size by the number of unique states divided by 2
categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns]
categorical_embedding_sizes = [(col_size, min(50, (col_size+1)//2)) for col_size in categorical_column_sizes]
print(categorical_embedding_sizes)

[(50, 25)]


In [61]:
#training set size: 1244, test set size: 311
total_records = 1555
test_records = int(total_records * .2)

#partition training and validation set respectively
categorical_train_data = categorical_data[:total_records-test_records]
categorical_test_data = categorical_data[total_records-test_records:total_records]
numerical_train_data = numerical_data[:total_records-test_records]
numerical_test_data = numerical_data[total_records-test_records:total_records]
train_outputs = outputs[:total_records-test_records]
test_outputs = outputs[total_records-test_records:total_records]

<h3>2.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 1.1.

In [62]:
class Model(nn.Module):

    def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
        super().__init__()
        #sets up embedding with embedding size for state name
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
        #dropout randomly zeros elements to avoid overfitting the training set
        self.embedding_dropout = nn.Dropout(p)
        #normalizes numerical data per batch 
        self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)

        #calculates total input size for first layer of the nn
        num_categorical_cols = sum((nf for ni, nf in embedding_size))
        input_size = num_categorical_cols + num_numerical_cols

        all_layers = []
        #each layer has a ReLU activation along with Batch Normalization with dropout
        for i in layers:
            all_layers.append(nn.Linear(input_size, i))
            all_layers.append(nn.ReLU(inplace=True))
            all_layers.append(nn.BatchNorm1d(i))
            all_layers.append(nn.Dropout(p))
            input_size = i
        
        #finishes with last output layer
        all_layers.append(nn.Linear(layers[-1], output_size))
        #Creates the network
        self.layers = nn.Sequential(*all_layers)
        
    #forward pass
    def forward(self, x_categorical, x_numerical):
        #adds embedding for categorical columns
        embeddings = []
        for i,e in enumerate(self.all_embeddings):
            embeddings.append(e(x_categorical[:,i]))
        x = torch.cat(embeddings, 1)
        x = self.embedding_dropout(x)
        #applies batch normalization
        x_numerical = self.batch_norm_num(x_numerical)
        x = torch.cat([x, x_numerical], 1)
        #performs the forward pass calculations
        x = self.layers(x)
        return x

In [63]:
#instantiates the model
model = Model(categorical_embedding_sizes, numerical_data.shape[1], 2, [200,100,50,50], p=0.35)
#sets CrossEntropyLoss to loss function with weights on the demovrats due to imbalanced data. 
#CrossEntropy Loss combines a Log Softmax layer with and negative log likelyhood loss in one single class

#loss_function = nn.CrossEntropyLoss(weight = torch.Tensor([1.0, 1.1]))
loss_function = nn.NLLLoss(weight = torch.Tensor([1.0, 1.1]))

#uses Stochastic Gradient Descent, optionally included Nesterov momentum 
#optimizer = torch.optim.SGD(model.parameters(), lr=5e-4, momentum=.9)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

#number of times to run through the training set
epochs = 700

#stores losses
aggregated_losses = []

for i in range(epochs):
    i += 1
    
    #calculates loss off of model prediction
    y_pred = model(categorical_train_data, numerical_train_data)
    single_loss = loss_function(y_pred, train_outputs)
    aggregated_losses.append(single_loss)

    #prints loss
    if i%25 == 1:
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')
    
    #backpropagation
    optimizer.zero_grad() # First zero all the gradients because of the way pytorch works
    single_loss.backward() # Perform backprop 
    optimizer.step() # performs a parameter update based on the current gradient

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')

epoch:   1 loss: -0.11343611
epoch:  26 loss: -0.15240996
epoch:  51 loss: -0.22501820
epoch:  76 loss: -0.32100192
epoch: 101 loss: -0.34275785
epoch: 126 loss: -0.48976398
epoch: 151 loss: -0.60192341
epoch: 176 loss: -0.66522807
epoch: 201 loss: -0.84344512
epoch: 226 loss: -0.95282412
epoch: 251 loss: -1.11066258
epoch: 276 loss: -1.18248534
epoch: 301 loss: -1.34964943
epoch: 326 loss: -1.48795009
epoch: 351 loss: -1.69113600
epoch: 376 loss: -1.71849263
epoch: 401 loss: -1.96285653
epoch: 426 loss: -2.09885693
epoch: 451 loss: -2.18980598
epoch: 476 loss: -2.34308052
epoch: 501 loss: -2.47596216
epoch: 526 loss: -2.70942736
epoch: 551 loss: -2.82967639
epoch: 576 loss: -2.84830189
epoch: 601 loss: -3.12552714
epoch: 626 loss: -3.18730068
epoch: 651 loss: -3.47729635
epoch: 676 loss: -3.40931940
epoch: 700 loss: -3.6440963745


In [64]:
#prints out testing loss
with torch.no_grad():
    y_val = model(categorical_test_data, numerical_test_data)
    loss = loss_function(y_val, test_outputs)
print(f'Loss: {loss:.8f}')
#finds the max of the two ouputted nodes for the binary classification 
y_output = np.argmax(y_val.numpy(), axis=1)
y_correct = test_outputs.numpy()
print("Weighted accuracy:",weighted_accuracy(y_output, test_outputs).numpy())

Loss: -3.53292894


In [65]:
#finds the max of the two ouputted nodes for the binary classification 
y_output = np.argmax(y_val.numpy(), axis=1)
y_correct = test_outputs.numpy()

In [66]:
weighted_accuracy(y_output, test_outputs)

tensor(0.8576)

In [57]:
#tests the model on test data
def get_NN_output(df):
    get_state_from_county = lambda county : county[county.index(",") + 2:] 
    df["state_name"] = df["County"].apply(get_state_from_county)

    parse_numerical_string = lambda income : int(str(income).replace(",", ""))
    df["MedianIncome"] = df["MedianIncome"].apply(parse_numerical_string)
    df.head()
    dataset = df
    categorical_columns = ['state_name']
    numerical_columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']

    for category in categorical_columns:
        dataset[category] = dataset[category].astype('category')
    statname = dataset['state_name'].cat.codes.values
    categorical_data = np.stack([statname], 1)
    categorical_data = torch.tensor(categorical_data, dtype=torch.int64)

    numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
    numerical_data = torch.tensor(numerical_data, dtype=torch.float)

    with torch.no_grad():
        y_val = model(categorical_data, numerical_data)

    y_test_output = np.argmax(y_val, axis=1)
    np_y_test_out = y_test_output.detach().numpy()
    return np_y_test_out



0.48333333333333334

<h3>2.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.4
# TODO

<h3>2.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

2.4.1 How did you preprocess the dataset and features?

2.4.2 Which two learning methods from class did you choose and why did you made the choices?

2.4.3 How did you do the model selection?

2.4.4 Does the test performance reach a given baseline 68% performanc? (Please include a screenshot of Kaggle Submission)

2.4.1 - I preprocessed the features by converting the name of the state in which the county was in into an index to add as a feature. I did this using the Embedding Libraries that pytorch has supplied for me. I also converted the income feature into an integer since it was primarily a string. I also made the target equal to 1 for Democrat and 0 for GOP to complete the binary classification. This calculation was made by finding the majority of Democrats or GOP. 

2.4.3 - I partitioned the training set into a training set and a validation set. When tuning parameters, I would train on the training set and validated the effect of changing certain parameters (learning rate, size of hidden layers, number of layers, loss function) on the validation set. I began with NLLLoss but changed to CrossEntropyLoss to add the softmax hidden layer. Additionally, I started using Stochastic Gradient Descent, but found that it was not learning fast enough and after I switched to torch's Adam optimizer, it was learning much faster per epoch.



<h2>Part 3: Creative Solution</h2><p>

<h3>3.1 Open-ended Code:</h3><p>
You may follow the steps in part 2 again but making innovative changes like creating new features, using new training algorithms, etc. Make sure you explain everything clearly in part 3.2. Note that reaching the 75% creative baseline is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 3.2
# TODO

<h3>3.2 Explanation in Words:</h3><p>

You need to answer the following questions in a markdown cell after this cell:

3.2.1 How much did you manage to improve performance on the test set compared to part 2? Did you reach the 75% accuracy for the test in Kaggle? (Please include a screenshot of Kaggle Submission)

3.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

<h2>Part 4: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The CSV shall contain TWO column named exactly "FIPS" and "Result" and 1555 total rows excluding the column names, "FIPS" column shall contain FIPS of counties with same order as in the test_2016_no_label.csv while "Result" column shall contain the 0 or 1 prdicaitons for corresponding columns. A sample predication file can be downloaded from Kaggle.

In [None]:
# TODO

# You may use pandas to generate a dataframe with FIPS and your predictions first 
# and then use to_csv to generate a CSV file.

<h2>Part 5: Resources and Literature Used</h2><p>