<h2>CS 4780/5780 Final Project: </h2>
<h3>Election Result Prediction for US Counties</h3>

Names and NetIDs for your group members: Eric Osband (eo255), Anthony Cuturuffo (acc284), Eddie Freedman (ebf45???)

<h3>Introduction:</h3>

<p> The final project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The programming project provide templates for how to do this, and the most recent video lectures summarize some of the tricks you will need (e.g. feature normalization, feature construction). So, this final project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is forecasting election results. Economic and sociological factors have been widely used when making predictions on the voting results of US elections. Economic and sociological factors vary a lot among counties in the United States. In addition, as you may observe from the election map of recent elections, neighbor counties show similar patterns in terms of the voting results. In this project you will bring the power of machine learning to make predictions for the county-level election results using Economic and sociological factors and the geographic structure of US counties. </p>
<p>

<h3>Your Task:</h3>
Plase read the project description PDF file carefully and make sure you write your code and answers to all the questions in this Jupyter Notebook. Your answers to the questions are a large portion of your grade for this final project. Please import the packages in this notebook and cite any references you used as mentioned in the project description. You need to print this entire Jupyter Notebook as a PDF file and submit to Gradescope and also submit the ipynb runnable version to Canvas for us to run.

<h3>Due Date:</h3>
The final project dataset and template jupyter notebook will be due on <strong>December 15th</strong> . Note that <strong>no late submissions will be accepted</strong>  and you cannot use any of your unused slip days before.
</p>

![image.png; width="100";](attachment:image.png)

<h2>Part 1: Basics</h2><p>

<h3>1.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [3]:
import os
import pandas as pd
import numpy as np
# TODO
from sklearn.preprocessing import StandardScaler
import math
import torch

<h3>1.2 Weighted Accuracy:</h3><p>
Since our dataset labels are heavily biased, you need to use the following function to compute weighted accuracy throughout your training and validation process and we use this for testing on Kaggle.
<p>

In [15]:
def weighted_accuracy(pred, true):
    assert(len(pred) == len(true))
    num_labels = len(true)
    num_pos = sum(true)
    num_neg = num_labels - num_pos
    frac_pos = num_pos/num_labels
    weight_pos = 1/frac_pos
    weight_neg = 1/(1-frac_pos)
    num_pos_correct = 0
    num_neg_correct = 0
    for pred_i, true_i in zip(pred, true):
        num_pos_correct += (pred_i == true_i and true_i == 1)
        num_neg_correct += (pred_i == true_i and true_i == 0)
    weighted_accuracy = ((weight_pos * num_pos_correct) 
                         + (weight_neg * num_neg_correct))/((weight_pos * num_pos) + (weight_neg * num_neg))
    return weighted_accuracy

<h2>Part 2: Baseline Solution</h2><p>
Note that your code should be commented well and in part 2.4 you can refer to your comments. (e.g. # Here is SVM, 
# Here is validation for SVM, etc). Also, we recommend that you do not to use 2012 dataset and the graph dataset to reach the baseline accuracy for 68% in this part, a basic solution with only 2016 dataset and reasonable model selection will be enough, it will be great if you explore thee graph and possibly 2012 dataset in Part 3.

<h3>2.1 Preprocessing and Feature Extraction:</h3><p>
Given the training dataset and graph information, you need to correctly preprocess the dataset (e.g. feature normalization). For baseline solution in this part, you might not need to introduce extra features to reach the baseline test accuracy.
<p>

In [16]:
# You may change this but we suggest loading data with the following code and you may need to change
# datatypes and do necessary data transformation after loading the raw data to the dataframe.
dataset_path = "./train_2016.csv"
# df = pd.read_csv(dataset_path, sep=',',header=None, encoding='unicode_escape')

# Chose to include header to remember column identifiers
df = pd.read_csv(dataset_path, sep=',', encoding='unicode_escape')
df.head()

Unnamed: 0,FIPS,County,DEM,GOP,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate
0,18019,"Clark County, IN",18791,30012,51837,4.9,12.8,11.0,20.9,4.2
1,6035,"Lassen County, CA",2026,6533,49793,-18.4,9.2,6.3,12.0,6.9
2,40081,"Lincoln County, OK",2423,10838,44914,-1.3,11.4,11.7,15.1,5.3
3,31153,"Sarpy County, NE",27704,44649,74374,9.2,14.2,5.0,40.1,2.9
4,28055,"Issaquena County, MS",395,298,26957,-12.8,9.8,5.3,6.7,14.0


In [61]:
actual_pred = lambda x : np.sign(x * 2)
def preprocess(raw_df):
    df = raw_df.copy()
    # Create feature representing state number
    # Gets state initials from a county string
    get_state_from_county = lambda county : county[county.index(",") + 2:]
  
    df["state_name"] = df["County"].apply(get_state_from_county)
    states = df["state_name"].unique().tolist()
  
    
    
    #one-hot encode state data
    onehot = pd.get_dummies(df["state_name"], prefix = None)
    df[onehot.columns] = onehot
    
    #create target label
    target = "target" # Percentage DEM vote, range between 0 and 1. Apply actual_pred(df["target"]) to get actual 0-1 prediction
    df[target] = (df["DEM"] / (df["DEM"] + df["GOP"])).astype(float)
    
    # Get rid of all commas in MedianIncome column
    df['MedianIncome']=df['MedianIncome'].str.replace(',','').astype(int)

    # Get rid of county, state_name, DEM and GOP columns
    df = df.drop(columns = ["state_name", "County", "DEM", "GOP", "FIPS"])
    
    y = df["target"].to_numpy()
    df = df.drop(columns = ["target"])
    x = df.to_numpy()
    
    return x,y

In [62]:
x, y = preprocess(df)
sample_size, input_size = x.shape
print(x.shape)

[0.38503781 0.2367099  0.18271624 0.3829005  0.56998557]
(1555, 56)


In [19]:
x_tensor, y_tensor = torch.from_numpy(x).float(), torch.from_numpy(y).float()
x_tensor = torch.nn.functional.normalize(x_tensor, dim=0)

In [20]:
dataset = torch.utils.data.TensorDataset(x_tensor, y_tensor)
train, test = torch.utils.data.random_split(dataset, [1000, 555])
train_loader = torch.utils.data.DataLoader(train, batch_size=1)
test_loader = torch.utils.data.DataLoader(test, batch_size=1)


In [21]:
class NeuralNet(torch.nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = torch.nn.Linear(input_size, hidden_size)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(hidden_size, num_classes)
        self.layer_out = torch.nn.Linear(num_classes, 1)
        self.sigmoid = torch.nn.Sigmoid()

    
    def forward(self, x):
        out = self.fc1(x)
        #out = self.relu(out)
        out = self.sigmoid(out)

        out = self.fc2(out)
        out = self.sigmoid(out)
        #out = self.relu(out)
        out = self.layer_out(out)
        #out = torch.nn.functional.log_softmax(out)
        #out = self.sigmoid(out)
        return out

In [72]:
num_classes = hidden_size = input_size
model = NeuralNet(input_size, hidden_size, num_classes)
epochs = 5
learning_rate = .001

# Loss and optimizer
def my_loss(output, target):
    loss = torch.mean((10*(output - target))**2)
    return loss
criterion = my_loss
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

#Training model
total_step = len(train_loader)
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = torch.autograd.Variable(data), torch.autograd.Variable(target)
        #print(data)
        optimizer.zero_grad()

        net_out = model(data)
        #print(net_out,target)
        loss = criterion(net_out, target)
        
        # Backprop
        loss.backward()
        optimizer.step()
        if batch_idx % 200 == 199:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                        epoch, batch_idx * len(data), len(train_loader.dataset),
                               100. * batch_idx / len(train_loader), loss.data))




In [74]:
#TESTING
# run a test loop
convert_pred = lambda x : (torch.sign(x - 0.5) + 1) / 2
test_loss = 0
correct = 0
preds = np.array([])
for data, target in test_loader:
    data, target = torch.autograd.Variable(data), torch.autograd.Variable(target)
    net_out = model(data)
    # sum up batch loss
    test_loss += criterion(net_out, target)
    pred = convert_pred(net_out.data.max(1)[1])  # get the index of the max log-probability
#     print(net_out, pred, target)
#     print(target.data[0])
#     print(np.sign(0.9))
#     print(pred.numpy())
#     preds.append([pred.numpy()])
    correct += pred.eq(convert_pred(target.data)).sum()

test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
print('\nWeighted accuracy:', preds)


Test set: Average loss: 7.1631, Accuracy: 478/555 (86%)


Weighted accuracy: []
tensor([[0.1141]], grad_fn=<AddmmBackward>)


<h3>2.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 1.1.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.4
# TODO


<h3>2.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.4
# TODO

<h3>2.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

2.4.1 How did you preprocess the dataset and features?

2.4.2 Which two learning methods from class did you choose and why did you made the choices?

2.4.3 How did you do the model selection?

2.4.4 Does the test performance reach a given baseline 68% performanc? (Please include a screenshot of Kaggle Submission)

<h2>Part 3: Creative Solution</h2><p>

<h3>3.1 Open-ended Code:</h3><p>
You may follow the steps in part 2 again but making innovative changes like creating new features, using new training algorithms, etc. Make sure you explain everything clearly in part 3.2. Note that reaching the 75% creative baseline is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 3.2
# TODO

<h3>3.2 Explanation in Words:</h3><p>

You need to answer the following questions in a markdown cell after this cell:

3.2.1 How much did you manage to improve performance on the test set compared to part 2? Did you reach the 75% accuracy for the test in Kaggle? (Please include a screenshot of Kaggle Submission)

3.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

<h2>Part 4: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The CSV shall contain TWO column named exactly "FIPS" and "Result" and 1555 total rows excluding the column names, "FIPS" column shall contain FIPS of counties with same order as in the test_2016_no_label.csv while "Result" column shall contain the 0 or 1 prdicaitons for corresponding columns. A sample predication file can be downloaded from Kaggle.

In [None]:
# TODO

# You may use pandas to generate a dataframe with FIPS and your predictions first 
# and then use to_csv to generate a CSV file.

<h2>Part 5: Resources and Literature Used</h2><p>