<h2>CS 4780/5780 Final Project: </h2>
<h3>Election Result Prediction for US Counties</h3>

Names and NetIDs for your group members: Eric Osband (eo255), Anthony Cuturuffo (acc284), Eddie Freedman (ebf45???)

<h3>Introduction:</h3>

<p> The final project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The programming project provide templates for how to do this, and the most recent video lectures summarize some of the tricks you will need (e.g. feature normalization, feature construction). So, this final project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is forecasting election results. Economic and sociological factors have been widely used when making predictions on the voting results of US elections. Economic and sociological factors vary a lot among counties in the United States. In addition, as you may observe from the election map of recent elections, neighbor counties show similar patterns in terms of the voting results. In this project you will bring the power of machine learning to make predictions for the county-level election results using Economic and sociological factors and the geographic structure of US counties. </p>
<p>

<h3>Your Task:</h3>
Plase read the project description PDF file carefully and make sure you write your code and answers to all the questions in this Jupyter Notebook. Your answers to the questions are a large portion of your grade for this final project. Please import the packages in this notebook and cite any references you used as mentioned in the project description. You need to print this entire Jupyter Notebook as a PDF file and submit to Gradescope and also submit the ipynb runnable version to Canvas for us to run.

<h3>Due Date:</h3>
The final project dataset and template jupyter notebook will be due on <strong>December 15th</strong> . Note that <strong>no late submissions will be accepted</strong>  and you cannot use any of your unused slip days before.
</p>

<img src="image.png">

<h2>Part 1: Basics</h2><p>

<h3>1.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

FROM ERIC: To download PyTorch, run the following

<code>conda install pytorch torchvision -c pytorch</code>

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn import svm
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import make_scorer
import math
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

<h3>1.2 Weighted Accuracy:</h3><p>
Since our dataset labels are heavily biased, you need to use the following function to compute weighted accuracy throughout your training and validation process and we use this for testing on Kaggle.
<p>

In [2]:
def weighted_accuracy(pred, true):
    assert(len(pred) == len(true))
    num_labels = len(true)
    num_pos = sum(true)
    num_neg = num_labels - num_pos
    frac_pos = num_pos/num_labels
    weight_pos = 1/frac_pos
    weight_neg = 1/(1-frac_pos)
    num_pos_correct = 0
    num_neg_correct = 0
    for pred_i, true_i in zip(pred, true):
        num_pos_correct += (pred_i == true_i and true_i == 1)
        num_neg_correct += (pred_i == true_i and true_i == 0)
    weighted_accuracy = ((weight_pos * num_pos_correct) 
                         + (weight_neg * num_neg_correct))/((weight_pos * num_pos) + (weight_neg * num_neg))
    return weighted_accuracy

<h2>Part 2: Baseline Solution</h2><p>
Note that your code should be commented well and in part 2.4 you can refer to your comments. (e.g. # Here is SVM, 
# Here is validation for SVM, etc). Also, we recommend that you do not to use 2012 dataset and the graph dataset to reach the baseline accuracy for 68% in this part, a basic solution with only 2016 dataset and reasonable model selection will be enough, it will be great if you explore thee graph and possibly 2012 dataset in Part 3.

<h3>2.1 Preprocessing and Feature Extraction:</h3><p>
Given the training dataset and graph information, you need to correctly preprocess the dataset (e.g. feature normalization). For baseline solution in this part, you might not need to introduce extra features to reach the baseline test accuracy.
<p>

In [3]:
def add_target(df):
    '''
    add_target(df) is df but with a target column extracted from GOP and DEM vote counts.
    Labels 1 if DEM > GOP and 0 otherwise. DEM and GOP columns are removed afterwards.
    '''
    df["target"] = (df["DEM"] > df["GOP"]).astype(int)
    df = df.drop(columns = ["DEM", "GOP"])
    return df

In [4]:
df_2016 = add_target(pd.read_csv("./train_2016.csv", sep=',', encoding='unicode_escape'))

In [5]:
def add_features(df):
    """
    add_features(df) is df but with the following additional features:
        state: id corresponding to state (integer 0-49)
    """
    # Get state initials from a county string, map state initials to their place in array, create new column
    get_state_from_county = lambda county : county[county.index(",") + 2:]
    df.loc[:,"state_name"] = df["County"].apply(get_state_from_county)
    states = df["state_name"].unique().tolist()
    get_id_from_state = lambda state : states.index(state)
    df.loc[:,"state"] = df["state_name"].apply(get_id_from_state)

    # Get rid of all commas in MedianIncome column
    df.loc[:,'MedianIncome']= df['MedianIncome'].str.replace(',','').astype(int)
    
    df = df.drop(columns = ["County",'state_name']) # These columns no longer needed
    return df

def preprocess(train_df, validation_df, test_df):
    """
    preprocess(train_df, validation_df, test_df) returns the three respective dataframes but preprocessed
    in the following way:
        1) Add features as decided by add_features(df)
        2) Normalize all features to a standard normal according to train_df statistics
        3) Make target column the last column (only applies to train_df and validation_df)
    """
    # First add features
    train_df = add_features(train_df)
    validation_df = add_features(validation_df)
    test_df = add_features(test_df)

    # Hold onto column references to put them back later
    temp = train_df["target"]
    temp2 = validation_df["target"]
    train_df = train_df.drop(columns = ["target"])
    validation_df = validation_df.drop(columns = ["target"])
    
    columns = list(train_df.columns)[1:]

    std_scaler = StandardScaler()
    std_scaler.fit(train_df[columns]) # Fit scaler to training dataset only
    # Scale all three datasets
    train_df[columns] = std_scaler.transform(train_df[columns]) 
    validation_df[columns] = std_scaler.transform(validation_df[columns])
    test_df[columns] = std_scaler.transform(test_df[columns])
    train_df["target"] = temp # Add back target columns to ensure they are at the end of the dataframe
    validation_df["target"] = temp2
    return train_df, validation_df, test_df

In [6]:
test_df = pd.read_csv("test_2016_no_label.csv")
train_df, validation_df = train_test_split(df_2016, test_size=0.2) # Perform a test-train split of training data
train_df = train_df.copy()
validation_df = validation_df.copy()
df, validation, test = preprocess(train_df, validation_df, test_df)
df.head()

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,state,target
1249,13309,-1.603402,1.033455,-1.534324,-1.104511,-0.950331,2.038591,-1.444147,0
1306,6063,0.300453,1.304335,-1.49303,-0.124578,0.179436,2.468241,-1.36006,0
768,29207,-0.958994,-0.424514,-0.254205,0.964237,-0.908096,0.695936,-1.275973,0
981,51131,-0.828407,0.053509,-1.038794,1.798995,-0.021177,0.427405,-1.191886,1
616,55043,-0.183841,-0.217371,-0.254205,-0.52381,0.126643,-0.64672,-1.107799,0


BEGIN NEURAL NETWORK PREPROCESSING

In [7]:
def get_NN_dataset():
    '''
    Gets Neural Network dataset.
    '''
    df = pd.read_csv("./train_2016.csv", sep=',', encoding='unicode_escape')
    #looked at many online resources for neural networks and failed many times during implementation
    #but converged to https://stackabuse.com/introduction-to-pytorch-for-classification/
    #Used the model of the neural network and changed around the parameters, layers, weights, epoch
    #size, dropout, and loss function. 
    #for better accuracy on our test set. 
    #preprocessing 
    return add_features(add_target(df))

In [8]:
dataset = get_NN_dataset()
print(dataset.columns)
#Label columns for whether they is a numerical value, or a non-numerical value for the case of the state name
#which will be treated as an index 
categorical_columns = ['state']
numerical_columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
outputs = ['target']

#convert to type category
for category in categorical_columns:
    dataset[category] = dataset[category].astype('category')
statname = dataset['state'].cat.codes.values

#creates respective tensors for categorical, numerical, and output data
categorical_data = np.stack([statname], 1)
categorical_data = torch.tensor(categorical_data, dtype=torch.int64)

numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
numerical_data = torch.tensor(numerical_data, dtype=torch.float)

outputs = torch.tensor(dataset[outputs].values).flatten()

Index(['FIPS', 'MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate',
       'BachelorRate', 'UnemploymentRate', 'target', 'state'],
      dtype='object')


In [9]:
#choosing embedding size by the number of unique states divided by 2
categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns]
categorical_embedding_sizes = [(col_size, min(50, (col_size+1)//2)) for col_size in categorical_column_sizes]
print(categorical_embedding_sizes)

[(50, 25)]


In [10]:
#training set size: 1244, test set size: 311
total_records = 1555
test_records = int(total_records * .2)

#partition training and validation set respectively
categorical_train_data = categorical_data[:total_records-test_records]
categorical_test_data = categorical_data[total_records-test_records:total_records]
numerical_train_data = numerical_data[:total_records-test_records]
numerical_test_data = numerical_data[total_records-test_records:total_records]
train_outputs = outputs[:total_records-test_records]
test_outputs = outputs[total_records-test_records:total_records]

<h3>2.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 1.1.

**NOTE:** We apply the LDA in section 2.3 via validation

In [11]:
class Model(nn.Module):

    def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
        super().__init__()
        #sets up embedding with embedding size for state name
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
        #dropout randomly zeros elements to avoid overfitting the training set
        self.embedding_dropout = nn.Dropout(p)
        #normalizes numerical data per batch 
        self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)

        #calculates total input size for first layer of the nn
        num_categorical_cols = sum((nf for ni, nf in embedding_size))
        input_size = num_categorical_cols + num_numerical_cols

        all_layers = []
        #each layer has a ReLU activation along with Batch Normalization with dropout
        for i in layers:
            all_layers.append(nn.Linear(input_size, i))
            all_layers.append(nn.ReLU(inplace=True))
            all_layers.append(nn.BatchNorm1d(i))
            all_layers.append(nn.Dropout(p))
            input_size = i
        
        #finishes with last output layer
        all_layers.append(nn.Linear(layers[-1], output_size))
        #Creates the network
        self.layers = nn.Sequential(*all_layers)
        
    #forward pass
    def forward(self, x_categorical, x_numerical):
        #adds embedding for categorical columns
        embeddings = []
        for i,e in enumerate(self.all_embeddings):
            embeddings.append(e(x_categorical[:,i]))
        x = torch.cat(embeddings, 1)
        x = self.embedding_dropout(x)
        #applies batch normalization
        x_numerical = self.batch_norm_num(x_numerical)
        x = torch.cat([x, x_numerical], 1)
        #performs the forward pass calculations
        x = self.layers(x)
        return x

In [12]:
#instantiates the model
model = Model(categorical_embedding_sizes, numerical_data.shape[1], 2, [200,100,50,50], p=0.35)
#sets CrossEntropyLoss to loss function with weights on the demovrats due to imbalanced data. 
#CrossEntropy Loss combines a Log Softmax layer with and negative log likelyhood loss in one single class

loss_function = nn.CrossEntropyLoss()
# loss_function = nn.NLLLoss(weight = torch.Tensor([1.0, 1.1]))

#uses Stochastic Gradient Descent, optionally included Nesterov momentum 
#optimizer = torch.optim.SGD(model.parameters(), lr=5e-4, momentum=.9)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

#number of times to run through the training set
epochs = 700

#stores losses
aggregated_losses = []

for i in range(epochs):
    i += 1
    
    #calculates loss off of model prediction
    y_pred = model(categorical_train_data, numerical_train_data)
    single_loss = loss_function(y_pred, train_outputs)
    aggregated_losses.append(single_loss)

    #prints loss
    if i%25 == 1:
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')
    
    #backpropagation
    optimizer.zero_grad() # First zero all the gradients because of the way pytorch works
    single_loss.backward() # Perform backprop 
    optimizer.step() # performs a parameter update based on the current gradient

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')

epoch:   1 loss: 0.82673168
epoch:  26 loss: 0.81152558
epoch:  51 loss: 0.74759930
epoch:  76 loss: 0.75068676
epoch: 101 loss: 0.71078569
epoch: 126 loss: 0.71594638
epoch: 151 loss: 0.67671406
epoch: 176 loss: 0.68952465
epoch: 201 loss: 0.64570290
epoch: 226 loss: 0.65024805
epoch: 251 loss: 0.62783855
epoch: 276 loss: 0.62232268
epoch: 301 loss: 0.59600860
epoch: 326 loss: 0.60050726
epoch: 351 loss: 0.57287937
epoch: 376 loss: 0.57107258
epoch: 401 loss: 0.57092923
epoch: 426 loss: 0.55099827
epoch: 451 loss: 0.55500185
epoch: 476 loss: 0.52533036
epoch: 501 loss: 0.52303916
epoch: 526 loss: 0.52062190
epoch: 551 loss: 0.50915277
epoch: 576 loss: 0.51518750
epoch: 601 loss: 0.49213085
epoch: 626 loss: 0.48122963
epoch: 651 loss: 0.46421376
epoch: 676 loss: 0.44645485
epoch: 700 loss: 0.4527860582


In [13]:
#prints out testing loss
with torch.no_grad():
    y_val = model(categorical_test_data, numerical_test_data)
    loss = loss_function(y_val, test_outputs)
print(f'Loss: {loss:.8f}')
#finds the max of the two ouputted nodes for the binary classification 
y_output = np.argmax(y_val.numpy(), axis=1)
y_correct = test_outputs.numpy()
print("Weighted accuracy:",weighted_accuracy(y_output, test_outputs).numpy())

Loss: 0.50656122
Weighted accuracy: 0.68617374


In [14]:
#tests the model on test data
def get_NN_output(df):
    df = add_features(df)
    categorical_columns = ['state']
    numerical_columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']


    for category in categorical_columns:
        dataset[category] = dataset[category].astype('category')
    statname = dataset['state'].cat.codes.values
    categorical_data = np.stack([statname], 1)
    categorical_data = torch.tensor(categorical_data, dtype=torch.int64)

    numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
    numerical_data = torch.tensor(numerical_data, dtype=torch.float)

    with torch.no_grad():
        y_val = model(categorical_data, numerical_data)

    y_test_output = np.argmax(y_val, axis=1)
    np_y_test_out = y_test_output.detach().numpy()
    return np_y_test_out

<h3>2.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [15]:
# Now that we have added all necessary features, we perform validation to determine the optimal
# prior probabilities.
def perform_validation(df, validation):
    '''
    perform_validation(df, validation) performs validation on the prior probabilities of LDA to find the value
    of the priors that maximize validation weighted accuracy.
    Returns tuple:
        best_lda, best_priors, best_score, best_one_percentages
            best_lda: after finding optimal priors, this is an LDA fit on entire training + validation set 
            best_priors: array of length 2 with the optimal priors calculated through validation
            best_score: weighted accuracy achieved by the highest performing lda on the validation set
            best_one_percentages: using these optimal priors, this was the number of 1s predicted by the algorithm
    '''
    get_priors = lambda x : [0.9 - 0.001*x, 0.1 + 0.001*x] # Function to calculate priors from loop iteration
    scores = []
    one_percentages = []
    x_train = df[df.columns[1:-1]] # Separate data into x and y train and test
    x_test = validation[df.columns[1:-1]]
    y_train = df['target']
    y_test = validation['target']
    # Now we repeatedly create LDA models after nudging the prior probabilities, then record the score
    for x in range(899):
        priors = get_priors(x)
        lda = LinearDiscriminantAnalysis(priors = priors).fit(x_train,y_train)
        y_pred = lda.predict(x_test)
        accuracy = weighted_accuracy(y_pred, y_test)
        scores.append(accuracy)
        one_percentages.append(np.count_nonzero(y_test) / len(y_test))
    # Now we see which priors and accuracy were most successful
    x = np.argmax(scores)
    priors = get_priors(x)
    lda = LinearDiscriminantAnalysis(priors = priors).fit(pd.concat([x_train,x_test]),pd.concat([y_train,y_test]))
    return lda, priors, scores[x],one_percentages[x]

In [16]:
lda_basic, priors, score, one_percentage = perform_validation(df, validation)
basic_score = score # For use later to compare to creative
print("Best priors:",priors) # Priors are in the form [prior for GOP (0), prior for DEM (1)]
print("Weighted accuracy:", score) # Weighted accuracy score we achieved
print("Percentage of 1s predicted:", one_percentage) # We also look to see the percentage of 1s we predicted

Best priors: [0.271, 0.729]
Weighted accuracy: 0.7869623655913979
Percentage of 1s predicted: 0.10289389067524116


<h3>2.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

2.4.1 How did you preprocess the dataset and features?

2.4.2 Which two learning methods from class did you choose and why did you made the choices?

2.4.3 How did you do the model selection?

2.4.4 Does the test performance reach a given baseline 68% performanc? (Please include a screenshot of Kaggle Submission)

1. To preprocess the label, we created a new binary value that equals 1 when the number of DEM votes was greater than the number of GOP votes, and 0 otherwise. In terms of features, we extracted the state abbreviation from the "County" column via substring selection and assigned each number to a number from 1 to 50. From there, all feature values (which excludes FIPS) were standardized according to a standard normal distribution.

    For the neural network, I preprocessed the features by converting the name of the state in which the county was in into an index to add as a feature. I did this using the Embedding Libraries that pytorch has supplied for me. I also converted the income feature into an integer since it was primarily a string. I also made the target equal to 1 for Democrat and 0 for GOP to complete the binary classification. This calculation was made by finding the majority of Democrats or GOP. 


2. We chose to use LDA and Neural Networks as our two learning methods. 
    We chose LDA because we believed many of the features to be independently generated. It could also take into account the prior class probabilities, which in the end allowed us to essentially assign a greater weight to predict 1 than 0, and thus gave us such good accuracy. Additionally, most of our data was real-valued, so it was quite practical. These characteristics allowed us to assume that our data was normally distributed, which is an assumption of the LDA method. And clearly the model was quite a success.
    
    We choose neural network because they are “universal aproximators”. If implemented correctly, they can approximate any dataset. I was previously aware of the fact there were plenty of good machine learning frameworks that were built to make the process of implementing a neural network easier. This definitely helped the decision process. Another good feature of neural networks is that if the results were not as good as hoped, they allow us to change around the parameters to better fit the dataset. 


3. For the LDA model selection, we tested 899 different LDA's with slightly different priors and tested their accuracy on our validation set (20% of the total sample). Afterwards, we selected the model/parameters that achieved the greatest validation score. We also played around with different solvers, as well as using quadratic analysis rather than linear, but none of those actually increased our performance by much so we kept it as it is now. 

    For the neural network, I partitioned the training set into a training set and a validation set. When tuning parameters, I would train on the training set and validated the effect of changing certain parameters (learning rate, size of hidden layers, number of layers, loss function) on the validation set. I began with NLLLoss but changed to CrossEntropyLoss to add the softmax hidden layer. Additionally, I started using Stochastic Gradient Descent, but found that it was not learning fast enough and after I switched to torch's Adam optimizer, it was learning much faster per epoch.
    

4. Our test performance far exceeded baseline of 68% (for both NN and LDA):

![](BasicScore.jpeg)

<h2>Part 3: Creative Solution</h2><p>

<h3>3.1 Open-ended Code:</h3><p>
You may follow the steps in part 2 again but making innovative changes like creating new features, using new training algorithms, etc. Make sure you explain everything clearly in part 3.2. Note that reaching the 75% creative baseline is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [17]:
# Import necessary datasets and add target column when appropriate
df_2016 = add_target(pd.read_csv("./train_2016.csv", sep=',', encoding='unicode_escape'))
df_graph = pd.read_csv("./graph.csv", sep=',', encoding='unicode_escape')
df_2012 = add_target(pd.read_csv("./train_2012.csv", sep=',', encoding='unicode_escape'))

In [18]:
def add_features_creative(df):
    """
    add_features_creative(df) is df but with the following additional features:
        avg_neighbor: average prediction of all neighboring counties to the given county.
                      Neighbors determined by graph.csv. First looks at prediction from 2016
                      election, and if that cannot be found, looks at data from 2012 election.
                      If that cannot be found either, then the given neighboring county is 
                      discounted from the average county score.
         num_neighbors: number of contiguous counties to the given one, as given by graph.csv
    """    
    avg_neighbor = []
    num_neighbors = []
    for row in df.iterrows():
        # For each row, find all of its neighbors' FIPS codes as given in graph.csv
        fips = row[1]["FIPS"]
        neighbors1 = df_graph.loc[df_graph["SRC"] == fips,"DST"].to_numpy()
        neighbors2 = df_graph.loc[df_graph["DST"] == fips,"SRC"].to_numpy()
        neighbors = np.append(neighbors1,neighbors2)
        neighbors = np.delete(neighbors,np.where(neighbors == fips)) # Delete current county from neighbors list
        num_neighbors.append(len(neighbors)) # Add number of neighbors
        total = 0
        count = 0
        for neighbor in neighbors:
            count += 1
            ndf = df_2016[df_2016["FIPS"] == neighbor].head(1)
            if not ndf.empty: # If this county was in 2016 dataset, add its target value to count
                total += ndf.iloc[0]["target"]
            else: # If it was not, then look at 2012 data
                ndf = df_2012[df_2012["FIPS"] == neighbor].head(1)
#                 if not ndf.empty: # Add 2012 data if found
                if False:
                    total += ndf.iloc[0]["target"]
                else: # If neither was found, make sure count keeps up
                    count -= 1
        if count <= 0:             # If there was no county information, put in a placeholder value of 0 as average
            avg_neighbor.append(0) # neighbor score. This makes sense because most likely the missing county will
        else:                      # be rural and therefore its neighbors would have voted GOP.
            avg_neighbor.append(total / count)  # Average only reported counties

    df.loc[:,"avg_neighbor"] = avg_neighbor
    df.loc[:,"num_neighbors"] = num_neighbors
    return df

def preprocess_creative(train_df, validation_df, test_df):
    """
    preprocess_creative(train_df, validation_df, test_df) returns the three respective dataframes but preprocessed
    in the following way:
        1) Add features as decided by add_features_creative(df)
        2) Normalize all features to a standard normal according to train_df statistics
        3) Make target column the last column (only applies to train_df and validation_df)
    """
    # First add creative features
    train_df = add_features_creative(train_df)
    validation_df = add_features_creative(validation_df)
    test_df = add_features_creative(test_df)
    # After adding creative features, we go and do exactly what we did in the basic solution to normalize, etc.
    return preprocess(train_df, validation_df, test_df)

In [19]:
test_df = pd.read_csv("test_2016_no_label.csv")
train_df, validation_df = train_test_split(df_2016, test_size=0.2) # Perform a test-train split of training data
train_df = train_df.copy()
validation_df = validation_df.copy()
df, validation, test_creative = preprocess_creative(train_df, validation_df, test_df)
df.head()

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,avg_neighbor,num_neighbors,state,target
905,45045,0.466271,0.97977,0.526584,-0.782416,1.325086,-0.443486,-0.484891,1.533789,-1.566346,0
1135,27133,0.631568,-0.39524,0.00026,0.508143,0.137221,-1.581476,-0.484891,0.018876,-1.480097,0
346,31181,-0.640721,0.012467,-0.971415,1.659182,0.094797,-1.12628,-0.484891,0.776332,-1.393848,0
348,51035,-0.738811,0.03645,-1.335793,0.403503,-0.912767,0.068609,-0.484891,0.776332,-1.307599,0
1335,8051,0.330337,1.163638,-0.971415,-2.107855,3.562938,-1.581476,0.770794,1.533789,-1.22135,1


In [20]:
lda_creative, priors, score, one_percentage = perform_validation(df, validation)
print("Best priors:",priors) # Priors are in the form [prior for GOP (0), prior for DEM (1)]
print("Weighted accuracy:", score) # Weighted accuracy score we achieved
print("Percentage of 1s predicted:", one_percentage) # We also look to see the percentage of 1s we predicted

Best priors: [0.387, 0.613]
Weighted accuracy: 0.8016711833785005
Percentage of 1s predicted: 0.13183279742765272


In [21]:
# Note that the following number changes drastically from run to run
print(f'Creative algorithm performed {(score - basic_score)*100:.3f}% better than basic solution',)

Creative algorithm performed 1.471% better than basic solution


<h3>3.2 Explanation in Words:</h3><p>

You need to answer the following questions in a markdown cell after this cell:

3.2.1 How much did you manage to improve performance on the test set compared to part 2? Did you reach the 75% accuracy for the test in Kaggle? (Please include a screenshot of Kaggle Submission)

3.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

1) We managed to improve performance a significant amount. What was most surprising was that we always thought the neural network would by far surpass the accuracy of any other type of model, and so we initially created our basic LDA just as a dummy essentially. But after its success on the basic solution, we decided to delve deeper and see how far we could push our accuracy. In the end, after solving many, many bugs, we ended up reaching 84% accuracy on Kaggle, as seen below, compared to around 77% on the basic solution at the time of this writing.
<img src="creative_accuracy.png">

2) To do this, we tried a couple of things, but ended up including all but one of them in our basic solution as described above. Thus the only unique element we added was incorporating both graph and 2012 data into our algorithm. We first hypothesized that the way surrounding counties voted would probably affect the way the county itself voted. Additionally, we also realized that the *number* of neighboring counties might also give us useful information, since many GOP counties tend to be square-shaped (thus having only 4 neighbors) whereas DEM counties might have weird squiggles and curves and thus would have more neighbors. To implement this, we added two new features: `avg_neighbor` and `num_neighbors`.

`num_neighbors` is simply the number of neighbors listed in the graph.csv file. If you look at our code, we had to do an awkward array concatenation since the CSV file could have the given county in either column. Additionally, we had to make sure to remove the county itself from that list, as it was (for some reason) listed as one of its own neighbors. This interestingly gave us a significant bug wherein we were getting around 97% accuracy on our validation but only 73% on Kaggle, but eventually this bug was spotted and the county itself was discarded from computation.

`avg_neighbor` is the average prediction of the given county's neighboring counties in either 2016 or 2012. The process was also described above in the code, but we'll explain it here again for ease of the reader. After we had assembled the list of all neighboring county codes, we looked through this list one-by-one and saw whether or not that county was listed in our entire train_2016 dataset. If it was, we added the target 1-0 value to our sum. If it was not in our table, then we looked at the train_2012 dataset, and if it was there, we added that target value to our sum. If it was not there, we did nothing. After we had collected this sum, we divided it by the number of counties found (in either 2016 or 2012 data) to obtain our average neighbor target value. So if all of a given county's neighbors voted democratic in 2016, then its average neighbor score would be 1.
Note that this does not actually cause data-leakage into our validation set as we never use the target information of a training example itself, only of its neighbors.

To summarize: we had hypotheses about two things that would probably be good features for our dataset, so we added them. We only resorted to using the 2012 dataset when we absolutely had to, i.e. the given county was not available in our 2016 sample. Other than that, we figured adding the 2012 dataset as a whole might bias our predictions a great deal, since we know from experience that voting lines changed a great deal between 2012 and 2016.

After we assembled lists of each of these features, we added them to their respective dataframes, and normalized the data with a standard scaler as before. Surprisingly enough, adding just these two features managed to increase our prediction score by 7% or so. 

<h2>Part 4: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The CSV shall contain TWO column named exactly "FIPS" and "Result" and 1555 total rows excluding the column names, "FIPS" column shall contain FIPS of counties with same order as in the test_2016_no_label.csv while "Result" column shall contain the 0 or 1 prdicaitons for corresponding columns. A sample predication file can be downloaded from Kaggle.

In [22]:
# First examine previously trained test set to ensure all correct features are there and in the proper order
test.head() 

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,state
0,17059,-0.805827,-0.814899,-0.336793,0.637593,-1.256529,1.340411,-1.444147
1,6103,-0.730743,0.173014,0.48909,-0.124578,-0.665249,1.071879,-1.36006
2,42047,-0.044727,-0.711328,-0.791029,0.819062,-0.337934,0.21258,-1.275973
3,47147,0.68266,0.619169,0.447796,-0.378635,-0.337934,-0.539307,-1.191886
4,39039,0.187077,-0.392646,-0.212911,-0.197165,-0.538547,-0.109657,-1.107799


In [23]:
test_creative.head()

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,avg_neighbor,num_neighbors,state
0,17059,-0.799297,-0.834923,-0.323631,0.577903,-1.252157,1.434196,-0.484891,0.018876,-1.566346
1,6103,-0.72321,0.156363,0.486098,-0.154577,-0.658225,1.149698,-0.484891,0.018876,-1.480097
2,42047,-0.028019,-0.730998,-0.768983,0.752303,-0.329441,0.239307,-0.484891,0.018876,-1.393848
3,47147,0.709096,0.604041,0.445612,-0.398736,-0.329441,-0.557285,-0.484891,0.776332,-1.307599
4,39039,0.206885,-0.411228,-0.202172,-0.224337,-0.530953,-0.10209,-0.484891,0.018876,-1.22135


In [24]:
def save_to_csv(fname, preds):
    outputdf = pd.DataFrame(test["FIPS"]) # Create dataframe with one column of FIPS county codes
    outputdf['Result'] = preds # Use our previously trained LDA to predict election result
    # Then view the percentage of 1s predicted to ensure its not astronomically different from the number of 
    # 1s predicted on our training or validation sets, and thus reinforce our hypothesis that LDA is an accurate
    # tool for modeling this data.
    print("Percentage of 1s predicted:",np.count_nonzero(outputdf['Result']) / len(outputdf['Result']))
    outputdf.to_csv(fname, index = False) # Save information in this file

In [25]:
test_features = test[test.columns[1:]] # Extract features from test set
test_features_creative = test_creative[test_creative.columns[1:]] # Extract features from test set
print("(creative LDA)")
save_to_csv("CreativeLDA",lda_creative.predict(test_features_creative))
print("(basic LDA)")
save_to_csv("BasicLDA",lda_basic.predict(test_features))
print("(basic NN)")
save_to_csv("BasicNN",get_NN_output(pd.read_csv("test_2016_no_label.csv")))

(creative LDA)
Percentage of 1s predicted: 0.28038585209003214
(basic LDA)
Percentage of 1s predicted: 0.4090032154340836
(basic NN)
Percentage of 1s predicted: 0.17363344051446947


<h2>Part 5: Resources and Literature Used</h2><p>

We only used stackoverflow and documentation. **POTENTIALLLY NN EXAMPLE @ANTHONY??**

Also, see preprocessing for neural network for credits for that.