<h2>CS 4780/5780 Final Project: </h2>
<h3>Election Result Prediction for US Counties</h3>

Names and NetIDs for your group members: Eric Osband (eo255), Anthony Cuturuffo (acc284), Eddie Freedman (ebf45???)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

<h3>Introduction:</h3>

<p> The final project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The programming project provide templates for how to do this, and the most recent video lectures summarize some of the tricks you will need (e.g. feature normalization, feature construction). So, this final project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is forecasting election results. Economic and sociological factors have been widely used when making predictions on the voting results of US elections. Economic and sociological factors vary a lot among counties in the United States. In addition, as you may observe from the election map of recent elections, neighbor counties show similar patterns in terms of the voting results. In this project you will bring the power of machine learning to make predictions for the county-level election results using Economic and sociological factors and the geographic structure of US counties. </p>
<p>

<h3>Your Task:</h3>
Plase read the project description PDF file carefully and make sure you write your code and answers to all the questions in this Jupyter Notebook. Your answers to the questions are a large portion of your grade for this final project. Please import the packages in this notebook and cite any references you used as mentioned in the project description. You need to print this entire Jupyter Notebook as a PDF file and submit to Gradescope and also submit the ipynb runnable version to Canvas for us to run.

<h3>Due Date:</h3>
The final project dataset and template jupyter notebook will be due on <strong>December 15th</strong> . Note that <strong>no late submissions will be accepted</strong>  and you cannot use any of your unused slip days before.
</p>

![image.png; width="100";](attachment:image.png)

<h2>Part 1: Basics</h2><p>

<h3>1.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

FROM ERIC: To download PyTorch, run the following

<code>conda install pytorch torchvision -c pytorch</code>

In [1]:
import os
import pandas as pd
import numpy as np
# TODO
from sklearn.preprocessing import StandardScaler

<h3>1.2 Weighted Accuracy:</h3><p>
Since our dataset labels are heavily biased, you need to use the following function to compute weighted accuracy throughout your training and validation process and we use this for testing on Kaggle.
<p>

In [2]:
def weighted_accuracy(pred, true):
    assert(len(pred) == len(true))
    num_labels = len(true)
    num_pos = sum(true)
    num_neg = num_labels - num_pos
    frac_pos = num_pos/num_labels
    weight_pos = 1/frac_pos
    weight_neg = 1/(1-frac_pos)
    num_pos_correct = 0
    num_neg_correct = 0
    for pred_i, true_i in zip(pred, true):
        num_pos_correct += (pred_i == true_i and true_i == 1)
        num_neg_correct += (pred_i == true_i and true_i == 0)
    weighted_accuracy = ((weight_pos * num_pos_correct) 
                         + (weight_neg * num_neg_correct))/((weight_pos * num_pos) + (weight_neg * num_neg))
    return weighted_accuracy

<h2>Part 2: Baseline Solution</h2><p>
Note that your code should be commented well and in part 2.4 you can refer to your comments. (e.g. # Here is SVM, 
# Here is validation for SVM, etc). Also, we recommend that you do not to use 2012 dataset and the graph dataset to reach the baseline accuracy for 68% in this part, a basic solution with only 2016 dataset and reasonable model selection will be enough, it will be great if you explore thee graph and possibly 2012 dataset in Part 3.

<h3>2.1 Preprocessing and Feature Extraction:</h3><p>
Given the training dataset and graph information, you need to correctly preprocess the dataset (e.g. feature normalization). For baseline solution in this part, you might not need to introduce extra features to reach the baseline test accuracy.
<p>

In [269]:
# You may change this but we suggest loading data with the following code and you may need to change
# datatypes and do necessary data transformation after loading the raw data to the dataframe.
dataset_path = "./train_2016.csv"
# df = pd.read_csv(dataset_path, sep=',',header=None, encoding='unicode_escape')

# Chose to include header to remember column identifiers
df = pd.read_csv(dataset_path, sep=',', encoding='unicode_escape')
# Have a quick look at the head of this dataset
df.head()

Unnamed: 0,FIPS,County,DEM,GOP,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate
0,18019,"Clark County, IN",18791,30012,51837,4.9,12.8,11.0,20.9,4.2
1,6035,"Lassen County, CA",2026,6533,49793,-18.4,9.2,6.3,12.0,6.9
2,40081,"Lincoln County, OK",2423,10838,44914,-1.3,11.4,11.7,15.1,5.3
3,31153,"Sarpy County, NE",27704,44649,74374,9.2,14.2,5.0,40.1,2.9
4,28055,"Issaquena County, MS",395,298,26957,-12.8,9.8,5.3,6.7,14.0


In [270]:
df2 = df.copy()
# Create feature representing state number

# Gets state initials from a county string
get_state_from_county = lambda county : county[county.index(",") + 2:]

df["state_name"] = df["County"].apply(get_state_from_county)
states = df["state_name"].unique().tolist()

#one-hot encode state data
#onehot = pd.get_dummies(df["state_name"], prefix = None)
#df[onehot.columns] = onehot

get_id_from_state = lambda state : states.index(state)
df["state"] = df["state_name"].apply(get_id_from_state)

# Create a target prediction column that is 1 if DEM score strictly greater than GOP score
target = "target"
df[target] = (df["DEM"] > df["GOP"]).astype(int)

# Get rid of all commas in MedianIncome column
df['MedianIncome']=df['MedianIncome'].str.replace(',','').astype(int)

# Get rid of county, state_name, DEM and GOP columns
df = df.drop(columns = ["County", "DEM", "GOP", 'state_name'])


# Get a list of all column names
columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate', 'state']
# Remove targets + ID from training columns, without dropping them from dataset
# Now df[columns] will have the dataframe but with only the feature columns
# and df[target] will have the prediction labels


std_scaler = StandardScaler()
std_scaler.fit(df[columns])
df[columns] = std_scaler.transform(df[columns])


df.head()

# Make sure you comment your code clearly and you may refer to these comments in the part 2.4
# TODO

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,state,target
0,18019,0.188697,0.416162,0.502519,0.056358,-0.065012,-0.518119,-1.53346,0
1,6035,0.025339,-1.459045,-0.95906,-1.611216,-1.026568,0.943105,-1.450494,0
2,40081,-0.364594,-0.08282,-0.065873,0.30472,-0.691644,0.077194,-1.367528,0
3,31153,1.98987,0.76223,1.070911,-2.07246,2.009357,-1.221672,-1.284562,0
4,28055,-1.799729,-1.008352,-0.715463,-1.966019,-1.59918,4.785582,-1.201596,1


<h3>2.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 1.1.

In [272]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.4
# TODO
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import svm
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import make_scorer

lda = LinearDiscriminantAnalysis()
labels = df['target']
features = df[df.columns[1:-1]]

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2)

In [273]:
scores = cross_validate(lda, x_train, y_train, scoring = make_scorer(weighted_accuracy), return_estimator = True)
scores

{'fit_time': array([0.00588584, 0.00443316, 0.00617313, 0.00493717, 0.00421596]),
 'score_time': array([0.00481892, 0.00436282, 0.00619102, 0.00324488, 0.00462604]),
 'estimator': (LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis()),
 'test_score': array([0.77027778, 0.77982143, 0.9389016 , 0.76510989, 0.79509002])}

In [274]:
lda_best = scores['estimator'][2]
y_pred =lda_best.predict(x_test)
print(weighted_accuracy(y_pred, y_test))

0.6927170868347339


In [275]:
test = pd.read_csv("test_2016_no_label.csv")

get_state_from_county = lambda county : county[county.index(",") + 2:]

test["state_name"] = test["County"].apply(get_state_from_county)
states = test["state_name"].unique().tolist()

#one-hot encode state data
#for state in onehot.columns:
    #test['state'] = (test["state_name"] == state).astype(int)
get_id_from_state = lambda state : states.index(state)
test["state"] = test["state_name"].apply(get_id_from_state)

# Get rid of all commas in MedianIncome column
test['MedianIncome']=test['MedianIncome'].str.replace(',','').astype(int)

# Get rid of county, state_name, DEM and GOP columns
test.drop(columns = ["County", "state_name"], inplace = True)


# Get a list of all column names
columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate', 'state']
# Remove targets + ID from training columns, without dropping them from dataset
# Now df[columns] will have the dataframe but with only the feature columns
# and df[target] will have the prediction labels

test[columns] = std_scaler.transform(test[columns])

test.head()

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,state
0,17059,-0.786575,-0.831293,-0.309469,0.588562,-1.253452,1.32194,-1.53346
1,6103,-0.71057,0.166671,0.502519,-0.156524,-0.648428,1.051343,-1.450494
2,42047,-0.016139,-0.726668,-0.756063,0.765964,-0.313504,0.185433,-1.367528
3,47147,0.72017,0.617364,0.461919,-0.404886,-0.313504,-0.572239,-1.284562
4,39039,0.218508,-0.404744,-0.187671,-0.227485,-0.51878,-0.139283,-1.201596


In [276]:
test_features = test[test.columns[1:]]
outputdf = pd.DataFrame(test['FIPS'])
outputdf['Result'] = lda_best.predict(test_features)
outputdf.head()

Unnamed: 0,FIPS,Result
0,17059,0
1,6103,0
2,42047,0
3,47147,0
4,39039,0


In [235]:
outputdf.to_csv('4780PredictionsLDA', index = False)

## Separating States into Regions

In [172]:
get_state_from_county = lambda county : county[county.index(",") + 2:]

df["state_name"] = df["County"].apply(get_state_from_county)
states = df["state_name"].unique().tolist()

#one-hot encode region data
westcoast = ['CA', 'NV', 'WA', 'OR', 'AZ', 'AK', 'HI']
rockies = ['ID', 'CO', 'MT', 'WY', 'UT']
gulf = ['NM', 'TX', 'AR', 'LA', 'MS', 'AL']
midwest = ['ND', 'SD', 'NE', 'KS', 'OK', 'MN', 'IA', 'MO', 'WI', 'IL', 'IN', 'MI', 'KY', 'TN', 'OH']
southeast = ['FL', 'GA', 'SC', 'SC', 'NC', 'VA', 'WV']
northeast = ['PA', 'MD', 'DE', 'PA', 'NJ', 'NY', 'CT', 'RI', 'MA', 'VT', 'NH', 'ME']

regions = [westcoast, rockies, gulf, midwest, southeast, northeast]
region_names = ['West', 'Rockies', 'Gulf', 'Midwest', 'Southeast', 'Northeast']

for i in range(len(regions)):
    df[region_names[i]] = (df['state_name'].isin(regions[i])).astype(int)

# Create a target prediction column that is 1 if DEM score strictly greater than GOP score
target = "target"
df[target] = (df["DEM"] > df["GOP"]).astype(int)

# Get rid of all commas in MedianIncome column
df['MedianIncome']=df['MedianIncome'].str.replace(',','').astype(int)

# Get rid of county, state_name, DEM and GOP columns
df = df.drop(columns = ["County", "DEM", "GOP", "state_name"])


# Get a list of all column names
columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
# Remove targets + ID from training columns, without dropping them from dataset
# Now df[columns] will have the dataframe but with only the feature columns
# and df[target] will have the prediction labels


std_scaler = StandardScaler()
std_scaler.fit(df[columns])
df[columns] = std_scaler.transform(df[columns])


df.head()

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,West,Rockies,Gulf,Midwest,Southeast,Northeast,target
0,18019,0.188697,0.416162,0.502519,0.056358,-0.065012,-0.518119,0,0,0,1,0,0,0
1,6035,0.025339,-1.459045,-0.95906,-1.611216,-1.026568,0.943105,1,0,0,0,0,0,0
2,40081,-0.364594,-0.08282,-0.065873,0.30472,-0.691644,0.077194,0,0,0,1,0,0,0
3,31153,1.98987,0.76223,1.070911,-2.07246,2.009357,-1.221672,0,0,0,1,0,0,0
4,28055,-1.799729,-1.008352,-0.715463,-1.966019,-1.59918,4.785582,0,0,1,0,0,0,1


In [180]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import make_scorer

lda = LinearDiscriminantAnalysis()
labels = df['target']
features = df[df.columns[1:-1]]

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2)
scores = cross_validate(lda, x_train, y_train, scoring = make_scorer(weighted_accuracy), return_estimator = True)
scores

{'fit_time': array([0.00719404, 0.00646615, 0.00771689, 0.0096519 , 0.00988507]),
 'score_time': array([0.0037179 , 0.00536609, 0.00757718, 0.00519919, 0.00653481]),
 'estimator': (LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis(),
  LinearDiscriminantAnalysis()),
 'test_score': array([0.75540035, 0.84521944, 0.80335157, 0.79629929, 0.84908213])}

In [183]:
lda_best = scores['estimator'][2]
y_pred = lda_best.predict(x_test)
print(weighted_accuracy(y_pred, y_test))

0.7243589743589743


In [184]:
test = pd.read_csv("test_2016_no_label.csv")

get_state_from_county = lambda county : county[county.index(",") + 2:]

test["state_name"] = test["County"].apply(get_state_from_county)
states = test["state_name"].unique().tolist()

#one-hot encode state data

westcoast = ['CA', 'NV', 'WA', 'OR', 'AZ', 'AK', 'HI']
rockies = ['ID', 'CO', 'MT', 'WY', 'UT']
gulf = ['NM', 'TX', 'AR', 'LA', 'MS', 'AL']
midwest = ['ND', 'SD', 'NE', 'KS', 'OK', 'MN', 'IA', 'MO', 'WI', 'IL', 'IN', 'MI', 'KY', 'TN', 'OH']
southeast = ['FL', 'GA', 'SC', 'SC', 'NC', 'VA', 'WV']
northeast = ['PA', 'MD', 'DE', 'PA', 'NJ', 'NY', 'CT', 'RI', 'MA', 'VT', 'NH', 'ME']

regions = [westcoast, rockies, gulf, midwest, southeast, northeast]
region_names = ['West', 'Rockies', 'Gulf', 'Midwest', 'Southeast', 'Northeast']

for i in range(len(regions)):
    test[region_names[i]] = (test['state_name'].isin(regions[i])).astype(int)

# Get rid of all commas in MedianIncome column
test['MedianIncome']=test['MedianIncome'].str.replace(',','').astype(int)

# Get rid of county, state_name, DEM and GOP columns
test.drop(columns = ["County", "state_name"], inplace = True)


# Get a list of all column names
columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
# Remove targets + ID from training columns, without dropping them from dataset
# Now df[columns] will have the dataframe but with only the feature columns
# and df[target] will have the prediction labels

test[columns] = std_scaler.transform(test[columns])

test.head()

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,West,Rockies,Gulf,Midwest,Southeast,Northeast
0,17059,-0.786575,-0.831293,-0.309469,0.588562,-1.253452,1.32194,0,0,0,1,0,0
1,6103,-0.71057,0.166671,0.502519,-0.156524,-0.648428,1.051343,1,0,0,0,0,0
2,42047,-0.016139,-0.726668,-0.756063,0.765964,-0.313504,0.185433,0,0,0,0,0,1
3,47147,0.72017,0.617364,0.461919,-0.404886,-0.313504,-0.572239,0,0,0,1,0,0
4,39039,0.218508,-0.404744,-0.187671,-0.227485,-0.51878,-0.139283,0,0,0,1,0,0


In [186]:
test_features = test[test.columns[1:]]
outputdf2 = pd.DataFrame(test['FIPS'])
outputdf2['Result'] = lda_best.predict(test_features)
outputdf2.head()

Unnamed: 0,FIPS,Result
0,17059,0
1,6103,0
2,42047,0
3,47147,0
4,39039,0


In [188]:
outputdf2.to_csv('Predictions2.csv', index = False)

<h3>2.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.4
# TODO

<h3>2.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

2.4.1 How did you preprocess the dataset and features?

2.4.2 Which two learning methods from class did you choose and why did you made the choices?

2.4.3 How did you do the model selection?

2.4.4 Does the test performance reach a given baseline 68% performanc? (Please include a screenshot of Kaggle Submission)

<h2>Part 3: Creative Solution</h2><p>

<h3>3.1 Open-ended Code:</h3><p>
You may follow the steps in part 2 again but making innovative changes like creating new features, using new training algorithms, etc. Make sure you explain everything clearly in part 3.2. Note that reaching the 75% creative baseline is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [246]:
# Make sure you comment your code clearly and you may refer to these comments in the part 3.2
# TODO
historical = pd.read_csv('train_2012.csv')
df2.head()

get_state_from_county = lambda county : county[county.index(",") + 2:]

df2["state_name"] = df2["County"].apply(get_state_from_county)
states = df2["state_name"].unique().tolist()

#one-hot encode state data
#onehot = pd.get_dummies(df["state_name"], prefix = None)
#df[onehot.columns] = onehot

# Create a target prediction column that is 1 if DEM score strictly greater than GOP score
target = "target"
df2[target] = (df2["DEM"] > df2["GOP"]).astype(int)

# Get rid of all commas in MedianIncome column
df2['MedianIncome']=df2['MedianIncome'].str.replace(',','').astype(int)

# Get rid of county, state_name, DEM and GOP columns
df2 = df2.drop(columns = ["County", "DEM", "GOP"])


# Get a list of all column names
columns = ['MedianIncome', 'MigraRate', 'BirthRate', 'DeathRate', 'BachelorRate', 'UnemploymentRate']
# Remove targets + ID from training columns, without dropping them from dataset
# Now df[columns] will have the dataframe but with only the feature columns
# and df[target] will have the prediction labels


std_scaler = StandardScaler()
std_scaler.fit(df2[columns])
df2[columns] = std_scaler.transform(df2[columns])


df2.head()

Unnamed: 0,FIPS,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate,state_name,target
0,18019,0.188697,0.416162,0.502519,0.056358,-0.065012,-0.518119,IN,0
1,6035,0.025339,-1.459045,-0.95906,-1.611216,-1.026568,0.943105,CA,0
2,40081,-0.364594,-0.08282,-0.065873,0.30472,-0.691644,0.077194,OK,0
3,31153,1.98987,0.76223,1.070911,-2.07246,2.009357,-1.221672,NE,0
4,28055,-1.799729,-1.008352,-0.715463,-1.966019,-1.59918,4.785582,MS,1


In [258]:
get_id_from_state = lambda state : states.index(state)
df2["state"] = df2["state_name"].apply(get_id_from_state)
historical.head()

Unnamed: 0,FIPS,County,DEM,GOP,MedianIncome,MigraRate,BirthRate,DeathRate,BachelorRate,UnemploymentRate
0,18019,"Clark County, IN",20775,25422,49164,0.1,13.1,9.7,19.7,7.8
1,6035,"Lassen County, CA",3044,7261,47480,-19.8,9.0,8.0,12.9,12.5
2,40081,"Lincoln County, OK",3265,9542,44149,-2.8,11.3,11.1,13.1,5.0
3,31153,"Sarpy County, NE",24709,40318,68118,8.0,15.9,4.9,36.6,3.9
4,28055,"Issaquena County, MS",479,302,28886,2.2,8.6,7.2,7.8,17.6


<h3>3.2 Explanation in Words:</h3><p>

You need to answer the following questions in a markdown cell after this cell:

3.2.1 How much did you manage to improve performance on the test set compared to part 2? Did you reach the 75% accuracy for the test in Kaggle? (Please include a screenshot of Kaggle Submission)

3.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

<h2>Part 4: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The CSV shall contain TWO column named exactly "FIPS" and "Result" and 1555 total rows excluding the column names, "FIPS" column shall contain FIPS of counties with same order as in the test_2016_no_label.csv while "Result" column shall contain the 0 or 1 prdicaitons for corresponding columns. A sample predication file can be downloaded from Kaggle.

In [None]:
# TODO

# You may use pandas to generate a dataframe with FIPS and your predictions first 
# and then use to_csv to generate a CSV file.

<h2>Part 5: Resources and Literature Used</h2><p>