<a href="https://colab.research.google.com/github/amolaka/DS-3001---Voting-Project/blob/main/Voting_Project_Paper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Voting Project Paper**
## Predicting 2024 Virginia Presidential Election Outcomes

Anwita Molaka & Sophie Phillips

##**Summary**

This project aims to predict the winner of the 2024 presidential election in the state of Virginia. The approach to developing this predictive model using voting data from presidential elections for Virginia from 2000 to 2020 and NHGIS county data, which contains county stats from every state in the country.
This data was cleaned and preprocessed -- accompanied with some EDA -- to help determine which variables will be the most useful to creating a model. The main variables that were included in the data to train and test our model were year, county FIPS code, population, and the total votes. We additionally chose to focus on the variables pertaining to income, age, education, and race statistics in hopes of building a more accurate model. The model we chose to explore was a random forest, both with the additional variables and without. Additionally, county adjacencies data was used to create chloropleth plots that helped visualize these results over a map of Virginia, as well as compare them to the actual results of the 2020 election to compare and contrast the outcomes. The random forest model predicted the number of democrat votes and the number of republican votes for each county in Virginia. The data from 2000-2019 was used to train the model, and the data from 2020 was used to test the model, since the candidates are the same as the upcoming 2024 presidential election. Overall, the model predicted that there will be about 2140696 votes for the democratic candidate and about 1995143 votes for the republican candidate. The r-squared value of about 0.9458 indicated a great fit.


*** ADD CHLOROPLETH MAP RESULTS FOR PREDICTED AND ACTUAL ****


## **Data**

The Voting Data file contains information on presidential elections for Virginia from 2000 to 2020 includes variables such as:


*   Year
*   County
* Candidate
* Party
* Candidate votes
* Total votes

The NHGIS county data demographic and economic variables we chose to focus on were:


*   Income (grouped into 0-50K, 50k-100k, and 100k+)
*   Education (grouped by Male and Female as 'Some early education', 'High school', 'Some college or Associate's degree', 'Bachelor's degree', and 'Master's degree or higher')
* Age (grouped by Male and Female as 18-24, 25-49, 50-64, 65 and above)
* Race (grouped into White, Black or African American, American Indian and Alaska Native, Asian, Native Hawaiian and Other Pacific Islander)

From the NHGIS county data, we also retrieved the population values for each county in Virginia. This was a bit of a challenge, as the data had values in ranges of years, rather than specific population values for each individual year. Thus, we created a for loop that iterates through each year in the voting data file, and imports the population values from the closest year available in the NHGIS county data. By doing so, we were able to keep all of the voting data, while still having semi-accurate population values for each county.







## Cleaning the Data & Challenges

For the county adjacencies data, cleaning mainly consisted of dropping NA's. We looked at outliers with a boxplot, but did not find any that required any log transformations.

For the voting data, we started with dropping NA's. Then, we looked at outliers with a boxplot and did log transformations on the candidatevotes and totalvotes variables. Then, we looked out the election data to determine which was the most informative. Based on the candidates and how informational the voting data was, we decided to use the 2020 data as our test data, and the rest of the data as our train data. Our next step was to combine the party votes. Each candidate was listed in the original data with their party affiliation and votes, which resulted in each county having multiple rows. We wanted each county to only have 1 row of data, so we aggregated counts for the votes by party, and only included Democrat and Republican since this is what we wanted to use for the model. Then, we calculated the total votes for each county by finding the sum of the democrat and republican votes.

Cleaning the NHGIS county data came with a few more challenges.To start, the second row of the dataframe had to be made as the header since that was the row that contained the actual descriptive column names. If left as was, this would have interfered with data types of the variables of interest, such as them not being all numerical. After removing that first row, there was still an issue with the data types, so certain variables had to be changed from objects to integer data types.

Additionally, some of the county data files, like ./0002_ds176_20105_county_E.csv contained extra rows at the end with information in an incorrect format, which were dropped. However, after looking further into these files, some did not contain data from the state of Virginia, which was what we desired, so not all of these county files ended up even being used. Instead, we ended up using all the files that ended with 'M' instead of 'E' and combined all the data, since each file contained data for a different set of years.
Once we combined all of the files, we isolated only the Virginia data. Then, we wanted to isolate and focus on the variables of interest. Each variable -- income, age, race, and education -- had numerous columns designated for values in very small, specific ranges. To make our analysis easier, we wanted to combine the columns into broader, but still informative ranges. To do so, we had to create our desired ranges for each variable and aggregate the counts from the combined dataset, for our new columns. Then, we replaced the existing columns with our columns that contained aggregated counts for our desired ranges, for each variable of interest.

After getting this new dataset with our variables of interest, we looked at outliers with boxplots. All of the variables displayed outliers, so we made sure to do log transformations on all of them.


### EDA

## **Results**

### Random Forest Model -- Main Variables

We decided to use a random forest for our prediction model, since it is generally more robust and accurate than a single decision tree.

In [3]:
! git clone https://github.com/amolaka/DS-3001---Voting-Project

Cloning into 'DS-3001---Voting-Project'...
remote: Enumerating objects: 236, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 236 (delta 35), reused 27 (delta 16), pack-reused 163[K
Receiving objects: 100% (236/236), 66.44 MiB | 15.20 MiB/s, done.
Resolving deltas: 100% (100/100), done.
Updating files: 100% (61/61), done.


In [4]:
# Import libraries and packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
# Import data
data = pd.read_csv('/content/DS-3001---Voting-Project/clean_data/voting_data_with_population-2.csv')
data = data.drop(['Unnamed: 0'], axis=1)
data.head()

Unnamed: 0,year,state,county_fips,county_name,democrat_votes,republican_votes,total_votes,population
0,2000,VIRGINIA,51001,ACCOMACK,5092,6352,11444,38305.0
1,2000,VIRGINIA,51003,ALBEMARLE,16255,18291,34546,79236.0
2,2000,VIRGINIA,51005,ALLEGHANY,2214,2808,5022,12926.0
3,2000,VIRGINIA,51007,AMELIA,1754,2947,4701,11400.0
4,2000,VIRGINIA,51009,AMHERST,4812,6660,11472,31894.0


In [9]:
data = data.dropna()

In [10]:
train_data = data[data['year'] != 2020]
test_data = data[data['year'] == 2020]

county_names = data[data['year'] == 2020]
county_names = county_names.drop(columns = ['democrat_votes', 'republican_votes', 'year'])

train_data = train_data.drop(columns = ['state', 'county_name'])
test_data = test_data.drop(columns = ['state', 'county_name'])

# Split data into train/test:
X_train = train_data.drop(columns = ['democrat_votes', 'republican_votes'])
y_train = train_data.loc[:, ['democrat_votes', 'republican_votes']]
X_test = test_data.drop(columns = ['democrat_votes', 'republican_votes'])
y_test = test_data.loc[:, ['democrat_votes', 'republican_votes']]

In [11]:
from sklearn.ensemble import RandomForestRegressor

# Fit model:
model = RandomForestRegressor() # Build a random forest model
model.fit(X_train,y_train) # Fit the model

In [12]:
# Look at R-squared values
train_data_score = model.score(X_train, y_train)
print(f'R^2 - Train: {train_data_score:.4f}')

test_data_score = model.score(X_test, y_test)
print(f'R^2 - Test: {test_data_score:.4f}')

R^2 - Train: 0.9969
R^2 - Test: 0.9482


In [13]:
# Look at predicted voting data
prediction = model.predict(X_test) # Model predictions

# Make a new datafarme
voting_predictions = pd.DataFrame()
voting_predictions['county_name'] = county_names['county_name']
voting_predictions['county_fips'] = county_names['county_fips']
voting_predictions['democrat_votes'] = prediction[:, 0]
voting_predictions['republican_votes'] = prediction[:, 1]

# Look at new dataframe
voting_predictions

Unnamed: 0,county_name,county_fips,democrat_votes,republican_votes
670,ACCOMACK,51001,6342.20,10214.90
671,ALBEMARLE,51003,35048.15,27253.06
672,ALLEGHANY,51005,3187.32,4913.92
673,AMELIA,51007,2821.54,4928.41
674,AMHERST,51009,6238.51,10097.52
...,...,...,...,...
798,SUFFOLK CITY,51800,24857.13,20513.47
799,VIRGINIA BEACH CITY,51810,95299.99,96903.22
800,WAYNESBORO CITY,51820,5218.57,5056.38
801,WILLIAMSBURG CITY,51830,3726.47,3176.10


In [14]:
# Look at total democrat vs total republican votes
print("Total Democrat Votes: " + str(voting_predictions['democrat_votes'].sum()))
print("Total Republican Votes: " + str(voting_predictions['republican_votes'].sum()))

Total Democrat Votes: 2145550.48
Total Republican Votes: 2005412.8900000001


The Random Forest model used for prediction was run several times, outside of this report. Each time, our R-squared for the model was around 0.94, indicating that it fits the data well. Additionally, we observed more Democrat Votes overall than Republican Votes each time we ran the model.

### Presentation of Main Results: Chloropleth Maps

## **Conclusion**
 One to two pages summarizing the project, defending it from criticism, and suggesting additional work that was outside the scope of the project.