My dataset uses the average national polling percentages for the Democratic and Republican candidates for each year, based on multiple polls during the election cycle. Instead of focusing on just the very last poll before Election Day, I wanted to see how well the overall polling trend for each year predicted the eventual winner. This helps me test whether the general polling climate—what most analysts, media, and campaigns would have seen—lines up with the final election results.

## What I did:

I scraped data from https://en.wikipedia.org/wiki/Polling_for_United_States_presidential_elections in order to get the overall historical picture of opinion polling for the US Presidential Elections. In the page, the data is sourced from Gallup Trial‑Heat Polls for the years 1936 to the early 2000s and then Mix of Gallup and other major pollsters in the later decades, and shows the averages of the opinion polling scores for each month in that year it was taken from. I wrote code to create dataframes from this scraped data for every year (located in the Historical_Notebook folder) and then created a singular data frame that housed all the dataframes together titled Historical_Polling_Data_Set also located in the same folder. Once that was done, I underwent the same process (scraping, writing code to clean the data, saving it in seperate dataframes by year and then combining all the seperate data frames into one main data frame) for the actual polling results from https://www.270towin.com which can be found in the same folder. Then I created the Historic_Baseline_Model_Poll_vs_Actual_Combined.csv where I had combined both the historic polling results by year and the actual results by year. Seeing that for the baseline model, I would need a very cleaned up structure, I started this notebook (Baseline_Model_Data) to clean the data by Averaging the polling for both democrats and republicans per year, determining who lead in polling for that year, creating columns to show which party won in the election in both the electoral college vote (which determines the presidency) and the popular vote (for future iterations of the project where I can study the differences in polling accuracy for both the electoral college and popular vote results for each years election and maybe see if there is data backed truth to the frustration that people feel over the electoral college determining the winner). Then I created a copy that only took into account the main columns I needed to complete my baseline Models which were Logistic Regression, Random Forest and XG Boost. Here you will find the results of testing on the historical polling dataset using each of these models and my final analysis on whether polling has predictive power when it comes to election results via the electoral college. 

In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Read in your dataset
df = pd.read_csv('Historic_Baseline_Model_Poll_vs_Actual_Combined.csv')

# Quick check of the first few rows
#df.head(60)

In [2]:
# Groups by year and calculates the average of all column per year
df = df.groupby('year').mean(numeric_only=True).reset_index().round(2)

# Sets Poll Leader Column to show 0 if Dem won that year, 1 if Rep won that year 
# Based of the Demoratic and Republican Columns 
df['Poll_Leader'] = (df['Republican'] > df['Democratic']).astype(int)

# Sets election_winner to show 0 if dem won that year, 1 if Rep won that year
# Based of the Republican_Electoral and Democrat_Electoral Columns 
# (Indicative of Electoral College Win which is how canidates win the US Presidency)
df['election_winner'] = (df['Republican_Electoral'] > df['Democrat_Electoral']).astype(int)

# Renames election_winner to EC_election_winner
df = df.rename(columns={'election_winner': 'EC_election_winner'})

# Creates a new column PC_election_winner
# Sets PC_electioin_winner to show 0 if dem won that year, 1 if rep won that year
# Based on Republican_Popular and Democrat_Popular Columns
# (Indicative of Polular Vote win which would be interesting to dive into in later iterations of this project)
df['PC_election_winner'] = (df['Republican_Popular'] > df['Democrat_Popular']).astype(int)

# Drop Duplicates
df = df.loc[:,~df.columns.duplicated()]

df.columns

Index(['year', 'Democratic', 'Republican', 'Poll_Leading_Margin',
       'Poll_Leader', 'EC_election_winner', 'Republican_Electoral',
       'Democrat_Electoral', 'Republican_Popular', 'Democrat_Popular',
       'Total_Popular_Vote (Total votes cast in Presidential Election)',
       'Republican_Electoral_pct (out of 270)',
       'Democrat_Electoral_pct (out of 270)',
       'Republican_Popular_pct (Out of total votes cast in Presidential Election)',
       'Democrat_Popular_pct (Out of total votes cast in Presidential Election)',
       'Electoral_Leading_Margin (difference between dem and rep electorial pct)',
       'Popular_Leading_Margin (difference between dem and rep popular vote pct)',
       'Poll_vs_Electoral_Margin_Diff', 'Poll_vs_Popular_Margin_Diff',
       'PC_election_winner'],
      dtype='object')

In [3]:
df

Unnamed: 0,year,Democratic,Republican,Poll_Leading_Margin,Poll_Leader,EC_election_winner,Republican_Electoral,Democrat_Electoral,Republican_Popular,Democrat_Popular,Total_Popular_Vote (Total votes cast in Presidential Election),Republican_Electoral_pct (out of 270),Democrat_Electoral_pct (out of 270),Republican_Popular_pct (Out of total votes cast in Presidential Election),Democrat_Popular_pct (Out of total votes cast in Presidential Election),Electoral_Leading_Margin (difference between dem and rep electorial pct),Popular_Leading_Margin (difference between dem and rep popular vote pct),Poll_vs_Electoral_Margin_Diff,Poll_vs_Popular_Margin_Diff,PC_election_winner
0,1936,50.33,44.44,5.89,0,0,8.0,523.0,16679583.0,27751597.0,44431180.0,2.96,193.7,37.54,62.46,190.74,24.92,-184.85,-19.03,0
1,1940,48.12,42.75,5.38,0,0,82.0,449.0,22305198.0,27244160.0,49549358.0,30.37,166.3,45.02,54.98,135.93,9.96,-130.56,-4.58,0
2,1944,49.22,44.33,4.89,0,0,99.0,432.0,22006285.0,25602504.0,47608789.0,36.67,160.0,46.22,53.78,123.33,7.56,-118.44,-2.67,0
3,1948,39.6,47.2,8.6,1,0,189.0,303.0,21969170.0,24105695.0,46074865.0,70.0,112.22,47.68,52.32,42.22,4.64,-33.62,3.96,0
4,1952,40.11,52.22,12.11,1,1,442.0,89.0,33778963.0,27314992.0,61093955.0,163.7,32.96,55.29,44.71,130.74,10.58,-118.63,1.53,1
5,1956,37.33,58.17,20.83,1,1,457.0,73.0,35581003.0,25738765.0,61319768.0,169.26,27.04,58.03,41.97,142.22,16.06,-121.39,4.77,1
6,1960,47.93,47.0,2.93,0,0,219.0,303.0,34107646.0,34227096.0,68334742.0,81.11,112.22,49.91,50.09,31.11,0.18,-28.18,2.75,0
7,1964,66.0,27.78,38.22,0,0,52.0,486.0,27146969.0,42825463.0,69972432.0,19.26,180.0,38.8,61.2,160.74,22.4,-122.52,15.82,0
8,1968,35.58,41.0,8.08,1,1,301.0,191.0,31710470.0,30898055.0,62608525.0,111.48,70.74,50.65,49.35,40.74,1.3,-32.66,6.78,1
9,1972,34.44,58.33,23.89,1,1,520.0,17.0,46740323.0,28901598.0,75641921.0,192.59,6.3,61.79,38.21,186.29,23.58,-162.4,0.31,1


In [4]:
# Creates EC_Poll_Accurate column which states if Poll_Leader matched EC_election_winner
df['EC_Poll_Accurate'] = (df['Poll_Leader'] == df['EC_election_winner']).astype(int)

# Creates PC_Poll_Accurate column which states if Poll_Leader matched PC_election_winner
df['PC_Poll_Accurate'] = (df['Poll_Leader'] == df['PC_election_winner']).astype(int)

df

Unnamed: 0,year,Democratic,Republican,Poll_Leading_Margin,Poll_Leader,EC_election_winner,Republican_Electoral,Democrat_Electoral,Republican_Popular,Democrat_Popular,...,Democrat_Electoral_pct (out of 270),Republican_Popular_pct (Out of total votes cast in Presidential Election),Democrat_Popular_pct (Out of total votes cast in Presidential Election),Electoral_Leading_Margin (difference between dem and rep electorial pct),Popular_Leading_Margin (difference between dem and rep popular vote pct),Poll_vs_Electoral_Margin_Diff,Poll_vs_Popular_Margin_Diff,PC_election_winner,EC_Poll_Accurate,PC_Poll_Accurate
0,1936,50.33,44.44,5.89,0,0,8.0,523.0,16679583.0,27751597.0,...,193.7,37.54,62.46,190.74,24.92,-184.85,-19.03,0,1,1
1,1940,48.12,42.75,5.38,0,0,82.0,449.0,22305198.0,27244160.0,...,166.3,45.02,54.98,135.93,9.96,-130.56,-4.58,0,1,1
2,1944,49.22,44.33,4.89,0,0,99.0,432.0,22006285.0,25602504.0,...,160.0,46.22,53.78,123.33,7.56,-118.44,-2.67,0,1,1
3,1948,39.6,47.2,8.6,1,0,189.0,303.0,21969170.0,24105695.0,...,112.22,47.68,52.32,42.22,4.64,-33.62,3.96,0,0,0
4,1952,40.11,52.22,12.11,1,1,442.0,89.0,33778963.0,27314992.0,...,32.96,55.29,44.71,130.74,10.58,-118.63,1.53,1,1,1
5,1956,37.33,58.17,20.83,1,1,457.0,73.0,35581003.0,25738765.0,...,27.04,58.03,41.97,142.22,16.06,-121.39,4.77,1,1,1
6,1960,47.93,47.0,2.93,0,0,219.0,303.0,34107646.0,34227096.0,...,112.22,49.91,50.09,31.11,0.18,-28.18,2.75,0,1,1
7,1964,66.0,27.78,38.22,0,0,52.0,486.0,27146969.0,42825463.0,...,180.0,38.8,61.2,160.74,22.4,-122.52,15.82,0,1,1
8,1968,35.58,41.0,8.08,1,1,301.0,191.0,31710470.0,30898055.0,...,70.74,50.65,49.35,40.74,1.3,-32.66,6.78,1,1,1
9,1972,34.44,58.33,23.89,1,1,520.0,17.0,46740323.0,28901598.0,...,6.3,61.79,38.21,186.29,23.58,-162.4,0.31,1,1,1


In [5]:
df.columns

Index(['year', 'Democratic', 'Republican', 'Poll_Leading_Margin',
       'Poll_Leader', 'EC_election_winner', 'Republican_Electoral',
       'Democrat_Electoral', 'Republican_Popular', 'Democrat_Popular',
       'Total_Popular_Vote (Total votes cast in Presidential Election)',
       'Republican_Electoral_pct (out of 270)',
       'Democrat_Electoral_pct (out of 270)',
       'Republican_Popular_pct (Out of total votes cast in Presidential Election)',
       'Democrat_Popular_pct (Out of total votes cast in Presidential Election)',
       'Electoral_Leading_Margin (difference between dem and rep electorial pct)',
       'Popular_Leading_Margin (difference between dem and rep popular vote pct)',
       'Poll_vs_Electoral_Margin_Diff', 'Poll_vs_Popular_Margin_Diff',
       'PC_election_winner', 'EC_Poll_Accurate', 'PC_Poll_Accurate'],
      dtype='object')

In [6]:
# Re Order Data Frame for clarity

Order = [
    'year', 'Poll_Leading_Margin', 'Poll_Leader', 'EC_election_winner',
    'PC_election_winner', 'EC_Poll_Accurate', 'PC_Poll_Accurate', 'Democratic', 'Republican',
    'Republican_Electoral',
    'Democrat_Electoral', 'Republican_Popular', 'Democrat_Popular',
    'Total_Popular_Vote (Total votes cast in Presidential Election)',
    'Republican_Electoral_pct (out of 270)',
    'Democrat_Electoral_pct (out of 270)',
    'Republican_Popular_pct (Out of total votes cast in Presidential Election)',
    'Democrat_Popular_pct (Out of total votes cast in Presidential Election)',
    'Electoral_Leading_Margin (difference between dem and rep electorial pct)',
    'Popular_Leading_Margin (difference between dem and rep popular vote pct)',
    'Poll_vs_Electoral_Margin_Diff', 'Poll_vs_Popular_Margin_Diff'
]

df = df[Order]


df

Unnamed: 0,year,Poll_Leading_Margin,Poll_Leader,EC_election_winner,PC_election_winner,EC_Poll_Accurate,PC_Poll_Accurate,Democratic,Republican,Republican_Electoral,...,Democrat_Popular,Total_Popular_Vote (Total votes cast in Presidential Election),Republican_Electoral_pct (out of 270),Democrat_Electoral_pct (out of 270),Republican_Popular_pct (Out of total votes cast in Presidential Election),Democrat_Popular_pct (Out of total votes cast in Presidential Election),Electoral_Leading_Margin (difference between dem and rep electorial pct),Popular_Leading_Margin (difference between dem and rep popular vote pct),Poll_vs_Electoral_Margin_Diff,Poll_vs_Popular_Margin_Diff
0,1936,5.89,0,0,0,1,1,50.33,44.44,8.0,...,27751597.0,44431180.0,2.96,193.7,37.54,62.46,190.74,24.92,-184.85,-19.03
1,1940,5.38,0,0,0,1,1,48.12,42.75,82.0,...,27244160.0,49549358.0,30.37,166.3,45.02,54.98,135.93,9.96,-130.56,-4.58
2,1944,4.89,0,0,0,1,1,49.22,44.33,99.0,...,25602504.0,47608789.0,36.67,160.0,46.22,53.78,123.33,7.56,-118.44,-2.67
3,1948,8.6,1,0,0,0,0,39.6,47.2,189.0,...,24105695.0,46074865.0,70.0,112.22,47.68,52.32,42.22,4.64,-33.62,3.96
4,1952,12.11,1,1,1,1,1,40.11,52.22,442.0,...,27314992.0,61093955.0,163.7,32.96,55.29,44.71,130.74,10.58,-118.63,1.53
5,1956,20.83,1,1,1,1,1,37.33,58.17,457.0,...,25738765.0,61319768.0,169.26,27.04,58.03,41.97,142.22,16.06,-121.39,4.77
6,1960,2.93,0,0,0,1,1,47.93,47.0,219.0,...,34227096.0,68334742.0,81.11,112.22,49.91,50.09,31.11,0.18,-28.18,2.75
7,1964,38.22,0,0,0,1,1,66.0,27.78,52.0,...,42825463.0,69972432.0,19.26,180.0,38.8,61.2,160.74,22.4,-122.52,15.82
8,1968,8.08,1,1,1,1,1,35.58,41.0,301.0,...,30898055.0,62608525.0,111.48,70.74,50.65,49.35,40.74,1.3,-32.66,6.78
9,1972,23.89,1,1,1,1,1,34.44,58.33,520.0,...,28901598.0,75641921.0,192.59,6.3,61.79,38.21,186.29,23.58,-162.4,0.31


In [7]:
# Main columns needed are year	
#Poll_Leading_Margin	
#Poll_Leader	
#EC_election_winner	
#PC_election_winner	
#EC_Poll_Accurate	
#PC_Poll_Accurate

# Kept the others for future iterations of the project

# Copy to keep only the main columns
df_baseline = df[
    ['year', 'Poll_Leading_Margin', 'Poll_Leader', 'EC_election_winner', 'PC_election_winner', 'EC_Poll_Accurate', 'PC_Poll_Accurate']
].copy()


df_baseline

Unnamed: 0,year,Poll_Leading_Margin,Poll_Leader,EC_election_winner,PC_election_winner,EC_Poll_Accurate,PC_Poll_Accurate
0,1936,5.89,0,0,0,1,1
1,1940,5.38,0,0,0,1,1
2,1944,4.89,0,0,0,1,1
3,1948,8.6,1,0,0,0,0
4,1952,12.11,1,1,1,1,1
5,1956,20.83,1,1,1,1,1
6,1960,2.93,0,0,0,1,1
7,1964,38.22,0,0,0,1,1
8,1968,8.08,1,1,1,1,1
9,1972,23.89,1,1,1,1,1


In [8]:
df_baseline.to_csv('Baseline_Model_Data.csv', index=False)

# Baseline Model

For my baseline model, I want to figure out how well the average polling data for each year could have predicted who actually won each U.S. presidential election. To make this a fair test, I am only using information that would have been available throughout the election cycle as my input features. After that, I check if those features can actually predict what happened in reality.

For X, I am using the average `Poll_Leader` and `Poll_Leading_Margin` for each year. These show who was ahead in the polls and by how much, not just right before Election Day, but as a summary of the whole polling season. This matches what someone looking at the overall polling trend would have seen. Nothing in these features includes any information from the actual results, so I am not leaking any answers into my model.

For y, I am using `EC_election_winner`, which shows who really won the electoral college that year. This is what I am trying to predict.

This way, my setup honestly tests if you could have used the average polling numbers available in a given year to figure out who was actually going to win.


## Logistic Regression

In [12]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Reading in Data
df = pd.read_csv('Baseline_Model_Data.csv')

# Features and Target Defined
X = df[['Poll_Leader', 'Poll_Leading_Margin']]
y = df['EC_election_winner']

In [14]:
# Spliting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

Splitting the data into training and test sets lets me check if my model can generalize to new elections, not just memorize the old ones. Stratifying by the election winner helps keep the balance of Republican and Democrat wins similar in both sets, which is important with a small dataset like this.

In [15]:
# Initializing and training the logistic regression model
# Teaching the model to recognize patterns in the features linked a Dem or Rep win

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [16]:
# Use test set for prediction
y_pred = logreg.predict(X_test)

# Calculating accuracy
acc_logreg = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {acc_logreg:.2%}")

# Confusion matrix
confusion_matrix(y_test, y_pred)

Logistic Regression Accuracy: 83.33%


array([[2, 1],
       [0, 3]])

In [17]:
#classification report
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       1.00      0.67      0.80         3\n           1       0.75      1.00      0.86         3\n\n    accuracy                           0.83         6\n   macro avg       0.88      0.83      0.83         6\nweighted avg       0.88      0.83      0.83         6\n'

The classification report and confusion matrix help break down how my logistic regression model is actually performing on the test data.

If you have never used a confusion matrix before, it is basically a table that shows how many times the model got each type of prediction right or wrong. In my case, I was testing the model on six different election years. Out of those six, the model correctly predicted two Republican wins and three Democrat wins. It only made one mistake, which was calling a Republican win as a Democrat.

That gave me an accuracy score of 83 percent, meaning the model matched the real election outcome for 5 out of 6 elections in the test set. It is important to note that I am using the average polling leader and average poll margin for each year, not just the final poll before Election Day. My results show how well the general polling trend across the whole year lined up with the actual winner, not just the outcome from last-minute shifts.

Also, this accuracy score is just what my model achieved on this specific test split. With so few elections, every single mistake changes the percentage a lot. If I split the data differently or tested on a different batch of years, the number could go up or down. This is just a snapshot of how well the average polling picture for each year tracked the winner, not a universal statement about the predictive power of polls for every presidential election.