# Florida Electoral Analysis

## Scope:
This project aims to explore the changes in the political geography of Florida over roughly the past decade (2012-2022). It will use vote totals for the major parties' top-of-ticket candidates (President or Governor, depending on the year) in each election broken down by county to answer the following questions:

1. Which counties had the `largest change in vote share` between in presidential election years? What about midterm years?
2. Which counties were the `most stable for each major party`? Which were the `least stable`?
3. Which county or counties were `most consistent in predicting the outcome of the top-of-ticket race` in a given year?

## Data Source:
For this analysis, I will be drawing data from the Florida Division of Elections precinct level election results (https://dos.myflorida.com/elections/data-statistics/elections-data/precinct-level-election-results/). The raw precinct-level data will be cleaned, aggregated to the county level, and reformatted for easier analysis.

## Notes:
1. The raw data for this analysis required a significant amount of cleaning, with each election year's data having its own unique challenges and inconsistencies (especially prior to 2016, when there was clearly an intentional effort to standardize most aspects of the data provided by county SOEs). This notebook and the code herein represent an idealized (or at least abridged) version of this analysis based on outcomes and learnings from a much longer data exploration and cleaning process. 
2. Notebooks containing the initial, longer exploration and cleaning processes for each individual election year's data can be found in the `misc` folder of this repository.
3. The initial scope of this project was much broader, including looking at U.S. House, U.S. Senate, State House, and State Senate races at the more granular precinct level. The year-specific cleaning and exploration notebooks reflect this initial scope.

# Step 1: Ingest Data

Input for this step is the raw data, in the form of ~400 tab-deliniated tables (one for each of 67 counties in Florida, for each of 6 election years).

The output of this step is a dataframe containing the merged raw data from original ~400 CSVs. The resulting dataframe contains ~3.6m rows across 19 columns.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import csv
import os
import glob

In [2]:
# Combine raw csv files into a master dataframe

# Define merge function
def merge_files(directory):
    target_files = glob.glob(directory)
    combined_df = pd.DataFrame()
    for file in target_files:
        df = pd.read_table(file, names=['county_code',\
                                        'county_name',\
                                        'elec_num',\
                                        'elec_date',\
                                        'elec_name',\
                                        'precinct_id',\
                                        'poll_loc',\
                                        'total_reg',\
                                        'total_reg_r',\
                                        'total_reg_d',\
                                        'total_reg_other',\
                                        'contest_name',\
                                        'district',\
                                        'contest_code',\
                                        'cand_or_issue',\
                                        'cand_party',\
                                        'cand_id',\
                                        'doe_num',\
                                        'vote_total'], encoding_errors='replace')
        combined_df = pd.concat([combined_df, df])
    return combined_df

# Define directory path
location = "Raw Data\\FL 20** by Precinct\\*"

# Run the function
df = merge_files(location)

In [3]:
# Check output
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3606146 entries, 0 to 1031
Data columns (total 19 columns):
 #   Column           Dtype  
---  ------           -----  
 0   county_code      object 
 1   county_name      object 
 2   elec_num         int64  
 3   elec_date        object 
 4   elec_name        object 
 5   precinct_id      object 
 6   poll_loc         object 
 7   total_reg        int64  
 8   total_reg_r      int64  
 9   total_reg_d      int64  
 10  total_reg_other  int64  
 11  contest_name     object 
 12  district         object 
 13  contest_code     float64
 14  cand_or_issue    object 
 15  cand_party       object 
 16  cand_id          float64
 17  doe_num          float64
 18  vote_total       int64  
dtypes: float64(3), int64(6), object(10)
memory usage: 550.3+ MB


# Step 2: Clean Data

For additional information and detail, notebooks in the `misc` directory of this repository go through each election year to identify issues needed to be rectified in the preprocessing step below.

In [4]:
# Define data cleaning functions
# Drop duplicates
def drop_duplicates(df):
    df.drop_duplicates(keep='first', inplace=True)
    return df


# Drop unneeded columns
def drop_columns(df):
    df = df.drop(columns=['total_reg_r',\
                           'total_reg_d',\
                           'total_reg_other',\
                           'elec_num',\
                           'poll_loc',\
                           'doe_num'])
    return df


# Standardize race names
def standardize_race_names(df):
    df['contest_name'] = df['contest_name']\
    .replace(dict.fromkeys(['PRESIDENT OF THE UNITED STATES'], 'President of the United States'))\
    .replace(dict.fromkeys(['GOVERNOR AND  LT.GOVERNOR', 'Governor'], 'Governor and Lieutenant Governor'))
    return df


# Narrow to top-of-ticket races
def select_races(df):
    df = df[df.contest_name.isin(['President of the United States',\
                                   'Governor and Lieutenant Governor'])]
    return df


# Combine above functions into master cleaning function
def data_cleaning_pipeline(df):
        df = drop_duplicates(df)
        df = drop_columns(df)
        df = standardize_race_names(df)
        df = select_races(df)
        return df

In [5]:
# Run data cleaning functions
df_cleaned = data_cleaning_pipeline(df)

# Check output
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 320832 entries, 0 to 215
Data columns (total 13 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   county_code    320832 non-null  object 
 1   county_name    320832 non-null  object 
 2   elec_date      320832 non-null  object 
 3   elec_name      320832 non-null  object 
 4   precinct_id    320778 non-null  object 
 5   total_reg      320832 non-null  int64  
 6   contest_name   320832 non-null  object 
 7   district       312588 non-null  object 
 8   contest_code   320832 non-null  float64
 9   cand_or_issue  320832 non-null  object 
 10  cand_party     317949 non-null  object 
 11  cand_id        319875 non-null  float64
 12  vote_total     320832 non-null  int64  
dtypes: float64(2), int64(2), object(9)
memory usage: 34.3+ MB


Now for a final pass over each column looking for errors and non-standard data (inconsistent date formats, candidate name misspellings, etc.).

In [6]:
# Check county_code
county_codes = df_cleaned['county_code'].unique()
print(county_codes)
print(len(county_codes))

['ALA' 'BAK' 'BAY' 'BRA' 'BRE' 'BRO' 'CAL' 'CHA' 'CIT' 'CLA' 'CLL' 'CLM'
 'DAD' 'DES' 'DIX' 'DUV' 'ESC' 'FLA' 'FRA' 'GAD' 'GIL' 'GLA' 'GUL' 'HAM'
 'HAR' 'HEN' 'HER' 'HIG' 'HIL' 'HOL' 'IND' 'JAC' 'JEF' 'LAF' 'LAK' 'LEE'
 'LEO' 'LEV' 'LIB' 'MAD' 'MAN' 'MON' 'MRN' 'MRT' 'NAS' 'OKA' 'OKE' 'ORA'
 'OSC' 'PAL' 'PAS' 'PIN' 'POL' 'PUT' 'SAN' 'SAR' 'SEM' 'STJ' 'STL' 'SUM'
 'SUW' 'TAY' 'UNI' 'VOL' 'WAK' 'WAL' 'WAS']
67


In [7]:
# Check county_name
county_names = df_cleaned['county_name'].unique()
print(county_names)
print(len(county_names))

['Alachua' 'Baker' 'Bay' 'Bradford' 'Brevard' 'Broward' 'Calhoun'
 'Charlotte' 'Citrus' 'Clay' 'Collier' 'Columbia' 'Miami-Dade' 'Desoto'
 'Dixie' 'Duval' 'Escambia' 'Flagler' 'Franklin' 'Gadsden' 'Gilchrist'
 'Glades' 'Gulf' 'Hamilton' 'Hardee' 'Hendry' 'Hernando' 'Highlands'
 'Hillsborough' 'Holmes' 'Indian River' 'Jackson' 'Jefferson' 'Lafayette'
 'Lake' 'Lee' 'Leon' 'Levy' 'Liberty' 'Madison' 'Manatee' 'Monroe'
 'Marion' 'Martin' 'Nassau' 'Okaloosa' 'Okeechobee' 'Orange' 'Osceola'
 'Palm Beach' 'Pasco' 'Pinellas' 'Polk' 'Putnam' 'Santa Rosa' 'Sarasota'
 'Seminole' 'St. Johns' 'St. Lucie' 'Sumter' 'Suwannee' 'Taylor' 'Union'
 'Volusia' 'Wakulla' 'Walton' 'Washington' 'SEMINOLE']
68


In [8]:
# Clean county_name
df_cleaned['county_name'] = df_cleaned['county_name'].replace(
    to_replace='SEMINOLE',\
    value='Seminole')

In [9]:
# Check elec_date
df_cleaned['elec_date'].unique()

array(['11/06/2012', '11/04/2014', '11/4/2014', '11/08/2016', '11/8/2016',
       ' 11/08/2016', '11/06/2018', '11/03/2020', '11/08/2022'],
      dtype=object)

In [10]:
# Clean elec_date
df_cleaned['elec_date'] = df_cleaned['elec_date'].replace(
    to_replace=['11/4/2014','11/8/2016',' 11/08/2016'],\
    value=['11/04/2014','11/08/2016','11/08/2016']
)

df_cleaned['elec_date'] = pd.to_datetime(df_cleaned['elec_date']).dt.year

In [12]:
# Check elec_name
df_cleaned['elec_name'].unique()

array(['2012 General Election', '2014 General Election',
       '2016 General Election', '2016 GENERAL ELECTION',
       '2018 General Election', '2020 General Election',
       '2022 General Election'], dtype=object)

In [13]:
# Clean elec_name
df_cleaned['elec_name'] = df_cleaned['elec_name'].replace(
    to_replace='2016 GENERAL ELECTION',\
    value='2016 General Election'
)


In [14]:
# Check contest_name
df_cleaned['contest_name'].unique()

array(['President of the United States',
       'Governor and Lieutenant Governor'], dtype=object)

In [15]:
# Check district
df_cleaned['district'].unique()

array([' ', nan], dtype=object)

In [16]:
# Clean district
df_cleaned['district'] = df_cleaned['district'].replace(
    to_replace=' ',\
    value='Statewide')\
    .fillna('Statewide')

In [17]:
# Check cand_or_issue
df_cleaned['cand_or_issue'].unique()

array(['Romney / Ryan', 'Obama / Biden', 'Stevens / Link',
       'Johnson / Gray', 'Goode, / Clymer', 'Stein / Honkala',
       'WriteinVotes', 'UnderVotes', 'Barr / Sheehan', 'OverVotes',
       'Alexander / Mendoza', 'Anderson / Rodriguez', 'Hoefling / Ellis',
       'Barnett / Cross', 'Lindsay / Osorio', 'Times Over Voted',
       'Number of Under Votes', 'Scott / Lopez-Cantera', 'Crist / Taddeo',
       'Wyllie / Roe', 'Khavari / Jones', 'Burkett / Matos',
       'Adrian Wyllie', 'Charlie Crist', 'Farid Khavari', 'Glenn Burkett',
       'Rick Scott', 'Write-in', 'Trump / Pence', 'Clinton / Kaine',
       'Johnson / Weld', 'Castle / Bradley', 'Stein / Baraka',
       'De La Fuente / Steinberg', 'WriteInVotes', 'Darrell L. Castle',
       'Donald J. Trump', 'Gary Johnson', 'Hillary R. Clinton',
       'Jill Stein', 'Roque De La Fuente', 'Times Blank Voted',
       'DeSantis / Nuñez', 'Gillum / King', 'Richardson / Argenziano',
       'Gibson / Wilds', 'Foley / Tutton', 'Stanley / Mc

In [18]:
# Clean cand_or_issue
# Standardize candidate names
df_cleaned['cand_or_issue'] = df_cleaned['cand_or_issue'].replace(
    to_replace=['Adrian Wyllie',\
                'Charlie Crist',\
                'Farid Khavari',\
                'Glenn Burkett',\
                'Rick Scott',\
                'Darrell L. Castle',\
                'Donald J. Trump',\
                'Gary Johnson',\
                'Hillary R. Clinton',\
                'Jill Stein',\
                'Roque De La Fuente',\
                'WriteinVotes',\
                'WriteInVotes'],\
    value=['Wyllie / Roe',\
            'Crist / Taddeo',\
            'Khavari / Jones',\
            'Burkett / Matos',\
            'Scott / Lopez-Cantera',\
            'Castle / Bradley',\
            'Trump / Pence',\
            'Johnson / Weld',\
            'Clinton / Kaine',\
            'Stein / Baraka',\
            'De La Fuente / Steinberg',\
            'Write-in',\
            'Write-in'])

# Standardize over/undervote counts
df_cleaned['cand_or_issue'] = df_cleaned['cand_or_issue'].replace(
    to_replace= ['Times Over Voted',\
                'Number of Under Votes',\
                'Times Blank Voted'],
    value=['OverVotes',\
           'UnderVotes',\
           'Blank Votes']
)

In [19]:
# Drop over/undervote counts, blank vote counts
df_cleaned = df_cleaned[~df_cleaned['cand_or_issue'].isin(['UnderVotes','OverVotes','Blank Votes'])]


In [20]:
# Check cand_party
df_cleaned['cand_party'].unique()


array(['REP', 'DEM', 'OBJ', 'LBT', 'CPF', 'GRE', ' ', 'PFP', 'SOC', 'JPF',
       'AIP', 'REF', 'PSL', 'LPF', 'NPA', 'NP', nan], dtype=object)

In [21]:
# Clean cand_party
df_cleaned['cand_party'] = df_cleaned['cand_party'].replace(
    to_replace=' ',\
    value='N/A')\
    .fillna('N/A')

In [22]:
# Reset Index
df_cleaned.reset_index(inplace=True)

In [23]:
df_cleaned.head(10)

Unnamed: 0,index,county_code,county_name,elec_date,elec_name,precinct_id,total_reg,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,vote_total
0,0,ALA,Alachua,2012,2012 General Election,1,1411,President of the United States,Statewide,100000.0,Romney / Ryan,REP,0.0,608
1,1,ALA,Alachua,2012,2012 General Election,1,1411,President of the United States,Statewide,100000.0,Obama / Biden,DEM,0.0,381
2,2,ALA,Alachua,2012,2012 General Election,1,1411,President of the United States,Statewide,100000.0,Stevens / Link,OBJ,0.0,1
3,3,ALA,Alachua,2012,2012 General Election,1,1411,President of the United States,Statewide,100000.0,Johnson / Gray,LBT,0.0,6
4,4,ALA,Alachua,2012,2012 General Election,1,1411,President of the United States,Statewide,100000.0,"Goode, / Clymer",CPF,0.0,1
5,5,ALA,Alachua,2012,2012 General Election,1,1411,President of the United States,Statewide,100000.0,Stein / Honkala,GRE,0.0,3
6,6,ALA,Alachua,2012,2012 General Election,1,1411,President of the United States,Statewide,100000.0,Write-in,,0.0,6
7,108,ALA,Alachua,2012,2012 General Election,2,1988,President of the United States,Statewide,100000.0,Romney / Ryan,REP,0.0,777
8,109,ALA,Alachua,2012,2012 General Election,2,1988,President of the United States,Statewide,100000.0,Obama / Biden,DEM,0.0,725
9,110,ALA,Alachua,2012,2012 General Election,2,1988,President of the United States,Statewide,100000.0,Johnson / Gray,LBT,0.0,9


# Step 3: Reformat Data

Data is fully cleaned, but not yet in a format suitable for easy analysis at the county level. 

The key values for the guiding research questions are change in the margin of victory for the Democratic and Republican candidates in each county in each year. Let's reformat the data to best be able to view those values.

In [232]:
# Group by: election, county, contest, party, and candidate
df_grouped = df_cleaned.groupby(['elec_date',\
                                'county_name',\
                                'contest_name',\
                                'cand_party',\
                                'cand_or_issue'],\
            as_index=False)\
            ['vote_total'].sum()

In [233]:
# Add column for vote share
df_grouped['vote_share'] = df_grouped.groupby(['elec_date', 'county_name'])['vote_total'].transform(lambda x: x / x.sum())

In [234]:
df_grouped.head(20)

Unnamed: 0,elec_date,county_name,contest_name,cand_party,cand_or_issue,vote_total,vote_share
0,2012,Alachua,President of the United States,AIP,Hoefling / Ellis,16,0.000132
1,2012,Alachua,President of the United States,CPF,"Goode, / Clymer",43,0.000356
2,2012,Alachua,President of the United States,DEM,Obama / Biden,69699,0.577107
3,2012,Alachua,President of the United States,GRE,Stein / Honkala,344,0.002848
4,2012,Alachua,President of the United States,JPF,Anderson / Rodriguez,27,0.000224
5,2012,Alachua,President of the United States,LBT,Johnson / Gray,1306,0.010814
6,2012,Alachua,President of the United States,,Write-in,356,0.002948
7,2012,Alachua,President of the United States,OBJ,Stevens / Link,46,0.000381
8,2012,Alachua,President of the United States,PFP,Barr / Sheehan,102,0.000845
9,2012,Alachua,President of the United States,PSL,Lindsay / Osorio,4,3.3e-05


In [235]:
df_grouped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2980 entries, 0 to 2979
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   elec_date      2980 non-null   int32  
 1   county_name    2980 non-null   object 
 2   contest_name   2980 non-null   object 
 3   cand_party     2980 non-null   object 
 4   cand_or_issue  2980 non-null   object 
 5   vote_total     2980 non-null   int64  
 6   vote_share     2980 non-null   float64
dtypes: float64(1), int32(1), int64(1), object(4)
memory usage: 151.5+ KB


### Step 3.5: (Light) Exploratory Data Analysis

The data must still be reformatted in order to answer our initial research questions, but the current shape of the data is such that we can do some light EDA that we won't be able to do as readily once we transform the data in the following steps.

In [236]:
# Sum all votes cast statewide for each election year and sort by ascending values
df_grouped.groupby('elec_date')['vote_total'].sum().sort_values(ascending=True)

elec_date
2014     5962422
2022     7771398
2018     8228238
2012     8486746
2016     9504468
2020    11090844
Name: vote_total, dtype: int64

`Interesting Finding:` The 2022 midterm looks like it had significantly lower turnout than would be expected. About 5 million fewer votes were cast in the 2022 midterm than the 2018 midterm.

In [237]:
# Sum all votes cast by county for each election year
df_grouped.groupby(['elec_date', 'county_name'])['vote_total'].sum()

elec_date  county_name
2012       Alachua        120773
           Baker           11388
           Bay             80101
           Bradford        11664
           Brevard        286428
                           ...  
2022       Union            4558
           Volusia        226246
           Wakulla         15063
           Walton          34901
           Washington       9126
Name: vote_total, Length: 402, dtype: int64

`Interesting Finding:` The above data could be used to analyze the counties with the largest changes in overall turnout year to year, or coupled with the `df_grouped` data to analyze changes in turnout by party. 

In [238]:
# Narrow to Democratic and Republican vote totals (ie. remove third party and write-in data, which are outside the scope of our analysis)
df_grouped_narrowed = df_grouped.query('cand_party == "DEM" or cand_party == "REP"').copy()
df_grouped_narrowed


Unnamed: 0,elec_date,county_name,contest_name,cand_party,cand_or_issue,vote_total,vote_share
2,2012,Alachua,President of the United States,DEM,Obama / Biden,69699,0.577107
11,2012,Alachua,President of the United States,REP,Romney / Ryan,48797,0.404039
15,2012,Baker,President of the United States,DEM,Obama / Biden,2310,0.202845
23,2012,Baker,President of the United States,REP,Romney / Ryan,8974,0.788022
27,2012,Bay,President of the United States,DEM,Obama / Biden,22051,0.275290
...,...,...,...,...,...,...,...
2971,2022,Wakulla,Governor and Lieutenant Governor,REP,DeSantis / Nuñez,11033,0.732457
2972,2022,Walton,Governor and Lieutenant Governor,DEM,Crist / Hernandez,6112,0.175124
2975,2022,Walton,Governor and Lieutenant Governor,REP,DeSantis / Nuñez,28647,0.820807
2976,2022,Washington,Governor and Lieutenant Governor,DEM,Crist / Hernandez,1285,0.140806


# Step 3: Reformat Data (cont'd)

`Note: Branching Paths`

At this point it is worth noting that there will be two parallel paths explored in the coming sections. The process and operations for each will be similar and the data in each is related, but it is important to understand how they differ and why they are both significant:
- The first path will examine the <b>vote margins</b> (labeled as <b>D_margin</b>) for each county in each year, which represents <b>by how many percentage points the Democratic candidate won the county</b>. A negative number indicates the number of percentage points by which the Republican candidate won the county. <p>
    Exe. A D_margin value of 5 means the Democrat outperformed the Republican by 5 points, as in a case where the Democrat received 52.5% of the vote and the Republican received 47.5%.
    Exe. A D_margin value of -5 means the Democrat underperformed the Republican by 5 points, as in a case where the Democrat received 47.5% of the vote and the Republican received 52.5%.
- The second path will examine the <b>net votes</b> (labeled as <b>D_net_votes</b>) for each county in each year, which represents <b>how many more votes the Democratic candidate received than the Republican candidate in a given county</b>. A negative number indicates how many more votes the Republican candidate received than the Democratic candidate in a given county.

In [239]:
# Subtract R vote total from D vote total to get D net votes
df_grouped_narrowed['vote_total'] *= np.where(df_grouped_narrowed.cand_party == 'DEM', 1, -1)
D_net_votes_by_county = df_grouped_narrowed.groupby(['elec_date', 'county_name'], as_index=False)['vote_total'].agg('sum')
D_net_votes_by_county

Unnamed: 0,elec_date,county_name,vote_total
0,2012,Alachua,20902
1,2012,Baker,-6664
2,2012,Bay,-34825
3,2012,Bradford,-4894
4,2012,Brevard,-36307
...,...,...,...
397,2022,Union,-3451
398,2022,Volusia,-64803
399,2022,Wakulla,-7113
400,2022,Walton,-22535


In [240]:
# Rename 'vote_total' column to 'D_net_votes' to reflext the operation in previous code block
D_net_votes_by_county.rename(columns= {'vote_total' : 'D_net_votes'}, inplace=True)

In [241]:
# Create pivot table with elections vs. counties and D_net_votes as values
net_votes_by_year_county = D_net_votes_by_county.pivot_table(index='county_name', columns='elec_date', values='D_net_votes')

# Change column headers (currently years as ints) to strings
net_votes_by_year_county.columns = net_votes_by_year_county.columns.map(str)

In [242]:
# Add columns for relevant summary statistics
net_votes_by_year_county['mean'] = net_votes_by_year_county.loc[:, '2012':'2022'].mean(axis=1).round(2)
net_votes_by_year_county['variance'] = net_votes_by_year_county.loc[:, '2012':'2022'].var(axis=1).round(3)
net_votes_by_year_county['std'] = net_votes_by_year_county.loc[:, '2012':'2022'].std(axis=1).round(3)
net_votes_by_year_county

elec_date,2012,2014,2016,2018,2020,2022,mean,variance,std
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alachua,20902,12955,28986,31416,38732,14475,24577.67,1.034404e+08,10170.567
Baker,-6664,-3856,-8182,-6928,-9874,-8502,-7334.33,4.249029e+06,2061.317
Bay,-34825,-27966,-40397,-28952,-40483,-38499,-35187.00,3.146203e+07,5609.103
Bradford,-4894,-2931,-5989,-5027,-7174,-6494,-5418.17,2.235996e+06,1495.325
Brevard,-36307,-25602,-62169,-47292,-59334,-75431,-51022.50,3.331530e+08,18252.478
...,...,...,...,...,...,...,...,...,...
Union,-2641,-875,-3554,-2869,-4080,-3451,-2911.67,1.258313e+06,1121.746
Volusia,-2742,-6434,-33916,-24325,-43246,-64803,-29244.33,5.458349e+08,23363.111
Wakulla,-4115,-1884,-6164,-5608,-7523,-7113,-5401.17,4.416386e+06,2101.520
Walton,-14819,-10821,-18880,-15906,-22609,-22535,-17595.00,2.152820e+07,4639.849


In [243]:
# Subtract R vote share from D vote share and group by election, county to get the D margin of victory for each county
df_grouped_narrowed['vote_share'] *= np.where(df_grouped_narrowed.cand_party == 'DEM', 1, -1)
D_margin_by_county = df_grouped_narrowed.groupby(['elec_date', 'county_name'], as_index=False)['vote_share'].agg('sum')
D_margin_by_county

Unnamed: 0,elec_date,county_name,vote_share
0,2012,Alachua,0.173068
1,2012,Baker,-0.585177
2,2012,Bay,-0.434764
3,2012,Bradford,-0.419582
4,2012,Brevard,-0.126758
...,...,...,...
397,2022,Union,-0.757130
398,2022,Volusia,-0.286427
399,2022,Wakulla,-0.472217
400,2022,Walton,-0.645684


In [244]:
# Rename 'vote_share' column to 'D_margin' to reflext the operation in previous code block
D_margin_by_county.rename(columns= {'vote_share' : 'D_margin'}, inplace=True)

# Multiply by 100 and round to 2 decimal places so values are more recognizable as percentages
D_margin_by_county['D_margin'] = (D_margin_by_county['D_margin'] * 100).round(decimals=2)
D_margin_by_county

Unnamed: 0,elec_date,county_name,D_margin
0,2012,Alachua,17.31
1,2012,Baker,-58.52
2,2012,Bay,-43.48
3,2012,Bradford,-41.96
4,2012,Brevard,-12.68
...,...,...,...
397,2022,Union,-75.71
398,2022,Volusia,-28.64
399,2022,Wakulla,-47.22
400,2022,Walton,-64.57


In [245]:
# Create pivot table with elections vs. counties and D_margin as values
vote_share_by_year_county = D_margin_by_county.pivot_table(index='county_name', columns='elec_date', values='D_margin')

# Change column headers (currently years as ints) to strings
vote_share_by_year_county.columns = vote_share_by_year_county.columns.map(str)


# Add columns for relevant summary statistics
vote_share_by_year_county['mean'] = vote_share_by_year_county.loc[:, '2012':'2022'].mean(axis=1)
vote_share_by_year_county['variance'] = vote_share_by_year_county.loc[:, '2012':'2022'].var(axis=1)
vote_share_by_year_county['std'] = vote_share_by_year_county.loc[:, '2012':'2022'].std(axis=1)
vote_share_by_year_county

elec_date,2012,2014,2016,2018,2020,2022,mean,variance,std
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alachua,17.31,16.52,22.28,27.24,27.08,15.09,20.920000,29.223720,5.405897
Baker,-58.52,-45.50,-64.40,-65.69,-70.12,-79.27,-63.916667,129.176427,11.365581
Bay,-43.48,-48.93,-45.79,-45.66,-43.43,-57.38,-47.445000,27.735230,5.266425
Bradford,-41.96,-34.22,-49.26,-47.82,-52.56,-63.25,-48.178333,96.231057,9.809743
Brevard,-12.68,-11.51,-19.54,-16.78,-16.41,-28.20,-17.520000,35.893640,5.991130
...,...,...,...,...,...,...,...,...,...
Union,-48.84,-17.40,-62.11,-58.86,-65.27,-75.71,-54.698333,410.197657,20.253337
Volusia,-1.17,-3.65,-12.88,-10.62,-14.04,-28.64,-11.833333,94.114947,9.701286
Wakulla,-28.00,-16.27,-39.91,-39.38,-40.78,-47.22,-35.260000,125.087320,11.184244
Walton,-51.85,-52.93,-55.70,-52.40,-51.62,-64.57,-54.845000,24.864510,4.986433


`Note:` Summary statistics that compare data between counties (exe. mean, median, mode, variance, or std for a given year) won't be particularly meaningful and are therefor excluded. 

This analysis aims to answer questions relating to the behavior of indiviual counties and treats counties as the unit of observation. It doesn't account for the relative populations (or, in the context of this data, vote totals) of each county. Given that there are many small counties with a vote share that favors Republicans (negative margins in the above table) and fewer but much more populous counties that favor Democrats (positive margins in the above table), the statistics  will skew towards the more numerous (but significantly less populous) Republican-leaning counties.

# Step 5: Answering Research Questions

Now that the data is cleaned and formatted, we can begin answering our research questions.

For reference, our key questions are:
1. Which counties had the `largest change in vote share` between in presidential election years? What about midterm years?
2. Which counties were the `most stable for each major party`? Which were the `least stable`?
3. Which county or counties were `most consistent in predicting the outcome of the top-of-ticket race` in a given year?

### QUESTION 1: Which counties had the `largest change in vote share` between in presidential election years? What about midterm years?

The question here is really which county looks the most different in terms of vote shares in 2020 than it did in 2012 (for presidential years) and between 2014 and 2022 (for midterm years). To answer this, we can add two columns to our current dataframe with this data then sort on those columns.

A positive value indicates the percentage by which a county has moved towards Democrats and a negative value indicates the percentage by which a county has moved towards Republicans.

In [246]:
vote_share_by_year_county['2012_to_2020_diff'] = vote_share_by_year_county['2020'] - vote_share_by_year_county['2012']
vote_share_by_year_county['2014_to_2022_diff'] = vote_share_by_year_county['2022'] - vote_share_by_year_county['2014']
vote_share_by_year_county

elec_date,2012,2014,2016,2018,2020,2022,mean,variance,std,2012_to_2020_diff,2014_to_2022_diff
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Alachua,17.31,16.52,22.28,27.24,27.08,15.09,20.920000,29.223720,5.405897,9.77,-1.43
Baker,-58.52,-45.50,-64.40,-65.69,-70.12,-79.27,-63.916667,129.176427,11.365581,-11.60,-33.77
Bay,-43.48,-48.93,-45.79,-45.66,-43.43,-57.38,-47.445000,27.735230,5.266425,0.05,-8.45
Bradford,-41.96,-34.22,-49.26,-47.82,-52.56,-63.25,-48.178333,96.231057,9.809743,-10.60,-29.03
Brevard,-12.68,-11.51,-19.54,-16.78,-16.41,-28.20,-17.520000,35.893640,5.991130,-3.73,-16.69
...,...,...,...,...,...,...,...,...,...,...,...
Union,-48.84,-17.40,-62.11,-58.86,-65.27,-75.71,-54.698333,410.197657,20.253337,-16.43,-58.31
Volusia,-1.17,-3.65,-12.88,-10.62,-14.04,-28.64,-11.833333,94.114947,9.701286,-12.87,-24.99
Wakulla,-28.00,-16.27,-39.91,-39.38,-40.78,-47.22,-35.260000,125.087320,11.184244,-12.78,-30.95
Walton,-51.85,-52.93,-55.70,-52.40,-51.62,-64.57,-54.845000,24.864510,4.986433,0.23,-11.64


In [247]:
# Top 10 counties that shifted towards Democrats during presidential election years
vote_share_by_year_county['2012_to_2020_diff'].sort_values(ascending=False).head(10)

county_name
St. Johns     11.06
Okaloosa      10.15
Alachua        9.77
Seminole       9.44
Clay           8.69
Duval          7.40
Santa Rosa     6.02
Escambia       5.42
Collier        5.38
Orange         4.84
Name: 2012_to_2020_diff, dtype: float64

In [248]:
# Top 10 counties that shifted towards Republicans during presidential election years
vote_share_by_year_county['2012_to_2020_diff'].sort_values(ascending=True).head(10)

county_name
Glades       -27.56
Okeechobee   -24.30
Hernando     -21.67
Dixie        -19.24
Citrus       -19.13
Liberty      -19.04
Calhoun      -18.48
Desoto       -17.69
Hendry       -17.07
Putnam       -16.78
Name: 2012_to_2020_diff, dtype: float64

In [249]:
# Map 2012_to_2020 diff 

In [250]:
# Graph this data for all counties
vote_share_by_year_county["2012_to_2020_diff"] = vote_share_by_year_county["2012_to_2020_diff"].astype(float)
 
fig_bar = px.bar(vote_share_by_year_county, 
                             x=vote_share_by_year_county.index, 
                             y='2012_to_2020_diff', 
                             title='Change in Vote Share in Presidential Elections (2012 to 2020)',
                             color='2012_to_2020_diff',
                             color_continuous_scale=px.colors.sequential.Bluered_r,
                             color_continuous_midpoint=0,
                             labels={'2012_to_2020_diff' : 'Change in Vote Share (%)', 'county_name' : 'County Name'})


fig_bar.show()

The counties that shifted most towards Democrats are focused in eastern central and northeast Florida. The counties that shifted most towards Republicans are focused in the western central and Big Ben regions. 

The shifts towards Republicans during presidential elections in this period are massive (at least in terms of vote share). The top 10 counties in this category all saw Republicans gain between 16.7% (Putnam) and a whopping 27.5% (Glades) of the vote share, whereas the biggest gains for Democrats were only between 4.8% (Organge) and 11% (St. Johns).

In [251]:
# Top 10 counties that shifted towards Democrats during midterm election years
vote_share_by_year_county['2014_to_2022_diff'].sort_values(ascending=False).head(10)

county_name
Okaloosa      2.29
Duval         0.94
Clay         -1.18
Alachua      -1.43
St. Johns    -1.91
Escambia     -2.04
Seminole     -4.69
Orange       -4.83
Santa Rosa   -4.95
Nassau       -5.16
Name: 2014_to_2022_diff, dtype: float64

In [252]:
# Top 10 counties that shifted towards Republicans during midterm election years
vote_share_by_year_county['2014_to_2022_diff'].sort_values(ascending=True).head(10)

county_name
Union        -58.31
Desoto       -47.90
Liberty      -46.49
Dixie        -44.00
Lafayette    -42.93
Okeechobee   -42.13
Hardee       -38.29
Hamilton     -38.15
Calhoun      -37.38
Hernando     -37.33
Name: 2014_to_2022_diff, dtype: float64

In [253]:
# Graph this data for all counties
vote_share_by_year_county["2014_to_2022_diff"] = vote_share_by_year_county["2014_to_2022_diff"].astype(float)
 
fig_bar = px.bar(vote_share_by_year_county, 
                             x=vote_share_by_year_county.index, 
                             y='2014_to_2022_diff', 
                             title='Change in Vote Share in Midterm Elections (2014 to 2022)',
                             color='2014_to_2022_diff',
                             color_continuous_scale=px.colors.sequential.Bluered_r,
                             color_continuous_midpoint=0,
                             labels={'2014_to_2022_diff' : 'Change in Vote Share (%)', 'county_name' : 'County Name'})
fig_bar.show()

The trends identified in the above analysis of presidential years are even stronger when looking at midterm years. Only two counties demonstrated a shift towards Democrats between 2014 and 2022 (Okaloosa with 2.2% and Duval with 0.9%). The counties with the strongest shift towards Republicans showed those shifts to be truly enormous -- for instance, Union shifted by 58.3%, DeSoto by 47.9%, and Liberty by 46.5%. Though in counties with relatively small vote totals, those percentage shifts are massive.

In [254]:
# Map 2014 to 2022 diff

### QUESTION 1b: Which counties had the `largest change in net votes` between in presidential election years? What about midterm years?

The above answer gives interesting information, but doesn't give a sense of scale. For instance, it is significant that a county like Union shifted 58 points towards Republicans during midterm years. However, it does not give a sense of magnitude. How many votes are actually represented by the 58 point shift?

To that end, I'll repeat the steps used to answer Question 1 using the D_net_votes metric instead of the D_vote_margin metric. This should give us a better sense of proportionality for how meaningful these shifts are in terms of actual vote totals across the state.

In [255]:
net_votes_by_year_county['2012_to_2020_diff'] = net_votes_by_year_county['2020'] - net_votes_by_year_county['2012'] 
net_votes_by_year_county['2014_to_2022_diff'] = net_votes_by_year_county['2022'] - net_votes_by_year_county['2014'] 
net_votes_by_year_county

elec_date,2012,2014,2016,2018,2020,2022,mean,variance,std,2012_to_2020_diff,2014_to_2022_diff
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Alachua,20902,12955,28986,31416,38732,14475,24577.67,1.034404e+08,10170.567,17830,1520
Baker,-6664,-3856,-8182,-6928,-9874,-8502,-7334.33,4.249029e+06,2061.317,-3210,-4646
Bay,-34825,-27966,-40397,-28952,-40483,-38499,-35187.00,3.146203e+07,5609.103,-5658,-10533
Bradford,-4894,-2931,-5989,-5027,-7174,-6494,-5418.17,2.235996e+06,1495.325,-2280,-3563
Brevard,-36307,-25602,-62169,-47292,-59334,-75431,-51022.50,3.331530e+08,18252.478,-23027,-49829
...,...,...,...,...,...,...,...,...,...,...,...
Union,-2641,-875,-3554,-2869,-4080,-3451,-2911.67,1.258313e+06,1121.746,-1439,-2576
Volusia,-2742,-6434,-33916,-24325,-43246,-64803,-29244.33,5.458349e+08,23363.111,-40504,-58369
Wakulla,-4115,-1884,-6164,-5608,-7523,-7113,-5401.17,4.416386e+06,2101.520,-3408,-5229
Walton,-14819,-10821,-18880,-15906,-22609,-22535,-17595.00,2.152820e+07,4639.849,-7790,-11714


In [256]:
# Top 10 counties where Democrats gained the most net votes during presidential election years
net_votes_by_year_county['2012_to_2020_diff'].sort_values(ascending=False).head(10)

county_name
Orange          64540
Duval           33672
Broward         21132
Seminole        21036
Alachua         17830
Hillsborough    12688
Leon            10988
Escambia         4781
Okaloosa         1197
Gadsden          -452
Name: 2012_to_2020_diff, dtype: int64

We can see that Democrats only increased their net number of votes from 2012 to 2020 in 9 counties -- Gadsden, the 10th county, shows a negative net vote total.

In [257]:
# Top 10 counties where Democrats lost the most net votes during presidential election years
net_votes_by_year_county['2012_to_2020_diff'].sort_values(ascending=True).head(10)

county_name
Miami-Dade   -123428
Pasco         -46384
Volusia       -40504
Polk          -32582
Lee           -31546
Marion        -26756
Hernando      -25785
Pinellas      -24605
Brevard       -23027
Citrus        -22058
Name: 2012_to_2020_diff, dtype: int64

In [258]:
# Map 2012_to_2020 net votes diff

In [259]:
# Graph this data for all counties
net_votes_by_year_county["2012_to_2020_diff"] = net_votes_by_year_county["2012_to_2020_diff"].astype(int)
 
fig_bar = px.bar(net_votes_by_year_county, 
                             x=net_votes_by_year_county.index, 
                             y='2012_to_2020_diff', 
                             title='Change in Net Votes in Presidential Elections (2012 to 2020)',
                             color='2012_to_2020_diff',
                             color_continuous_scale=px.colors.sequential.Bluered_r,
                             color_continuous_midpoint=0,
                             labels={'2012_to_2020_diff' : 'Change in Net Votes', 'county_name' : 'County Name'})



fig_bar.show()

Unsurprisingly, Miami-Dade (123.4k net votes for Republicans) saw the biggest shift in net votes and by a wide margin -- almost three times as many net votes for Republicans than the next most significant county, Pasco (46.4k net votes for Republicans).

In [260]:
# Top 10 counties where Democrats gained the most net votes during midterm election years
net_votes_by_year_county['2014_to_2022_diff'].sort_values(ascending=False).head(10)

county_name
Alachua      1520
Liberty     -1202
Lafayette   -1365
Glades      -1597
Hamilton    -1689
Franklin    -1725
Gulf        -1859
Jefferson   -1985
Calhoun     -2049
Hardee      -2153
Name: 2014_to_2022_diff, dtype: int64

In [261]:
# Top 10 counties where Democrats lost the most net votes during midterm election years
net_votes_by_year_county['2014_to_2022_diff'].sort_values(ascending=True).head(10)

county_name
Miami-Dade     -180264
Palm Beach     -102103
Broward         -88508
Pinellas        -81380
Pasco           -69882
Lee             -62088
Volusia         -58369
Hillsborough    -54628
Brevard         -49829
Polk            -49315
Name: 2014_to_2022_diff, dtype: int64

In [262]:
# Graph this data for all counties
net_votes_by_year_county["2014_to_2022_diff"] = net_votes_by_year_county["2014_to_2022_diff"].astype(int)
 
fig_bar = px.bar(net_votes_by_year_county, 
                             x=net_votes_by_year_county.index, 
                             y='2014_to_2022_diff', 
                             title='Change in Net Votes in Midterm Elections (2014 to 2022)',
                             color='2014_to_2022_diff',
                             color_continuous_scale=px.colors.sequential.Bluered_r,
                             color_continuous_midpoint=0,
                             labels={'2014_to_2022_diff' : 'Change in Net Votes', 'county_name' : 'County Name'})



fig_bar.show()

The graph really speaks for itself.  The only county where Democrats have increased the net number of votes in the past decade is Alachua, and only by 1520 votes.

### QUESTION 2: Which counties were the `most stable for each major party`? Which were the `least stable`?

The answer to this question would be the counties with the smallest variances. Let's grab the 10 smallest:

In [263]:
vote_share_by_year_county['variance'].sort_values(ascending=True).head(10)

county_name
Nassau           7.669590
Sumter          17.109880
Santa Rosa      18.892297
Leon            21.569057
Walton          24.864510
Bay             27.735230
Clay            27.881137
Alachua         29.223720
Indian River    29.668507
St. Johns       31.189390
Name: variance, dtype: float64

The only immediately identifiable trend I can see in this cohort is that they are concentrated in the northern part of the state. 

`For further investigation:`
1. Map counties to more readily identify any regional trends in this cohort (are they on the periphery of an urban core? concentrated in the north or south?)
2. Examine population to see if there are strong population trends in the cohort (are they smaller counties where a relatively small change in vote total causes a disproportionate change in vote margins?)
3. Examine turnout rates (are these counties where turnout drops more radically in midterms? does that correlate with a radical change in vote share?)
4. Examine whether changes in margins in these counties were in one direction or both directions (is the variance in vote share because of swings between R and D or from a strong shift toward one party?)

In [264]:
vote_share_by_year_county['variance'].sort_values(ascending=False).head(10)

county_name
Union         410.197657
Liberty       270.087467
Desoto        269.449507
Okeechobee    252.030827
Dixie         249.484097
Glades        235.628840
Lafayette     222.755587
Hendry        221.779960
Miami-Dade    217.224787
Hernando      199.588667
Name: variance, dtype: float64

This is really interesting. With the exception of Miami-Dade, these are small- to mid-sized counties that tend to be fairly red. There also doesn't appear to be a strong regional element; these counties are all across the state. 

The same questions for further investigation from above apply here as well.

### QUESTION 3: Which county or counties were `most consistent in predicting the outcome of the top-of-ticket race` in a given year?


There are two ways to interpret this question. The first is a binary method, and would require determining which counties voted in the same direction as the entire state for each election year and then tallying the number of correct predictors (exe. if Alachua was won by Obama in 2012, then Scott in 2014, then Trump in 2016, then DeSantis in 2018, then Trump in 2020, then DeSantis in 2022, then they would have "predicted" 6 out of 6 of the elections we're looking at). The limitation of this method is that it doesn't give a sense of scale -- if DeSantis got 60% of the statewide vote in 2022 but only won Alachua with 52% of the vote, Alachua may have "predicted" the DeSantis victory but failed to predict the degree of the victory.

The second is a correlational method. It would be to look at the correlations between the vote shares in each county and the statewide vote share in each election year. The county (or counties) whose vote shares most strongly correlate with the statewide vote shares could be considered the strongest "predictor" of the statewide outcomes. The limitation of this method is that a county's vote share correlating with the statewide vote share does not necessarily mean the outcomes were in the same in the county and statewide. For example, county could have voted very narrowly for one candidate (say, with 50.2% of the county vote total) but the same candidate narrowly lost the statewide vote (say, with 49.8% of the vote). These two numbers would show a strong correlation, even though the candidate that won the county did not end up winning the state.

Below, we'll explore both the binary and correlational options methods.

`Binary Method:` The first thing we'll have to do for this method is go back to an earlier dataframe in the notebook and reformat the data to see which counties voted in the same way as the state overall in each of our election years.

In [274]:
# Make a copy of df_grouped
df_binary = D_margin_by_county.copy()

# Add column county_winner containing which party won each county in each year
df_binary['county_winner'] = np.where(df_binary.D_margin > 0, 'DEM', 'REP')

df_binary

Unnamed: 0,elec_date,county_name,D_margin,county_winner
0,2012,Alachua,17.31,DEM
1,2012,Baker,-58.52,REP
2,2012,Bay,-43.48,REP
3,2012,Bradford,-41.96,REP
4,2012,Brevard,-12.68,REP
...,...,...,...,...
397,2022,Union,-75.71,REP
398,2022,Volusia,-28.64,REP
399,2022,Wakulla,-47.22,REP
400,2022,Walton,-64.57,REP


In [284]:
# Recast the 'elec_date' column as string
df_binary['elec_date'] = df_binary['elec_date'].astype(str)

# Define function to label a new column with the winning party from each election year
def label_state_winner(row):
    if row['elec_date'] == '2012':
        return 'DEM'
    if row['elec_date'] == '2014':
        return 'REP'
    if row['elec_date'] == '2016':
        return 'REP'
    if row['elec_date'] == '2018':
        return 'REP'
    if row['elec_date'] == '2020':
        return 'REP'
    if row['elec_date'] == '2022':
        return 'REP' 
    else:
        return 'Error'

# Apply function
df_binary['state_winner'] = df_binary.apply(label_state_winner, axis=1)

# Define function to return 1 if the county winner and state winner are the same or 0 if not
def label_prediction_success(row):
    if row['county_winner'] == row['state_winner']:
        return 1
    else:
        return 0

# Apply function    
df_binary['prediction_success'] = df_binary.apply(label_prediction_success, axis=1)

# Check output
df_binary

Unnamed: 0,elec_date,county_name,D_margin,county_winner,state_winner,prediction_success
0,2012,Alachua,17.31,DEM,DEM,1
1,2012,Baker,-58.52,REP,DEM,0
2,2012,Bay,-43.48,REP,DEM,0
3,2012,Bradford,-41.96,REP,DEM,0
4,2012,Brevard,-12.68,REP,DEM,0
...,...,...,...,...,...,...
397,2022,Union,-75.71,REP,REP,1
398,2022,Volusia,-28.64,REP,REP,1
399,2022,Wakulla,-47.22,REP,REP,1
400,2022,Walton,-64.57,REP,REP,1


In [292]:
df_county_successful_predictions = df_binary.groupby('county_name')['prediction_success'].sum().sort_values(ascending=False)
df_county_successful_predictions

county_name
Lake          5
Okeechobee    5
Lafayette     5
Baker         5
Lee           5
             ..
Orange        1
Gadsden       1
Broward       1
Leon          1
Alachua       1
Name: prediction_success, Length: 67, dtype: int64

In [295]:
fig_histogram = px.histogram(df_county_successful_predictions, 
                             x='prediction_success',
                             title='Successful Predictions',
                             labels={'prediction_success' : 'Prediction Successes', 'count' : 'Count of Counties'})
fig_histogram.show()

Looks like no county voted the same way as the state at large for all 6 of our elections in question, but 54 counties voted the same way as the state at large 5 out of 6 times. That's the overwhelming majority of the state's counties, so this method doesn't provide any particularly meaningful insights.

`Correlational Method:` Same as with the binary method, our first step is to find an earlier dataframe in the notebook that is close to the format we'll need and alter it to fit our needs.

In [349]:
# vote_share_by_year_county contains the vote share for each county in each election year, which is a good start
df_correlation = vote_share_by_year_county.copy()
df_correlation.drop(columns=['2012_to_2020_diff', '2014_to_2022_diff','mean','variance','std'], inplace=True)
df_correlation = df_correlation.transpose()
df_correlation


county_name,Alachua,Baker,Bay,Bradford,Brevard,Broward,Calhoun,Charlotte,Citrus,Clay,...,St. Johns,St. Lucie,Sumter,Suwannee,Taylor,Union,Volusia,Wakulla,Walton,Washington
elec_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012,17.31,-58.52,-43.48,-41.96,-12.68,34.93,-43.7,-14.24,-21.84,-45.72,...,-37.66,7.84,-34.89,-44.93,-38.13,-48.84,-1.17,-28.0,-51.85,-47.26
2014,16.52,-45.5,-48.93,-34.22,-11.51,38.44,-35.14,-12.3,-15.18,-48.87,...,-38.32,9.43,-39.41,-37.85,-32.17,-17.4,-3.65,-16.27,-52.93,-45.45
2016,22.28,-64.4,-45.79,-49.26,-19.54,34.91,-55.71,-27.55,-39.38,-43.93,...,-33.07,-2.4,-38.96,-54.95,-51.11,-62.11,-12.88,-39.91,-55.7,-56.85
2018,27.24,-65.69,-45.66,-47.82,-16.78,36.66,-57.54,-26.3,-37.39,-38.75,...,-28.57,3.72,-40.24,-55.19,-51.13,-58.86,-10.62,-39.38,-52.4,-56.87
2020,27.08,-70.12,-43.43,-52.56,-16.41,29.73,-62.18,-26.57,-40.97,-37.03,...,-26.6,-1.56,-36.08,-56.63,-53.78,-65.27,-14.04,-40.78,-51.62,-61.04
2022,15.09,-79.27,-57.38,-63.25,-28.2,15.38,-72.52,-41.37,-49.04,-50.05,...,-40.23,-19.14,-46.7,-67.2,-66.02,-75.71,-28.64,-47.22,-64.57,-71.24


In [346]:
# We need the statewide vote share for each election, which we can't derive from the above df 
# df_grouped can be manipulated to get these figures, so we'll start there
statewide_vote_share_by_year = df_grouped.copy()

# Drop the current vote_share column which groups by county and which we'll replace with one that groups only by elec_year and cand_party
statewide_vote_share_by_year = statewide_vote_share_by_year.drop(columns='vote_share')

# Get statewide vote totals by party by elec_date by grouping vote totals by elec_date, cand_party, and cand_or_issue, summing vote_total
statewide_vote_share_by_year = statewide_vote_share_by_year.groupby(['elec_date','cand_party','cand_or_issue'], as_index=False)['vote_total'].sum()

# Add vote_share column
statewide_vote_share_by_year['vote_share'] = statewide_vote_share_by_year.groupby(['elec_date'])['vote_total'].transform(lambda x: x / x.sum())

# Narrow to DEM and REP candidates
statewide_vote_share_by_year = statewide_vote_share_by_year.query('cand_party == "DEM" or cand_party == "REP"')

# Make REP vote_share values negative to get a marginal value when summed by elec_date
statewide_vote_share_by_year['vote_share'] *= np.where(statewide_vote_share_by_year.cand_party == 'DEM', 1, -1)

# Group by vote_share by elec_date
statewide_vote_share_by_year = statewide_vote_share_by_year.groupby(['elec_date'], as_index=False)['vote_share'].agg('sum')

# Examine output
statewide_vote_share_by_year



Unnamed: 0,elec_date,vote_share
0,2012,0.008725
1,2014,-0.010824
2,2016,-0.01181
3,2018,-0.003941
4,2020,-0.033512
5,2022,-0.194031


In [375]:
# Translate these values to percentages and add to df_correlation
df_correlation['Statewide'] = [0.8725, -001.0824, -001.1810, -000.3941, -003.3512, -019.4031]
df_correlation



county_name,Alachua,Baker,Bay,Bradford,Brevard,Broward,Calhoun,Charlotte,Citrus,Clay,...,St. Lucie,Sumter,Suwannee,Taylor,Union,Volusia,Wakulla,Walton,Washington,Statewide
elec_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012,17.31,-58.52,-43.48,-41.96,-12.68,34.93,-43.7,-14.24,-21.84,-45.72,...,7.84,-34.89,-44.93,-38.13,-48.84,-1.17,-28.0,-51.85,-47.26,0.8725
2014,16.52,-45.5,-48.93,-34.22,-11.51,38.44,-35.14,-12.3,-15.18,-48.87,...,9.43,-39.41,-37.85,-32.17,-17.4,-3.65,-16.27,-52.93,-45.45,-1.0824
2016,22.28,-64.4,-45.79,-49.26,-19.54,34.91,-55.71,-27.55,-39.38,-43.93,...,-2.4,-38.96,-54.95,-51.11,-62.11,-12.88,-39.91,-55.7,-56.85,-1.181
2018,27.24,-65.69,-45.66,-47.82,-16.78,36.66,-57.54,-26.3,-37.39,-38.75,...,3.72,-40.24,-55.19,-51.13,-58.86,-10.62,-39.38,-52.4,-56.87,-0.3941
2020,27.08,-70.12,-43.43,-52.56,-16.41,29.73,-62.18,-26.57,-40.97,-37.03,...,-1.56,-36.08,-56.63,-53.78,-65.27,-14.04,-40.78,-51.62,-61.04,-3.3512
2022,15.09,-79.27,-57.38,-63.25,-28.2,15.38,-72.52,-41.37,-49.04,-50.05,...,-19.14,-46.7,-67.2,-66.02,-75.71,-28.64,-47.22,-64.57,-71.24,-19.4031


In [399]:
df_correlation_diff = df_correlation.sub(df_correlation['Statewide'], axis=0).abs()


df_correlation_diff = df_correlation_diff.drop(columns='Statewide').transpose()

df_correlation_diff['mean_diff'] = df_correlation_diff.loc[:, '2012':'2022'].mean(axis=1)
df_correlation_diff['var_diff'] = df_correlation_diff.loc[:, '2012':'2022'].var(axis=1)

# df_correlation_diff

# df_correlation_diff['mean_diff'].sort_values(ascending=True)
df_correlation_diff['var_diff'].sort_values(ascending=True)

county_name
Lee             3.427194
Monroe          4.310466
Polk            4.517983
Palm Beach      4.618421
Broward         5.393715
                 ...    
Desoto        121.959703
Lafayette     153.854450
Dixie         162.079117
Liberty       172.735654
Union         301.062948
Name: var_diff, Length: 67, dtype: float64

In [400]:
fig_scatter = px.line(df_correlation, x=df_correlation.index, y=df_correlation.columns)
fig_scatter.show()

In [357]:
correlations = df_correlation.corr()
correlations

county_name,Alachua,Baker,Bay,Bradford,Brevard,Broward,Calhoun,Charlotte,Citrus,Clay,...,St. Lucie,Sumter,Suwannee,Taylor,Union,Volusia,Wakulla,Walton,Washington,Statewide
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alachua,1.000000,-0.198833,0.621051,-0.083705,0.167443,0.337040,-0.242602,-0.072469,-0.320338,0.974751,...,0.186989,0.376215,-0.188539,-0.184181,-0.300064,0.061526,-0.351974,0.531340,-0.105119,0.436550
Baker,-0.198833,1.000000,0.393320,0.984978,0.862225,0.815969,0.985069,0.929404,0.956573,-0.195506,...,0.872081,0.497381,0.982047,0.972590,0.971079,0.867648,0.969724,0.623971,0.948159,0.703113
Bay,0.621051,0.393320,1.000000,0.522322,0.753085,0.765142,0.441767,0.625602,0.376498,0.704763,...,0.745536,0.945471,0.480900,0.507747,0.214604,0.738586,0.279098,0.925696,0.592156,0.904878
Bradford,-0.083705,0.984978,0.522322,1.000000,0.927451,0.880770,0.984666,0.967505,0.957023,-0.062642,...,0.942651,0.604302,0.987490,0.986269,0.932736,0.935353,0.944653,0.734310,0.980462,0.798159
Brevard,0.167443,0.862225,0.753085,0.927451,1.000000,0.885702,0.884691,0.968364,0.870050,0.243409,...,0.981229,0.822436,0.918688,0.924546,0.786349,0.966422,0.827439,0.917068,0.933365,0.889938
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Volusia,0.061526,0.867648,0.738586,0.935353,0.966422,0.899879,0.915323,0.974941,0.892279,0.134320,...,0.979626,0.806871,0.923940,0.943034,0.760233,1.000000,0.821876,0.859115,0.975326,0.904736
Wakulla,-0.351974,0.969724,0.279098,0.944653,0.827439,0.677971,0.970145,0.917975,0.983302,-0.315094,...,0.809113,0.445064,0.973934,0.962582,0.981901,0.821876,1.000000,0.539797,0.911931,0.570829
Walton,0.531340,0.623971,0.925696,0.734310,0.917068,0.870078,0.643339,0.798087,0.604089,0.604744,...,0.903974,0.899429,0.693213,0.705662,0.505741,0.859115,0.539797,1.000000,0.751730,0.938599
Washington,-0.105119,0.948159,0.592156,0.980462,0.933365,0.874839,0.980037,0.984042,0.955769,-0.056665,...,0.950530,0.687552,0.976779,0.987735,0.866236,0.975326,0.911931,0.751730,1.000000,0.830744


In [356]:
statewide_correlations = correlations['Statewide'].sort_values(ascending=False)
statewide_correlations

county_name
Statewide     1.000000
Lee           0.970925
Hendry        0.968131
Palm Beach    0.967788
Broward       0.964227
                ...   
Duval         0.493587
Clay          0.467924
Alachua       0.436550
St. Johns     0.434279
Okaloosa      0.322389
Name: Statewide, Length: 68, dtype: float64

In [364]:
fig_bar = px.bar(y= statewide_correlations, 
                 x=statewide_correlations.index,
                 color=statewide_correlations)
fig_bar.show()

## Misc: EDA and Visualizations

In [None]:
fig_histogram = px.histogram(D_margin_by_county, x='D_margin')
fig_histogram.show()

In [None]:
# Plot 
fig_histogram_totals = px.histogram(D_margin_by_county,x='county_name', y='vote_share', hover_name='county_name', title='Totals')
fig_histogram.show()

fig_histogram_2012 = px.histogram(D_margin_by_county[D_margin_by_county['elec_date'] == 2012],x='county_name', y='vote_share', title='2012')
fig_histogram_2012.show()

fig_histogram_2014 = px.histogram(D_margin_by_county[D_margin_by_county['elec_date'] == 2014],x='county_name', y='vote_share', title='2014')
fig_histogram_2014.show()

fig_histogram_2016 = px.histogram(D_margin_by_county[D_margin_by_county['elec_date'] == 2016],x='county_name', y='vote_share', title='2016')
fig_histogram_2016.show()

fig_histogram_2018 = px.histogram(D_margin_by_county[D_margin_by_county['elec_date'] == 2018],x='county_name', y='vote_share', title='2018')
fig_histogram_2018.show()

fig_histogram_2020 = px.histogram(D_margin_by_county[D_margin_by_county['elec_date'] == 2020],x='county_name', y='vote_share', title='2020')
fig_histogram_2020.show()

fig_histogram_2022 = px.histogram(D_margin_by_county[D_margin_by_county['elec_date'] == 2022],x='county_name', y='vote_share', title='2022')
fig_histogram_2022.show()

In [None]:
fig_line_totals = px.line(D_margin_by_county, x='elec_date', y='vote_share', color='county_name')
fig_line_totals.show()