# Florida Electoral Analysis - Data Cleaning

The ultimate objectives of this project are to:

1. Identify precincts friendliest to Ds in R counties
2. Identify the counties and precincts with the greatest swing from R to D over the course of multiple elections
3. Identify the counties and precincts with the greatest swing from D to R over the course of multiple elections
4. Identify the counties and precincts surrounded by precincts that swung in an opposite direction -- essentially, identifying where people diverge most from their neighbors

For this analysis, I will be drawing data from the Florida Division of Elections precinct level election results (https://dos.myflorida.com/elections/data-statistics/elections-data/precinct-level-election-results/).

The data flow for this project will be as follows:

1. Ingest data from .txt files into dataframes using Python and pandas
2. Examine and clean dataframes using Python and pandas
3. Save the cleaned dataframes to .csv
4. Upload .csv data to BigQuery for storage and easier querying
5. Conduct analysis on data using Python and numpy
6. Create visualizations that convey analysis using Python, matplotlib, and seaborn
7. Use GIS tools (QGIS) to map data onto the state of Florida

**Note**: I recognize this may not be a particularly efficient work flow, and involves using more tools than may be strictly necessary; that is intentional. My goal is to develop my skills across these tools and platforms, especially as regards moving the same dataset from one tool to another. 

## Step 1: Reformat and Ingest Data

Before we dive into analysis, we have to ingest and clean a **lot** of data, a task to which this entire notebook will be dedicated. The data available from the FDOE is available in .txt format (*yay*) for each county for each election, so we'll have to turn 67 .txt files into one dataframe *for each year we want to investigate*. We'll start with our earliest year of data, which is from the 2012 general election. From here, we can define a standard set of columns into which will:

 * Work for our analytical purposes
 * Be kept consistent across the various years for which we are doing this analysis

In [96]:
import numpy as np
import pandas as pd
import matplotlib as mp
import csv
import os
import glob

To start ingesting the 2012 data, we'll first need to convert all of their provided .txt files into .csv's that we can more readily get into a dataframe. 

To start ingesting and cleaning the data, let's put together a script to:
1. Open each .txt file
2. Use a the tab character as a delimeter for each value
3. Add a semicolon character to the end of each row as a line delimiter.

**Note:** Most of this code is commented out, as it only needs to be run once at the onset of the data cleaning process and I don't want it to run again should I hit 'Run All' in my IDE.

In [97]:
output = "C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2012 by Precinct\\CSV Files"
outPath = glob.glob("{0}\*.csv".format(output))

folPath = "C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2012 by Precinct\\Text Files"
filPath = glob.glob("{0}\*.txt".format(folPath))

# For future years, turn the below code into a function to be called on the directory of .txt or .csv files
# def txt_to_csv(input_path,output_path,fields):

# for file in filPath:
#     with open(file, 'r') as input_file:
#         in_txt = csv.reader(input_file, delimiter='\t')
#         filename = os.path.splitext(os.path.basename(file))[0] + '.csv'
#         stripped = (line.strip() for line in input_file)
#         lines = (line.split("\t") for line in stripped if line)
#         with open(os.path.join(output, filename), 'w') as output_file:
#             fields = ['county_code', 'county_name', 'elec_num', 'elec_date', 'elec_name', 'precinct_id', 'poll_loc', 'total_reg', 'total_reg_r', 'total_reg_d', 'total_reg_other', 'contest_name', 'district', 'contest_code', 'cand_or_issue', 'cand_party', 'cand_id', 'doe_num', 'vote_total']
#             writer = csv.writer(output_file)
#             writer.writerow(fields)
#             writer.writerows(lines)

Now that we have clean(er) .csv files, let's turn them into dataframes and merge 'em all into one big ol' honkin' df.

In [98]:
# merging the files
joined_files = os.path.join(output, "*.csv")
  
# A list of all joined files is returned
joined_list = glob.glob(joined_files)
  
# Finally, the files are joined
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)

We did it! We now have a huge dataframe that has all of our .csv's merged together. Let's take a look at it.

In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688411 entries, 0 to 688410
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   county_code      688411 non-null  object
 1   county_name      688411 non-null  object
 2   elec_num         688411 non-null  int64 
 3   elec_date        688411 non-null  object
 4   elec_name        688411 non-null  object
 5   precinct_id      688411 non-null  object
 6   poll_loc         660772 non-null  object
 7   total_reg        688411 non-null  int64 
 8   total_reg_r      688411 non-null  int64 
 9   total_reg_d      688411 non-null  int64 
 10  total_reg_other  688411 non-null  int64 
 11  contest_name     688411 non-null  object
 12  district         688411 non-null  object
 13  contest_code     688411 non-null  int64 
 14  cand_or_issue    688411 non-null  object
 15  cand_party       627673 non-null  object
 16  cand_id          688411 non-null  int64 
 17  doe_num   

It's a big boi, almost 700k rows! Right off the bat, I see there are some NaN values somewhere in the `poll_loc` and `cand_party` columns -- we'll do a few other things to clean up and format this dataframe first, but that's good to know and we'll be addressing those shortly.

For the moment, let's just take a look at the shape of it and check out the data in each column:

In [100]:
df.head()

Unnamed: 0,county_code,county_name,elec_num,elec_date,elec_name,precinct_id,poll_loc,total_reg,total_reg_r,total_reg_d,total_reg_other,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total
0,ALA,Alachua,9547,11/06/2012,2012 General Election,1,01 First Baptist Church of Waldo,1411,0,0,0,President of the United States,,100000,Romney / Ryan,REP,0,55509,608
1,ALA,Alachua,9547,11/06/2012,2012 General Election,1,01 First Baptist Church of Waldo,1411,0,0,0,President of the United States,,100000,Obama / Biden,DEM,0,55511,381
2,ALA,Alachua,9547,11/06/2012,2012 General Election,1,01 First Baptist Church of Waldo,1411,0,0,0,President of the United States,,100000,Stevens / Link,OBJ,0,59923,1
3,ALA,Alachua,9547,11/06/2012,2012 General Election,1,01 First Baptist Church of Waldo,1411,0,0,0,President of the United States,,100000,Johnson / Gray,LBT,0,59927,6
4,ALA,Alachua,9547,11/06/2012,2012 General Election,1,01 First Baptist Church of Waldo,1411,0,0,0,President of the United States,,100000,"Goode, / Clymer",CPF,0,59941,1


## Step 2: Removing Duplicates
First off, let's just check for dupes:

In [101]:
df.duplicated().sum()

0

No dupes! Great!

## Step 3: Removing Extraneous Columns

Looking back at the dataframe preview, it looks like there might be some columns that don't contain any valuable information that we could drop to clean up our frame:

`total_reg_r`, `total_reg_d`, and `total_reg_other` look like they may not contain any useful values, and `elec_name` and `elec_num` may just be the same for every row and therefor not very useful either. 

Let's take a look:

In [102]:
df['total_reg_r'].unique() # All values are 0, not useful!

array([0], dtype=int64)

In [103]:
df['total_reg_d'].unique() # All values are 0, not useful!

array([0], dtype=int64)

In [104]:
df['elec_num'].unique() # All values are 9547, not useful!

array([9547], dtype=int64)

In [105]:
df['elec_name'].unique() # All values are '2012 General Election', not useful!

array(['2012 General Election'], dtype=object)

In [106]:
df['total_reg_other'].unique() # There are unique values!

array([     0,   2301,   1051,   3048,   2258,    942,   1398,   3753,
         1931,   4456,   4124,   6240,   1482,   3659,   3001,   1682,
         3526,   3249,   3857,   2253,   2606,   2990,   1940,   2763,
         3447,   2529,   2948,   2339,   1651,   1162,   2613,   1458,
         1032,   1402,   1332,   1011,   1704,   1197,   1963,   2072,
         1511,   1735,   2709,   1615,   1397,   2284,   2525,   2042,
          525,   1911,   2368,   2088,   2872,   1875,   1919,   2715,
         4226,   3727,   1862,   3505,   2047,   4269,   4568,   3749,
         4494,   2822,   4653,   2448,   2252,    829,   1648,   3600,
         3708,   1999,   2020,   1572,    699,    527,   3471,   2336,
         2912,    732,    955,   3399,   1113,   2666,   1932,   1769,
         1597,   2264,   2940,   2815,   1497,    834,   1645,   3232,
         1971,   1258,    726,   1846,   2955,   1239,   3514,    549,
         2133,    995,   2439,   1226,   1431,   3354,   3952,   2503,
      

In [107]:
# How many unique values are there?
len(df['total_reg_other'].unique())


245

Looks like we do indeed have some unique values in the `total_reg_other` column -- 245 of them, in fact. 

That's not many overall given the size of our dataframe, but let's make a new dataframe with just those rows to see what they look like:

In [108]:
df_reg_other = df[df['total_reg_other'] != 0]
df_reg_other.head()

Unnamed: 0,county_code,county_name,elec_num,elec_date,elec_name,precinct_id,poll_loc,total_reg,total_reg_r,total_reg_d,total_reg_other,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total
606479,POL,Polk,9547,11/06/2012,2012 General Election,101,Precinct 101,2301,0,0,2301,President of the United States,,100000,Times Over Voted,,0,901,5
606480,POL,Polk,9547,11/06/2012,2012 General Election,101,Precinct 101,2301,0,0,2301,President of the United States,,100000,Romney / Ryan,REP,0,55509,597
606481,POL,Polk,9547,11/06/2012,2012 General Election,101,Precinct 101,2301,0,0,2301,President of the United States,,100000,Obama / Biden,DEM,0,55511,359
606482,POL,Polk,9547,11/06/2012,2012 General Election,101,Precinct 101,2301,0,0,2301,President of the United States,,100000,Stevens / Link,OBJ,0,59923,1
606483,POL,Polk,9547,11/06/2012,2012 General Election,101,Precinct 101,2301,0,0,2301,President of the United States,,100000,Johnson / Gray,LBT,0,59927,3


It's not clear to me what the values in `total_reg_other` actually signify, especially given that the other two columns `total_reg_r` and `total_reg_d` are essentially empty, but it looks like it may just mirror the value in the `total_reg` column for a given precinct. 

Let's find out by creating a new dataframe (`df_reg_temp`) that should only be populated with rows wherein `total_reg_other` is not equal to `total_reg`:

In [109]:
df_reg_temp = df_reg_other[df_reg_other['total_reg_other'] != df_reg_other['total_reg']]
df_reg_temp.head()

Unnamed: 0,county_code,county_name,elec_num,elec_date,elec_name,precinct_id,poll_loc,total_reg,total_reg_r,total_reg_d,total_reg_other,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total


And it's empty! We can safetly say there is no unique data in the `total_reg_other` column that isn't already in `total_reg` and add it to our list of columns to drop.

In [110]:
df = df.drop(columns=['total_reg_r', 'total_reg_d', 'total_reg_other', 'elec_num', 'elec_name'])
df.head()

Unnamed: 0,county_code,county_name,elec_date,precinct_id,poll_loc,total_reg,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total
0,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Romney / Ryan,REP,0,55509,608
1,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Obama / Biden,DEM,0,55511,381
2,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Stevens / Link,OBJ,0,59923,1
3,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Johnson / Gray,LBT,0,59927,6
4,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,"Goode, / Clymer",CPF,0,59941,1


Great! Five superfluous columns removed. 

## Step 4: Dealing With NaN cells

Remember those NaN values in `poll_loc` and `cand_party` from earlier? Let's check those out. Let's create a dataframe of just those rows which contain NaNs:

In [111]:
df_nan = df[df.isna().any(axis=1)]
df_nan

Unnamed: 0,county_code,county_name,elec_date,precinct_id,poll_loc,total_reg,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total
28654,BRO,Broward,11/06/2012,A001,A001,1120,President of the United States,,100000,Times Over Voted,,0,901,2
28655,BRO,Broward,11/06/2012,A001,A001,1120,President of the United States,,100000,Number of Under Votes,,0,902,4
28660,BRO,Broward,11/06/2012,A001,A001,1120,United States Senator,,120000,Times Over Voted,,0,901,0
28661,BRO,Broward,11/06/2012,A001,A001,1120,United States Senator,,120000,Number of Under Votes,,0,902,39
28664,BRO,Broward,11/06/2012,A001,A001,1120,U.S. Representative,District 22,140220,Times Over Voted,,0,901,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
653401,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,TANGIBLE PERSONAL PROPERTY TAX EXEMPTION,Amendment No. 10,901000,Number of Under Votes,,0,902,0
653404,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,ADDITIONAL HOMESTEAD EXEMPTION; LOW-INCOME SEN...,Amendment No. 11,901100,Times Over Voted,,0,901,0
653405,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,ADDITIONAL HOMESTEAD EXEMPTION; LOW-INCOME SEN...,Amendment No. 11,901100,Number of Under Votes,,0,902,0
653408,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,APPOINTMENT OF STUDENT BODY PRESIDENT TO BOARD...,Amendment No. 12,901200,Times Over Voted,,0,901,0


Here we can see that, at least for the NaNs in `cand_party`, these are rows that represent over and under vote totals. While interesting, they're not pertinent to this analysis, so we can consider dropping them. Let's just make sure those conditions in `cand_or_issue` are the only things that may be holding a NaN in `cand_party`:

In [112]:
df_nan_cand_party = df[df['cand_party'].isna()]
df_nan_cand_party

Unnamed: 0,county_code,county_name,elec_date,precinct_id,poll_loc,total_reg,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total
28654,BRO,Broward,11/06/2012,A001,A001,1120,President of the United States,,100000,Times Over Voted,,0,901,2
28655,BRO,Broward,11/06/2012,A001,A001,1120,President of the United States,,100000,Number of Under Votes,,0,902,4
28660,BRO,Broward,11/06/2012,A001,A001,1120,United States Senator,,120000,Times Over Voted,,0,901,0
28661,BRO,Broward,11/06/2012,A001,A001,1120,United States Senator,,120000,Number of Under Votes,,0,902,39
28664,BRO,Broward,11/06/2012,A001,A001,1120,U.S. Representative,District 22,140220,Times Over Voted,,0,901,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
653401,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,TANGIBLE PERSONAL PROPERTY TAX EXEMPTION,Amendment No. 10,901000,Number of Under Votes,,0,902,0
653404,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,ADDITIONAL HOMESTEAD EXEMPTION; LOW-INCOME SEN...,Amendment No. 11,901100,Times Over Voted,,0,901,0
653405,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,ADDITIONAL HOMESTEAD EXEMPTION; LOW-INCOME SEN...,Amendment No. 11,901100,Number of Under Votes,,0,902,0
653408,SEM,Seminole,11/06/2012,80,PRECINCT 80,3192,APPOINTMENT OF STUDENT BODY PRESIDENT TO BOARD...,Amendment No. 12,901200,Times Over Voted,,0,901,0


In [113]:
df_nan_cand_party['cand_or_issue'].unique()

array(['Times Over Voted', 'Number of Under Votes'], dtype=object)

Looks like rows describing over and under voters are the only rows which contain NaNs in the `cand_party` column, which means we can drop those roughly 61,000 rows.

In [114]:
df = df.dropna(subset=['cand_party'])

In [115]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 627673 entries, 0 to 688410
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   county_code    627673 non-null  object
 1   county_name    627673 non-null  object
 2   elec_date      627673 non-null  object
 3   precinct_id    627673 non-null  object
 4   poll_loc       600034 non-null  object
 5   total_reg      627673 non-null  int64 
 6   contest_name   627673 non-null  object
 7   district       627673 non-null  object
 8   contest_code   627673 non-null  int64 
 9   cand_or_issue  627673 non-null  object
 10  cand_party     627673 non-null  object
 11  cand_id        627673 non-null  int64 
 12  doe_num        627673 non-null  int64 
 13  vote_total     627673 non-null  int64 
dtypes: int64(5), object(9)
memory usage: 71.8+ MB


Great, now our count of non-nulls in `cand_party` matches our non-null counts in other columns. Going back to our nulls in `poll_loc` -- I can't tell immediately what the NaNs in `poll_loc` are, so let's drill down on those:

In [116]:
df_nan_poll_loc = df[df['poll_loc'].isna()]
df_nan_poll_loc

Unnamed: 0,county_code,county_name,elec_date,precinct_id,poll_loc,total_reg,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total
172570,DAD,Miami-Dade,11/06/2012,1000,,3223,President of the United States,,100000,Romney / Ryan,REP,0,55509,400
172571,DAD,Miami-Dade,11/06/2012,1000,,3223,President of the United States,,100000,Obama / Biden,DEM,0,55511,703
172572,DAD,Miami-Dade,11/06/2012,1000,,3223,President of the United States,,100000,Stevens / Link,OBJ,0,59923,1
172573,DAD,Miami-Dade,11/06/2012,1000,,3223,President of the United States,,100000,Johnson / Gray,LBT,0,59927,7
172574,DAD,Miami-Dade,11/06/2012,1000,,3223,President of the United States,,100000,"Goode, / Clymer",CPF,0,59941,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
418959,LEE,Lee,11/06/2012,200,,0,President of the United States,,100000,Hoefling / Ellis,AIP,0,60016,0
418960,LEE,Lee,11/06/2012,200,,0,President of the United States,,100000,Anderson / Rodriguez,JPF,0,60019,0
418961,LEE,Lee,11/06/2012,200,,0,President of the United States,,100000,WriteinVotes,,0,900,0
418962,LEE,Lee,11/06/2012,200,,0,President of the United States,,100000,OverVotes,,0,901,0


These NaNs in `poll_loc` are a bit weird, but at least some of these rows are associated with vote totals, so we'll keep them in our dataframe. We likely won't be performing operations on the `poll_loc` column, so they may not be an issue.

## Step 5: Narrowing Rows to Relevant Races

We've cleared out the superfluous columns and rows with NaN data that we don't need, but we still have a lot of clutter in here; each row is currently the vote count per precinct per race per candidate, for every single race, amendment, and other item on the ballot. As our final cleaning step, let's clean the dataframe to only include pertinent federal and state leg races:

In [117]:
df_cleaned_2012 = df[df.contest_name.isin(['President of the United States',\
                                'United States Senator',\
                                'U.S. Representative',\
                                'State Senator',\
                                'State Representative'])]\
                                .reset_index()

And let's check it out:

In [118]:
df_cleaned_2012.head()

Unnamed: 0,index,county_code,county_name,elec_date,precinct_id,poll_loc,total_reg,contest_name,district,contest_code,cand_or_issue,cand_party,cand_id,doe_num,vote_total
0,0,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Romney / Ryan,REP,0,55509,608
1,1,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Obama / Biden,DEM,0,55511,381
2,2,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Stevens / Link,OBJ,0,59923,1
3,3,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,Johnson / Gray,LBT,0,59927,6
4,4,ALA,Alachua,11/06/2012,1,01 First Baptist Church of Waldo,1411,President of the United States,,100000,"Goode, / Clymer",CPF,0,59941,1


In [119]:
df_cleaned_2012.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159171 entries, 0 to 159170
Data columns (total 15 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   index          159171 non-null  int64 
 1   county_code    159171 non-null  object
 2   county_name    159171 non-null  object
 3   elec_date      159171 non-null  object
 4   precinct_id    159171 non-null  object
 5   poll_loc       152136 non-null  object
 6   total_reg      159171 non-null  int64 
 7   contest_name   159171 non-null  object
 8   district       159171 non-null  object
 9   contest_code   159171 non-null  int64 
 10  cand_or_issue  159171 non-null  object
 11  cand_party     159171 non-null  object
 12  cand_id        159171 non-null  int64 
 13  doe_num        159171 non-null  int64 
 14  vote_total     159171 non-null  int64 
dtypes: int64(6), object(9)
memory usage: 18.2+ MB


By removing extraneous data, we're down to less than 160k rowns over 14 non-index columns from our original nearly 700k rows over 19 non-index columns. Woo!

## Step 6: Saving the Datafram to a .csv

Given that the ultimate destination for all of our dataframes will be BigQuery, let's export this new cleaned dataframe as a .csv that can readily be uploaded into BigQuery.

**Note:** This code is commented out to prevent creating multiple files when running if I hit 'Run All' in my IDE.

In [120]:
# df_cleaned_2012.to_csv('fl_2012_cleaned.csv')

## Step 7: Creating Data Cleaning Pipeline

So after all that, we've ingested and cleaned... one election's worth of data, out of six. Mercifully, the FDOE data is provided in the same format for each county each year, meaning we can set up some cleaning functions based on our learnings above to make the cleaning process repeatable and *much* easier.

We'll start with a function that reads all of the files for for a given election year (ie. in a given directory) and merges them into a single dataframe:

In [121]:
def merge_files(directory):
    target_files = glob.glob(directory)
    combined_df = pd.DataFrame()
    for file in target_files:
        df = pd.read_table(file, names=['county_code', 'county_name', 'elec_num', 'elec_date', 'elec_name', 'precinct_id', 'poll_loc', 'total_reg', 'total_reg_r', 'total_reg_d', 'total_reg_other', 'contest_name', 'district', 'contest_code', 'cand_or_issue', 'cand_party', 'cand_id', 'doe_num', 'vote_total'], encoding_errors='replace')
        combined_df = pd.concat([combined_df, df])
    return combined_df

Now functions to drop duplicates, drop superfluous columns, drop NaNs, and narrow to relevant races:

In [122]:
def drop_duplicates(df):
    df.drop_duplicates(keep='first', inplace=True)
    return df

In [123]:
def drop_columns(df):
    df = df.drop(columns=['total_reg_r','total_reg_d','total_reg_other','elec_num','elec_name'])
    return df

In [124]:
def drop_nans(df):
    df = df.dropna(subset=['cand_party'])
    return df

In [125]:
def select_races(df):
    df = df[df.contest_name.isin(['President of the United States',\
                                'United States Senator',\
                                'U.S. Representative',\
                                'State Senator',\
                                'State Representative'])]\
                                .reset_index()
    return df

Now a master function that applies all of the above cleaning functions and saves the cleaned data as an appropriately named .csv:

In [126]:
def data_cleaning_pipeline(locations):
    i = 0
    years = ['2012', '2014', '2016', '2018', '2020', '2022']    
    for location in locations:
        df = merge_files(location)
        df = drop_duplicates(df)
        df = drop_columns(df)
        df = drop_nans(df)
        df = select_races(df)
        df.to_csv(f"fl_{years[i]}_cleaned.csv")
        i += 1


In [127]:
location_list = ["C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2012 by Precinct\\Text Files\\*", \
                "C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2014 by Precinct\\*", \
                "C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2016 by Precinct\\*", \
                "C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2018 by Precinct\\*", \
                "C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2020 by Precinct\\*", \
                "C:\\Users\\canor\\Documents\\GitHub\\FL-Political-Analysis\\Florida Analysis\\FL 2022 by Precinct\\*"]

In [128]:
data_cleaning_pipeline(location_list)