### 90803 Data Cleaning and Question Definition
### Data Cleaning: 2020 Elections Data

**Team 14**

Chi-Shiun Tsai & Colton Lapp

This notebook is used for cleaning the 2020 general election from MIT Election Data and Science Lab.

### 0. Impoting libraries

In [1]:
import glob
import numpy as np
import pandas as pd
from datetime import datetime

### 1. Reading datasets

In [2]:
df = pd.read_csv('../data/2020-all-states-local-precinct-general.csv', low_memory=False)
df.sample(25)

Unnamed: 0,precinct,office,party_detailed,party_simplified,mode,votes,county_name,county_fips,jurisdiction_name,jurisdiction_fips,...,stage,state,special,writein,state_po,state_fips,state_cen,state_ic,date,readme_check
35376,CHRIST UN METHODIST,COUNTY COMMISSION MEMBER,REPUBLICAN,REPUBLICAN,TOTAL,0,SHELBY,1117.0,SHELBY,1117,...,GEN,ALABAMA,False,False,AL,1,63,41,2020-11-03,False
441066,77_77 - L.A. AINGER MIDDLE SCHOOL,SHERIFF,,,TOTAL,138,CHARLOTTE,12015.0,CHARLOTTE,12015,...,GEN,FLORIDA,False,False,FL,12,59,43,2020-11-03,False
1090698,OSRB 81-120,WAKE SOIL AND WATER CONSERVATION DISTRICT SUPE...,NONPARTISAN,NONPARTISAN,ABSENTEE BY MAIL,0,WAKE,37183.0,WAKE,37183,...,GEN,NORTH CAROLINA,False,True,NC,37,56,47,2020-11-03,False
619489,V130,SOIL AND WATER,,,TOTAL,116,JEFFERSON,21111.0,JEFFERSON,21111,...,GEN,KENTUCKY,False,False,KY,21,61,51,2020-11-03,False
290125,510,COURT OF APPEAL,NONPARTISAN,NONPARTISAN,TOTAL,1695,DUVAL,12031.0,DUVAL,12031,...,GEN,FLORIDA,False,False,FL,12,59,43,2020-11-03,False
1338709,NORTH CHARLESTON 1,SCHOOL BOARD,NONPARTISAN,NONPARTISAN,IN PERSON ABSENTEE,42,CHARLESTON,45019.0,CHARLESTON,45019,...,GEN,SOUTH CAROLINA,False,False,SC,45,57,48,2020-11-03,False
1263893,ISSAQUEENA,CITY COUNCIL CLEMSON,NONPARTISAN,NONPARTISAN,ELECTION DAY,44,PICKENS,45077.0,PICKENS,45077,...,GEN,SOUTH CAROLINA,False,False,SC,45,57,48,2020-11-03,False
1082958,OSAP 1-40,WAKE SOIL AND WATER CONSERVATION DISTRICT SUPE...,NONPARTISAN,NONPARTISAN,ABSENTEE BY MAIL,0,WAKE,37183.0,WAKE,37183,...,GEN,NORTH CAROLINA,False,True,NC,37,56,47,2020-11-03,False
89514,GRAND BAY MIDDLE SCH,COUNTY CONSTABLE,,,TOTAL,0,MOBILE,1097.0,MOBILE,1097,...,GEN,ALABAMA,False,False,AL,1,63,41,2020-11-03,False
1330466,BELVEDERE NO. 74,REGISTER OF MESNE CONVEYANCE,NONPARTISAN,NONPARTISAN,ELECTION DAY,1,AIKEN,45003.0,AIKEN,45003,...,GEN,SOUTH CAROLINA,False,False,SC,45,57,48,2020-11-03,False


### 2. Data cleaning

In [3]:
# Subset to only columns we want
df = df[['year', 'state', 'precinct', 'office', 'party_simplified', 'votes']]
df

Unnamed: 0,year,state,precinct,office,party_simplified,votes
0,2020,ALABAMA,BELLAMY COMMUNITY CTR,3 MILL SCHOOL TAX,,0
1,2020,ALABAMA,BELLAMY COMMUNITY CTR,3 MILL SCHOOL TAX,,25
2,2020,ALABAMA,BELLAMY COMMUNITY CTR,3 MILL SCHOOL TAX,NONPARTISAN,62
3,2020,ALABAMA,BELLAMY COMMUNITY CTR,3 MILL SCHOOL TAX,NONPARTISAN,171
4,2020,ALABAMA,BOYD TRAILER,3 MILL SCHOOL TAX,,0
...,...,...,...,...,...,...
1485059,2020,WYOMING,9-1,CIRCUIT COURT JUDGE,NONPARTISAN,272
1485060,2020,WYOMING,9-1,CIRCUIT COURT JUDGE,NONPARTISAN,64
1485061,2020,WYOMING,9-1,CIRCUIT COURT JUDGE,NONPARTISAN,256
1485062,2020,WYOMING,9-1,CIRCUIT COURT JUDGE,NONPARTISAN,56


In [4]:
# Group by year, state, office, and party, and sum the votes
df.groupby(['year', 'state', 'office', 'party_simplified']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,votes
year,state,office,party_simplified,Unnamed: 4_level_1
2020,ALABAMA,3 MILL SCHOOL TAX,NONPARTISAN,5148
2020,ALABAMA,5 MILL SCHOOL TAX,NONPARTISAN,5189
2020,ALABAMA,ASSISTANT TAX ASSESSOR BESSEMER DIVISION OF JEFFERS,DEMOCRAT,44153
2020,ALABAMA,ASSISTANT TAX ASSESSOR BESSEMER DIVISION OF JEFFERS,REPUBLICAN,35314
2020,ALABAMA,ASSISTANT TAX COLLECTOR BESSEMER DIVISION OF JEFFER,DEMOCRAT,44063
2020,...,...,...,...
2020,WISCONSIN,WAUSHARA COUNTY DISTRICT ATTORNEY,DEMOCRAT,3955
2020,WISCONSIN,WAUSHARA COUNTY DISTRICT ATTORNEY,REPUBLICAN,9188
2020,WISCONSIN,WINNEBAGO COUNTY DISTRICT ATTORNEY,REPUBLICAN,70331
2020,WISCONSIN,WOOD COUNTY DISTRICT ATTORNEY,OTHER,29128


In [5]:
# Check for missing values
df.isnull().sum(axis=0)

year                     0
state                    0
precinct                 0
office                   0
party_simplified    357354
votes                    0
dtype: int64

In [6]:
df['party_simplified'].unique()

array([nan, 'NONPARTISAN', 'DEMOCRAT', 'REPUBLICAN', 'OTHER',
       'LIBERTARIAN'], dtype=object)

In [7]:
# Fill missing values with NONE
df['party_simplified'] = df['party_simplified'].fillna('NONE')
df['party_simplified'].unique()

array(['NONE', 'NONPARTISAN', 'DEMOCRAT', 'REPUBLICAN', 'OTHER',
       'LIBERTARIAN'], dtype=object)

In [8]:
# Check for missing values again to make sure we got them all
df.isnull().sum(axis=0)

year                0
state               0
precinct            0
office              0
party_simplified    0
votes               0
dtype: int64

### 3. Save cleaned dataset

In [9]:
df_cleaned = df.groupby(['year', 'state', 'office', 'party_simplified']).sum().reset_index()

In [10]:
df_cleaned.head(10)

Unnamed: 0,year,state,office,party_simplified,votes
0,2020,ALABAMA,3 MILL SCHOOL TAX,NONE,1176
1,2020,ALABAMA,3 MILL SCHOOL TAX,NONPARTISAN,5148
2,2020,ALABAMA,5 MILL SCHOOL TAX,NONE,1135
3,2020,ALABAMA,5 MILL SCHOOL TAX,NONPARTISAN,5189
4,2020,ALABAMA,ASSISTANT TAX ASSESSOR BESSEMER DIVISION OF JE...,DEMOCRAT,44153
5,2020,ALABAMA,ASSISTANT TAX ASSESSOR BESSEMER DIVISION OF JE...,NONE,1707
6,2020,ALABAMA,ASSISTANT TAX ASSESSOR BESSEMER DIVISION OF JE...,REPUBLICAN,35314
7,2020,ALABAMA,ASSISTANT TAX COLLECTOR BESSEMER DIVISION OF J...,DEMOCRAT,44063
8,2020,ALABAMA,ASSISTANT TAX COLLECTOR BESSEMER DIVISION OF J...,NONE,1811
9,2020,ALABAMA,ASSISTANT TAX COLLECTOR BESSEMER DIVISION OF J...,REPUBLICAN,35299


In [11]:
df_cleaned.to_csv('../data/data_cleaned/2020-all-states-local-precinct-general-cleaned.csv', index=False)

### References

- MIT election web: https://electionlab.mit.edu/
- Codebook: https://github.com/MEDSL/2020-elections-official/blob/main/2020-precincts-codebook.md 