# Clean the usafacts data

We do the following cleaning to the usafacts data.
- The county code for Richmond, GA seems to be off. Instead of 13243, it should be 13245. [link](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697)
- Three counties appeared twice: Jackson, FL; Lee, SC; Washington, VT. It could be due to either the county names are wrong or they refer to different sources at the county. We decide to merge the duplicated rows by adding the numbers together. This way, we are able to make sure the total number of cases match with the total number of cases reported on the webpage.

In [1]:
# load data
import numpy as np
import pandas as pd

In [2]:
# load data
raw = pd.read_csv("../data/01_usafacts_data.csv", encoding="iso-8859-1")

In [3]:
# preprocess get all the duplicates
replicates = raw.groupby(['countyFIPS', 'stateFIPS'])['County Name'].count().reset_index()
redundant_countyFIPS = list(replicates.loc[replicates['County Name'] > 1, 'countyFIPS'])
assert redundant_countyFIPS == [12063, 13243, 45061, 50023], "The data source seems to have changed"
redundant_countyFIPS

[12063, 13243, 45061, 50023]

In [4]:
# visualize the rows that have duplicates
raw[raw['countyFIPS'].isin(redundant_countyFIPS)]

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/2020,1/23/2020,1/24/2020,1/25/2020,1/26/2020,1/27/2020,...,3/11/2020,3/12/2020,3/13/2020,3/14/2020,3/15/2020,3/16/2020,3/17/2020,3/18/2020,3/19/2020,3/20/2020
155,12063,Jackson County,FL,12,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,4
156,12063,Jackson County,FL,12,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,1
224,13243,Richmond County,GA,13,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,2,8
225,13243,Randolph County,GA,13,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
762,45061,Lee County,SC,45,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,1
763,45061,Lee County,SC,45,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
857,50023,Washington County,VT,50,0,0,0,0,0,0,...,0,0,0,1,1,1,1,1,1,1
858,50023,Washington County,VT,50,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1


In [5]:
# preprocess change Richmond, GA's county code
assert raw.iloc[224, 0] == 13243 and raw.iloc[224, 1] == 'Richmond County', 'The data source have changed.'
raw.iloc[224, 0] = 13245

In [6]:
# preprocess merge rows with the same (county, state) pair by adding up the other numbers
cleaned = raw.groupby(['countyFIPS', 'County Name', 'State', 'stateFIPS']).sum().reset_index()

In [7]:
# visualize the rows that have duplicates
cleaned[cleaned['countyFIPS'].isin([45061, 50023, 13243, 12063])]

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/2020,1/23/2020,1/24/2020,1/25/2020,1/26/2020,1/27/2020,...,3/11/2020,3/12/2020,3/13/2020,3/14/2020,3/15/2020,3/16/2020,3/17/2020,3/18/2020,3/19/2020,3/20/2020
196,12063,Jackson County,FL,12,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1,5
263,13243,Randolph County,GA,13,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
771,45061,Lee County,SC,45,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,2
860,50023,Washington County,VT,50,0,0,0,0,0,0,...,0,0,0,1,1,1,1,1,2,2


In [8]:
# save the cleaned data
cleaned.to_csv("../intermediate/02_cleaned_usafacts.csv", header=True, index=False)

In [9]:
# load the cleaned dat
data = pd.read_csv('../intermediate/02_cleaned_usafacts.csv')

In [10]:
# visualize the head
data.head()

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/2020,1/23/2020,1/24/2020,1/25/2020,1/26/2020,1/27/2020,...,3/11/2020,3/12/2020,3/13/2020,3/14/2020,3/15/2020,3/16/2020,3/17/2020,3/18/2020,3/19/2020,3/20/2020
0,0,Statewide Unallocated,AK,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,4
1,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
2,0,Statewide Unallocated,AR,5,0,0,0,0,0,0,...,0,2,5,6,9,6,12,18,15,35
3,0,Statewide Unallocated,AZ,4,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,Statewide Unallocated,CA,6,0,0,0,0,0,0,...,16,16,0,0,0,0,0,0,0,0
