# EDA for usafacts data

Link to the usafacts webpage [here](https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/)

In [1]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# load data
raw = pd.read_csv("../data/01_usafacts_data.csv", encoding="iso-8859-1")

In [23]:
#visualize the top 5 rows
raw.head(5)

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,Date:1/22/2020,Date:1/23/2020,Date:1/24/2020,Date:1/25/2020,Date:1/26/2020,Date:1/27/2020,...,Date:3/11/2020,Date:3/12/2020,Date:3/13/2020,Date:3/14/2020,Date:3/15/2020,Date:3/16/2020,Date:3/17/2020,Date:3/18/2020,Date:3/19/2020,Date:3/20/2020
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
1,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,0,0,0,1,1,1,1,1,1,2
2,1015,Calhoun County,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,1
3,1017,Chambers County,AL,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,1,1
4,1043,Cullman County,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [24]:
# visualize the number of columns and rows
raw.shape

(958, 63)

Here is a brief summary of the table:
- countyFIPS is a unique identifier for each county in the US. FIPS stands for Federal Information Processing Standard coding scheme. Since there are confirmed cases that we don't know the county information, their countyFIPS is 0.
- County Name is the name of a county
- State : which state the county belongs to
- stateFIPS: an identifier for the state
- other fields refers to the number of confirmed cases up to a particular date.

In [14]:
# preprocess add a prefix to all the dates in the column names
raw.columns = list(raw.columns[:4]) + ["Date:" + name for name in raw.columns[4:]]

In [46]:
# preprocess check how many rows are there for each (county, state) pair
raw.groupby(
    ['countyFIPS', 'stateFIPS']
)['County Name'].count().reset_index().sort_values(
    'County Name',
    ascending=False,
).head(5)

Unnamed: 0,countyFIPS,stateFIPS,County Name
770,45061,45,2
859,50023,50,2
263,13243,13,2
196,12063,12,2
0,0,1,1


<font color='red'>Observation : some (county, state) pairs are not unique. </font>

In [16]:
pd.wide_to_long(raw, stubnames='Date:', i=['countyFIPS', 'stateFIPS'], j='confirmed')

ValueError: the id variables need to uniquely identify each row