### Step 2: Explore and Assess the Data
#### Explore the Data
Identify data quality issues, like missing values, duplicate data, etc.  
Identify which columns to keep for each data source  
Identify which columns are present in multiple sources but have different names

In [1]:
import pandas as pd

#### Covid cases and deaths

In [18]:
covid_df = pd.read_csv("data/covid_cases_US.csv")
covid_df.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,2/18/21,2/19/21,2/20/21,2/21/21,2/22/21,2/23/21,2/24/21,2/25/21,2/26/21,2/27/21
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,...,6071,6079,6092,6117,6121,6143,6172,6203,6228,6248
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,...,19324,19361,19392,19433,19461,19554,19635,19670,19698,19714
2,84001005,US,USA,840,1005.0,Barbour,Alabama,US,31.868263,-85.387129,...,2057,2061,2067,2070,2074,2084,2095,2099,2106,2113
3,84001007,US,USA,840,1007.0,Bibb,Alabama,US,32.996421,-87.125115,...,2405,2411,2414,2416,2417,2432,2437,2442,2445,2449
4,84001009,US,USA,840,1009.0,Blount,Alabama,US,33.982109,-86.567906,...,6008,6021,6040,6042,6043,6058,6072,6086,6084,6095


At first glance, it looks alright. Each day's new cases are a column which means we'll have to extract and transpose that data. It's great that the latitude and longitude are included, though. This'll make it much easier to match up counties to weather stations later on. Although the "Long_" column name is a bit weird.  
I also think we can ditch most of the first few columns, we don't need the country codes and region.

In [19]:
covid_df.min()

UID          16
iso2         AS
iso3        ASM
code3        16
FIPS       60.0
           ... 
2/23/21       0
2/24/21       0
2/25/21       0
2/26/21       0
2/27/21       0
Length: 413, dtype: object

In [20]:
covid_df.max()

UID        84099999
iso2             VI
iso3            VIR
code3           850
FIPS        99999.0
             ...   
2/23/21     1183496
2/24/21     1185559
2/25/21     1187542
2/26/21     1189232
2/27/21     1190894
Length: 413, dtype: object

So far, this looks quite good. No strange states, no negative FIPS (kinda like a zip-code, but for counties), no negative case numbers.

In [27]:
covid_df[["FIPS", "Admin2", "Province_State", "Lat", "Long_"]].nunique()

FIPS              3330
Admin2            1978
Province_State      58
Lat               3226
Long_             3226
dtype: int64

There are supposed to be up to 3243 FIPS codes in the US, depending on whether or not you count overseas territories. We count 3300, that looks close enough.  
But here's where it gets a bit weird. 1978 named counties (here called "Admin2" for some reason), but 3300 FIPS? Each code is meant to map to a county. How do we have more codes than counties?  
It's also a bit weird that it lists 58 states, I thought there were 51? Maybe these are overseas territories?  
It is good to see that while latitude and longitude don't match the FIPS, they are at least matching each other.

In [28]:
covid_df[["FIPS", "Admin2", "Province_State", "Lat", "Long_"]].isna().sum()

FIPS              10
Admin2             6
Province_State     0
Lat                0
Long_              0
dtype: int64

I figured maybe we have some counties that we report by FIPS, not by name, but that doesn't appear to be the case. We've got a few NaNs for both, but not significant amounts.  
The rest of the data seems intact, though.

In [55]:
pd.concat(g for _, g in covid_df.groupby("Admin2") if len(g) > 1)[["FIPS", "Admin2", "Province_State"]].head(20)

Unnamed: 0,FIPS,Admin2,Province_State
823,19001.0,Adair,Iowa
1031,21001.0,Adair,Kentucky
1540,29001.0,Adair,Missouri
2212,40001.0,Adair,Oklahoma
255,8001.0,Adams,Colorado
580,16003.0,Adams,Idaho
625,17001.0,Adams,Illinois
729,18001.0,Adams,Indiana
824,19003.0,Adams,Iowa
1456,28001.0,Adams,Mississippi


Mystery solved! Sounds like a bunch of states have counties of the same name.

In [47]:
covid_df[covid_df["FIPS"].isna()][["Admin2", "Province_State"]].head(20)

Unnamed: 0,Admin2,Province_State
1267,Dukes and Nantucket,Massachusetts
1304,Federal Correctional Institution (FCI),Michigan
1336,Michigan Department of Corrections (MDOC),Michigan
1591,Kansas City,Missouri
2954,Bear River,Utah
2959,Central Utah,Utah
2978,Southeast Utah,Utah
2979,Southwest Utah,Utah
2982,TriCounty,Utah
2990,Weber-Morgan,Utah


So the places without FIPS are... prisons? And... Kansas City as well as a few places in Utah?

In [42]:
covid_df[covid_df["Admin2"].isna()][["FIPS", "Province_State"]].head()

Unnamed: 0,FIPS,Province_State
100,60.0,American Samoa
336,88888.0,Diamond Princess
570,99999.0,Grand Princess
571,66.0,Guam
2121,69.0,Northern Mariana Islands


Places without a county appear to be in overseas territories, and two cruise ships. It seems like overseas territories have FIPS below 100, counties on the continent start at 101. The two cruise ships seem to have been assigned made-up numbers in the high five-digit realm.

##### Conclusion
We might be able to fill in the missing FIPS codes for Utah as well as the two outliers from other sources. I'm not sure sure about the prisons.  
The places without county names could just take the province name as a county name, for clarity's sake. Or we could just exclude them, I don't know if we need those in the data set since we're not likely to find relevant data for them in our other sources.

During ETL, I'll need to insert one entry into the table for each fips+date combined key, and then add the cases and deaths using that key.

I'd hazard a guess that the deaths data is structured quite similarly, so we should be able to use that the same way should we decide to do so.

#### Health data

#### Document findings


#### Cleaning Steps
Document steps necessary to clean the data

#### Document findings


#### Selection steps
Document steps necessary to select the correct columns and prepare some for linking and consolidation

#### Document findings
