### Step 2: Explore and Assess the Data
#### Explore the Data
Identify data quality issues, like missing values, duplicate data, etc.  
Identify which columns to keep for each data source  
Identify which columns are present in multiple sources but have different names

In [1]:
import pandas as pd

#### Covid cases and deaths

In [183]:
covid_df = pd.read_csv("data/covid_cases_US.csv")
covid_df.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,2/18/21,2/19/21,2/20/21,2/21/21,2/22/21,2/23/21,2/24/21,2/25/21,2/26/21,2/27/21
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,...,6071,6079,6092,6117,6121,6143,6172,6203,6228,6248
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,...,19324,19361,19392,19433,19461,19554,19635,19670,19698,19714
2,84001005,US,USA,840,1005.0,Barbour,Alabama,US,31.868263,-85.387129,...,2057,2061,2067,2070,2074,2084,2095,2099,2106,2113
3,84001007,US,USA,840,1007.0,Bibb,Alabama,US,32.996421,-87.125115,...,2405,2411,2414,2416,2417,2432,2437,2442,2445,2449
4,84001009,US,USA,840,1009.0,Blount,Alabama,US,33.982109,-86.567906,...,6008,6021,6040,6042,6043,6058,6072,6086,6084,6095


At first glance, it looks alright. Each day's new cases are a column which means we'll have to extract and transpose that data. It's great that the latitude and longitude are included, though. This'll make it much easier to match up counties to weather stations later on. Although the "Long_" column name is a bit weird.  
I also think we can ditch most of the first few columns, we don't need the country codes and region.

In [184]:
county_columns = ["FIPS", "Admin2", "Province_State", "Lat", "Long_"]
columns = county_columns + list(covid_df.columns[11:])
covid_df = covid_df[columns]
covid_df.head()

Unnamed: 0,FIPS,Admin2,Province_State,Lat,Long_,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,...,2/18/21,2/19/21,2/20/21,2/21/21,2/22/21,2/23/21,2/24/21,2/25/21,2/26/21,2/27/21
0,1001.0,Autauga,Alabama,32.539527,-86.644082,0,0,0,0,0,...,6071,6079,6092,6117,6121,6143,6172,6203,6228,6248
1,1003.0,Baldwin,Alabama,30.72775,-87.722071,0,0,0,0,0,...,19324,19361,19392,19433,19461,19554,19635,19670,19698,19714
2,1005.0,Barbour,Alabama,31.868263,-85.387129,0,0,0,0,0,...,2057,2061,2067,2070,2074,2084,2095,2099,2106,2113
3,1007.0,Bibb,Alabama,32.996421,-87.125115,0,0,0,0,0,...,2405,2411,2414,2416,2417,2432,2437,2442,2445,2449
4,1009.0,Blount,Alabama,33.982109,-86.567906,0,0,0,0,0,...,6008,6021,6040,6042,6043,6058,6072,6086,6084,6095


In [185]:
# Rename columns for ease of use
new_column_names = ["fips", "county_name", "state", "latitude", "longitude"]
covid_df = covid_df.rename(columns=dict(zip(county_columns, new_column_names)))
covid_df.head()

Unnamed: 0,fips,county_name,state,latitude,longitude,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,...,2/18/21,2/19/21,2/20/21,2/21/21,2/22/21,2/23/21,2/24/21,2/25/21,2/26/21,2/27/21
0,1001.0,Autauga,Alabama,32.539527,-86.644082,0,0,0,0,0,...,6071,6079,6092,6117,6121,6143,6172,6203,6228,6248
1,1003.0,Baldwin,Alabama,30.72775,-87.722071,0,0,0,0,0,...,19324,19361,19392,19433,19461,19554,19635,19670,19698,19714
2,1005.0,Barbour,Alabama,31.868263,-85.387129,0,0,0,0,0,...,2057,2061,2067,2070,2074,2084,2095,2099,2106,2113
3,1007.0,Bibb,Alabama,32.996421,-87.125115,0,0,0,0,0,...,2405,2411,2414,2416,2417,2432,2437,2442,2445,2449
4,1009.0,Blount,Alabama,33.982109,-86.567906,0,0,0,0,0,...,6008,6021,6040,6042,6043,6058,6072,6086,6084,6095


In [186]:
covid_df.min()

fips             60.0
state         Alabama
latitude      -14.271
longitude   -174.1596
1/22/20             0
               ...   
2/23/21             0
2/24/21             0
2/25/21             0
2/26/21             0
2/27/21             0
Length: 407, dtype: object

In [187]:
covid_df.max()

fips           99999.0
state          Wyoming
latitude     69.314792
longitude     145.6739
1/22/20              1
               ...    
2/23/21        1183496
2/24/21        1185559
2/25/21        1187542
2/26/21        1189232
2/27/21        1190894
Length: 407, dtype: object

So far, this looks quite good. No strange states, no negative FIPS (kinda like a zip-code, but for counties), no negative case numbers.

In [174]:
covid_df[new_column_names].nunique()

fips           3330
county_name    1978
state            58
latitude       3226
longitude      3226
dtype: int64

There are supposed to be up to 3243 FIPS codes in the US, depending on whether or not you count overseas territories. We count 3300, that looks close enough.  
But here's where it gets a bit weird. 1978 named counties (here called "Admin2" for some reason), but 3300 FIPS? Each code is meant to map to a county. How do we have more codes than counties?  
It's also a bit weird that it lists 58 states, I thought there were 51? Maybe these are overseas territories?  
It is good to see that while latitude and longitude don't match the FIPS, they are at least matching each other.

In [175]:
covid_df[new_column_names].isna().sum()

fips           10
county_name     6
state           0
latitude        0
longitude       0
dtype: int64

I figured maybe we have some counties that we report by FIPS, not by name, but that doesn't appear to be the case. We've got a few NaNs for both, but not significant amounts.  
The rest of the data seems intact, though.

In [176]:
pd.concat(g for _, g in covid_df.groupby("county_name") if len(g) > 1)[["fips", "county_name", "state"]].head(20)

Unnamed: 0,fips,county_name,state
823,19001.0,Adair,Iowa
1031,21001.0,Adair,Kentucky
1540,29001.0,Adair,Missouri
2212,40001.0,Adair,Oklahoma
255,8001.0,Adams,Colorado
580,16003.0,Adams,Idaho
625,17001.0,Adams,Illinois
729,18001.0,Adams,Indiana
824,19003.0,Adams,Iowa
1456,28001.0,Adams,Mississippi


Mystery solved! Sounds like a bunch of states have counties of the same name.

In [177]:
covid_df[covid_df["fips"].isna()][["county_name", "state"]].head(10)

Unnamed: 0,county_name,state
1267,Dukes and Nantucket,Massachusetts
1304,Federal Correctional Institution (FCI),Michigan
1336,Michigan Department of Corrections (MDOC),Michigan
1591,Kansas City,Missouri
2954,Bear River,Utah
2959,Central Utah,Utah
2978,Southeast Utah,Utah
2979,Southwest Utah,Utah
2982,TriCounty,Utah
2990,Weber-Morgan,Utah


So the places without FIPS are... prisons? And... Kansas City as well as a few places in Utah? Let's drop the prisons.

In [236]:
covid_df = covid_df.drop(covid_df.index[[1304, 1336]])
covid_df[covid_df["fips"].isna()][["county_name", "state"]].head(10)

Unnamed: 0,county_name,state
1267,Dukes and Nantucket,Massachusetts
1591,Kansas City,Missouri
2954,Bear River,Utah
2959,Central Utah,Utah
2978,Southeast Utah,Utah
2979,Southwest Utah,Utah
2982,TriCounty,Utah
2990,Weber-Morgan,Utah


In [178]:
covid_df[covid_df["county_name"].isna()][["fips", "state"]].head(6)

Unnamed: 0,fips,state
100,60.0,American Samoa
336,88888.0,Diamond Princess
570,99999.0,Grand Princess
571,66.0,Guam
2121,69.0,Northern Mariana Islands
3007,78.0,Virgin Islands


Places without a county appear to be in overseas territories, and two cruise ships. It seems like overseas territories have FIPS below 100, counties on the continent start at 101. The two cruise ships seem to have been assigned made-up numbers in the high five-digit realm.

In [188]:
covid_df.loc[covid_df["fips"].notna(), "fips"] = covid_df.loc[covid_df["fips"].notna(), "fips"].astype(int).astype(str).str.pad(width=5, side='left', fillchar='0')
covid_df.head()

Unnamed: 0,fips,county_name,state,latitude,longitude,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,...,2/18/21,2/19/21,2/20/21,2/21/21,2/22/21,2/23/21,2/24/21,2/25/21,2/26/21,2/27/21
0,1001,Autauga,Alabama,32.539527,-86.644082,0,0,0,0,0,...,6071,6079,6092,6117,6121,6143,6172,6203,6228,6248
1,1003,Baldwin,Alabama,30.72775,-87.722071,0,0,0,0,0,...,19324,19361,19392,19433,19461,19554,19635,19670,19698,19714
2,1005,Barbour,Alabama,31.868263,-85.387129,0,0,0,0,0,...,2057,2061,2067,2070,2074,2084,2095,2099,2106,2113
3,1007,Bibb,Alabama,32.996421,-87.125115,0,0,0,0,0,...,2405,2411,2414,2416,2417,2432,2437,2442,2445,2449
4,1009,Blount,Alabama,33.982109,-86.567906,0,0,0,0,0,...,6008,6021,6040,6042,6043,6058,6072,6086,6084,6095


Lastly, we force the FIPS that aren't NaN to a 5-digit format. This is required to make them work with the other data sets.

##### Conclusion
We might be able to fill in the missing FIPS codes for Utah as well as the two outliers from other sources. I'm not sure sure about the prisons, so let's drop them.  
The places without county names could just take the province name as a county name, for clarity's sake. Or we could just exclude them, I don't know if we need those in the data set since we're not likely to find relevant data for them in our other sources.

During ETL, I'll need to insert one entry into the table for each fips+date combined key, and then add the cases and deaths using that key.  
The FIPS codes are represented flexibly and as doubles, I will need to force these to be 5-digit codes, preferably in string representation.

I'd hazard a guess that the deaths data is structured quite similarly, so we should be able to use that the same way should we decide to do so.

#### Health data

In [118]:
health_df = pd.read_csv("data/health_data.csv")
health_df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,State FIPS Code,County FIPS Code,5-digit FIPS Code,State Abbreviation,Name,Release Year,County Ranked (Yes=1/No=0),Premature death raw value,Premature death numerator,Premature death denominator,...,Male population 18-44 raw value,Male population 45-64 raw value,Male population 65+ raw value,Total male population raw value,Female population 0-17 raw value,Female population 18-44 raw value,Female population 45-64 raw value,Female population 65+ raw value,Total female population raw value,Population growth raw value
0,statecode,countycode,fipscode,state,county,year,county_ranked,v001_rawvalue,v001_numerator,v001_denominator,...,v013_rawvalue,v016_rawvalue,v017_rawvalue,v025_rawvalue,v026_rawvalue,v027_rawvalue,v031_rawvalue,v032_rawvalue,v035_rawvalue,v097_rawvalue
1,00,000,00000,US,United States,2020,,6940.1105188,3813889,912286150,...,,,,,,,,,,
2,01,000,01000,AL,Alabama,2020,,9942.7946665,81791,13640424,...,,,,,,,,,,
3,01,001,01001,AL,Autauga County,2020,1,8128.5911903,791,155856,...,,,,,,,,,,
4,01,003,01003,AL,Baldwin County,2020,1,7354.1225298,2967,588433,...,,,,,,,,,,


It looks like this data has two different options for column names, with the alternative stored in the first row. The original column names are more descriptive, so I'll stick with those and remove the others.  
Another interesting observation is that the data set contains data for the US as a whole which might be interesting as a reference value. I'll need to check if the values for it make sense; if they do, I'll keep it in and treat it separately. It also seems to contain all the states as rows as well, which we can identify by them ending in 000. Again, this might be useful, but it probably won't mesh well with the rest of our data?

I'm going to make a preselection of interesting columns here before I continue; the dataset is huge, but not all of it is relevant.

In [119]:
health_columns = ["5-digit FIPS Code", "Name", "State Abbreviation", "Poor or fair health raw value", "Adult smoking raw value", "Adult obesity raw value", "Physical inactivity raw value", "Excessive drinking raw value", "Uninsured raw value", "Primary care physicians raw value", "Unemployment raw value", "Air pollution - particulate matter raw value", "Severe housing problems raw value", "Percentage of households with overcrowding", "Food insecurity raw value", "Residential segregation - non-White/White raw value", "% 65 and older raw value", "% Rural raw value"]
health_df = health_df[health_columns]
health_df = health_df.drop(health_df.index[0])
health_df.head()

Unnamed: 0,5-digit FIPS Code,Name,State Abbreviation,Poor or fair health raw value,Adult smoking raw value,Adult obesity raw value,Physical inactivity raw value,Excessive drinking raw value,Uninsured raw value,Primary care physicians raw value,Unemployment raw value,Air pollution - particulate matter raw value,Severe housing problems raw value,Percentage of households with overcrowding,Food insecurity raw value,Residential segregation - non-White/White raw value,% 65 and older raw value,% Rural raw value
1,0,United States,US,0.1719867644,0.1708001743,0.29,0.233,0.1897709024,0.1022344603,0.0007546654,0.0389533902,8.6,0.1791360885,,0.125,46.77346382,0.1602579828,0.1926902892
2,1000,Alabama,AL,0.2202870285,0.2092735311,0.355,0.298,0.1390351529,0.1104478259,0.0006482388,0.0393356691,11.0,0.1434070208,,0.163,50.777775905,0.1691726316,0.409631829
3,1001,Autauga County,AL,0.2088298733,0.1808155718,0.333,0.347,0.1502603126,0.0872168595,0.000450418,0.0362907886,11.7,0.1466346154,0.0120192308,0.132,23.628395199,0.1556266974,0.4200216232
4,1003,Baldwin County,AL,0.1750913436,0.174890326,0.31,0.265,0.179583101,0.1133340447,0.0007289727,0.0361538216,10.3,0.1356620093,0.0127079175,0.116,31.825343231,0.2044334975,0.4227909911
5,1005,Barbour County,AL,0.2959180171,0.2199998453,0.417,0.235,0.1284401555,0.1224279246,0.0003165809,0.0517138421,11.5,0.1458333333,0.0168859649,0.22,23.449712509,0.194204413,0.677896347


Next, I'll rename the columns to be a bit easier to parse.

In [120]:
new_column_names = ["fips", "county_name", "state", "poor_health", "smokers", "obesity", "physical_inactivity", "excessive_drinking", "uninsured", "physicians", "unemployment", "air_pollution", "housing_problems", "household_overcrowding", "food_insecurity", "residential_segregation", "over_sixtyfives", "rural"]
health_df = health_df.rename(columns=dict(zip(health_columns, new_column_names)))
health_df.head()

Unnamed: 0,fips,county_name,state,poor_health,smokers,obesity,physical_inactivity,excessive_drinking,uninsured,physicians,unemployment,air_pollution,housing_problems,household_overcrowding,food_insecurity,residential_segregation,over_sixtyfives,rural
1,0,United States,US,0.1719867644,0.1708001743,0.29,0.233,0.1897709024,0.1022344603,0.0007546654,0.0389533902,8.6,0.1791360885,,0.125,46.77346382,0.1602579828,0.1926902892
2,1000,Alabama,AL,0.2202870285,0.2092735311,0.355,0.298,0.1390351529,0.1104478259,0.0006482388,0.0393356691,11.0,0.1434070208,,0.163,50.777775905,0.1691726316,0.409631829
3,1001,Autauga County,AL,0.2088298733,0.1808155718,0.333,0.347,0.1502603126,0.0872168595,0.000450418,0.0362907886,11.7,0.1466346154,0.0120192308,0.132,23.628395199,0.1556266974,0.4200216232
4,1003,Baldwin County,AL,0.1750913436,0.174890326,0.31,0.265,0.179583101,0.1133340447,0.0007289727,0.0361538216,10.3,0.1356620093,0.0127079175,0.116,31.825343231,0.2044334975,0.4227909911
5,1005,Barbour County,AL,0.2959180171,0.2199998453,0.417,0.235,0.1284401555,0.1224279246,0.0003165809,0.0517138421,11.5,0.1458333333,0.0168859649,0.22,23.449712509,0.194204413,0.677896347


In [122]:
numeric_columns = new_column_names[3:]
health_df[numeric_columns] = health_df[numeric_columns].apply(pd.to_numeric)
health_df.dtypes

fips                        object
county_name                 object
state                       object
poor_health                float64
smokers                    float64
obesity                    float64
physical_inactivity        float64
excessive_drinking         float64
uninsured                  float64
physicians                 float64
unemployment               float64
air_pollution              float64
housing_problems           float64
household_overcrowding     float64
food_insecurity            float64
residential_segregation    float64
over_sixtyfives            float64
rural                      float64
dtype: object

In [123]:
health_df.head()

Unnamed: 0,fips,county_name,state,poor_health,smokers,obesity,physical_inactivity,excessive_drinking,uninsured,physicians,unemployment,air_pollution,housing_problems,household_overcrowding,food_insecurity,residential_segregation,over_sixtyfives,rural
1,0,United States,US,0.171987,0.1708,0.29,0.233,0.189771,0.102234,0.000755,0.038953,8.6,0.179136,,0.125,46.773464,0.160258,0.19269
2,1000,Alabama,AL,0.220287,0.209274,0.355,0.298,0.139035,0.110448,0.000648,0.039336,11.0,0.143407,,0.163,50.777776,0.169173,0.409632
3,1001,Autauga County,AL,0.20883,0.180816,0.333,0.347,0.15026,0.087217,0.00045,0.036291,11.7,0.146635,0.012019,0.132,23.628395,0.155627,0.420022
4,1003,Baldwin County,AL,0.175091,0.17489,0.31,0.265,0.179583,0.113334,0.000729,0.036154,10.3,0.135662,0.012708,0.116,31.825343,0.204433,0.422791
5,1005,Barbour County,AL,0.295918,0.22,0.417,0.235,0.12844,0.122428,0.000317,0.051714,11.5,0.145833,0.016886,0.22,23.449713,0.194204,0.677896


We need to convert a lof of our health data columns to numeric so they're not treated as strings for further analysis.

In [103]:
health_df.min()

fips                                  00000
county_name                Abbeville County
state                                    AK
poor_health                        0.081206
smokers                            0.059087
obesity                               0.124
physical_inactivity                   0.095
excessive_drinking                 0.078096
uninsured                          0.022627
physicians                              0.0
unemployment                       0.013021
air_pollution                           3.0
housing_problems                   0.032203
household_overcrowding                  0.0
food_insecurity                       0.029
residential_segregation            0.068236
over_sixtyfives                    0.048297
rural                                   0.0
dtype: object

In [104]:
health_df.max()

fips                                56045
county_name                Ziebach County
state                                  WY
poor_health                      0.409907
smokers                          0.414913
obesity                             0.577
physical_inactivity                 0.499
excessive_drinking               0.286237
uninsured                        0.337496
physicians                       0.005144
unemployment                      0.19904
air_pollution                        19.7
housing_problems                 0.708934
household_overcrowding            0.51585
food_insecurity                     0.363
residential_segregation          90.41887
over_sixtyfives                  0.575873
rural                                 1.0
dtype: object

These all make sense. No negative values, and the ranges (for rural (0-1), or segregation(0-100)) are within the expected bounds.

In [106]:
health_df[["fips", "county_name", "state"]].nunique()

fips           3194
county_name    1928
state            52
dtype: int64

As before, these kinda make sense. 3194 county identifiers where I'd expect 3243, which is a bit below so we might be missing some, but it's not too weird. As we know already, some county names exist in multiple states, hence the lower county name count.  
52 states makes sense since the US as a whole is treated as a state.

In [107]:
health_df.isna().sum()

fips                         0
county_name                  0
state                        0
poor_health                  0
smokers                      0
obesity                      0
physical_inactivity          0
excessive_drinking           0
uninsured                    1
physicians                 147
unemployment                 1
air_pollution               36
housing_problems             0
household_overcrowding      51
food_insecurity              0
residential_segregation    351
over_sixtyfives              0
rural                        7
dtype: int64

Good to see that there are no missing values for FIPS, county or state. Missing values in our health data is likely explained by insufficient data, so we need to fill these in where found with sensible default values.  
For example, if we don't have data for overcrowding, does this mean we don't have any overcrowding at all? Or do we just leave it as None and deal with it in the analytics section?

In [113]:
health_df[pd.to_numeric(health_df["fips"]) < 100].head()

Unnamed: 0,fips,county_name,state,poor_health,smokers,obesity,physical_inactivity,excessive_drinking,uninsured,physicians,unemployment,air_pollution,housing_problems,household_overcrowding,food_insecurity,residential_segregation,over_sixtyfives,rural
1,0,United States,US,0.171987,0.1708,0.29,0.233,0.189771,0.102234,0.000755,0.038953,8.6,0.179136,,0.125,46.773464,0.160258,0.19269


Overseas territories have FIPS codes below 100. We only found one row with a FIPS code like that, and that's the row for all of the US. This means that this data set doesn't include the overseas territories. I'll have to make a decision later whether or not I want to include the overseas territories; it largely depends on how my other data sets treat them.

##### Conclusion
As we've seen, the county and state information is solid, but there are some gaps in the health data. We also need to preselect the columns since the data set is too large otherwise.  

We'll need to make a decision later on if we'll include the US and state data as well. If we do, we will have to create aggregates for Covid-19 case numbers by country and state so we can tie this data to the Covid-19 set.  
We might also need to drop data for overseas territories and the cruise ships from other data sets since the health data is quite significant for this project and those counties are missing from it.

As we've seen, we need to rename some columns and force them into the correct data format. It's convenient that the FIPS codes are already 5-digits.  
We need to fill in default data for those rows that have missing data for some columns. I still need to figure out what that default data should be; the best approach is to either fill in zeroes which might mess up the data, or to calculate the state average and use that instead which takes more effort but is likely more accurate.

#### County area data

In [127]:
raw_area_df = pd.read_json("data/us_county_area.json")
raw_area_df.head()

Unnamed: 0,type,features
0,FeatureCollection,"{'type': 'Feature', 'properties': {'GEO_ID': '..."
1,FeatureCollection,"{'type': 'Feature', 'properties': {'GEO_ID': '..."
2,FeatureCollection,"{'type': 'Feature', 'properties': {'GEO_ID': '..."
3,FeatureCollection,"{'type': 'Feature', 'properties': {'GEO_ID': '..."
4,FeatureCollection,"{'type': 'Feature', 'properties': {'GEO_ID': '..."


Ok, so the data I need is in there, but it's a bit hidden. I checked it in a text editor and GEO_ID is quite long, but the last 5 digits are the FIPS code. I could extract the county name as well, but the format seems a bit weird, I'll just use the FIPS and fill in the name from other data sources.  
The county's surface area is stored as CENSUSAREA.

I'm not entirely sure what LSAD is, the values seem to be "County", "Borough", "Muno" and "CA"; they do still have a FIPS code, so they hopefully match other data I have. Since I'll only use this to determine the population density as supplementary data, it's fine if I'm missing some columns.

In [132]:
county_area_dict = {}
county_area_dict = {county['properties']['GEO_ID'][-5:]: county['properties']['CENSUSAREA'] for county in raw_area_df['features']}

county_area_df = pd.DataFrame(county_area_dict.items(), columns=["fips", "area"])
county_area_df.head()

Unnamed: 0,fips,area
0,1001,594.436
1,1009,644.776
2,1017,596.531
3,1021,692.854
4,1033,592.619


In [134]:
county_area_df[["fips"]].nunique()

fips    3221
dtype: int64

3221 unique FIPS codes sounds good, that's in the right range.

In [135]:
county_area_df.min()

fips    01001
area    1.999
dtype: object

In [137]:
county_area_df.max()

fips         72153
area    145504.789
dtype: object

No negative numbers for area, and the FIPS codes start at a reasonable value, but the max values look strange. A quick Google search shows that these are all places in Puerto Rico.

In [145]:
county_area_df[pd.to_numeric(county_area_df["fips"]) > 60000].count()

fips    78
area    78
dtype: int64

Overall, there seem to be 78 places either in Puerto Rico or in overseas territories.

##### Conclusion
All in all, the data is sound. I believe that we'll have a few too many FIPS codes for places that aren't counties, but we'll figure that out when we try to merge the data sets.

#### Weather data
I've got weather data in the "data/weather" folder. The data is organised into one file per weather data type, with columns for each day and one row for each county. This is great because it means I don't have to match weather stations to counties, I can just use the weather data directly.  

I really, really can't be bothered to analyse all files individually. Since they all come from the same data source, I'll do one and assume the others work the same.

In [216]:
weather_df = pd.read_csv("data/weather/tMin_US.csv")
weather_df.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,12/16/20,12/17/20,12/18/20,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,...,7.13,0.85,-2.79,-1.47,6.22,6.41,2.79,3.0,12.25,-3.37
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,...,10.45,1.31,-1.73,0.65,9.66,6.59,4.36,4.94,15.13,-0.28
2,84001005,US,USA,840,1005.0,Barbour,Alabama,US,31.868263,-85.387129,...,6.93,2.18,-2.46,-1.68,6.51,7.65,3.6,2.36,12.44,-2.22
3,84001007,US,USA,840,1007.0,Bibb,Alabama,US,32.996421,-87.125115,...,7.28,0.05,-3.09,-2.31,6.93,6.24,2.15,2.83,8.77,-2.48
4,84001009,US,USA,840,1009.0,Blount,Alabama,US,33.982109,-86.567906,...,5.93,-0.66,-3.57,-2.48,5.93,5.16,2.07,2.95,7.85,-4.4


Interestingly enough, this data's columns look similar to our covid case data set. The latitude and longitude values seem to, at a glance, match the Covid-19 data set. This means I can apply the same transformations here. I will however drop latitude and longitude since we already have those.

In [217]:
county_columns = ["FIPS", "Admin2", "Province_State"]
columns = county_columns + list(weather_df.columns[11:])
weather_df = weather_df[columns]

# Rename columns for ease of use
new_column_names = ["fips", "county_name", "state"]
weather_df = weather_df.rename(columns=dict(zip(county_columns, new_column_names)))

# Force health data columns to numeric format. The first few columns (fips, county and state) don't need to be numeric.
numeric_columns = weather_df.columns[3:]
weather_df[numeric_columns] = weather_df[numeric_columns].astype(float)

weather_df.head()

Unnamed: 0,fips,county_name,state,1/1/20,1/2/20,1/3/20,1/4/20,1/5/20,1/6/20,1/7/20,...,12/16/20,12/17/20,12/18/20,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20
0,1001.0,Autauga,Alabama,0.0,8.0,16.0,11.0,0.0,1.0,9.0,...,7.13,0.85,-2.79,-1.47,6.22,6.41,2.79,3.0,12.25,-3.37
1,1003.0,Baldwin,Alabama,4.0,13.0,19.0,13.0,2.0,4.0,9.0,...,10.45,1.31,-1.73,0.65,9.66,6.59,4.36,4.94,15.13,-0.28
2,1005.0,Barbour,Alabama,2.0,8.0,19.0,13.0,0.0,2.0,8.0,...,6.93,2.18,-2.46,-1.68,6.51,7.65,3.6,2.36,12.44,-2.22
3,1007.0,Bibb,Alabama,0.0,8.0,14.0,10.0,0.0,2.0,8.0,...,7.28,0.05,-3.09,-2.31,6.93,6.24,2.15,2.83,8.77,-2.48
4,1009.0,Blount,Alabama,0.0,6.0,12.0,8.0,-1.0,2.0,7.0,...,5.93,-0.66,-3.57,-2.48,5.93,5.16,2.07,2.95,7.85,-4.4


In [218]:
weather_df.dtypes

fips           float64
county_name     object
state           object
1/1/20         float64
1/2/20         float64
                ...   
12/21/20       float64
12/22/20       float64
12/23/20       float64
12/24/20       float64
12/25/20       float64
Length: 362, dtype: object

Some of the data columns were treated as integers, some as floats, so for consistency, I'll treat them all as floats.

In [219]:
# Drop counties below 100 and above 60000
weather_df = weather_df.drop(weather_df[(weather_df["fips"] < 100) | (weather_df["fips"] > 60000)].index)

# Fix the FIPS codes to 5 digits, as a string
weather_df.loc[weather_df["fips"].notna(), "fips"] = weather_df.loc[weather_df["fips"].notna(), "fips"].astype(int).astype(str).str.pad(width=5, side='left', fillchar='0')
weather_df.head()

Unnamed: 0,fips,county_name,state,1/1/20,1/2/20,1/3/20,1/4/20,1/5/20,1/6/20,1/7/20,...,12/16/20,12/17/20,12/18/20,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20
0,1001,Autauga,Alabama,0.0,8.0,16.0,11.0,0.0,1.0,9.0,...,7.13,0.85,-2.79,-1.47,6.22,6.41,2.79,3.0,12.25,-3.37
1,1003,Baldwin,Alabama,4.0,13.0,19.0,13.0,2.0,4.0,9.0,...,10.45,1.31,-1.73,0.65,9.66,6.59,4.36,4.94,15.13,-0.28
2,1005,Barbour,Alabama,2.0,8.0,19.0,13.0,0.0,2.0,8.0,...,6.93,2.18,-2.46,-1.68,6.51,7.65,3.6,2.36,12.44,-2.22
3,1007,Bibb,Alabama,0.0,8.0,14.0,10.0,0.0,2.0,8.0,...,7.28,0.05,-3.09,-2.31,6.93,6.24,2.15,2.83,8.77,-2.48
4,1009,Blount,Alabama,0.0,6.0,12.0,8.0,-1.0,2.0,7.0,...,5.93,-0.66,-3.57,-2.48,5.93,5.16,2.07,2.95,7.85,-4.4


In [220]:
weather_df.min()

county_name    Abbeville
state            Alabama
1/1/20           -1000.0
1/2/20           -1000.0
1/3/20           -1000.0
                 ...    
12/21/20         -1000.0
12/22/20         -1000.0
12/23/20         -1000.0
12/24/20         -1000.0
12/25/20         -1000.0
Length: 361, dtype: object

-1000 degrees? That can't be right.

In [221]:
weather_df[weather_df.columns[3:]].apply(lambda x: x < -100).sum()

1/1/20      1
1/2/20      1
1/3/20      1
1/4/20      1
1/5/20      1
           ..
12/21/20    1
12/22/20    1
12/23/20    1
12/24/20    1
12/25/20    1
Length: 359, dtype: int64

Ok, so, this is happening only in one county. Let's look at which one.

In [222]:
weather_df[weather_df["1/1/20"] < -100].head()

Unnamed: 0,fips,county_name,state,1/1/20,1/2/20,1/3/20,1/4/20,1/5/20,1/6/20,1/7/20,...,12/16/20,12/17/20,12/18/20,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20
85,2185,North Slope,Alaska,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,...,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0


Right, so, it's Alaska. Not unsurprising that it would be cold there. However, it looks like all its values are the same, so let's just drop this one, this is likely broken data.

In [223]:
weather_df = weather_df.drop(85)
weather_df.min()

county_name    Abbeville
state            Alabama
1/1/20             -27.0
1/2/20             -31.0
1/3/20             -28.0
                 ...    
12/21/20          -32.22
12/22/20          -26.47
12/23/20          -20.52
12/24/20          -26.36
12/25/20           -25.5
Length: 361, dtype: object

Alright, that looks more reasonable.

In [224]:
weather_df.max()

county_name    Ziebach
state          Wyoming
1/1/20            19.0
1/2/20            18.0
1/3/20            22.0
                ...   
12/21/20          21.4
12/22/20         19.53
12/23/20         19.02
12/24/20         20.39
12/25/20         20.52
Length: 361, dtype: object

In [227]:
weather_df[weather_df.columns[:3]].nunique()

fips           3141
county_name    1846
state            51
dtype: int64

As expected.

In [228]:
weather_df[weather_df.columns[:3]].isna().sum()

fips           8
county_name    0
state          0
dtype: int64

In [229]:
weather_df[weather_df["fips"].isna()][["county_name", "state"]].head(8)

Unnamed: 0,county_name,state
1223,Dukes and Nantucket,Massachusetts
1537,Kansas City,Missouri
2860,Bear River,Utah
2865,Central Utah,Utah
2883,Southeast Utah,Utah
2884,Southwest Utah,Utah
2887,TriCounty,Utah
2894,Weber-Morgan,Utah


Same as the Covid-19 data, minus the prisons.  

Out of curiousity, let's check the max temperature data for the broken station.

In [238]:
weather_df = pd.read_csv("data/weather/tMax_US.csv")
weather_df[weather_df["FIPS"] == 2185.0]

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,12/16/20,12/17/20,12/18/20,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20
85,84002185,US,USA,840,2185.0,North Slope,Alaska,US,69.314792,-153.483609,...,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0


Yup, still broken, let's just ditch it.

#### Conclusion
It looks like this data matches the Covid-19 columns quite well, so we can use a lot of the same code. We found one faulty county, so we just removed it from the data set. We'll need to be careful how we join the data for this, might just be worth completely dropping that for our evaluation.

### Cleaning and selection steps
I'll take care of cleaning and selection in the ETL script, under "scripts/etl.py". I've made initial decisions on which columns to use during the exploration phase.