## Covid data for the counties I care about

The Washington Post has convenient data by state. I care about Washington DC, where I live, and how certain other locations are doing. The state-level data is not fine-grained enough for me. MSA and county level data are available online, but overwelming and not easily filterable to what I want. I'm creating this tool to provide historic data at the county level. I will use plotly for interactive visualizations and serve the website via FastAPI or put into Streamlit. I'll use GitHub actions and Prefect to fetch the data and make sure everything runs okay. I'll use Great Expectations for data quality checking and PyTest to check my code. 

I may use DVC to version my data.

At some later date, I may make an app that allows other users to choose which counties they want to include.

Imports and config

In [36]:
import pandas as pd
import plotly.express as px

pd.options.display.max_rows=100

Read in data

In [8]:
df_2022 = pd.read_csv('us-counties-2022.csv', index_col='date')
df_2022

Unnamed: 0_level_0,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-01-01,USA-72999,Unknown,Puerto Rico,0,328.14,,0,0.00,
2022-01-01,USA-72153,Yauco,Puerto Rico,0,66.50,196.40,0,0.00,0.00
2022-01-01,USA-72151,Yabucoa,Puerto Rico,0,63.13,196.30,0,0.00,0.00
2022-01-01,USA-72149,Villalba,Puerto Rico,0,47.50,221.18,0,0.00,0.00
2022-01-01,USA-72147,Vieques,Puerto Rico,0,7.63,91.16,0,0.00,0.00
...,...,...,...,...,...,...,...,...,...
2022-01-28,USA-69100,Rota,Northern Mariana Islands,0,0.00,0.00,0,0.00,0.00
2022-01-28,USA-78999,Unknown,Virgin Islands,0,0.00,,1,0.22,
2022-01-28,USA-78030,St. Thomas,Virgin Islands,6,32.75,63.43,0,0.43,0.83
2022-01-28,USA-78020,St. John,Virgin Islands,0,6.00,143.88,0,0.00,0.00


In [9]:
df_2022.info()

<class 'pandas.core.frame.DataFrame'>
Index: 91102 entries, 2022-01-01 to 2022-01-28
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   geoid                91102 non-null  object 
 1   county               91102 non-null  object 
 2   state                91102 non-null  object 
 3   cases                91102 non-null  int64  
 4   cases_avg            91102 non-null  float64
 5   cases_avg_per_100k   90197 non-null  float64
 6   deaths               91102 non-null  int64  
 7   deaths_avg           91102 non-null  float64
 8   deaths_avg_per_100k  90197 non-null  float64
dtypes: float64(4), int64(2), object(3)
memory usage: 7.0+ MB


In [18]:
df_2022[df_2022['state'].str.startswith('District')]

Unnamed: 0_level_0,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k,fips
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-01-01,USA-11001,District of Columbia,District of Columbia,0,2103.0,297.98,0,0.4,0.06,11001
2022-01-02,USA-11001,District of Columbia,District of Columbia,0,2103.0,297.98,0,0.4,0.06,11001
2022-01-03,USA-11001,District of Columbia,District of Columbia,9201,2103.14,298.0,7,1.29,0.18,11001
2022-01-04,USA-11001,District of Columbia,District of Columbia,2006,2122.86,300.79,2,1.38,0.19,11001
2022-01-05,USA-11001,District of Columbia,District of Columbia,1326,2110.57,299.05,2,1.71,0.24,11001
2022-01-06,USA-11001,District of Columbia,District of Columbia,1293,1975.14,279.86,3,2.0,0.28,11001
2022-01-07,USA-11001,District of Columbia,District of Columbia,1928,1969.25,279.03,2,2.0,0.28,11001
2022-01-08,USA-11001,District of Columbia,District of Columbia,0,1969.25,279.03,0,2.0,0.28,11001
2022-01-09,USA-11001,District of Columbia,District of Columbia,0,1969.25,279.03,0,2.0,0.28,11001
2022-01-10,USA-11001,District of Columbia,District of Columbia,6238,1827.29,258.91,6,2.14,0.3,11001


Filter to counties of interest

In [41]:
counties = ['District of Columbia', 'Wood', 'Putnam', 'Montgomery', "Prince George's", "Arlington", "Alexandria", "Manhatten", "Cook", 'Baltimore', 'Franklin', 'Claremont', 'Somerset']


In [42]:

cols = ['county', 'state', 'fips', 'cases_avg_per_100k']

df_2022_smaller = df_2022.loc[df_2022['county'].isin(counties), cols]
df_2022_smaller

Unnamed: 0_level_0,county,state,fips,cases_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,Wood,Wisconsin,55141,82.19
2022-01-01,Wood,West Virginia,54107,55.42
2022-01-01,Putnam,West Virginia,54079,73.14
2022-01-01,Franklin,Washington,53021,32.26
2022-01-01,Montgomery,Virginia,51121,46.25
...,...,...,...,...
2022-01-28,District of Columbia,District of Columbia,11001,52.97
2022-01-28,Montgomery,Arkansas,05097,104.93
2022-01-28,Franklin,Arkansas,05047,152.41
2022-01-28,Montgomery,Alabama,01101,134.48


In [43]:
df_2022_smaller.drop_duplicates(subset = ["county", "state"])


Unnamed: 0_level_0,county,state,fips,cases_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,Wood,Wisconsin,55141,82.19
2022-01-01,Wood,West Virginia,54107,55.42
2022-01-01,Putnam,West Virginia,54079,73.14
2022-01-01,Franklin,Washington,53021,32.26
2022-01-01,Montgomery,Virginia,51121,46.25
2022-01-01,Franklin,Virginia,51067,68.83
2022-01-01,Arlington,Virginia,51013,190.42
2022-01-01,Franklin,Vermont,50011,96.87
2022-01-01,Wood,Texas,48499,10.67
2022-01-01,Montgomery,Texas,48339,66.54


Convert geoid to FIPS code for plotting

In [52]:
df_2022['fips'] = pd.to_numeric(df_2022['geoid'].str[-5:], downcast='integer')
df_2022

Unnamed: 0_level_0,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k,fips
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-01-01,USA-72999,Unknown,Puerto Rico,0,328.14,,0,0.00,,72999
2022-01-01,USA-72153,Yauco,Puerto Rico,0,66.50,196.40,0,0.00,0.00,72153
2022-01-01,USA-72151,Yabucoa,Puerto Rico,0,63.13,196.30,0,0.00,0.00,72151
2022-01-01,USA-72149,Villalba,Puerto Rico,0,47.50,221.18,0,0.00,0.00,72149
2022-01-01,USA-72147,Vieques,Puerto Rico,0,7.63,91.16,0,0.00,0.00,72147
...,...,...,...,...,...,...,...,...,...,...
2022-01-28,USA-69100,Rota,Northern Mariana Islands,0,0.00,0.00,0,0.00,0.00,69100
2022-01-28,USA-78999,Unknown,Virgin Islands,0,0.00,,1,0.22,,78999
2022-01-28,USA-78030,St. Thomas,Virgin Islands,6,32.75,63.43,0,0.43,0.83,78030
2022-01-28,USA-78020,St. John,Virgin Islands,0,6.00,143.88,0,0.00,0.00,78020


In [53]:
df_2022.info()

<class 'pandas.core.frame.DataFrame'>
Index: 91102 entries, 2022-01-01 to 2022-01-28
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   geoid                91102 non-null  object 
 1   county               91102 non-null  object 
 2   state                91102 non-null  object 
 3   cases                91102 non-null  int64  
 4   cases_avg            91102 non-null  float64
 5   cases_avg_per_100k   90197 non-null  float64
 6   deaths               91102 non-null  int64  
 7   deaths_avg           91102 non-null  float64
 8   deaths_avg_per_100k  90197 non-null  float64
 9   fips                 91102 non-null  int32  
dtypes: float64(4), int32(1), int64(2), object(3)
memory usage: 9.3+ MB


Filter to fips codes of counties I want. 

If ever make into an app, will change to have folks choose State and then County from drop downs.


In [54]:
fips_counties = [11001, 24033, 24031, 17031, 39173, 39137, 39113, 39049, 51013, 42111]

cols = ['county', 'state', 'fips', 'cases_avg_per_100k']

df_2022_counties = df_2022.loc[df_2022['fips'].isin(fips_counties), cols]
df_2022_counties

Unnamed: 0_level_0,county,state,fips,cases_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,Arlington,Virginia,51013,190.42
2022-01-01,Somerset,Pennsylvania,42111,76.83
2022-01-01,Wood,Ohio,39173,109.09
2022-01-01,Putnam,Ohio,39137,54.00
2022-01-01,Montgomery,Ohio,39113,90.12
...,...,...,...,...
2022-01-28,Franklin,Ohio,39049,93.89
2022-01-28,Prince George's,Maryland,24033,38.05
2022-01-28,Montgomery,Maryland,24031,52.37
2022-01-28,Cook,Illinois,17031,88.31


In [61]:
px.line(df_2022_counties, x=df_2022_counties.index, y='cases_avg_per_100k', color='county' )

Montgomery is kind of a mess

7-day rolling average of cases as of yesterday's data

In [None]:
px.line(df_2022_counties, x=df_2022_counties.index, y='cases_avg_per_100k', color='county' )

Read historic data and concatenate DataFrames