## Covid data for the counties I care about

The Washington Post has convenient data by state. I care about Washington DC, where I live, and how certain other locations are doing. The state-level data is not fine-grained enough for me. MSA and county level data are available online, but overwelming and not easily filterable to what I want. I'm creating this tool to provide historic data at the county level. I will use plotly for interactive visualizations and serve the website via FastAPI or put into Streamlit. I'll use GitHub actions and Prefect to fetch the data and make sure everything runs okay. I'll use Great Expectations for data quality checking and PyTest to check my code. 

I may use DVC to version my data.

At some later date, I may make an app that allows other users to choose which counties they want to include.

Imports and config

In [36]:
import pandas as pd
import plotly.express as px

pd.options.display.max_rows = 100


Read in data

In [8]:
df_2022 = pd.read_csv("us-counties-2022.csv", index_col="date")
df_2022


Unnamed: 0_level_0,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-01-01,USA-72999,Unknown,Puerto Rico,0,328.14,,0,0.00,
2022-01-01,USA-72153,Yauco,Puerto Rico,0,66.50,196.40,0,0.00,0.00
2022-01-01,USA-72151,Yabucoa,Puerto Rico,0,63.13,196.30,0,0.00,0.00
2022-01-01,USA-72149,Villalba,Puerto Rico,0,47.50,221.18,0,0.00,0.00
2022-01-01,USA-72147,Vieques,Puerto Rico,0,7.63,91.16,0,0.00,0.00
...,...,...,...,...,...,...,...,...,...
2022-01-28,USA-69100,Rota,Northern Mariana Islands,0,0.00,0.00,0,0.00,0.00
2022-01-28,USA-78999,Unknown,Virgin Islands,0,0.00,,1,0.22,
2022-01-28,USA-78030,St. Thomas,Virgin Islands,6,32.75,63.43,0,0.43,0.83
2022-01-28,USA-78020,St. John,Virgin Islands,0,6.00,143.88,0,0.00,0.00


In [9]:
df_2022.info()


<class 'pandas.core.frame.DataFrame'>
Index: 91102 entries, 2022-01-01 to 2022-01-28
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   geoid                91102 non-null  object 
 1   county               91102 non-null  object 
 2   state                91102 non-null  object 
 3   cases                91102 non-null  int64  
 4   cases_avg            91102 non-null  float64
 5   cases_avg_per_100k   90197 non-null  float64
 6   deaths               91102 non-null  int64  
 7   deaths_avg           91102 non-null  float64
 8   deaths_avg_per_100k  90197 non-null  float64
dtypes: float64(4), int64(2), object(3)
memory usage: 7.0+ MB


Finding counties that could be tricky to match spelling/format.

In [96]:
df_2022[df_2022['county'].str.startswith('Alexandria')].head(2)

Unnamed: 0_level_0,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k,fips
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-01-01,USA-51510,Alexandria city,Virginia,0,281.14,176.34,0,0.38,0.24,51510
2022-01-02,USA-51510,Alexandria city,Virginia,0,281.14,176.34,0,0.38,0.24,51510


In [97]:
df_2022[df_2022["state"].str.startswith("District")].head(2)


Unnamed: 0_level_0,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k,fips
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-01-01,USA-11001,District of Columbia,District of Columbia,0,2103.0,297.98,0,0.4,0.06,11001
2022-01-02,USA-11001,District of Columbia,District of Columbia,0,2103.0,297.98,0,0.4,0.06,11001


In [98]:
df_2022[df_2022["state"].str.contains("York")].head(2)

Unnamed: 0_level_0,geoid,county,state,cases,cases_avg,cases_avg_per_100k,deaths,deaths_avg,deaths_avg_per_100k,fips
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-01-01,USA-36998,New York City,New York,45341,34646.38,415.58,20,27.89,0.33,36998
2022-01-01,USA-36123,Yates,New York,28,13.25,53.19,0,0.14,0.57,36123


Filter to counties of interest

In [105]:
counties = [
    "District of Columbia",
    "Wood",
    "Putnam",
    "Montgomery",
    "Prince George's",
    "Arlington",
    "Alexandria city",
    "New York City",          # README at NYT mentions some NE are city, not county
    "Allegheny",
    "Cook",
    "Baltimore",
    "Franklin",
    "Clermont",
    "Somerset",
    "Philadelphia",
    "Denver",
    "Boulder",
    "San Francisco",
    "Los Angeles",
    "Pima",
    "Manatee"
]


In [106]:
cols = ["county", "state", "fips", "cases_avg_per_100k"]

df_2022_smaller = df_2022.loc[df_2022["county"].isin(counties), cols]
df_2022_smaller


Unnamed: 0_level_0,county,state,fips,cases_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,Wood,Wisconsin,55141,82.19
2022-01-01,Wood,West Virginia,54107,55.42
2022-01-01,Putnam,West Virginia,54079,73.14
2022-01-01,Franklin,Washington,53021,32.26
2022-01-01,Montgomery,Virginia,51121,46.25
...,...,...,...,...
2022-01-28,Montgomery,Arkansas,5097,104.93
2022-01-28,Franklin,Arkansas,5047,152.41
2022-01-28,Pima,Arizona,4019,222.82
2022-01-28,Montgomery,Alabama,1101,134.48


See each state/county once.

In [107]:
df_2022_smaller.drop_duplicates(subset=["county", "state"])

Unnamed: 0_level_0,county,state,fips,cases_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,Wood,Wisconsin,55141,82.19
2022-01-01,Wood,West Virginia,54107,55.42
2022-01-01,Putnam,West Virginia,54079,73.14
2022-01-01,Franklin,Washington,53021,32.26
2022-01-01,Montgomery,Virginia,51121,46.25
2022-01-01,Franklin,Virginia,51067,68.83
2022-01-01,Arlington,Virginia,51013,190.42
2022-01-01,Alexandria city,Virginia,51510,176.34
2022-01-01,Franklin,Vermont,50011,96.87
2022-01-01,Wood,Texas,48499,10.67


Convert geoid to FIPS code for plotting

In [None]:
df_2022["fips"] = pd.to_numeric(df_2022["geoid"].str[-5:], downcast="integer")
df_2022


In [None]:
df_2022.info()


Filter to fips codes of counties I want. 

If ever make into an app, will change to have folks choose State and then County from drop downs.


In [108]:
fips_counties = [
    11001,
    24033,
    24031,
    17031,
    39173,
    39137,
    39113,
    39049,
    51013,
    42111,
    42003,
    39025,
    8031,
    8013,
    4019,
    24005,
    6037,
    6075,
    36998,
    12081
]

cols = ["county", "state", "fips", "cases_avg_per_100k"]

df_2022_counties = df_2022.loc[df_2022["fips"].isin(fips_counties), cols]
df_2022_counties


Unnamed: 0_level_0,county,state,fips,cases_avg_per_100k
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,Arlington,Virginia,51013,190.42
2022-01-01,Somerset,Pennsylvania,42111,76.83
2022-01-01,Allegheny,Pennsylvania,42003,132.63
2022-01-01,Wood,Ohio,39173,109.09
2022-01-01,Putnam,Ohio,39137,54.00
...,...,...,...,...
2022-01-28,Denver,Colorado,8031,128.22
2022-01-28,Boulder,Colorado,8013,129.72
2022-01-28,San Francisco,California,6075,156.79
2022-01-28,Los Angeles,California,6037,258.48


In [109]:
px.line(
    df_2022_counties, x=df_2022_counties.index, y="cases_avg_per_100k", color="county"
)


Montgomery is kind of a mess

7-day rolling average of cases as of yesterday's data

In [None]:
px.line(
    df_2022_counties, x=df_2022_counties.index, y="cases_avg_per_100k", color="county"
)

Read historic data and concatenate DataFrames

## Map

Most recent 7 day moving average.

Scatter geo. 

Future direction: could make an animation over time. Could do choropleth too.

In [None]:
px.scatter_geo(