# Covid19 Cases by State in USA Bar Chart Race 

### Apr. 28, 2020
#### By: Jeff Hale

In this notebook I will demonstrate how to use Python, pandas, and the bar_chart_race package to make a bar chart race of Covid19 cases over time by state.

County level dataset from [New York Times](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv)  via [Kaggle](https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset).

#### Import packages

In [42]:
import pandas as pd
from IPython.display import HTML
import bar_chart_race as bcr

#### Read data

In [5]:
data_path = 'data/us-counties-2020-04-28.csv' 

df = pd.read_csv(data_path, index_col='date')
df.head()

Unnamed: 0_level_0,county,state,fips,cases,deaths
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-21,Snohomish,Washington,53061.0,1,0
2020-01-22,Snohomish,Washington,53061.0,1,0
2020-01-23,Snohomish,Washington,53061.0,1,0
2020-01-24,Cook,Illinois,17031.0,1,0
2020-01-24,Snohomish,Washington,53061.0,1,0


#### Set index to datetime 

In [6]:
df.index = pd.to_datetime(df.index)

#### Explore data

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 95420 entries, 2020-01-21 to 2020-04-27
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   county  95420 non-null  object 
 1   state   95420 non-null  object 
 2   fips    94270 non-null  float64
 3   cases   95420 non-null  int64  
 4   deaths  95420 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 4.4+ MB


In [8]:
df_cases = df.loc[:, ['state', 'cases']]

### groupby date and state

In [30]:
df_states = df_cases.groupby(['date','state']).sum().reset_index()
df_states

Unnamed: 0,date,state,cases
0,2020-01-21,Washington,1
1,2020-01-22,Washington,1
2,2020-01-23,Washington,1
3,2020-01-24,Illinois,1
4,2020-01-24,Washington,1
...,...,...,...
3089,2020-04-27,Virginia,13535
3090,2020-04-27,Washington,13864
3091,2020-04-27,West Virginia,1078
3092,2020-04-27,Wisconsin,6081


In [33]:
df_states = df_states.set_index('date')
df_states.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3094 entries, 2020-01-21 to 2020-04-27
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   state   3094 non-null   object
 1   cases   3094 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 72.5+ KB


In [34]:
df_states.head()

Unnamed: 0_level_0,state,cases
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-21,Washington,1
2020-01-22,Washington,1
2020-01-23,Washington,1
2020-01-24,Illinois,1
2020-01-24,Washington,1


#### Pivot the data to get it in the correct format for the bar chart race.

Data needs to be in wide format, so states along the top, dates in the index, either deaths or cases for the values. 

In [56]:
df_pivoted = df_states.pivot(values='cases', columns='state')
df_pivoted.tail(2)

state,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,District of Columbia,Florida,...,Tennessee,Texas,Utah,Vermont,Virgin Islands,Virginia,Washington,West Virginia,Wisconsin,Wyoming
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-04-26,6421.0,339.0,6526.0,3001.0,43691.0,13441.0,25269.0,4034.0,3841.0,31520.0,...,9493.0,25206.0,4123.0,851.0,57.0,12970.0,13663.0,1054.0,5911.0,371.0
2020-04-27,6539.0,343.0,6716.0,3069.0,45211.0,13804.0,25997.0,4162.0,3892.0,32130.0,...,9796.0,25960.0,4236.0,855.0,59.0,13535.0,13864.0,1078.0,6081.0,389.0


In [60]:
df_pivoted.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 98 entries, 2020-01-21 to 2020-04-27
Data columns (total 55 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Alabama                   46 non-null     float64
 1   Alaska                    47 non-null     float64
 2   Arizona                   93 non-null     float64
 3   Arkansas                  48 non-null     float64
 4   California                94 non-null     float64
 5   Colorado                  54 non-null     float64
 6   Connecticut               51 non-null     float64
 7   Delaware                  48 non-null     float64
 8   District of Columbia      52 non-null     float64
 9   Florida                   58 non-null     float64
 10  Georgia                   57 non-null     float64
 11  Guam                      44 non-null     float64
 12  Hawaii                    53 non-null     float64
 13  Idaho                     46 non-null     float

#### Make a DataFrame that starts after there are a few data points so X-axis labels look nicer at start.

In [69]:
df_pivoted_later = df_pivoted[df_pivoted.index >= "2020-02-20"]
df_pivoted_later.head(2)

state,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,District of Columbia,Florida,...,Tennessee,Texas,Utah,Vermont,Virgin Islands,Virginia,Washington,West Virginia,Wisconsin,Wyoming
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-02-20,,,1.0,,8.0,,,,,,...,,2.0,,,,,1.0,,1.0,
2020-02-21,,,1.0,,9.0,,,,,,...,,4.0,,,,,1.0,,1.0,


#### Make the bar chart race and output to a file

In [73]:
bcr.bar_chart_race(
    df=df_pivoted_later,
    filename='covid19_county_state_h_later.mp4',
    orientation='h',
    sort='desc',
    label_bars=True,
    use_index=True,
    steps_per_period=10,
    period_length=300,
    figsize=(8, 6),
    cmap='dark24',
    title='COVID-19 Cases by State',
    bar_label_size=7,
    tick_label_size=7,
    period_label_size=16,
)

#### Make a bar chart race for inline notebook viewing.

In [70]:
bcr_html = bcr.bar_chart_race(df=df_pivoted_later, filename=None, period_length=300, figsize=(8, 6))

In [71]:
HTML(bcr_html)

#### Save cleaned state level data to output files

In [68]:
df_pivoted.to_csv('pivoted_covid19_through_apr_27_wf.csv')

In [72]:
df_pivoted.to_csv('pivoted_covid19_through_feb_20_to_apr_27_wf.csv')

## Future directions:

- Could pull from NYT repo directly so have most updated info.
- Could aggregate data by week.

### I hope you found this example of how to make a bar chart race after some data munging to be useful! 