# california-coronavirus-data examples

By [Ben Welsh](https://palewi.re/who-is-ben-welsh)

A demonstration of how to use Python to work with the Los Angeles Times' independent tally of coronavirus cases in California published on GitHub at [datadesk/california-coronavirus-data](https://github.com/datadesk/california-coronavirus-data#state-cdph-totalscsv). To run this notebook immediately in the cloud,  click the [Binder](https://mybinder.org/) launcher below.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/datadesk/california-coronavirus-data/master?urlpath=lab/tree/notebooks/examples.ipynb)

In [38]:
%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


## Import Python tools

Our data analysis and plotting tools

In [39]:
import pandas as pd
import altair as alt

Customizations to the Altair theme

In [40]:
import altair_latimes as lat

In [41]:
alt.themes.register("latimes", lat.theme)
alt.themes.enable("latimes")

ThemeRegistry.enable('latimes')

In [42]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Import data

Read in the agency totals

In [43]:
agency_df = pd.read_csv("../latimes-agency-totals.csv", parse_dates=["date"])

In [44]:
agency_df.head()

Unnamed: 0,agency,county,fips,date,confirmed_cases,deaths,recoveries,did_not_update
0,Alameda,Alameda,1,2020-06-17,4418,115.0,,
1,Berkeley,Alameda,1,2020-06-17,115,1.0,,
2,Alpine,Alpine,3,2020-06-17,1,0.0,1.0,True
3,Amador,Amador,5,2020-06-17,12,0.0,10.0,
4,Butte,Butte,7,2020-06-17,90,1.0,67.0,


In [45]:
agency_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6037 entries, 0 to 6036
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   agency           6037 non-null   object        
 1   county           6037 non-null   object        
 2   fips             6037 non-null   int64         
 3   date             6037 non-null   datetime64[ns]
 4   confirmed_cases  6037 non-null   int64         
 5   deaths           6036 non-null   float64       
 6   recoveries       1902 non-null   float64       
 7   did_not_update   829 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(3)
memory usage: 377.4+ KB


## Aggregate data

### By state

Lump all the agencies together and you get the statewide totals.

In [46]:
state_df = (
    agency_df.groupby(["date"])
    .agg({"confirmed_cases": "sum", "deaths": "sum"})
    .reset_index()
)

In [47]:
state_df.head()

Unnamed: 0,date,confirmed_cases,deaths
0,2020-01-26,2,0.0
1,2020-01-27,3,0.0
2,2020-01-28,3,0.0
3,2020-01-29,4,0.0
4,2020-01-30,4,0.0


In [48]:
state_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             144 non-null    datetime64[ns]
 1   confirmed_cases  144 non-null    int64         
 2   deaths           144 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 3.5 KB


### By county

Three cities &mdash; Berkeley, Long Beach and Pasadena &mdash; run independent public health departments. Calculating county-level totals requires grouping them with their local peers.

In [49]:
county_df = (
    agency_df.groupby(["date", "county"])
    .agg({"confirmed_cases": "sum", "deaths": "sum"})
    .reset_index()
)

In [50]:
county_df.head()

Unnamed: 0,date,county,confirmed_cases,deaths
0,2020-01-26,Alameda,0,0.0
1,2020-01-26,Calaveras,0,0.0
2,2020-01-26,Contra Costa,0,0.0
3,2020-01-26,Humboldt,0,0.0
4,2020-01-26,Los Angeles,1,0.0


In [51]:
county_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5704 entries, 0 to 5703
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             5704 non-null   datetime64[ns]
 1   county           5704 non-null   object        
 2   confirmed_cases  5704 non-null   int64         
 3   deaths           5704 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 178.4+ KB


## Chart the statewide totals over time

In [52]:
# Create a base chart with the common x-axis
chart = alt.Chart(state_df).encode(x=alt.X("date:T", title=None))

# Create the cases line
cases = chart.mark_line(color=lat.palette["default"]).encode(
    y=alt.Y("confirmed_cases:Q", title="Confirmed cases")
)

# Create the deaths line
deaths = chart.mark_line(color=lat.palette["schemes"]["ice-7"][3]).encode(
    y=alt.Y("deaths:Q", title="Deaths")
)

# Combine them into a single chart
(cases & deaths).properties(title="Statewide cumulative totals")

## Chart the county totals

First on a linear scale

In [53]:
# Create the base chart
chart = (
    alt.Chart(county_df)
    .mark_line()
    .encode(
        x=alt.X("date:T", title=None),
        color=alt.Color("county:N", title="County", legend=None),
    )
)

# The cases line
cases = chart.encode(y=alt.Y("confirmed_cases:Q", title="Confirmed cases"),)

# The deaths line
deaths = chart.mark_line().encode(y=alt.Y("deaths:Q", title="Deaths"),)

# Combined into a chart
(cases & deaths).properties(title="Cumulative totals by county")

Again on a logarithmic scale

In [54]:
# Make a base chart
chart = (
    alt.Chart(county_df)
    .mark_line()
    .encode(
        x=alt.X("date:T", title=None),
        color=alt.Color("county:N", title="County", legend=None),
    )
)

# The cases lines
cases = chart.transform_filter(alt.datum.confirmed_cases > 0).encode(
    y=alt.Y("confirmed_cases:Q", scale=alt.Scale(type="log"), title="Confirmed cases"),
)

# The deaths lines
deaths = chart.transform_filter(alt.datum.deaths > 0).encode(
    y=alt.Y("deaths:Q", scale=alt.Scale(type="log"), title="Deaths"),
)

# Slapping them together
(cases & deaths).properties(title="Cumulative totals by county")

A common technique for clarifying these charts to begin each line on the day the county hit a minimum number. Let's try it with 10.

In [55]:
day_10_df = (
    county_df[
        # Filter down to only days with 10 or more cumulative cases
        county_df.confirmed_cases
        >= 10
    ]
    .groupby(
        # And then get the minimum date for each county
        "county"
    )
    .date.min()
    .reset_index()
)

Merge that date to each row in the data.

In [56]:
county_date_diff_df = county_df.merge(
    day_10_df, how="inner", on="county", suffixes=["", "_gte_10_cases"]
)

Calculate each day's distance from its tenth day.

In [57]:
county_date_diff_df["days_since_10"] = (
    county_date_diff_df.date - county_date_diff_df.date_gte_10_cases
).dt.days

Chart it.

In [58]:
alt.Chart(county_date_diff_df).transform_filter(
    # Only keep everything once they hit 10 cases
    alt.datum.days_since_10
    >= 0
).mark_line().encode(
    x=alt.X("days_since_10:O", title="Days since 10th case"),
    y=alt.Y("confirmed_cases:Q", scale=alt.Scale(type="log"), title="Confirmed cases"),
    color=alt.Color("county:N", title="County", legend=None),
).properties(
    title="Cumulative totals by county"
)

## County trends on a linear 'Pez' plot

Fill in any date gaps so that every county has a row for every date.

In [59]:
backfilled_county_df = (
    county_df.set_index(["county", "date"])
    .unstack("county")
    .fillna(0)
    .stack("county")
    .reset_index()
)

Calculate the rolling change in each county.

In [60]:
chronological_county_df = backfilled_county_df.sort_values(["county", "date"])

Calculate the daily change in each county.

In [61]:
chronological_county_df["new_confirmed_cases"] = chronological_county_df.groupby(
    "county"
).confirmed_cases.diff()

Let's chill that out as a seven-day average.

In [62]:
chronological_county_df["new_confirmed_cases_rolling_average"] = (
    chronological_county_df.groupby("county")
    .new_confirmed_cases.rolling(7)
    .mean()
    .droplevel(0)
)

Make the chart.

In [63]:
alt.Chart(chronological_county_df, title="New cases by day").mark_rect(
    stroke=None
).encode(
    x=alt.X(
        "date:O", axis=alt.Axis(ticks=False, grid=False, labels=False,), title=None
    ),
    y=alt.Y(
        "county:N",
        title="County",
        axis=alt.Axis(ticks=False, grid=False, labelPadding=5),
    ),
    color=alt.Color(
        "new_confirmed_cases_rolling_average:Q",
        scale=alt.Scale(
            type="threshold", domain=[0, 3, 10, 25, 50, 100, 500], scheme="blues"
        ),
        title="New cases (7-day average)",
    ),
).properties(
    height=800
)

## Chart new cases and deaths

Calculate the number of new cases each day using panda's [diff](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html) method.

In [64]:
state_df["new_confirmed_cases"] = state_df.confirmed_cases.diff()

Do the same for deaths

In [65]:
state_df["new_deaths"] = state_df.deaths.diff()

Now calculate the moving seven-day average of each using panda's [rolling](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) method.

In [66]:
state_df["new_confirmed_cases_rolling_average"] = state_df.new_confirmed_cases.rolling(
    7
).mean()

In [67]:
state_df["new_deaths_rolling_average"] = state_df.new_deaths.rolling(7).mean()

Put it all together on the chart 

In [68]:
# One base chart object with the data they all share
chart = alt.Chart(state_df).encode(x=alt.X("date:T", title=None),)

# The new cases bars
cases_bars = chart.mark_bar(color=lat.palette["default"]).encode(
    y=alt.Y("new_confirmed_cases:Q", title="New confirmed cases")
)

# The cases rolling average
cases_line = chart.mark_line(color=lat.palette["accent"]).encode(
    y=alt.Y("new_confirmed_cases_rolling_average:Q", title="7-day average")
)

# The new deaths bars
deaths_bars = chart.mark_bar(color=lat.palette["schemes"]["ice-7"][3]).encode(
    y=alt.Y("new_deaths:Q", title="New deaths")
)

# The deaths rolling average
deaths_line = chart.mark_line(color=lat.palette["schemes"]["ice-7"][6]).encode(
    y=alt.Y("new_deaths_rolling_average:Q", title="7-day average")
)

# Combine it all together into one paired chart
((cases_bars + cases_line) & (deaths_bars + deaths_line)).properties(
    title="New case and deaths statewide by day"
)

Now do it by county

In [72]:
chronological_county_df.head()

Unnamed: 0,date,county,confirmed_cases,deaths,new_confirmed_cases,new_confirmed_cases_rolling_average
0,2020-01-26,Alameda,0.0,0.0,,
58,2020-01-27,Alameda,0.0,0.0,0.0,
116,2020-01-28,Alameda,0.0,0.0,0.0,
174,2020-01-29,Alameda,0.0,0.0,0.0,
232,2020-01-30,Alameda,0.0,0.0,0.0,


In [73]:
alt.Chart(chronological_county_df, title="New cases by day").mark_line().encode(
    x=alt.X("date:O", axis=alt.Axis(ticks=False, grid=False, labels=False), title=None),
    y=alt.Y("new_confirmed_cases_rolling_average:Q", title="7-day average"),
    color=alt.Color("county:N", title="County", legend=None),
)