# california-coronavirus-data examples

By [Ben Welsh](https://palewi.re/who-is-ben-welsh)

A demonstration of how to use Python to work with the Los Angeles Times' independent tally of coronavirus cases in California published on GitHub at [datadesk/california-coronavirus-data](https://github.com/datadesk/california-coronavirus-data#state-cdph-totalscsv). To run this notebook immediately in the cloud,  click the [Binder](https://mybinder.org/) launcher below.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/datadesk/california-coronavirus-data/master?urlpath=lab/tree/notebooks/examples.ipynb)

## Import Python tools

Our data analysis and plotting tools

In [1]:
import pandas as pd
import altair as alt

Customize the altair theme

In [2]:
import altair_latimes as lat

In [3]:
alt.themes.register('latimes', lat.theme)
alt.themes.enable('latimes')

ThemeRegistry.enable('latimes')

## Importing data

Read in the agency totals

In [4]:
agency_df = pd.read_csv("../latimes-agency-totals.csv")

In [5]:
agency_df.head()

Unnamed: 0,agency,county,fips,date,confirmed_cases,deaths,did_not_update
0,Alameda,Alameda,1,2020-04-08,640,16,
1,Berkeley,Alameda,1,2020-04-08,34,0,
2,Alpine,Alpine,3,2020-04-08,1,0,
3,Amador,Amador,5,2020-04-08,3,0,
4,Butte,Butte,7,2020-04-08,13,0,


In [6]:
agency_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1862 entries, 0 to 1861
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   agency           1862 non-null   object
 1   county           1862 non-null   object
 2   fips             1862 non-null   int64 
 3   date             1862 non-null   object
 4   confirmed_cases  1862 non-null   int64 
 5   deaths           1862 non-null   int64 
 6   did_not_update   116 non-null    object
dtypes: int64(3), object(4)
memory usage: 102.0+ KB


## Chart statewide totals

Chart the cumulative totals statewide

In [7]:
chart = alt.Chart(agency_df).properties(width=400, height=300)

cases = chart.mark_line(color=lat.palette['default']).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("sum(confirmed_cases):Q", title="Confirmed cases")
)

deaths = chart.mark_line(color=lat.palette['schemes']['ice-7'][3]).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("sum(deaths):Q", title="Deaths")
)

(cases | deaths).properties(title="Statewide cumulative totals")

## Chart top counties

Chart the counties individually.

In [8]:
county_df = agency_df.groupby(["date", "county"]).agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index()

In [9]:
county_df.head()

Unnamed: 0,date,county,confirmed_cases,deaths
0,2020-01-26,Alameda,0,0
1,2020-01-26,Calaveras,0,0
2,2020-01-26,Contra Costa,0,0
3,2020-01-26,Humboldt,0,0
4,2020-01-26,Los Angeles,1,0


In [10]:
county_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1739 entries, 0 to 1738
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   date             1739 non-null   object
 1   county           1739 non-null   object
 2   confirmed_cases  1739 non-null   int64 
 3   deaths           1739 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 54.5+ KB


In [11]:
county_totals_df = county_df.groupby("county").agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index().sort_values("confirmed_cases", ascending=False)

In [12]:
county_totals_df.head(10)

Unnamed: 0,county,confirmed_cases,deaths
17,Los Angeles,60721,1276
39,Santa Clara,16761,567
33,San Diego,14309,216
26,Orange,9678,135
29,Riverside,8004,264
34,San Francisco,7380,93
37,San Mateo,7110,173
0,Alameda,6534,137
30,Sacramento,5253,189
6,Contra Costa,4655,64


In [13]:
chart = alt.Chart(county_df[county_df.county.isin(county_totals_df.head(10).county)]).properties(width=400, height=300)

cases = chart.mark_line().encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("confirmed_cases:Q", title="Confirmed cases"),
    color=alt.Color("county:N", title="County")
)

deaths = chart.mark_line().encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("deaths:Q", title="Deaths"),
    color=alt.Color("county:N", title="County")
)

(cases | deaths).properties(title="Cumulative totals in top 10 counties")

## Again on a logarithm scale

In [56]:
# Trim down to the top 10 counties
top_ten_df = county_df[county_df.county.isin(county_totals_df.head(10).county)]

# Make a base chart
chart = alt.Chart(log_df).properties(
    width=500,
    height=300
).mark_line().encode(
    x=alt.X("date:T", title="Date"),
    color=alt.Color("county:N", title="County")
)

# The cases lines
cases = chart.transform_filter(alt.datum.confirmed_cases > 0).encode(
    y=alt.Y(
        "confirmed_cases:Q",
        scale=alt.Scale(type='log'),
        title="Confirmed cases"
    ),
)

# The deaths lines
deaths = chart.transform_filter(alt.datum.deaths > 0).encode(
    y=alt.Y(
        "deaths:Q",
        scale=alt.Scale(type='log'),
        title="Deaths"
    ),
)

# Slapping them together
(cases | deaths).properties(title="Cumulative totals in top 10 counties",)

## Chart new cases and deaths by day

Group the data by date

In [14]:
state_df = agency_df.groupby("date").agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index().sort_values("date", ascending=True)

In [15]:
state_df.head()

Unnamed: 0,date,confirmed_cases,deaths
0,2020-01-26,2,0
1,2020-01-27,2,0
2,2020-01-28,2,0
3,2020-01-29,2,0
4,2020-01-30,2,0


Calculate the number of new cases each day using panda's [diff](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html) method.

In [16]:
state_df['new_confirmed_cases'] = state_df.confirmed_cases.diff()

Do the same for deaths

In [17]:
state_df['new_deaths'] = state_df.deaths.diff()

Now let's do something slightly different and calculate the rolling seven-day average of each.

In [18]:
state_df['new_confirmed_cases_rolling_average'] = state_df.new_confirmed_cases.rolling(7).mean()

In [19]:
state_df['new_deaths_rolling_average'] = state_df.new_deaths.rolling(7).mean()

Put it all together on the chart 

In [20]:
# One base chart object with the data they all share
chart = alt.Chart(state_df).properties(width=400, height=300)

# The new cases bars
cases_bars = chart.mark_bar(color=lat.palette['default']).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("new_confirmed_cases:Q", title="New confirmed cases")
)

# The cases rolling average
cases_line = chart.mark_line(color=lat.palette['accent']).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("new_confirmed_cases_rolling_average:Q", title="7-day average")
)

# The new deaths bars
deaths_bars = chart.mark_bar(color=lat.palette['schemes']['ice-7'][3]).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("new_deaths:Q", title="New deaths")
)

# The deaths rolling average
deaths_line = chart.mark_line(color=lat.palette['schemes']['ice-7'][6]).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("new_deaths_rolling_average:Q", title="7-day average")
)

# Combine it all together into one paired chart
((cases_bars + cases_line) | (deaths_bars + deaths_line)).properties(
    title="New case and deaths statewide by day"
)