# Example notebook

By [Ben Welsh](https://palewi.re/who-is-ben-welsh)

A demonstration of how to use Python to work with the Los Angeles Times' independent tally of coronavirus cases in California. Draws from the [datadesk/california-coronavirus-data](https://github.com/datadesk/california-coronavirus-data#state-cdph-totalscsv) repository on GitHub.

## Import Python tools

Our data analysis and plotting tools

In [2]:
import pandas as pd
import altair as alt

Customize the altair theme

In [3]:
import altair_latimes as lat

In [4]:
alt.themes.register('latimes', lat.theme)
alt.themes.enable('latimes')

ThemeRegistry.enable('latimes')

## Importing data

Read in the agency totals

In [5]:
agency_df = pd.read_csv("../latimes-agency-totals.csv")

In [6]:
agency_df.head()

Unnamed: 0,agency,county,fips,date,confirmed_cases,deaths,did_not_update
0,Alameda,Alameda,1,2020-04-04,510,12,
1,Berkeley,Alameda,1,2020-04-04,27,0,True
2,Alpine,Alpine,3,2020-04-04,1,0,True
3,Amador,Amador,5,2020-04-04,3,0,True
4,Butte,Butte,7,2020-04-04,11,0,True


In [7]:
agency_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1638 entries, 0 to 1637
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   agency           1638 non-null   object
 1   county           1638 non-null   object
 2   fips             1638 non-null   int64 
 3   date             1638 non-null   object
 4   confirmed_cases  1638 non-null   int64 
 5   deaths           1638 non-null   int64 
 6   did_not_update   93 non-null     object
dtypes: int64(3), object(4)
memory usage: 89.7+ KB


## Chart statewide totals

Chart the cumulative totals statewide

In [48]:
chart = alt.Chart(agency_df).properties(width=400, height=300)

cases = chart.mark_line(color=lat.palette['default']).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("sum(confirmed_cases):Q", title="Confirmed cases")
)

deaths = chart.mark_line(color=lat.palette['highlight']).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("sum(deaths):Q", title="Deaths")
)

(cases | deaths).properties(title="Statewide cumulative totals")

## Chart top counties

Chart the counties individually.

In [25]:
county_df = agency_df.groupby(["date", "county"]).agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index()

In [26]:
county_df.head()

Unnamed: 0,date,county,confirmed_cases,deaths
0,2020-01-26,Alameda,0,0
1,2020-01-26,Calaveras,0,0
2,2020-01-26,Contra Costa,0,0
3,2020-01-26,Humboldt,0,0
4,2020-01-26,Los Angeles,1,0


In [27]:
county_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1527 entries, 0 to 1526
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   date             1527 non-null   object
 1   county           1527 non-null   object
 2   confirmed_cases  1527 non-null   int64 
 3   deaths           1527 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 47.8+ KB


In [54]:
county_totals_df = county_df.groupby("county").agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index().sort_values("confirmed_cases", ascending=False)

In [55]:
county_totals_df.head(10)

Unnamed: 0,county,confirmed_cases,deaths
16,Los Angeles,33858,624
38,Santa Clara,11665,397
32,San Diego,8663,111
25,Orange,6108,75
33,San Francisco,4931,57
36,San Mateo,4770,105
0,Alameda,4072,81
28,Riverside,4064,160
29,Sacramento,3244,112
6,Contra Costa,2948,37


In [58]:
chart = alt.Chart(county_df[county_df.county.isin(county_totals_df.head(10).county)]).properties(width=400, height=300)

cases = chart.mark_line().encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("confirmed_cases:Q", title="Confirmed cases"),
    color=alt.Color("county:N", title="County")
)

deaths = chart.mark_line().encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("deaths:Q", title="Deaths"),
    color=alt.Color("county:N", title="County")
)

(cases | deaths).properties(title="Cumulative totals in top 10 counties")

### Chart new cases and deaths by day

In [60]:
state_df = agency_df.groupby("date").agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index().sort_values("date", ascending=True)

In [61]:
state_df.head()

Unnamed: 0,date,confirmed_cases,deaths
0,2020-01-26,2,0
1,2020-01-27,2,0
2,2020-01-28,2,0
3,2020-01-29,2,0
4,2020-01-30,2,0


In [62]:
state_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 69
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   date             70 non-null     object
 1   confirmed_cases  70 non-null     int64 
 2   deaths           70 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 2.2+ KB


In [66]:
state_df['new_confirmed_cases'] = state_df.confirmed_cases.diff()

In [75]:
state_df['new_deaths'] = state_df.deaths.diff()

In [76]:
chart = alt.Chart(state_df).properties(width=400, height=300)

cases = chart.mark_bar(color=lat.palette['default']).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("new_confirmed_cases:Q", title="New confirmed cases")
)

deaths = chart.mark_bar(color=lat.palette['highlight']).encode(
    x=alt.X("date:T", title="Date"),
    y=alt.Y("new_deaths:Q", title="New deaths")
)

(cases | deaths).properties(title="New case and deaths statewide by day")