# california-coronavirus-data examples, modified slightly for presentation to CESMII

The interactive examples for the presentation are shown at the bottom under "Additions for the CESMII/RTG workshop".  If you want to run an interactive example of this notebook, you can click here:
[![Binder](https://mybinder.org/badge_logo.svg)]([![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/benjum/winjum_cesmii_presentation/master?filepath=LATimes-examples-winjum.ipynb))

The first part of this notebook is taken directly from the LA Times notebook by [Ben Welsh](https://palewi.re/who-is-ben-welsh).  That notebook is "A demonstration of how to use Python to work with the Los Angeles Times' independent tally of coronavirus cases in California published on GitHub at [datadesk/california-coronavirus-data](https://github.com/datadesk/california-coronavirus-data#state-cdph-totalscsv)." To run that original notebook immediately in the cloud,  click the [Binder](https://mybinder.org/) launcher here:  [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/datadesk/california-coronavirus-data/master?urlpath=lab/tree/notebooks/examples.ipynb)



## Import Python tools

Our data analysis and plotting tools

In [None]:
import pandas as pd
import altair as alt

Customizations to the Altair theme

In [None]:
import altair_latimes as lat

In [None]:
alt.themes.register('latimes', lat.theme)
alt.themes.enable('latimes')

## Import data

Read in the agency totals

In [None]:
agency_df = pd.read_csv(
    "latimes-agency-totals.csv",
    parse_dates=["date"]
)

In [None]:
agency_df.head()

In [None]:
agency_df.info()

## Aggregate data

### By state

Lump all the agencies together and you get the statewide totals.

In [None]:
state_df = agency_df.groupby(["date"]).agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index()

In [None]:
state_df.head()

In [None]:
state_df.info()

### By county

Three cities &mdash; Berkeley, Long Beach and Pasadena &mdash; run independent public health departments. Calculating county-level totals requires grouping them with their local peers.

In [None]:
county_df = agency_df.groupby(["date", "county"]).agg({
    "confirmed_cases": "sum",
    "deaths": "sum"
}).reset_index()

In [None]:
county_df.head()

In [None]:
county_df.info()

## Chart the statewide totals over time

In [None]:
# Create a base chart with the common x-axis
chart = alt.Chart(state_df).encode(
    x=alt.X("date:T", title=None)
)

# Create the cases line
cases = chart.mark_line(color=lat.palette['default']).encode(
    y=alt.Y("confirmed_cases:Q", title="Confirmed cases")
)

# Create the deaths line
deaths = chart.mark_line(color=lat.palette['schemes']['ice-7'][3]).encode(
    y=alt.Y("deaths:Q", title="Deaths")
)

# Combine them into a single chart
(cases & deaths).properties(title="Statewide cumulative totals")

## Chart the county totals

First on a linear scale

In [None]:
# Create the base chart
chart = alt.Chart(county_df).mark_line().encode(
    x=alt.X("date:T", title=None),
    color=alt.Color("county:N", title="County", legend=None)
)

# The cases line
cases = chart.encode(
    y=alt.Y(
        "confirmed_cases:Q",
        title="Confirmed cases"
    ),
)

# The deaths line
deaths = chart.mark_line().encode(
    y=alt.Y("deaths:Q", title="Deaths"),
)

# Combined into a chart
(cases & deaths).properties(title="Cumulative totals by county")

Again on a logarithmic scale

In [None]:
# Make a base chart
chart = alt.Chart(county_df).mark_line().encode(
    x=alt.X("date:T", title=None),
    color=alt.Color("county:N", title="County", legend=None)
)

# The cases lines
cases = chart.transform_filter(alt.datum.confirmed_cases > 0).encode(
    y=alt.Y(
        "confirmed_cases:Q",
        scale=alt.Scale(type='log'),
        title="Confirmed cases"
    ),
)

# The deaths lines
deaths = chart.transform_filter(alt.datum.deaths > 0).encode(
    y=alt.Y(
        "deaths:Q",
        scale=alt.Scale(type='log'),
        title="Deaths"
    ),
)

# Slapping them together
(cases & deaths).properties(title="Cumulative totals by county")

A common technique for clarifying these charts to begin each line on the day the county hit a minimum number. Let's try it with 10.

In [None]:
day_10_df = county_df[
    # Filter down to only days with 10 or more cumulative cases
    county_df.confirmed_cases >= 10
].groupby(
    # And then get the minimum date for each county
    'county'
).date.min().reset_index()

Merge that date to each row in the data.

In [None]:
county_date_diff_df = county_df.merge(
    day_10_df,
    how='inner',
    on='county',
    suffixes=['', '_gte_10_cases']
)

Calculate each day's distance from its tenth day.

In [None]:
county_date_diff_df['days_since_10'] = (
    county_date_diff_df.date - county_date_diff_df.date_gte_10_cases
).dt.days

Chart it.

In [None]:
alt.Chart(county_date_diff_df).transform_filter(
    # Only keep everything once they hit 10 cases
    alt.datum.days_since_10 >= 0
).mark_line().encode(
    x=alt.X(
        "days_since_10:O",
        title="Days since 10th case"
    ),
    y=alt.Y(
        "confirmed_cases:Q",
        scale=alt.Scale(type='log'),
        title="Confirmed cases"
    ),
    color=alt.Color("county:N", title="County", legend=None)
).properties(title="Cumulative totals by county")

## County trends on a linear 'Pez' plot

Fill in any date gaps so that every county has a row for every date.

In [None]:
backfilled_county_df = county_df.set_index([
    "county",
    "date"
]).unstack("county").fillna(0).stack("county").reset_index()

Calculate the rolling change in each county.

In [None]:
chronological_county_df = backfilled_county_df.sort_values(["county", "date"])

Calculate the daily change in each county.

In [None]:
chronological_county_df['new_confirmed_cases'] = chronological_county_df.groupby("county").confirmed_cases.diff()

Let's chill that out as a seven-day average.

In [None]:
chronological_county_df['new_confirmed_cases_rolling_average'] = chronological_county_df.new_confirmed_cases.rolling(7).mean()

Make the chart.

In [None]:
alt.Chart(chronological_county_df, title="New cases by day").mark_rect(stroke=None).encode(
    x=alt.X(
        'date:O',
        axis=alt.Axis(
            ticks=False,
            grid=False,
            labels=False,
        ),
        title=None
    ),
    y=alt.Y(
        'county:N',
        title="County",
        axis=alt.Axis(ticks=False, grid=False, labelPadding=5)
    ),
    color=alt.Color(
        "new_confirmed_cases_rolling_average:Q",
        scale=alt.Scale(
            type="threshold",
            domain=[0, 3, 10, 25, 50, 100, 500],
            scheme="blues"
        ),
        title="New cases (7-day average)"
    )
).properties(height=800)

## Chart new cases and deaths

Calculate the number of new cases each day using panda's [diff](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html) method.

In [None]:
state_df['new_confirmed_cases'] = state_df.confirmed_cases.diff()

Do the same for deaths

In [None]:
state_df['new_deaths'] = state_df.deaths.diff()

Now calculate the moving seven-day average of each using panda's [rolling](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) method.

In [None]:
state_df['new_confirmed_cases_rolling_average'] = state_df.new_confirmed_cases.rolling(7).mean()

In [None]:
state_df['new_deaths_rolling_average'] = state_df.new_deaths.rolling(7).mean()

Put it all together on the chart 

In [None]:
# One base chart object with the data they all share
chart = alt.Chart(state_df).encode(
    x=alt.X("date:T", title=None),
)

# The new cases bars
cases_bars = chart.mark_bar(color=lat.palette['default']).encode(
    y=alt.Y(
        "new_confirmed_cases:Q",
        title="New confirmed cases"
    )
)

# The cases rolling average
cases_line = chart.mark_line(color=lat.palette['accent']).encode(
    y=alt.Y(
        "new_confirmed_cases_rolling_average:Q",
        title="7-day average"
    )
)

# The new deaths bars
deaths_bars = chart.mark_bar(color=lat.palette['schemes']['ice-7'][3]).encode(
    y=alt.Y(
        "new_deaths:Q",
        title="New deaths"
    )
)

# The deaths rolling average
deaths_line = chart.mark_line(color=lat.palette['schemes']['ice-7'][6]).encode(
    y=alt.Y(
        "new_deaths_rolling_average:Q",
        title="7-day average"
    )
)

# Combine it all together into one paired chart
((cases_bars + cases_line) & (deaths_bars + deaths_line)).properties(
    title="New case and deaths statewide by day"
)

# Additions for the CESMII/RTG workshop

## Adding a few interactive elements to the Altair plots

Here is a chart from above:

In [None]:
# Create the base chart
chart = alt.Chart(county_df).mark_line().encode(
    x=alt.X("date:T", title=None),
    color=alt.Color("county:N", title="County", legend=None)
)

# The cases line
cases = chart.encode(
    y=alt.Y(
        "confirmed_cases:Q",
        title="Confirmed cases"
    ),
)

# The deaths line
deaths = chart.mark_line().encode(
    y=alt.Y("deaths:Q", title="Deaths"),
)

# Combined into a chart
(cases & deaths).properties(title="Cumulative totals by county")

The next cell takes this line chart and adds new elements for interactivity.
* a selector was added with `selopac = ...` that will allow lines to be selected via county labels in the legend
  * four options are included so you can see different effects
* `how_to_select` gets added to the chart title so you know what to look for
* In the specification for `chart`:
  * `county_df` was changed to `county_df.loc[county_df['confirmed_cases']>250]` to make fewer lines show up on the plot (only those counties with over 250 confirmed cases)
  * we removed `legend=None` from the color so as to show the legend
  * `opacity=alt.condition(selopac, alt.value(1), alt.value(0.1))` was added to change the line opacity based on county selection
  * `.add_selection(selopac)` was added to associate the selector with the chart

In [None]:
# UNCOMMENT ANOTHER PAIR (selopac + how_to_select) TO SEE OTHER EFFECTS
selopac = alt.selection_single(fields=['county'])
how_to_select = 'CLICK ON LINE TO SELECT'
# selopac = alt.selection_single(fields=['county'],bind='legend') 
# how_to_select = 'CLICK ON COUNTY IN LEGEND TO SELECT'
# selopac = alt.selection_single(on='mouseover',fields=['county'])
# how_to_select = 'MOVE MOUSE OVER LINE TO SELECT'
# selopac = alt.selection_single(on='mouseover',fields=['county'],bind='legend') 
# how_to_select = 'MOVE MOUSE OVER LINE TO SELECT LINE + LEGEND ENTRY'

# Create the base chart
chart = alt.Chart(county_df.loc[county_df['confirmed_cases']>250]).mark_line().encode(
    x=alt.X("date:T", title=None),
    color=alt.Color("county:N", title="County"),
    opacity=alt.condition(selopac, alt.value(1), alt.value(0.1))
).add_selection(selopac)

# The cases line
cases = chart.encode(
    y=alt.Y(
        "confirmed_cases:Q",
        title="Confirmed cases"
    ),
)

# The deaths line
deaths = chart.mark_line().encode(
    y=alt.Y("deaths:Q", title="Deaths"),
)

# Combined into a chart
(cases & deaths).properties(title="Cumulative totals by county, "+how_to_select)

## Plotly

Plotly express is an easy way to make quick plots for inspection.  After making the plot, try moving your mouse over the plot to see the tooltips, investigate the little menu that becomes visible, zoom in on the plot, and etc.

In [None]:
import plotly.express as px

fig = px.bar(state_df, x="date", y="new_confirmed_cases")
fig.show()

## Add 7-day average line and some styling  

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Bar(x=state_df["date"], y=state_df["new_confirmed_cases"],name='new confirmed cases'))
fig.add_trace(go.Scatter(x=state_df["date"], y=state_df["new_confirmed_cases_rolling_average"], name='7-day rolling avg'))

fig.update_xaxes(showline=True, linecolor='black', showgrid=True, gridwidth=1, gridcolor='LightBlue')
fig.update_yaxes(showline=True, linecolor='black', showgrid=True, gridwidth=1, gridcolor='LightBlue')
fig.update_layout(legend=dict(x=0.1, y=0.9, bordercolor="Black", borderwidth=2))
fig.update_layout(width=750, height=500, plot_bgcolor="White", title="New confirmed cases")

fig.show()

### Similar plot for counties, now combining the plotly plot with the ipywidgets library

First, make the data arrays for county info, being careful to do diffs and averages for individual counties and not between different county data

In [None]:
for i in county_df.county.unique():
    county_df.loc[county_df.county == i,'new_confirmed_cases'] = county_df.loc[county_df.county == i].confirmed_cases.diff()
    county_df.loc[county_df.county == i,'new_confirmed_cases_rolling_average'] = county_df.loc[county_df.county == i].new_confirmed_cases.rolling(7).mean()

Then, we use ipywidgets to make select via county:
* make a function to hold the plotting routine
* call ipywidgets.interact with the name of the created function and a parameter equal to the list of available counties

In [None]:
import plotly.graph_objects as go
import ipywidgets

def countyplotly(county2plot='Los Angeles'):
    dfhere = county_df.loc[county_df.county == county2plot]
    
    fig = go.Figure()
    
    fig.add_trace(go.Bar(x=dfhere["date"], y=dfhere["new_confirmed_cases"],name='new confirmed cases'))
    fig.add_trace(go.Scatter(x=dfhere["date"], y=dfhere["new_confirmed_cases_rolling_average"], name='7-day rolling avg'))

    fig.update_xaxes(showline=True, linecolor='black', showgrid=True, gridwidth=1, gridcolor='LightBlue')
    fig.update_yaxes(showline=True, linecolor='black', showgrid=True, gridwidth=1, gridcolor='LightBlue')
    fig.update_layout(legend=dict(x=0.1, y=0.9, bordercolor="Black", borderwidth=2))
    fig.update_layout(width=750, height=500, plot_bgcolor="White", title="New confirmed cases")

    fig.show()

ipywidgets.interact(countyplotly,county2plot=sorted(county_df.county.unique()));