# Visualize various metrics

This notebook provides an example of looking at different metrics to identify effects of different events (e.g. stay at home orders, outlier deaths, etc.)

Make sure to run batch model fitting to generate `data/metro_areas.csv` before running this notebook.

```
python fit_models.py --specfile=metro_areas
```

In [None]:
import itertools
import numpy as np
import pandas as pd
from datetime import datetime, timezone, timedelta
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from scipy import optimize
import statsmodels.api as sm
import os
import pickle
import requests

from modeling import dataproc, sir_model

# Utility functions

In [None]:
def plot_sir_model(r, i, total_model_days, df, metric, sampling_rate, name):
    """Plot the model death rates and total deaths vs actual data.
    
    Args:
        r: Array holding daily recovered population values from SIR model
        i: Array holding daily infected population values from SIR model
        total_model_days: Total number of modeled days to plot
        df: Dataframe holding metric values.
        metric: The type of metric to plot ('Cases' or 'Deaths')
        sampling_rate: Number of samples per day used to simulate the model.
        name: A name to attach to the plot.
    """
    plot_start_time = df['Date'].min().timestamp()
    plot_step_size = 24 * 60 * 60 / sampling_rate
    plot_end_time = plot_start_time + total_model_days * 24 * 60 * 60 
    plot_timestamps = np.arange(plot_start_time, plot_end_time, plot_step_size)
    plot_dates = [datetime.utcfromtimestamp(x) for x in plot_timestamps]
    print('peak date', plot_dates[np.argmax(i)])
    # Plot peak infection
    fig, ax = plt.subplots(figsize=(12, 8))
    ax.ticklabel_format(useOffset=False)
    ax.ticklabel_format(style='plain')
    ax.plot(plot_dates[:-sampling_rate],
            (r[sampling_rate:] - r[:-sampling_rate]),
            c='g',
            label='model ' + metric + ' rate',
            linewidth=4)
    ax.plot(df['Date'].to_list()[:-1],
            (df[metric] - df[metric].shift())[1:], label='actual ' + metric + ' rate', c='r', linewidth=4)
    ax.set_title('SIR model for ' + name)
    ax.set_xlabel('Number of days')
    ax.set_ylabel('Number of individuals')
    plt.legend()
    plt.plot()
    
    # Plot recovery
    fig, ax = plt.subplots(figsize=(12, 8))
    ax.ticklabel_format(useOffset=False)
    ax.ticklabel_format(style='plain')
    ax.plot(plot_dates, r, c='g',
            label='model ' + metric, linewidth=4)
    ax.plot(df['Date'].to_list(), df[metric], label='actual ' + metric, c='r', linewidth=4)
    ax.set_title('SIR model for ' + name)
    ax.set_xlabel('Number of days')
    ax.set_ylabel('Number of individuals')
    plt.legend()
    plt.show()

# Load Covid-19 and Census Data

In [None]:
datastore = dataproc.DataStore()

# Load model params dataframe

In [None]:
model_df = pd.read_csv('data/metro_areas.csv', index_col='Unnamed: 0', parse_dates=['Date'])
model_df

In [None]:
r_np[:,0] + timedelta(DATE_OFFSET)

In [None]:
DATE_OFFSET = 0
area_names = model_df['Area'].unique()
plt.figure(figsize=(15, 8))
for area_name in area_names:
    r_np = model_df[model_df['Area'] == area_name][['Date', 'R']].to_numpy()
    plt.plot(r_np[:,0] + timedelta(DATE_OFFSET), r_np[:,1], linewidth=4, label=area_name)
plt.plot(r_np[:,0] + timedelta(DATE_OFFSET), [1.0] * r_np.shape[0], linewidth=4, linestyle=':')
plt.title('R values')
plt.legend()

In [None]:
model_df[model_df['Area'] == 'New Orleans'].mean()

## Sheltering in place effect on infection rate

How many people does an infected person infect per day? Models trained with new data may reveal a sudden change in this parameter based on sheltering-in-place orders. For example, New York, Michigan, and Louisiana all implement sheltering-in-place orders around 3/22-3/24, and the "infection rate" based on deaths suddenly dropped about a week later.

In [None]:
plt.figure(figsize=(10, 8))

shelter_dates_7_days = {'NYC': '2020-03-29',
                        'Detroit': '2020-03-31'
                       }

for area in ['NYC', 'Detroit', 'New Orleans']:
    model_area_df = model_df[model_df['Area'] == area]
    plt.plot(model_area_df['Date'], model_area_df['Infection Rate'], linewidth=5, label=area)
plt.legend()

## Anomaly detection

Can we use the model MSE to detect anomalies in the death rate?

Notice that there seems to be a recent sudden shift in prediction error around 4/15 for New Orleans, 4/16 for Detroit and New York! Why?

In [None]:
for area in ['NYC', 'Detroit', 'New Orleans']:
    model_area_df = model_df[(model_df['Area'] == area) & (model_df['Date'] <= '2020-04-18')]
    plt.plot(model_area_df['Date'], model_area_df['MSE'], linewidth=5, label=area)
    plt.legend()
    plt.show()


In [None]:
# Check data for one such area and date range
model_df[(model_df['Area'] == 'New Orleans') & (model_df['Date'] >= '2020-04-12') & (model_df['Date'] <= '2020-04-18')]

## Example reports

Doing a news search for these regions: it is actually faulty data pulled from the NYC api, but newspapers still rely on it to report a [new record number of deaths](https://user-images.githubusercontent.com/47190785/79694498-8dfd6a80-823e-11ea-8810-3a305fe7d13f.png).