# Macro Predictions

This notebook generates the lstm prediction for the visualizations in the `Prediction_Visualization` notebook. We first iterate over a list of the countries detailed in teh JHU data, and then over a list of the fifty US staes (abbreviations).

We start by importing all of the libraries we're going to needs (predominantly pandas and the data loading / inference libaries in this repo).

In [1]:
# basic python libs
import os, wget, json

# deep learning framework tools
import torch
from torch import nn, optim

# data analysis libs
import numpy as np, pandas as pd

# visualization libs
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns
from pylab import rcParams
from pandas.plotting import register_matplotlib_converters

# ML pre-processing
from sklearn.preprocessing import MinMaxScaler

We're also going to need the predictor and data loading classes in this repo.

In [2]:
from ML.lstm_torch import LSTM_data_loader, LSTM_Predictor, train_lstm, predict_future

import warnings
warnings.filterwarnings('ignore')

In [3]:
%matplotlib inline

In [4]:
sns.set(style='whitegrid', palette='muted', font_scale=1.6)
sns.set_palette(sns.color_palette("husl", 8))

rcParams['figure.figsize'] = 16, 12
register_matplotlib_converters()

In [5]:
# set our random seed
RANDOM_SEED = 26
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x1242c37f0>

## Data Context and Versions

We have two data frames that we ue for these analyses. For the country level data, we can use eithe the most recent data available, or we can use a March 22nd snpshot, which we use for the US-state level analysis.

The pro and cons are fairly straightforward; the newer data provides more, better information as earlier recorded dates are fairly sparse. We can see this in lower recorded training and testing Mean Squared Errors for our predictins. That said, after Macrh 22nd, state-level information was dropped from the data, so tarining models at that level of granularity is nog longer possible uunless we use deprecated data. For this analysis, and the analysis in the visual, we use the March 22nd snap shop. If you wish to do otherwise, though, swicth out `./data/jhu_data/time_series_19-covid-Confirmed_3_22.csv` for `time_series_covid19_confirmed_global.csv`.

In [6]:
if os.path.isfile('time_series_covid19_confirmed_global.csv'):
    os.remove('time_series_covid19_confirmed_global.csv')
    
wget.download('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

'time_series_covid19_confirmed_global.csv'

In [7]:
tdf = pd.read_csv('./data/jhu_data/time_series_19-covid-Confirmed_3_22.csv')
tdf.drop(columns='Unnamed: 0', inplace=True)

In [8]:
tdf.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/13/20,3/14/20,3/15/20,3/16/20,3/17/20,3/18/20,3/19/20,3/20/20,3/21/20,3/22/20
0,,Thailand,15.0,101.0,2,3,5,7,8,8,...,75,82,114,147,177,212,272,322,411,599
1,,Japan,36.0,138.0,2,1,2,2,4,4,...,701,773,839,825,878,889,924,963,1007,1086
2,,Singapore,1.2833,103.8333,0,1,3,3,4,5,...,200,212,226,243,266,313,345,385,432,455
3,,Nepal,28.1667,84.25,0,0,0,1,1,1,...,1,1,1,1,1,1,1,1,1,2
4,,Malaysia,2.5,112.5,0,0,0,3,4,4,...,197,238,428,566,673,790,900,1030,1183,1306


Load in our state geojson data to make our mapping dictionary.

In [9]:
with open('./data/geo_data/us-states.json', 'r') as f:
    us_states = json.load(f)
    
state_abrs = [x['id'] for x in us_states['features']]

state_mapper_lst = [{x['properties']['name']:x['id']} for x in us_states['features']]

state_mapper_lst
state_mapper = {}
for s in state_mapper_lst:
    state_mapper.update(s)

As an example, we can run through using the NY abbreviation.

In [10]:
state_data_loader = LSTM_data_loader(df=tdf,
                                       region_abr='NY',
                                       country='US',
                                       state_mapper=state_mapper)

state_data_loader.subset_df()
state_data_loader.transform_df_datetime(delta=True)

state_data_loader.gen_data_sets(test_data_size=0)

X_train, y_train = state_data_loader.set_seq(train=True, sequence_lenth=3)
X_train = torch.from_numpy(X_train).float()
y_train = torch.from_numpy(y_train).float()

model = LSTM_Predictor(features=1,
                       neurons=512,
                       sequences=3,
                       layers=2,
                       dropout=0.0)

model, _, _ = train_lstm(model,
                                      X_train,
                                      y_train,
                                      epochs=300)

seq_length = model.sequences
days_to_predict = 10

outs = predict_future(n_future=days_to_predict, 
                      time_data=X_train, 
                      sequece_lenth=model.sequences, 
                      model=model)

predicted_cases = state_data_loader.scaler.inverse_transform(
  np.expand_dims(outs, axis=0)
).flatten()
print([int(x) for x in predicted_cases.tolist()])

Data is converted to daily delta
Epoch 0 train loss: 1.6421136856079102
Epoch 50 train loss: 0.8585584759712219
Epoch 100 train loss: 0.48346182703971863
Epoch 150 train loss: 0.48300594091415405
Epoch 200 train loss: 0.4827078580856323
Epoch 250 train loss: 0.482661634683609
[3003, 3665, 4670, 6008, 7720, 9850, 12416, 15372, 18580, 21824]


## Saving Data For Viz

The below for loop takes the process that we just did for New York and performs it for every state in the US. We take the trained models and predict the next ten days, and write these predictions to a json. We then take these results and visualize them in the results viz notebook.

In [11]:
data_saver_nodelta = {}

counter = 1

for state in state_abrs:
    
    print('{}: {} out of {}'.format(state, counter, len(state_abrs)))
    
    state_data_loader = LSTM_data_loader(df=tdf,
                                           region_abr=state,
                                           country='US',
                                           state_mapper=state_mapper)
    
    state_data_loader.subset_df()

    state_data_loader.transform_df_datetime(delta=False)

    state_data_loader.gen_data_sets(test_data_size=0)
    
    X_train, y_train = state_data_loader.set_seq(train=True, sequence_length=3)
    X_train = torch.from_numpy(X_train).float()
    y_train = torch.from_numpy(y_train).float()
    
    model = LSTM_Predictor(features=1,
                           neurons=512,
                           sequences=3,
                           layers=2,
                           dropout=0.3)

    model, train_hist, test_hist = train_lstm(model,
                                          X_train,
                                          y_train,
                                          epochs=300)
    
    seq_length = model.sequences
    days_to_predict = 10

    outs = predict_future(n_future=days_to_predict, 
                          time_data=X_train, 
                          sequece_lenth=model.sequences, 
                          model=model)

    predicted_cases = state_data_loader.scaler.inverse_transform(
      np.expand_dims(outs, axis=0)
    ).flatten()
    print(predicted_cases)
    
    data_saver_nodelta[state] = predicted_cases
    counter+=1
    
    data_try = data_saver_nodelta.copy()

    for key in data_try.keys():
        data_try[key] = data_try[key].tolist()

    for_writing = json.dumps(data_try)
    
    with open('states_predictions.json', 'w') as fp:
            json.dump(for_writing, fp)

## Country Predictions

We then do the same process, but instead of processing 50 states, we process each country recorded in the JHU data. Additonally, this model is trained on new daily cases, as opposed to the total confirmed cases in our data.

In [12]:
country_list = tdf['Country/Region'].tolist()
print(len(country_list))
country_list = list(dict.fromkeys(country_list))
print(len(country_list))

501
183


In [13]:
data_saver = {}

counter = 1

for country in country_list:
    
    print('{}: {} out of {}'.format(country, counter, len(country_list)))
    
    country_data_loader = LSTM_data_loader(df=tdf,
                                           region_abr=None,
                                           country=country,
                                           state_mapper=None)
    
    country_data_loader.subset_df()

    country_data_loader.transform_df_datetime(delta=False)
    try:
        country_data_loader.gen_data_sets(test_data_size=0)

        X_train, y_train = country_data_loader.set_seq(train=True,sequence_lenght=3)
        X_train = torch.from_numpy(X_train).float()
        y_train = torch.from_numpy(y_train).float()

        model = LSTM_Predictor(features=1,
                               neurons=512,
                               sequences=3,
                               layers=2,
                               dropout=0.3)

        model, _, _ = train_lstm(model,
                                              X_train,
                                              y_train,
                                              epochs=300)

        seq_length = model.sequences
        days_to_predict = 10

        outs = predict_future(n_future=days_to_predict, 
                              time_data=X_train, 
                              sequece_lenth=model.sequences, 
                              model=model)

        predicted_cases = country_data_loader.scaler.inverse_transform(
          np.expand_dims(outs, axis=0)
        ).flatten()
        print(predicted_cases)

        data_saver[country] = predicted_cases
        counter+=1

        data_try = data_saver.copy()

        for key in data_try.keys():
            data_try[key] = data_try[key].tolist()

        for_writing = json.dumps(data_try)

        with open('country_predictions_delta.json', 'w') as fp:
            json.dump(for_writing, fp)
            
    except Exception as e:
        print(e)