# COV-19 Case Prediction
This notebook aims to create models to predict COV-19 cases in 313 different areas of the world, using GluonTS models.

The data set is downloaded from Kaggle(https://www.kaggle.com/c/covid19-global-forecasting-week-5), you can download them and put all the csv files under a folder called \"covid19-global-forecasting-week-4\" in the same directory of this notebook.

**NOTE: this notebook is for illustration purposes only, it has not been reviewed by epidemiological experts, and we do not claim that accurate epidemiological predictions can be made with the code that follows.**

In [None]:
%matplotlib inline
import mxnet as mx
from mxnet import gluon
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import os
from tqdm.autonotebook import tqdm
from pathlib import Path

In [None]:
prediction_length = 20

## Load data and preprocessing
We first load the data from files. Since the original data doesn't meet the requirements of GluonTS models, we need to do data preprocessing and generate new dataframe where each row represents a time series for a certain place.

In [None]:
total = pd.read_csv("./covid19-global-forecasting-week-5/train.csv", index_col=False)
test = pd.read_csv("./covid19-global-forecasting-week-5/test.csv", index_col=False)

In [None]:
total = total[total["Target"]=="ConfirmedCases"]
test = test[test["Target"]=="ConfirmedCases"]

In [None]:
total.head()

In [None]:
total = total.fillna("")
total["name"] = total["Country_Region"] + "_" + total["Province_State"] + "_" + total["County"]
total.head()

In [None]:
country_list = sorted(list(set(total["name"])))
date_list = sorted(list(set(total["Date"])))
data_dic = {"name": country_list}

for date in date_list:
    tmp = total[total["Date"]==date][["name", "Date", "TargetValue"]]
    tmp = tmp.pivot(index="name", columns="Date", values="TargetValue")
    tmp_values = tmp[date].values
    data_dic[date] = tmp_values
new_df = pd.DataFrame(data_dic)
new_df.head()

The original data is daily confirmed cases, we tansform it into a accumulative one.

In [None]:
total_values = new_df.drop("name", axis=1).values
row, col = total_values.shape
for i in range(row):
    tmp = total_values[i]
    for j in range(col):
        if j > 0:
            tmp[j] = tmp[j] + tmp[j - 1]
    total_values[i] = tmp

Get the features for populations and weight, and apply min-max scale to it, also divide all the countries into three different type according to the weight it has.

In [None]:
feature_dic_population = {}
feature_dic_weight = {}
for date in date_list:
    tmp = total[total["Date"]==date][["Date", "name", "Population", "Weight"]]
    population = tmp.pivot(index="name", columns="Date", values="Population")
    weight = tmp.pivot(index="name", columns="Date", values="Weight")
    feature_dic_population[date] = population[date].values
    feature_dic_weight[date] = weight[date].values
feature_df_population = pd.DataFrame(feature_dic_population)
feature_df_weight = pd.DataFrame(feature_dic_weight)
# feature_df_population.head()

In [None]:
populations = []
weights = []
for i in range(feature_df_population.shape[0]):
    populations.append(feature_df_population.values[i][0])
    weights.append(feature_df_weight.values[i][0])

In [None]:
def min_max_scale(lst):
    minimum = min(lst)
    maximum = max(lst)
    new = []
    for i in range(len(lst)):
        new.append((lst[i] - minimum) / (maximum - minimum))
    return new

In [None]:
scaled_populations = min_max_scale(populations)
scaled_weights = min_max_scale(weights)
stat_real_features = []
stat_cat_features = []
for i in range(len(scaled_weights)):
    if 0 <= scaled_weights[i] <= 0.33:
#         country with small number of people
        stat_cat_features.append([1])
    elif 0.33 < scaled_weights[i] <= 0.67:
#         country with median number of people
        stat_cat_features.append([2])
    else:
#         country with large number of people
        stat_cat_features.append([3])
    stat_real_features.append([scaled_weights[i]])

## Create training dataset and train the model

In [None]:
from gluonts.dataset.common import load_datasets, ListDataset
from gluonts.dataset.field_names import FieldName
from copy import copy

train_df = new_df.drop(["name"], axis=1)
test_target_values = total_values.copy()
train_target_values = [ts[:-prediction_length] for ts in total_values]
cat_cardinality = [3]

start_date = [pd.Timestamp("2020-01-23", freq='1D') for _ in range(len(new_df))]
train_ds = ListDataset([
    {
        FieldName.TARGET: target,
        FieldName.START: start,
        FieldName.FEAT_STATIC_REAL: static_real,
        FieldName.FEAT_STATIC_CAT: static_cat
    }
    for (target, start, static_real,  static_cat) in zip(train_target_values,
                                         start_date,
                                         stat_real_features,
                                        stat_cat_features)
], freq="D")

test_ds = ListDataset([
    {
        FieldName.TARGET: target,
        FieldName.START: start,
        FieldName.FEAT_STATIC_REAL: static_real,
        FieldName.FEAT_STATIC_CAT: static_cat
    }
    for (target, start, static_real,  static_cat) in zip(test_target_values,
                                         start_date,
                                        stat_real_features, 
                                        stat_cat_features)
], freq="D")

In [None]:
next(iter(train_ds))

In [None]:
from gluonts.model.deepar import DeepAREstimator
from gluonts.distribution.neg_binomial import NegativeBinomialOutput
from gluonts.mx.trainer import Trainer

n = 50
estimator = DeepAREstimator(
    prediction_length=prediction_length,
    freq="D",
    distr_output = NegativeBinomialOutput(),
    use_feat_static_real=True,
#     use_feat_static_cat=True,
#     cardinality=cat_cardinality,
    trainer=Trainer(
        learning_rate=1e-5,
        epochs=n,
        num_batches_per_epoch=50,
        batch_size=32
    )
)

predictor = estimator.train(train_ds)

## Evaluate the model

In [None]:
from gluonts.evaluation.backtest import make_evaluation_predictions

forecast_it, ts_it = make_evaluation_predictions(
    dataset=test_ds,
    predictor=predictor,
    num_samples=100
)

print("Obtaining time series conditioning values ...")
tss = list(tqdm(ts_it, total=len(test_ds)))
print("Obtaining time series predictions ...")
forecasts = list(tqdm(forecast_it, total=len(test_ds)))

In [None]:
from gluonts.evaluation import Evaluator


class CustomEvaluator(Evaluator):

    def get_metrics_per_ts(self, time_series, forecast):
        successive_diff = np.diff(time_series.values.reshape(len(time_series)))
        successive_diff = successive_diff ** 2
        successive_diff = successive_diff[:-prediction_length]
        denom = np.mean(successive_diff)
        pred_values = forecast.samples.mean(axis=0)
        true_values = time_series.values.reshape(len(time_series))[-prediction_length:]
        num = np.mean((pred_values - true_values) ** 2)
        rmsse = num / denom
        metrics = super().get_metrics_per_ts(time_series, forecast)
        metrics["RMSSE"] = rmsse
        return metrics

    def get_aggregate_metrics(self, metric_per_ts):
        wrmsse = metric_per_ts["RMSSE"].mean()
        agg_metric, _ = super().get_aggregate_metrics(metric_per_ts)
        agg_metric["MRMSSE"] = wrmsse
        return agg_metric, metric_per_ts


evaluator = CustomEvaluator(quantiles=[0.5, 0.67, 0.95, 0.99])
agg_metrics, item_metrics = evaluator(iter(tss), iter(forecasts), num_series=len(test_ds))
print(json.dumps(agg_metrics, indent=4))

In [None]:
## Plot graphs for the results

In [None]:
next(iter(train_ds))

In [None]:
from gluonts.model.deepar import DeepAREstimator
from gluonts.distribution.neg_binomial import NegativeBinomialOutput
from gluonts.mx.trainer import Trainer

n = 50
estimator = DeepAREstimator(
    prediction_length=prediction_length,
    freq="D",
    distr_output = NegativeBinomialOutput(),
    use_feat_static_real=True,
#     use_feat_static_cat=True,
#     cardinality=cat_cardinality,
    trainer=Trainer(
        learning_rate=1e-5,
        epochs=n,
        num_batches_per_epoch=50,
        batch_size=32
    )
)

predictor = estimator.train(train_ds)

## Evaluate the model

In [None]:
from gluonts.evaluation.backtest import make_evaluation_predictions

forecast_it, ts_it = make_evaluation_predictions(
    dataset=test_ds,
    predictor=predictor,
    num_samples=100
)

print("Obtaining time series conditioning values ...")
tss = list(tqdm(ts_it, total=len(test_ds)))
print("Obtaining time series predictions ...")
forecasts = list(tqdm(forecast_it, total=len(test_ds)))

In [None]:
from gluonts.evaluation import Evaluator


class CustomEvaluator(Evaluator):

    def get_metrics_per_ts(self, time_series, forecast):
        successive_diff = np.diff(time_series.values.reshape(len(time_series)))
        successive_diff = successive_diff ** 2
        successive_diff = successive_diff[:-prediction_length]
        denom = np.mean(successive_diff)
        pred_values = forecast.samples.mean(axis=0)
        true_values = time_series.values.reshape(len(time_series))[-prediction_length:]
        num = np.mean((pred_values - true_values) ** 2)
        rmsse = num / denom
        metrics = super().get_metrics_per_ts(time_series, forecast)
        metrics["RMSSE"] = rmsse
        return metrics

    def get_aggregate_metrics(self, metric_per_ts):
        wrmsse = metric_per_ts["RMSSE"].mean()
        agg_metric, _ = super().get_aggregate_metrics(metric_per_ts)
        agg_metric["MRMSSE"] = wrmsse
        return agg_metric, metric_per_ts


evaluator = CustomEvaluator(quantiles=[0.5, 0.67, 0.95, 0.99])
agg_metrics, item_metrics = evaluator(iter(tss), iter(forecasts), num_series=len(test_ds))
print(json.dumps(agg_metrics, indent=4))

## Plot graphs for the results

In [None]:
plot_log_path = "./plots/"
directory = os.path.dirname(plot_log_path)
if not os.path.exists(directory):
    os.makedirs(directory)
    

def plot_prob_forecasts(ts_entry, forecast_entry, path, sample_id, inline=True):
    plot_length = 150
    prediction_intervals = (50, 67, 95, 99)
    legend = ["observations", "median prediction"] + [f"{k}% prediction interval" for k in prediction_intervals][::-1]

    _, ax = plt.subplots(1, 1, figsize=(10, 7))
    ts_entry[-plot_length:].plot(ax=ax)
    forecast_entry.plot(prediction_intervals=prediction_intervals, color='g')
    ax.axvline(ts_entry.index[-prediction_length], color='r')
    plt.legend(legend, loc="upper left")
    if inline:
        plt.show()
        plt.clf()
    else:
        plt.savefig('{}forecast_{}.pdf'.format(path, sample_id))
        plt.close()

print("Plotting time series predictions ...")
for i in tqdm(range(5)):
    ts_entry = tss[i]
    forecast_entry = forecasts[i]
    plot_prob_forecasts(ts_entry, forecast_entry, plot_log_path, i)

## Comments
The result is seemingly good but there is still much space for improvements. The main problem is that the data got from kaggle contain only a few features which limits us from creating more precise models. The current is very close to a baseline model because it contains only one extra feature. The next thing to do is to find additional data on kaggle or from the internet to improve the model.