# Online model definition, data preparation and example

## Table of Content
1. [Model definition](#model-def)
    * [Linear Regression](#linear-regression)
    * [Passive-aggressive Regressor](#pa-regressor)
2. [Upload model to chantilly](#chantilly-rest)
3. [Download GoogleCloudPlatform/covid-19-open-data](#download-dataset)
4. [Simulate data stream and view live dashboard for model evaluation](#simulate)
    * [Open Chantilly dashboard](#dashboard)
5. [Data preparation](#data-prep)
    * [Filter by state Bremen](#bremen-data)
    * [Merge vaccination data](#vaccination-data)
6. [Example predict, learn and measure metric including vaccination data](#example-vaccination)
    * [SNARIMAX model definition](#snarimax)
    * [Pick one feature x and label y](#pick-one)
    * [Define metrics for evaluation of the mode](#metric-eval)
    * [Walkthrough prediction, metric update & training](#walkthrough)
    * [Full cycle of predict, metric update & train on vaccination data](#full-loop)
7. [Playground](#playground)

## 1. Model definition <a class="anchor" id="model-def"></a>

Creme is used for defining our online machine learning models, which can be trained and predict by one sample at a time

More information about creme can be found here: https://github.com/MaxHalford/creme

##### Pipeline definition
1. To include the date inside the linear regression, we're first transforming it into a unix timestamp with compose.FuncTransformer.
2. Then compose.Select is used to specify, what our input features are to train the model on.
3. For standardizing and scaling down the input values, preprocessing.StandardScaler is used, which scales each input feature by subtracting the mean and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.
4. As the last step the disered model is define.

Choose one of the following models for uploading to chantilly by executing the cell or define your own model with creme.

### Linear Regression <a class="anchor" id="linear-regression"></a>
Linear Regression is a basic linear model, which assumes a linear relationship input variables and the output variables

In [None]:
from creme import compose
from creme import linear_model
from creme import preprocessing


def parse(row):
    import datetime as dt
    row['date'] = dt.datetime.fromisoformat(row['date']).timestamp()
    return row

model = compose.FuncTransformer(parse) \
    | compose.Select('date', 'new_deceased', 'new_recovered',
                     'cumulative_confirmed', 'cumulative_deceased',
                     'cumulative_recovered') \
    | preprocessing.StandardScaler() \
    | linear_model.LinearRegression()


### Passive-aggressive Regressor <a class="anchor" id="pa-regressor"></a>
The passive-aggressive regressor is model, which aims to learn from large-scale data and is defined by:

* Passive: If the prediction is correct, keep the model and do not make any changes.
* Aggressive: If the prediction is incorrect, make changes to the model.

In [None]:
from creme import compose
from creme import linear_model
from creme import preprocessing

def parse(row):
    import datetime as dt
    row['date'] = dt.datetime.fromisoformat(row['date']).timestamp()
    return row

model = compose.FuncTransformer(parse) \
    | compose.Select('date', 'new_deceased', 'new_recovered',
                     'cumulative_confirmed', 'cumulative_deceased',
                     'cumulative_recovered') \
    | preprocessing.StandardScaler() \
    | linear_model.PARegressor()

## 2. Upload model to chantilly <a class="anchor" id="chantilly-rest"></a>

The model is uploaded via a REST-API to chantilly, which then can be used our trained through REST-Requests. To serialize the payload for the REST-Request, dill is used to pickle the model.

More information about chantilly can be found here: https://github.com/online-ml/chantilly

In [None]:
import dill
import requests

requests.post('http://localhost:5000/api/model', data=dill.dumps(model))

## 3. Download GoogleCloudPlatform/covid-19-open-data <a class="anchor" id="download-dataset"></a>

In this tutorial the COVID-19 Open-Data dataset is used. It attemps to assemble the largest Covid-19 epidemiological database.
More information about the dataset can be found here: https://github.com/GoogleCloudPlatform/covid-19-open-data

First we need to download the base epidemiology.csv, which lists the numbers for new cases, tests, recovered & deceased.
The schema definition can be found here: [Epidemiology Schema](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/docs/table-epidemiology.md)

In [None]:
!wget https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv

For the example inside this notebook we're also using the vaccination dataset, which lists the numbers for new vaccinations, etc..The schema definition can be found here: [Vaccinations Schema](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/docs/table-vaccinations.md)

In [None]:
!wget https://storage.googleapis.com/covid19-open-data/v3/vaccinations.csv

## 4. Simulate data stream and view live dashboard for model evaluation <a class="anchor" id="simulate"></a>

For simulating a data stream and evaluating different models the python script `simulate.py` was developed.

```
# python simulate.py --help
usage: simulate.py [-h] [--file [FILE]] [speed_up]

positional arguments:
  speed_up

optional arguments:
  -h, --help     show this help message and exit
  --file [FILE]
```

The script will use `epidemiology.csv` as a default filepath to simulate a data stream for our specific case. You can also specify a custom filepath by using --file like `python --file bremen_epidemiology.csv` for example, if you continue with the notebook and create the dataset for the state Bremen in Germany.
The sleep timer is currently set to 1 second between each, which can be speeded up with a positional argument, when starting the script.

There are two ways to use this script. Either run it in the cell below, which will spam the notebook. Or start a new terminal in jupyterlab and execute the python script in the terminal

In [None]:
!python simulate.py

### Open Chantilly dashboard <a class="anchor" id="dashboard"></a>

To evaluate the performance of the model, which was uploaded to chantilly, open [Chantilly Dashboard](http://localhost:5000).

## 5. Data preparation <a class="anchor" id="data-prep"></a>

In this sections we're trying to parse and prepare the data to only contain the state Bremen in Germany and clean up columns and rows, which contain NaN-values. For this we're using pandas, which is widely used in the data science community and is also open source.

More information can be found here: https://pandas.pydata.org

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('epidemiology.csv')

#### Filter by state Bremen (DE_HB) <a class="anchor" id="bremen-data"></a>

The location key of the specific region is built by using a combination of codes. See [GoogleCloudPlatform/covid-19-open-data](https://github.com/GoogleCloudPlatform/covid-19-open-data) for more information

In [None]:
bremen_data = data[data.location_key.apply(lambda x: "DE_HB" == x if isinstance(x, str) else False)]
# Drop na columns with no data
bremen_data = bremen_data.drop(columns=["new_tested", "cumulative_tested"])

#### Quick look inside our dataframe

To understand how our dataframe looks like, a quick and easy way is to use head(), which prints the first five rows inside our dataframe.

In [None]:
bremen_data.head()

An easy way to have a quick overview of the value in pandas, which we're trying to predict, is to use describe() on the Series, which outputs basic statiscal values

In [None]:
bremen_data.new_confirmed.describe()

To export the dataframe in a csv-file, to_csv(path) can be used, which will create a new csv-file with the data from the dataframe.

In [None]:
bremen_data.to_csv("bremen_epidemiology.csv")

### Merge vaccination data <a class="anchor" id="vaccination-data"></a>

Here we're trying to parse the vaccination data and merge it with the epidemiology dataframe.

In [None]:
vacc_data = pd.read_csv('vaccinations.csv')

To merge the dataframes we need to select the unqiue identifiers on both dataframes, which are the date and the location_key. Pandas provides an merge-function, which only uses the both dataframes as input parameters and a list of unique-identifiers, which are used to merge the dataframes.

In [None]:
merged_data = pd.merge(data, vacc_data, on=['date', 'location_key'])

Filter again for the state Bremen in Germany (DE_HB)

In [None]:
bremen_vacc_data = merged_data[merged_data.location_key.apply(lambda x: "DE_HB" == x if isinstance(x, str) else False)]

Select specific columns, which we want to include in our dataframe

In [None]:
bremen_vacc_data = bremen_vacc_data[['date', 'location_key', 'new_confirmed', 'new_deceased',
       'new_recovered', 'cumulative_confirmed', 'cumulative_deceased',
       'cumulative_recovered', 'new_persons_vaccinated',
       'cumulative_persons_vaccinated', 'new_persons_fully_vaccinated',
       'cumulative_persons_fully_vaccinated', 'new_vaccine_doses_administered',
       'cumulative_vaccine_doses_administered']]
# Drop na rows
bremen_vacc_data = bremen_vacc_data.dropna()

Quick look inside the dataframe

In [None]:
bremen_vacc_data.head()

Describe can also be used on a dataframe, which will generate all the basic statistical values for each column

In [None]:
bremen_vacc_data.describe()

## 6. Example predict, learn and measure metric including vaccination data <a class="anchor" id="example-vaccination"></a>

Since the dataset including vaccinations for Bremen seems to be pretty small currently, we wanted to demonstrate, how the basic workflow with creme looks like. One of the models called SNARIMAX, which we wanted to test on chantilly, also had some problems with the compatibility. Therefore we decided to only use it also in this example.

### SNARIMAX model definition <a class="anchor" id="snarimax"></a>

SNARIMAX stands for (S)easonal (N)on-linear (A)uto(R)egressive (I)ntegrated (M)oving-(A)verage with e(X)ogenous inputs model.

This model generalizes many established time series models in a single interface that can be trained online. It assumes that the provided training data is ordered in time and is uniformly spaced.

Documentation to the model can be found here: https://riverml.xyz/dev/api/time-series/SNARIMAX/

In [None]:
from creme import compose
from creme import linear_model
from creme import preprocessing
from creme import time_series

def parse(row):
    import datetime as dt
    row['date'] = dt.datetime.fromisoformat(row['date']).toordinal()
    return row

model = compose.FuncTransformer(parse) \
    | compose.Select('date', 'new_deceased', 'new_recovered', 'cumulative_confirmed',
                     'cumulative_deceased', 'cumulative_recovered', 'new_persons_vaccinated',
                     'cumulative_persons_vaccinated', 'new_persons_fully_vaccinated',
                     'cumulative_persons_fully_vaccinated', 'new_vaccine_doses_administered',
                     'cumulative_vaccine_doses_administered') \
    | time_series.SNARIMAX(p=1, d=1, q=14)

### Pick one feature x and label y <a class="anchor" id="pick-one"></a>

In [None]:
test_dict = bremen_vacc_data.to_dict(orient="records")[-1]
y = test_dict.pop("new_confirmed")

In [None]:
test_dict

In [None]:
y

### Define metrics for evaluation of the model <a class="anchor" id="metric-eval"></a>

SMAPE stands (S)ymmetric (M)ean (A)bsolute (P)ercentage (E)rror and is used to evaluate the absolute percentage error in respect to the actual ground truth. This metric gives a good representation, how the model is performing and learning over time.

In [None]:
from creme import metrics
smape = metrics.SMAPE()

### Walkthrough prediction, metric update & training <a class="anchor" id="walkthrough"></a>

Predict value

In [None]:
y_pred = model.forecast(horizon=1, xs=[test_dict.copy()])[0]

Update metric

In [None]:
smape.update(y_true=y, y_pred=y_pred)

Train model on actual value

In [None]:
model.fit_one(x=test_dict.copy(), y=y)

Repeat prediction

In [None]:
model.forecast(horizon=1, xs=[test_dict.copy()])[0]

### Full cycle of predict, metric update & train on vaccination data <a class="anchor" id="full-loop"></a>

This is how a continuous loop of prediction, metric update and training on the merged vaccination dataframe could look like. A sleep timer of one second is used in the for loop to see the training process and metric improvement per cycle.

In [None]:
import time
for data_dict in bremen_vacc_data.to_dict(orient="records"):
    y = data_dict.pop("new_confirmed")
    y_pred = model.forecast(horizon=1, xs=[test_dict.copy()])[0]
    smape.update(y_true=y, y_pred=y_pred)
    model.fit_one(x=test_dict.copy(), y=y)
    print(f"Prediction: {y_pred}, Truth: {y}, SMAPE: {smape.get()}")
    time.sleep(1.0)

## Playground <a class="anchor" id="playground"></a>

Try out stuff yourself!