# Getting started with cleanair

This is a quick startup guide to get hands on with the data, models and visualisation tools in the cleanair repo.

We recommend you copy and paste code snippets from this notebook into your own notebook to run your models and evaluate the fits.

## Installation

The full installation (including docker) is given in the README of this repo, but here is a quick summary:

### Clone the repository

```bash
git clone https://github.com/alan-turing-institute/clean-air-infrastructure.git
```

### Install cleanair and dependencies

Create a new python 3.7 conda/pyenv virtual environment. Install the requirements then install cleanair:
```bash
cd clean-air-infrastructure
git checkout -b 182_dev
git pull origin 182_dev
pip install -r containers/requirements.txt
pip install -e containers
```

> Please check that pip is using the virtual environment you have setup by running `which pip`.

### Jupyterlab (optional)


Add the jupyterlab extensions for plotly, dash and widgets:

```bash
jupyter labextension install jupyterlab-dash --no-build
jupyter labextension install jupyterlab-plotly --no-build
jupyter labextension install @jupyter-widgets/jupyterlab-manager --no-build
jupyter labextension install plotlywidget
```

Also check that you have [nodejs installed](https://treehouse.github.io/installation-guides/mac/node-mac.html):

```bash
node -v
```

### Test install

Run the import statements below to test everything has installed.

> Ignore the tensorflow warnings. We are currently using an old version of TF.

In [1]:
# check that all of your imports are working
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import json
import sys
import os
import pickle
from datetime import datetime
import ipywidgets as ipw
import logging

sys.path.append("../")
from cleanair.models import ModelData
from cleanair.models import SVGP
from cleanair import metrics
from cleanair.dashboard import timeseries
from cleanair.dashboard.components import ModelFitComponent

















## Setup

We need to add some files and configs before you can start running files.

### DB credentials

You will need to create a local secrets file. Run the following to create a file with the database secrets:
```bash
mkdir -p terraform/.secrets
touch terraform/.secrets/db_secrets.json
echo '{
    "username": "<db_admin_username>@<db_server_name>",
    "password": "<db_admin_password>",
    "host": "<db_server_name>.postgres.database.azure.com",
    "port": 5432,
    "db_name": "<dbname>",
    "ssl_mode": "require"
}' >> terraform/.secrets/db_secrets.json
```

Open the file and replace the <> with the secret values which can be found in the keyvault in the `RG_CLEANAIR_INFRASTRUCTURE` Azure resource group. If you don't have access to the vault, ask someone in the cleanair team to help you out.

> At this point you should have enough to start the `run_model_fitting.py` entrypoint

### Get some data

Ask Patrick to send you a sample of data.

### Parser config

We recommend you store some default settings when you intend to run models locally. Put these settings in the `config.json` file in your secrets folder:
```bash
touch terraform/.secrets/config.json
echo '{
    "config_dir": "<DATA_DIRECTORY>",
    "results_dir": "<DATA_DIRECTORY>",
    "no_db_write": true,
    "predict_write": true,
    "local_read": true,
    "local_write": true,
    "tag": "<INSERT_YOUR_TAG>",
    "return_y": true,
    "predict_training": false,
    "predict_read_local": true
}' >> terraform/.secrets/config.json
```

Make sure to change `<DATA_DIRECTORY>` and `<INSERT_YOUR_TAG>`. The data directory should be the absolute filepath to your data store. The tag should be a name that you give your model fits (useful for later when uploading results to the database.


In [2]:
# directory to your secrets directory
secrets_dir = "../../.secrets/"

# open the parser config
with open(os.path.join(secrets_dir, "config.json"), "r") as filepath:
    parser_config = json.load(filepath)

# setup your filepaths
data_dir = parser_config["config_dir"]
results_dir = parser_config["results_dir"]
secretfile=os.path.join(secrets_dir, "db_secrets.json")
    


## The ModelData class

The ModelData class is the interface between your data and your model. It has methods and constructors for reading and writing data/predictions from files and databases. It abstracts away a lot of the details so that you (hopefully) don't have to worry about data processing.

The below code snippet shows how to initialise a ModelData object by reading from a data directory called `config_dir` and from the `db_secrets.json` file. Make sure your `config.json` file correctly initialised before running the below code.

In [3]:
# read input data from a directory instead of the database
if parser_config["local_read"]:
    model_data = ModelData(config_dir=data_dir, secretfile=secretfile)
else:
    raise NotImplementedError("Reading from database is not supported in this notebook yet.")

2020-02-28 14:42:09     INFO: Database connection information loaded from None


## Create a model

Below is an example of a simple SVGP (that doesn't perform very well). You can access and change the parameters of the model by changing the `model_params` attribute (which can also be passed as an argument to the constructor).

All models in the cleanair repo inherit from the base class `cleanair.models.Model`. This base class is documented and describes the data format passed to the `fit` and `predict` methods. 

If you want to change the SVGP, then you should be able to create a new class that inherits from SVGP and overwrites the appropriate methods. The `cleanair.models.SVGP` class inherits from the `Model`.

In [4]:
# run with the default SVGP
model = SVGP()

# get the default parameters
print("Default model params:")
print(json.dumps(model.model_params, indent=4))

# change a parameter
model.model_params["maxiter"] = 10

Default model params:
{
    "jitter": 1e-05,
    "likelihood_variance": 0.1,
    "minibatch_size": 100,
    "n_inducing_points": 2000,
    "restore": false,
    "train": true,
    "model_state_fp": null,
    "maxiter": 100,
    "kernel": {
        "name": "mat32+linear",
        "variance": 0.1,
        "lengthscale": 0.1
    }
}


In [5]:
# get the data into the right format (dicts)
train_dict = model_data.get_training_data_arrays()
pred_dict = model_data.get_pred_data_arrays()
x_train, y_train = train_dict['X'], train_dict['Y']
x_test = pred_dict['X']

# fit the model on training set
model.fit(x_train, y_train)

# predict on testing set
y_pred = model.predict(x_test)

print(y_pred)

laqn []
0 :  48
1 :  48
2 :  48
3 :  48
4 :  48
5 :  48
6 :  48
7 :  48
8 :  48
9 :  48
10 :  48
11 :  48
12 :  48
13 :  48
14 :  48
15 :  48
16 :  48
17 :  48
18 :  48
19 :  48
20 :  48
21 :  48
22 :  48
23 :  48
24 :  48
25 :  48
26 :  48
27 :  48
28 :  48
29 :  48
30 :  48
31 :  48
32 :  48
33 :  48
34 :  48
35 :  48
36 :  48
37 :  48
38 :  48
39 :  48
40 :  48
41 :  48
42 :  48
43 :  48
44 :  48
45 :  48
46 :  48
47 :  48
{'laqn': {'NO2': {'mean': array([[0.53808959],
       [0.61048998],
       [0.68293151],
       ...,
       [2.55365583],
       [2.62612424],
       [2.69859265]]), 'var': array([[1.04601679],
       [1.07076113],
       [1.09427337],
       ...,
       [2.40804219],
       [2.46603607],
       [2.52486315]])}}}


## Saving the predictions

Finally, we need to update the ModelData object with our predictions then write the results to a pickle file.

In the validation notebook, we will show how to visualise the results of your model fit.

In [6]:
# now we need to update the model data object with the predictions
model_data.update_training_df_with_preds(y_pred, datetime.now())

# we could write the results to the database, but for now we are going to write to a file
pred_filepath = os.path.join(parser_config["results_dir"], "test_pred.pickle")
with open(pred_filepath, "wb") as handle:
    pickle.dump(y_pred, handle)