# Getting started with cleanair

This is a quick startup guide to get hands on with the data, models and visualisation tools in the cleanair repo.

We recommend you copy and paste code snippets from this notebook into your own notebook to run your models and evaluate the fits.

## Installation

The full installation (including docker) is given in the README of this repo, but here is a quick summary:

### Clone the repository

```bash
git clone https://github.com/alan-turing-institute/clean-air-infrastructure.git
```

### Install cleanair and dependencies

Create a new python 3.7 conda/pyenv virtual environment. Install the requirements then install cleanair:
```bash
cd clean-air-infrastructure
git checkout -b 182_dev
git pull origin 182_dev
pip install -r containers/requirements.txt
pip install -e containers
```

> Please check that pip is using the virtual environment you have setup by running `which pip`.

### Jupyterlab (optional)


Add the jupyterlab extensions for plotly, dash and widgets:

```bash
jupyter labextension install jupyterlab-dash --no-build
jupyter labextension install jupyterlab-plotly --no-build
jupyter labextension install @jupyter-widgets/jupyterlab-manager --no-build
jupyter labextension install plotlywidget
```

Also check that you have [nodejs installed](https://treehouse.github.io/installation-guides/mac/node-mac.html):

```bash
node -v
```

### Test install

Run the import statements below to test everything has installed.

> Ignore the tensorflow warnings. We are currently using an old version of TF.

In [1]:
# check that all of your imports are working
import numpy as np
import pandas as pd
import json
import os
import pickle
from datetime import datetime

from cleanair.models import ModelData
from cleanair.models import SVGP
from cleanair import metrics
from cleanair.dashboard import timeseries
from cleanair.dashboard.components import ModelFitComponent
from cleanair.instance import ValidationInstance

## Setup

We need to add some files and configs before you can start running files.

### DB credentials

You will need to create a local secrets file. Run the following to create a file with the database secrets:
```bash
mkdir -p terraform/.secrets
touch terraform/.secrets/db_secrets.json
echo '{
    "username": "<db_admin_username>@<db_server_name>",
    "password": "<db_admin_password>",
    "host": "<db_server_name>.postgres.database.azure.com",
    "port": 5432,
    "db_name": "<dbname>",
    "ssl_mode": "require"
}' >> terraform/.secrets/db_secrets.json
```

Open the file and replace the <> with the secret values which can be found in the keyvault in the `RG_CLEANAIR_INFRASTRUCTURE` Azure resource group. If you don't have access to the vault, ask someone in the cleanair team to help you out.

> At this point you should have enough to start the `run_model_fitting.py` entrypoint

### Get some data

Ask Patrick to send you a sample of data. Alternatively if you have access to the DB then you can request your own data.

### Parser config

We recommend you store some default settings when you intend to run models locally. Put these settings in the `config.json` file in your secrets folder:
```bash
touch terraform/.secrets/config.json
echo '{
    "secretfile": "../../terraform/.secrets/db_secrets.json",
    "config_dir": <DATA_DIRECTORY>,
    "results_dir": <RESULTS_DIRECTORY>,
    "model_dir": <MODEL_DIRECTORY>,
    "no_db_write": false,
    "predict_write": true,
    "local_read": false,
    "local_write": false,
    "write_model_params": false,
    "read_model_params": false,
    "tag": "validation",
    "predict_training": false,
    "predict_read_local": false,
    "include_prediction_y": false,
    "model_name": "svgp",
    "cluster_id": <CLUSTER_ID>,
    "trainend": "2020-02-19T00:00:00",
    "predstart": "2020-02-19T00:00:00"
}' >> terraform/.secrets/config.json
```

Make sure to change `<DATA_DIRECTORY>`, `<RESULTS_DIRECTORY>` and `<MODEL_DIRECTORY>` to valid directories if you wish to read and write from files. The filepaths should be absolute. Also change `<CLUSTER_ID>` to the name of the machine you are running the models on, e.g. `patrick_macbookpro`.


In [2]:
# directory to your secrets directory
secrets_dir = "../../terraform/.secrets/"

# open the parser config
with open(os.path.join(secrets_dir, "config.json"), "r") as filepath:
    parser_config = json.load(filepath)

# setup your filepaths
data_dir = parser_config["config_dir"]
results_dir = parser_config["results_dir"]
secretfile=os.path.join(secrets_dir, "db_secrets.json")
    


## Create an instance

An instance is a model + data + parameters + settings. You can quickly create an instance object by passing the `model_params`, `experiment_config` and `data_config` dictionaries and selecting a model with the `model_name` parameter.

In [3]:
# create a sparse variational GP
model_name = "svgp"

# change the parameters of the model
model_params = dict(
    ValidationInstance.DEFAULT_MODEL_PARAMS,
    maxiter=10,   # you can change or add individual params
)

# change the settings for loading data
data_config = dict(
    ValidationInstance.DEFAULT_DATA_CONFIG,
    include_satellite=False,   # turn off satellite data
)

# update the experiment confing with settings store in your config file
experiment_config = ValidationInstance.DEFAULT_EXPERIMENT_CONFIG.copy()
experiment_config.update(parser_config)

# create the instance using the dictionaries
instance = ValidationInstance(
    data_config=data_config,
    experiment_config=experiment_config,
    model_params=model_params,
    model_name=model_name,
    tag="validation",
    cluster_id=parser_config["cluster_id"],
)

2020-03-23 11:48:28     INFO: Database connection information loaded from None


Once the instance is created you can setup the model, load the data, train the model, predict on the test set and save the results all by calling the `run()` method.

In [4]:
print(instance.data_config)
instance.run()

{'train_start_date': datetime.datetime(2020, 1, 29, 0, 0), 'train_end_date': datetime.datetime(2020, 1, 30, 0, 0), 'pred_start_date': datetime.datetime(2020, 1, 30, 0, 0), 'pred_end_date': datetime.datetime(2020, 1, 31, 0, 0), 'include_satellite': False, 'include_prediction_y': True, 'train_sources': ['laqn'], 'pred_sources': ['laqn'], 'train_interest_points': 'all', 'train_satellite_interest_points': 'all', 'pred_interest_points': 'all', 'species': ['NO2'], 'features': ['value_1000_total_a_road_length', 'value_500_total_a_road_length', 'value_500_total_a_road_primary_length', 'value_500_total_b_road_length'], 'norm_by': 'laqn'}


2020-03-23 11:48:34     INFO: Setting up model.
2020-03-23 11:48:34     INFO: Writing model parameters to json file.
2020-03-23 11:48:34     INFO: Writing model parameters to a json file.
2020-03-23 11:48:34     INFO: Inserting 1 row into the model table.
2020-03-23 11:48:34     INFO: Reading from local file.
2020-03-23 11:48:34     INFO: Database connection information loaded from None
2020-03-23 11:48:37     INFO: State files saved to /Users/pohara/documents/tests/laqn_test_instance/
2020-03-23 11:48:38     INFO: State files saved to /Users/pohara/documents/tests/laqn_test_instance/
2020-03-23 11:48:38     INFO: Inserting 1 row into data config table.
2020-03-23 11:48:38     INFO: Training started.
2020-03-23 11:48:38     INFO: Training the model for 10 iterations.

One of the clusters is empty. Re-run kmeans with a different initialization.

2020-03-23 11:48:40     INFO: Training ended.
2020-03-23 11:48:40     INFO: Inserting 1 record into the instance table.
2020-03-23 11:48:40    