# Instances: an overview

An instance is a model + parameters + data + code version.

## Class hierarchy



### Base class

`Instance` is the base class for all instances. It inherits from `DBWriter` and `DBQueryMixin` to allow for reading and writing from the database.

Every instance has some attributes:

- model_name
- param_id
- data_id
- cluster_id
- tag
- git_hash
- fit_start_time
- instance_id

All of these attributes are defined as properties in the `Instance` class.

#### Hashing

The `instace_id` is created by hashing the combination of `data_id`, `param_id`, `model_name` and `git_hash`. That is, an `instance_id` is uniquely identified by its data, parameters, model and code version.

Whenever any of the four above properties are set, the `instance_id` is also set. So changing the `data_id` will change the value of `instance_id`.

The data_id and param_id are created by hashing the data_config and model_params dictionaries respectively 

### Runnable instance

The `RunnableInstance` class extends the `Instance` class. It provides functions for running the model, loading data and saving results. It also provides the `run()` method which will do all of the loading, fitting, prediction and saving in one step.

Runnable instances require three dictionaries that are stored as *properties*: `experiment_config`, `data_config` and `model_params`. A default dictionary is stored in 
- `RunnableInstance.DEFAULT_EXPERIMENT_CONFIG`
- `RunnableInstance.DEFAULT_DATA_CONFIG`
- `RunnableInstance.DEFAULT_MODEL_PARAMS`

The subclasses that inherit from the runnable instance can overwrite these default dictionaries.

Whenever `model_params` or `data_config` is set, it will update `param_id` or `data_id` respectively (and thus update `instance_id`).

#### Validation instance

You do not need to have database access to run instances - you can use local data files. The `ValidationInstance` provides methods for reading and writing from local files by overwriting and adding methods to the `RunnableInstance`.

#### Production instance

This is the instance that is intended to be used in the live Pipeline. It assumes all data and results are read and written to the database.

#### Test instance

Test instances are designed for quickly testing that your code works. They inherit from `ValidationInstance` and have some simple default settings.

In [1]:
import os
import json

import pandas as pd

from cleanair.instance import ValidationInstance
from cleanair.instance import InstanceQuery

In [2]:
# directory to your secrets directory
secrets_dir = "../../terraform/.secrets/"

# open the parser config
with open(os.path.join(secrets_dir, "config.json"), "r") as filepath:
    parser_config = json.load(filepath)

# setup your filepaths
data_dir = parser_config["config_dir"]
results_dir = parser_config["results_dir"]
secretfile=os.path.join(secrets_dir, "db_secrets.json")


## Creating instances

A runnable instance can be created from three dictionaries and the nane of your model.

In [14]:
model_name = "svgp"

model_params = dict(
    ValidationInstance.DEFAULT_MODEL_PARAMS,
    maxiter=10,   # you can change or add individual params
)

data_config = dict(
    ValidationInstance.DEFAULT_DATA_CONFIG,
    include_satellite=False,   # turn off satellite data
)

experiment_config = ValidationInstance.DEFAULT_EXPERIMENT_CONFIG.copy()

# create the instance using the dictionaries
instance = ValidationInstance(
    data_config=data_config,
    experiment_config=experiment_config,
    model_params=model_params,
    model_name=model_name,
    tag="validation",
    cluster_id=parser_config["cluster_id"],
)

2020-03-19 09:25:45     INFO: Database connection information loaded from None
2020-03-19 09:25:48    ERROR: Could not find a git repository in the parent directory. Setting git_hash to empty string.
2020-03-19 09:25:48    ERROR: <traceback object at 0x7fa7e9e2bfa0>
2020-03-19 09:25:48     INFO: Tag is validation
2020-03-19 09:25:48     INFO: Model name is svgp
2020-03-19 09:25:48     INFO: Param id is 5d40db9001000dd23798709d7f6041fada73ecb528dba17adae000a3d3ec5856
2020-03-19 09:25:48     INFO: Data id is 2f5950d5efe59a54155dc4dfa7dd23ef26dc7cb531533592aba67796abd23c02
2020-03-19 09:25:48     INFO: Instance id is 01270fcb954718e74b69aac20a5788babbb4939cbd1437672365c87405573c9a
2020-03-19 09:25:48     INFO: Cluster id is patrick_laptop


## Running instances

The `run` method will setup the model, load the data, fit the model, predict and write all instances and results to the database.

If you only want to do some of these steps you can call individual functions, e.g. `run_model_fitting()`. Take a look at the source code (inside `RunnableInstance.run()` for more details.


In [15]:
instance.run()

2020-03-19 09:25:48     INFO: Setting up model.
2020-03-19 09:25:48     INFO: Inserting 1 row into the model table.
2020-03-19 09:25:48     INFO: Loading input data from database.
2020-03-19 09:25:48     INFO: Database connection information loaded from None
2020-03-19 09:25:51     INFO: Validating config
2020-03-19 09:25:52     INFO: Validate config complete
2020-03-19 09:25:53     INFO: Loading training data for species: ['NO2'] from sources: ['laqn']
2020-03-19 09:25:53     INFO: Using data from 2020-01-29 00:00:00 (inclusive) to 2020-01-30 00:00:00 (exclusive)
                If dynamic features were not requested then ignore.
2020-03-19 09:25:57     INFO: Getting prediction data for sources: ['laqn'], species: ['NO2'], from 2020-01-30 00:00:00 (inclusive) to 2020-01-31 00:00:00 (exclusive)
                If dynamic features were not requested then ignore.
2020-03-19 09:26:01     INFO: Inserting 1 row into data config table.
2020-03-19 09:26:01     INFO: Training started.
2020-03-

## Loading instances from DB

By just having the `instance_id` of an instance, you can load the data, model and results of a model that has been executed by somebody else.

To get all possible instances, execute the following query.

In [16]:
iq = InstanceQuery(secretfile=secretfile)
instance_df = iq.get_all_instances()
instance_df.sample(3)  # get 3 random rows

2020-03-19 09:26:07     INFO: Database connection information loaded from None


Unnamed: 0,instance_id,model_name,tag,param_id,data_id,git_hash,fit_start_time,cluster_id
14,95e55ac41790f5bd8e20c4aad5d7ae5b260060bde34e74...,svgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,d3e4a2b915b95d1b665f21335a7a7f814dd379883bab07...,5e70276403a9ade573ceddbbd1bcc16e6dca5a38,2020-03-18 11:54:48.845217,patrick_laptop
15,426f473f4db28f6a254567113b3f92356da3daff57029a...,svgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,4921095fc71f4d251890811e2b5e95646794e55efe4c6c...,5e70276403a9ade573ceddbbd1bcc16e6dca5a38,2020-03-18 12:10:54.602584,patrick_laptop
2,fef9af3669249d06e9e20efe12904a79fd69448e1a4258...,mr_dgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,196cef9b97e3c60b89c1bed13f380e106634f0c2818942...,7631025862f8302811920a79ca6eafda0fdc3fce,2020-03-10 15:57:35.496357,patrick_laptop


### Filtering instances

We can now filter instances by e.g. tag, model_name, fit_start_time, etc. and we can sort our instances by the datetime when they were fitted.





In [17]:
dgp_df = instance_df.loc[
    (instance_df.model_name=="mr_dgp") & (instance_df.tag == "validation")
]
dgp_df = dgp_df.sort_values(by="fit_start_time", ascending=False)
dgp_df

Unnamed: 0,instance_id,model_name,tag,param_id,data_id,git_hash,fit_start_time,cluster_id
17,7239dc5350dba8b5efaf8ae5c68504bb502246fbbdbe3d...,mr_dgp,validation,63965896b83b12ecea2438950e7815de44b7f8a34ae26e...,4921095fc71f4d251890811e2b5e95646794e55efe4c6c...,bdfa0316e32a5a8f3ca8ab864a6da96c51530254,2020-03-18 18:45:11.537208,kangrui_laptop
8,b3ef88af90f6cbd9fae66d4d76b4988d21d75a5b2111f5...,mr_dgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,ed8c582190037215743fee222532017ae8f5d87f6a1cd6...,cc5e139b448e5a2ef1519a0fdec70592bf018b8f,2020-03-16 17:11:13.440396,patrick_laptop
5,9896e39160ef419f0af458badee0c9133ecc6350743238...,mr_dgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,196cef9b97e3c60b89c1bed13f380e106634f0c2818942...,ef6314f19ef7258714913c5a784e88b9ca4838f1,2020-03-16 09:32:47.551807,patrick_laptop
4,7886078cddd4fe9a50c6f9e44d25426b2cf29776c6796e...,mr_dgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,196cef9b97e3c60b89c1bed13f380e106634f0c2818942...,a6fba4359c58c24b0f59620cde57498ec1283fc8,2020-03-13 15:19:18.957921,patrick_laptop
2,fef9af3669249d06e9e20efe12904a79fd69448e1a4258...,mr_dgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,196cef9b97e3c60b89c1bed13f380e106634f0c2818942...,7631025862f8302811920a79ca6eafda0fdc3fce,2020-03-10 15:57:35.496357,patrick_laptop
1,31f60f310ee9c24c075f7c9cbeb2c864aa3cff67f45d2b...,mr_dgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,196cef9b97e3c60b89c1bed13f380e106634f0c2818942...,bc17463940c76521655f0c9a83bb5a7b40e72f68,2020-03-10 11:00:40.087366,patrick_laptop
0,01691e714314345e889a4271680930788201080bb3320c...,mr_dgp,validation,16bfd462ce906f099e346c69e08318e0598e5270ac0fb6...,196cef9b97e3c60b89c1bed13f380e106634f0c2818942...,ec577aaa25590274521c9d3581cdcc8824cefc1f,2020-03-10 10:21:59.691353,patrick_laptop


### Choose instance and load data

To get the instance that was executed most recently, we choose the first item in the sorted dataframe of deep GP fits.

We can then load the data from the database and get the predictions for that model fit.

In [18]:
# get the most recent fit to examine in detail
instance_row = dgp_df.iloc[0]
print(instance_row)

# now load the data, model params and results
instance = ValidationInstance.instance_from_id(
    instance_id=instance_row.instance_id,
    experiment_config=parser_config
)
# load the data and the model predictions
instance.load_data()
results_df = instance.load_results()

# need to merge the results df onto the data df
instance.model_data.normalised_pred_data_df = pd.merge(
    instance.model_data.normalised_pred_data_df,
    results_df,
    how="inner",
    on=["point_id", "measurement_start_utc"],
)

instance_id       7239dc5350dba8b5efaf8ae5c68504bb502246fbbdbe3d...
model_name                                                   mr_dgp
tag                                                      validation
param_id          63965896b83b12ecea2438950e7815de44b7f8a34ae26e...
data_id           4921095fc71f4d251890811e2b5e95646794e55efe4c6c...
git_hash                   bdfa0316e32a5a8f3ca8ab864a6da96c51530254
fit_start_time                           2020-03-18 18:45:11.537208
cluster_id                                           kangrui_laptop
Name: 17, dtype: object


2020-03-19 09:26:11     INFO: Database connection information loaded from None
2020-03-19 09:26:20     INFO: Load data config from database.
2020-03-19 09:26:20     INFO: Load model params from database
2020-03-19 09:26:20     INFO: Database connection information loaded from None
2020-03-19 09:26:23     INFO: Tag is validation
2020-03-19 09:26:23     INFO: Model name is mr_dgp
2020-03-19 09:26:23     INFO: Param id is 2ad484505311d2eae9e417b65f14f51af28fa35dd8d0d8997048c854b0bc4bb6
2020-03-19 09:26:23     INFO: Data id is 4921095fc71f4d251890811e2b5e95646794e55efe4c6c21b29fe148d4636256
2020-03-19 09:26:23     INFO: Instance id is 0f822a567e8ded92d9c9dfa7425bde48c8afdc0e6ae4e0d33c6956b318a39b81
2020-03-19 09:26:23     INFO: Cluster id is kangrui_laptop
2020-03-19 09:26:23    ERROR: Param id and hashed model params do not match.
2020-03-19 09:26:23     INFO: Loading input data from database.
2020-03-19 09:26:23     INFO: Database connection information loaded from None
2020-03-19 09:26: