# Incremental Learning 3: Benchmark incremental learning algorithm with transformed data

<img src="03-incremental-learning-with-transformed-data/3-schema.png" />

In this notebook, we illustrate the evaluation of the performance of the incremental learning algorithm using real-world data transformed according to a predefined transformation operation. The possible transformation operations are:
* `partition_on_metric_ranges`: Partition the data frame into values rows not contained and contained in
random intervals of metric features.
* `partition_on_categories`: Partition the data frame into values rows matching and not matching a random
subset of categories.
* `resample_metric_features`: Resample by randomly weighting equally spaced quantile buckets of features.
* `shift_metric_features`: Apply a random shift to the metric features in the dataset.
* `rotate_metric_features`: Downsample and apply a random rotation to the metric feature values in the dataset.
* `regression_category_drift`: Downsample and apply a random shift to the target variable for each distinct
    category of the categorical_features feature values in dataset.

We split the original dataset into two parts: train and test1. We transform both these parts using the specified transformation operation defined in the variable `config`. The transformation of the dataset part `train_dataset` results in `update_dataset`, while the transformation of the dataset part `test1` results in `test2`.

Now, we combine the data from `train` and `update` to obtain the `baseline_dataset` and `test1` and `test2` to obtain the `test_dataset`.
The `baseline_dataset` is used to train the `baseline_model` from scratch. This model is the "golden standard" with which we compare all our other models.

We train the `train_model` on the `trained_dataset` without showing it any transformed data. Next, we update this model using `update_dataset` and obtain the `updated_model`.

To evaluate the `baseline_model`, `train_model`, and `updated_model`, we use the `test_dataset`. The goal is that the generalization errors from `trained_model` and `updated_model` are as close as possible to our "golden standard" of the `baseline_model.


In [33]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import pprint

from incremental_learning.config import jobs_dir, logger
from incremental_learning.job import train, update, evaluate
from incremental_learning.storage import read_dataset, upload_job, delete_job
from incremental_learning.transforms import transform_dataset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [17]:
def compute_regression_metrics(y_true,
                               baseline_model_predictions,
                               trained_model_predictions,
                               updated_model_predictions):
    scores = {
        'baseline': {
            'mae': metrics.mean_absolute_error(y_true, baseline_model_predictions),
            'mse': metrics.mean_squared_error(y_true, baseline_model_predictions)
        },
        'trained_model': {
            'mae': metrics.mean_absolute_error(y_true, trained_model_predictions),
            'mse': metrics.mean_squared_error(y_true, trained_model_predictions)
        },
        'updated_model': {
            'mae': metrics.mean_absolute_error(y_true, updated_model_predictions),
            'mse': metrics.mean_squared_error(y_true, updated_model_predictions)
        },
    }
    return scores


def compute_classification_metrics(y_true,
                                   baseline_model_predictions,
                                   trained_model_predictions,
                                   updated_model_predictions):
    scores = {
        'baseline': {
            'acc': metrics.accuracy_score(y_true, baseline_model_predictions)
        },
        'trained_model': {
            'acc': metrics.accuracy_score(y_true, trained_model_predictions)
        },
        'updated_model': {
            'acc': metrics.accuracy_score(y_true, updated_model_predictions)
        },
    }

    for label in np.unique(y_true):
        scores['baseline']['precision_' + label] = \
            metrics.precision_score(y_true, baseline_model_predictions, pos_label=label)
        scores['trained_model']['precision_' + label] = \
            metrics.precision_score(y_true, trained_model_predictions, pos_label=label)
        scores['updated_model']['precision_' + label] = \
            metrics.precision_score(y_true, updated_model_predictions, pos_label=label)
        scores['baseline']['recall_' + label] = \
            metrics.recall_score(y_true, baseline_model_predictions, pos_label=label)
        scores['trained_model']['recall_' + label] = \
            metrics.recall_score(y_true, trained_model_predictions, pos_label=label)
        scores['updated_model']['recall_' + label] = \
            metrics.recall_score(y_true, updated_model_predictions, pos_label=label)

    return scores

In [18]:
test_fraction = 0.2
config = {
    "dataset_name": "ccpp",
    "seed": 90982247,
    "threads": 1,
    "transform_name": "partition_on_metric_ranges",
    "transform_parameters": {
        "fraction": 0.45,
        "metric_features": [
                    "AT",
                    "AP"
        ]
    }
}
dataset_name = config['dataset_name']
verbose=False
force_update = False


In [19]:
original_dataset = read_dataset(config['dataset_name'])
original_dataset = original_dataset.sample(frac=0.1)
train_dataset, update_dataset, test1_dataset, test2_dataset = transform_dataset(dataset=original_dataset,
                                                                                test_fraction=test_fraction,
                                                                                transform_name=config['transform_name'],
                                                                                transform_parameters=config[
                                                                                    'transform_parameters'],
                                                                                seed=config['seed'])
baseline_dataset = pd.concat([train_dataset, update_dataset])
test_dataset = pd.concat([test1_dataset, test2_dataset])


In [20]:
baseline_model = train(config['dataset_name'], baseline_dataset, verbose=verbose)
elapsed_time = baseline_model.wait_to_complete()
logger.info('Elapsed time: {}'.format(elapsed_time))

[I] incremental_learning >> Elapsed time: 176.3931188583374


In [21]:
trained_model = train(dataset_name, train_dataset, verbose=verbose)
elapsed_time = trained_model.wait_to_complete()
logger.info('Elapsed time: {}'.format(elapsed_time))

[I] incremental_learning >> Elapsed time: 126.05649495124817


In [22]:
updated_model = update(dataset_name, update_dataset, trained_model, force=force_update, verbose=verbose)
elapsed_time = updated_model.wait_to_complete()
logger.info('Elapsed time: {}'.format(elapsed_time))

[I] incremental_learning >> Elapsed time: 5.320300340652466


In [23]:
baseline_eval = evaluate(dataset_name, test_dataset, baseline_model, verbose=verbose)
baseline_eval.wait_to_complete()

trained_model_eval = evaluate(dataset_name, test_dataset, trained_model, verbose=verbose)
trained_model_eval.wait_to_complete()

updated_model_eval = evaluate(dataset_name, test_dataset, updated_model, verbose=verbose)
updated_model_eval.wait_to_complete()

6.257479667663574

In [27]:
dependent_variable = baseline_model.dependent_variable

scores = {}

if baseline_model.is_regression():
    y_true = np.array([y for y in test_dataset[dependent_variable]])
    scores = compute_regression_metrics(y_true,
                                        baseline_eval.get_predictions(),
                                        trained_model_eval.get_predictions(),
                                        updated_model_eval.get_predictions())
elif baseline_model.is_classification():
    y_true = np.array([str(y) for y in test_dataset[dependent_variable]])
    scores = compute_classification_metrics(y_true,
                                            baseline_eval.get_predictions(),
                                            trained_model_eval.get_predictions(),
                                            updated_model_eval.get_predictions())

In [32]:
pprint.pprint(scores)

{'baseline': {'mae': 3.4624664751688647, 'mse': 21.21438482287693},
 'trained_model': {'mae': 4.137316767374676, 'mse': 29.138932961441892},
 'updated_model': {'mae': 3.6626670138041177, 'mse': 23.4148928431643}}


Results show that the generalization error of the `updated_model` is lower than the generalization error of the `trained_model` and is much closer to the generalization error of the `baseline_model`.

In [30]:
# path = jobs_dir/'demo_baseline_model'
# baseline_model.store(destination=path)
# success = upload_job(local_job_path=path)

In [34]:
# delete_job('demo_baseline_model')