# Evaluation

<img src="../_static/images/banner/plots.png" class="banner-photo"/>

## Overview


Every training `Job` automatically generates metrics when evaluated against each split/ fold. 

The `Algorithm.analysis_type` determines which metrics and plots are prepared:

* Although `'classification_multi'` and `'classification_binary'` share the same metrics and plots, they go about producing these artifacts differently: e.g. ROC curves `roc_multi_class=None` vs `roc_multi_class='ovr'`.

* `'regression'`, unlike the classification analyses, does not have an 'accuracy' metric, so we substitute 'r2', R^2 (coefficient of determination, for it. There are no regression-specific plots in AIQC yet. Note that unsupervised/ self-supervised models are also considered a regression.

In order to accomodate the [dashboards](dashboard.html), the following arguments were added:

- `call_display:bool=True` when `True`, performs `fig.display()`. Whereas when `False`, it returns the raw `fig` object. The learning curve, feature importance, and confusion matrix functions return a list of figs.

- `height:int=None` pixel-based adjustment for boomerang chart and feature importance.

> The actual arguments of the methods in this in this notebook are documented in the [Low-Level Docs](api_low_level.html#10.-Assess-the-Results.), 

---

## Prerequisites

`Plotly` is used for interactive charts (hover, toggle, zoom). Reference the [Installation](installation.html#Plotting) section for information about configuring Plotly. However, static images are used in this notebook due to lack of support for 3rd party JS in the documentation portal.

We'll use the `datum` and `tests` modules to rapidly generate a couple examples.

In [2]:
from aiqc import datum
from aiqc import tests

---

## Classification

Let's quickly generate a trained classification model to inspect.

In [3]:
%%capture
queue_multiclass = tests.tf_multi_tab.make_queue()
queue_multiclass.run_jobs()

### Queue Visualization

`plot_performance` aka the "boomerang chart" is unique to AIQC, and it really brings the benefits of the library to light. Each model from the Queue is evaluated against all splits/ folds.

When evaluating a classification-based `Queue.analysis_type`, the following `score_type:str` are available: 	accuracy, f1, roc_auc, precision, and recall.

In [None]:
queue_multiclass.plot_performance(
    max_loss = 1.5, score_type='accuracy', min_score = 0.70
)

![Classify Boomerang](../_static/images/visualization/classify_boomerang.png)

### Queue Metrics

In [5]:
queue_multiclass.metrics_df(
    selected_metrics = None
    , sort_by        = 'predictor_id'
    , ascending      = True
).head(6)

Unnamed: 0,hyperparamcombo_id,job_id,predictor_id,split,accuracy,f1,loss,precision,recall,roc_auc
0,17,23,23,train,0.912,0.911,0.271,0.917,0.912,0.983
1,17,23,23,validation,0.81,0.806,0.317,0.822,0.81,0.966
2,17,23,23,test,0.963,0.963,0.24,0.967,0.963,1.0


These are also aggregated by metric across all splits/folds.

In [6]:
queue_multiclass.metrics_aggregate_to_pandas(
    selected_metrics = None
    , selected_stats = None
    , sort_by        = 'predictor_id'
    , ascending      = True
).head(12)

Unnamed: 0,hyperparamcombo_id,job_id,predictor_id,metric,maximum,minimum,pstdev,median,mean
0,17,23,25,accuracy,0.963,0.81,0.063608,0.912,0.895
1,17,23,25,f1,0.963,0.806,0.065301,0.911,0.893333
2,17,23,25,loss,0.317,0.24,0.031633,0.271,0.276
3,17,23,25,precision,0.967,0.822,0.060139,0.917,0.902
4,17,23,25,recall,0.963,0.81,0.063608,0.912,0.895
5,17,23,25,roc_auc,1.0,0.966,0.01388,0.983,0.983


### Job Visualization

A learning curve will be generated for each train-evaluation pair of metrics in the `Predictor.history` dictionary. Reference the [low-level API](api_low_level.html#Customizable-history) for more details.

Loss values in the first few epochs can often be extremely high before they plummet and become more gradual. This really stretches out the graph and makes it hard to see if the evaluation set is diverging or not. The `skip_head:bool` parameter skips displaying the first 15% of epochs so that figure is easier to interpret.

In [None]:
queue_multiclass.jobs[0].predictors[0].plot_learning_curve(skip_head=True)

![Classify Learn](../_static/images/visualization/classify_learn.png)

In [None]:
queue_multiclass.jobs[0].predictors[0].predictions[0].plot_feature_importance(top_n=4)

![Classify Features](../_static/images/visualization/classify_features.png)

These classification metrics are preformatted for plotting.

In [9]:
queue_multiclass.jobs[0].predictors[0].predictions[0].plot_data['test'].keys()

dict_keys(['confusion_matrix', 'roc_curve', 'precision_recall_curve'])

In [None]:
queue_multiclass.jobs[0].predictors[0].predictions[0].plot_roc_curve()

![Classify ROC](../_static/images/visualization/classify_roc.png)

In [None]:
queue_multiclass.jobs[0].predictors[0].predictions[0].plot_confusion_matrix()

![Plot Confusion](../_static/images/visualization/classify_confusion.png)

In [None]:
queue_multiclass.jobs[0].predictors[0].predictions[0].plot_precision_recall()

![Precision Recall](../_static/images/visualization/classify_pr.png)

### Job Metrics

Each training `Prediction` contains the following metrics by split/fold:

In [13]:
from pprint import pprint as p

In [14]:
p(queue_multiclass.jobs[0].predictors[0].predictions[0].metrics)

{'test': {'accuracy': 0.963,
          'f1': 0.963,
          'loss': 0.24,
          'precision': 0.967,
          'recall': 0.963,
          'roc_auc': 1.0},
 'train': {'accuracy': 0.912,
           'f1': 0.911,
           'loss': 0.271,
           'precision': 0.917,
           'recall': 0.912,
           'roc_auc': 0.983},
 'validation': {'accuracy': 0.81,
                'f1': 0.806,
                'loss': 0.317,
                'precision': 0.822,
                'recall': 0.81,
                'roc_auc': 0.966}}


It also contains per-epoch `History` metrics calculated during model training.

In [15]:
queue_multiclass.jobs[0].predictors[0].history.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

---

## Regression

Let's quickly generate a trained quantification model to inspect.

In [18]:
%%capture
queue_regression = tests.tf_reg_tab.make_queue()
queue_regression.run_jobs()

### Queue Visualization

When evaluating a regression-based `Queue.analysis_type`, the following `score_type:str` are available: r2, mse, and explained_variance.

In [None]:
queue_regression.plot_performance(
    max_loss=1.5, score_type='r2', min_score=0.65
)

![Regression Boomerang](../_static/images/visualization/regression_boomerang.png)

### Queue Metrics

In [None]:
queue_regression.metrics_df().head(9)

These are also aggregated by metric across all splits/folds.

In [None]:
queue_regression.metrics_aggregate_to_pandas().tail(12)

### Job Visualization

In [None]:
queue_regression.jobs[0].predictors[0].plot_learning_curve(skip_head=True)

![Regression Learn](../_static/images/visualization/regression_learn.png)

In [None]:
queue_regression.jobs[0].predictors[0].predictions[0].plot_feature_importance(top_n=12)

![Regression Features](../_static/images/visualization/regression_features.png)

### Job Metrics

Each training `Prediction` contains the following metrics.

In [19]:
p(queue_regression.jobs[0].predictors[0].predictions[0].metrics)

{'test': {'explained_variance': 0.048,
          'loss': 0.754,
          'mse': 1.045,
          'r2': -0.045},
 'train': {'explained_variance': 0.036,
           'loss': 0.733,
           'mse': 0.971,
           'r2': 0.029},
 'validation': {'explained_variance': 0.048,
                'loss': 0.678,
                'mse': 0.822,
                'r2': 0.043}}


It also contains per-epoch metrics calculated during model training.

In [20]:
queue_regression.jobs[0].predictors[0].history.keys()

dict_keys(['loss', 'mean_squared_error', 'val_loss', 'val_mean_squared_error'])