# Model Error Analysis - Tutorial

Install the [Model Error Analysis plugin](https://www.dataiku.com/product/plugins/model-error-analysis/) and import the required libraries.

In [0]:
import dataiku

In [0]:
dataiku.use_plugin_libs('model-error-analysis')

In [0]:
from dku_error_analysis_mpp.dku_error_visualizer import DkuErrorVisualizer
from dku_error_analysis_mpp.dku_error_analyzer import DkuErrorAnalyzer
from dku_error_analysis_model_parser.model_handler_utils import get_model_handler

In [0]:
%matplotlib inline

## Load a trained primary model

Load any trained DSS classification or regression model. This is your _Primary Model_.

Replace the `lookup` and `version_id` placeholders with your own model identifiers.

In [0]:
lookup='5V4iiYuK'
version_id = None#'initial'

Build an accessor to your model through `get_model_handler`.

In [0]:
m = dataiku.Model(lookup)
model_handler = get_model_handler(m, version_id)

## Use a DkuErrorAnalyzer

Instantiate a `DkuErrorAnalyzer` object with your model accessor.

In [0]:
dku_error_analyzer = DkuErrorAnalyzer(model_handler)

Fit the underlying <font color=red>_Error Tree_ </font> to your DSS model performances on its test set.

In [0]:
dku_error_analyzer.fit()

### Error Tree

The underlying Error Tree can be retrieved by the attribute `_error_tree`.

You can see that its estimator consists in a `DecisionTree` from `sklearn` predicting _Correct_ or _Wrong Prediction_.

In [0]:
error_clf = dku_error_analyzer._error_tree._estimator
print(error_clf)
print(error_clf.classes_)

The features used in the Error Tree can be retrieved by the attribute `preprocessed_feature_names`. 

These features are the very same used by your primary model.

In [0]:
feature_names = dku_error_analyzer.preprocessed_feature_names
feature_names[:3]

### Error Tree Metrics

You can have a report on the Error Tree thanks to the `evaluate` function as a text or formatted output. 

This will output some metrics computed on a part of the test set of your DSS model:
1. the Error Tree accuracy score
2. the estimated accuracy of your primary model according to the Error Tree
3. the true accuracy of your primary model
4. the _Fidelity_ of the Error Tree (absolute deviation of 2. and 3.)

Ideally the two values 2. and 3. above should be equal, thus their deviation (_Fidelity_) is computed as an indicator of how well the Error Tree represents your model performances.

The _Confidence Decision_ states whether you can trust the Error Tree as a representation of your model performances.

In [0]:
print(dku_error_analyzer.evaluate(output_format='str'))

In [0]:
dku_error_analyzer.evaluate(output_format='dict')

## Use a DkuErrorVisualizer

Instantiate a `DkuErrorVisualizer` object on your `DkuErrorAnalyzer` object in order to have useful plot and analysis functions.

In [0]:
dku_error_visualizer = DkuErrorVisualizer(dku_error_analyzer)

### Plot the Decision Tree

Plot the Error Tree decision tree and have a look at the red nodes, representing your primary model failures.

In [0]:
dku_error_visualizer.plot_error_tree(size=(25, 25))

### Explore the Error Tree nodes

Use the `get_error_leaf_summary` function of the `DkuErrorAnalyzer` to explore the nodes and have information about the samples they contain. 

The provided information covers:
1. the number of correct predictions,
2. the number of wrong predictions,
3. the node _Local error_: the ratio of the number of wrongly predicted samples over the total number of samples in the node. This is the error rate in the node. It is interesting to focus on nodes where the local error rate is higher than the average error rate of the model on all samples. 
4. the node _Global error_: the ratio of the number of wrongly predicted samples over the total number of mispredicted samples in the test set. The nodes where the global erro is high is where the majority of wrong predictions are located.
5. the path to node: showing roughly the features behaviour for the samples in the node. Helps understanding what feature ranges are contributing the most to the error. 

The different nodes contain meaningful segments of the test set, and represent different types of errors the primary model makes.

We are especially interested in nodes with high Global Error (majority of errors) and high Local Error (the error rate in the subgroup of samples in the node). Especially if the local error is much higher than the average error rate of the model.

You can input leaf nodes from the tree plot above (`leaf_selector` argument).

Replace `leaf_id` with the leaf node you would like to explore. 

In [0]:
leaf_id = 8

In [0]:
dku_error_analyzer.get_error_leaf_summary(leaf_selector=leaf_id, add_path_to_leaves=True, output_format='dict')

You can also let the analyzer show you all the leaf nodes ranked by importance (higher global error).

In [0]:
dku_error_analyzer.get_error_leaf_summary(add_path_to_leaves=True, output_format='dict')

### Display the Feature Distributions of samples in the nodes

You can use the `DkuErrorVisualizer` to plot the histograms of the features in the nodes, comparing to the global population as it is a mainly successful baseline.

Again, you can either input leaf nodes by the `leaf_selector` argument.

In [0]:
dku_error_visualizer.plot_feature_distributions_on_leaves(leaf_selector=leaf_id, top_k_features=1, show_class=True, show_global=False)

You can also let the visualizer show you the feature distributions in all the leaf nodes ranked by importance.

In [0]:
dku_error_visualizer.plot_feature_distributions_on_leaves(top_k_features=1, show_class=True)

In this example we observe that the primary model yields wrong predictions for houses with very large living rooms and high scored views. Comparing with the global baseline, we see that these samples are under-represented in the primary training set. 

# Moving forward

The Model Error Analysis plugin automatically highlights any information relevant to the model errors, leading the user to focus on what are the problematic features and what are the typical values of these features for the mispredicted samples.

This information can later be exploited to support the strategy selected by the user :
* **improve model design**: removing a problematic feature, removing samples likely to be mislabeled, ensemble with a model trained on a problematic subpopulation, ...
* **enhance data collection**: gather more data regarding the most erroneous under-represented populations,
* **select critical samples for manual inspection** thanks to the Error Tree and avoid primary predictions on them, generating model assertions.