# Trust Score Comparison

This notebooks provides an overview for using and understanding the trust score comparison check.

**Structure:**

- [What is trust score?](#what_is_trust_score)
- [Loading the data](#load_data_model)
- [Run the check](#run_check)
- [Define a condition](#define_condition)


<a id='what_is_trust_score'></a>
## What is trust score?

Trust score is an alternative measure of model confidence, used in classification problems to assign a higher score to samples whose prediction is more likely to end up correct. 

#### What is model confidence

Model confidence commonly refers to the predicted probability of classification model for the predicted class. This quantity is useful for a variety of tasks:
1.  Detecting "problematic samples" before labels become available - predictions with low probability are more likely to be wrong.
2. Risk management - in use-cases such as loan approval, we may want to weigh the probability that the loan will be returned with the loaned sum and the expected return.
3. Early warning of concept drift - a significant decline in the average confidence of samples encountered in production or test data indicates that the model is predicting on more and more samples on which it is unsure. 

#### Trust Score compared to predicted probability

"Regular" model confidence is easy to compute - just use the model's "predict_proba" function. The danger with relying on the values produced by the model itself is that they are often un-calibrated - which means that predicted probabilities don't correspond to the actual percent of correct predictions (check the <a href="https://docs.deepchecks.com/en/stable/examples/checks/performance/calibration_score.html" target="_blank">calibration score</a> check for more info). This is because the methods and loss functions used by these models are often not designed to produce actual probabilities. Additionally, most common classification metrics (such as precision, recall, accuracy etc.) measure only the quality of the final prediction (after threshold is applied to the predicted probability) and not on the probability itself. This reinforces the tendency to ignore the quality of the probabilities themselves.

Trust Score is an alternative method for scoring the "trust-worthiness" of the model predictions that is completely independent of model implementation. The method and code used by the deepchecks package were published in <a href="https://arxiv.org/abs/1805.11783" target="_blank">To Trust Or Not To Trust A Classifier</a>. 
 
Trust score has been shown to perform better than predicted probability in identifying correctly classified samples, and is used by the TrustScoreComparison check for:
1. Identifying the samples with highest (and lowest) score - which are the samples most likely (and unlikely) to be correctly classified by the model. This is useful for visually detecting common qualities among the highest and lowest confidence samples.
2. Identifying a degradation between the trust score on the test data when comparing it to the training data, which may indicate that the model will perform worse on test compared to train and serves as a method to detect concept drift. This condition is useful especially for cases when the test labels are not available, such as when performing inference on new and unknown data.

<a id='load_data_model'></a>
## Loading the data

We'll load the scikit-learn breast cancer dataset to test out the Trust Score check.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from deepchecks.tabular.datasets.classification.breast_cancer import load_data
from deepchecks.tabular import Dataset

label = 'target'

train_df, test_df = load_data(data_format='Dataframe')
train = Dataset(train_df, label=label)
test = Dataset(test_df, label=label)

clf = AdaBoostClassifier()
features = train_df.drop(label, axis=1)
target = train_df[label]
clf = clf.fit(features, target)

<a id='run_check'></a>
## Run the check

Next, we'll run the check on the dataset and model, modifying the default value of min_test_samples in order to enable us to run this check on the small dataset. In this case, we'll run the check "as is", and introduce the condition in the [next section](#define_condition).\
Additional optional parameters include the maximal sample size, the random state, the number of highest and lowest Trust Score samples to show and various hyperparameters controlling the trust score algorithm.

In [2]:
from deepchecks.tabular.checks import TrustScoreComparison

TrustScoreComparison(min_test_samples=100).run(train, test, clf)


pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.






this method is deprecated in favour of `Styler.to_html()`


this method is deprecated in favour of `Styler.to_html()`



Unnamed: 0,Trust Score,Model Prediction,target,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
92,0.81,0,1,14.99,22.11,97.53,693.7,0.09,0.1,0.07,0.04,0.19,0.06,0.32,1.34,2.31,28.51,0.0,0.03,0.03,0.01,0.02,0.0,16.76,31.55,110.2,867.1,0.11,0.33,0.31,0.13,0.32,0.09
40,0.8,0,0,15.13,29.81,96.71,719.5,0.08,0.05,0.05,0.03,0.19,0.05,0.47,1.63,3.04,45.38,0.01,0.01,0.02,0.01,0.03,0.0,17.26,36.91,110.1,931.4,0.11,0.1,0.15,0.07,0.32,0.06
108,0.78,0,0,15.61,19.38,100.0,758.6,0.08,0.06,0.04,0.03,0.15,0.05,0.23,1.0,1.53,22.18,0.0,0.01,0.01,0.01,0.01,0.0,17.91,31.67,115.9,988.6,0.11,0.18,0.23,0.09,0.27,0.07
136,0.77,0,1,14.74,25.42,94.7,668.6,0.08,0.07,0.04,0.03,0.18,0.06,0.3,1.39,2.18,27.41,0.0,0.01,0.02,0.01,0.02,0.0,16.51,32.29,107.4,826.4,0.11,0.14,0.16,0.11,0.27,0.07
65,0.69,0,1,12.04,28.14,76.85,449.9,0.09,0.06,0.02,0.02,0.19,0.06,0.61,2.64,4.1,44.96,0.01,0.02,0.01,0.01,0.02,0.0,13.6,33.33,87.24,567.6,0.1,0.1,0.06,0.06,0.24,0.07

Unnamed: 0,Trust Score,Model Prediction,target,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
138,3.28,0,0,23.21,26.97,153.5,1670.0,0.1,0.17,0.2,0.12,0.19,0.06,1.06,0.96,7.25,155.8,0.01,0.03,0.04,0.02,0.02,0.0,31.01,34.51,206.0,2944.0,0.15,0.41,0.58,0.26,0.31,0.09
127,3.14,1,1,11.29,13.04,72.23,388.0,0.1,0.08,0.03,0.03,0.18,0.06,0.19,0.53,1.16,13.17,0.01,0.01,0.01,0.01,0.02,0.0,12.32,16.18,78.27,457.5,0.14,0.15,0.13,0.09,0.27,0.08
142,2.9,0,0,19.59,18.15,130.7,1214.0,0.11,0.17,0.25,0.13,0.2,0.06,0.74,1.05,4.79,97.07,0.0,0.02,0.04,0.01,0.02,0.0,26.73,26.39,174.9,2232.0,0.14,0.38,0.68,0.22,0.36,0.09
86,2.84,1,1,13.5,12.71,85.69,566.2,0.07,0.04,0.0,0.0,0.14,0.05,0.22,0.69,1.51,20.39,0.0,0.0,0.0,0.0,0.01,0.0,14.97,16.94,95.48,698.7,0.09,0.06,0.01,0.02,0.23,0.06
43,2.68,1,1,10.32,16.35,65.31,324.9,0.09,0.05,0.01,0.01,0.19,0.06,0.21,0.97,1.36,12.97,0.01,0.01,0.01,0.01,0.02,0.0,11.25,21.77,71.12,384.9,0.13,0.09,0.04,0.02,0.27,0.07


### Analyzing the output

From here we can see that high trust score predictions are mostly correct, while the lowest trust score samples are wrong more often than not and are always predicted to belong to the negative class.

Furthermore, we may notice some other common characteristics, such as the fact that `worst texture` and `mean texture` both seem to be lower in the top scoring samples, while the worst scoring samples have high `worst texture` and `mean texture` values, both features with high feature importance for the AdaBoost model. Might it be that high texture samples are getting worse predictions by the model?

In [3]:
pd.Series(index=train_df.columns[:-1] ,data=clf.feature_importances_, name='Model Feature importance').sort_values(ascending=False).to_frame().head(7)

Unnamed: 0,Model Feature importance
compactness error,0.08
worst texture,0.08
fractal dimension error,0.08
area error,0.08
mean concave points,0.06
worst perimeter,0.06
mean texture,0.06


<a id='define_condition'></a>
## Define a condition

### Introducing concept drift 

First, we introduce concept drift into the data by changing the relation between the `worst texture` and `mean concave points` features, both important features for the model.

In [4]:
mod_test_df = test_df.copy()
np.random.seed(0)
sample_idx = np.random.choice(test_df.index, 80, replace=False)
mod_test_df.loc[sample_idx, 'worst texture'] = mod_test_df.loc[sample_idx, 'target'] * (mod_test_df.loc[sample_idx, 'mean concave points'] > 0.05)
mod_test = Dataset(mod_test_df, label=label)

### Checking for decline in Trust Score

Now, we define a condition on the Trust Score check to alert us on significant degradation in the mean Trust Score of the test data compared to the training data. Note that the threshold percent of decline can be modified by passing a different threshold to the condition (the default is 0.2, or 20% decline).

In [5]:
from deepchecks.tabular.checks import TrustScoreComparison

TrustScoreComparison(min_test_samples=100).add_condition_mean_score_percent_decline_not_greater_than(threshold=0.19).run(train, mod_test, clf)




this method is deprecated in favour of `Styler.hide(axis='index')`


this method is deprecated in favour of `Styler.to_html()`



Status,Condition,More Info
!,Mean trust score decline is not greater than 19%,Found decline of: -21.09%



this method is deprecated in favour of `Styler.to_html()`


this method is deprecated in favour of `Styler.to_html()`



Unnamed: 0,Trust Score,Model Prediction,target,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
183,0.9,0,0,13.61,24.98,88.05,582.7,0.09,0.09,0.09,0.04,0.16,0.06,0.46,1.29,2.86,43.14,0.01,0.01,0.03,0.01,0.01,0.0,16.99,0.0,108.6,906.5,0.13,0.19,0.32,0.12,0.27,0.07
151,0.83,0,1,14.26,19.65,97.83,629.9,0.08,0.22,0.3,0.08,0.17,0.08,0.36,1.49,3.4,29.25,0.01,0.07,0.14,0.02,0.03,0.01,15.3,23.73,107.0,709.0,0.09,0.42,0.68,0.15,0.24,0.11
92,0.81,0,1,14.99,22.11,97.53,693.7,0.09,0.1,0.07,0.04,0.19,0.06,0.32,1.34,2.31,28.51,0.0,0.03,0.03,0.01,0.02,0.0,16.76,31.55,110.2,867.1,0.11,0.33,0.31,0.13,0.32,0.09
136,0.77,0,1,14.74,25.42,94.7,668.6,0.08,0.07,0.04,0.03,0.18,0.06,0.3,1.39,2.18,27.41,0.0,0.01,0.02,0.01,0.02,0.0,16.51,32.29,107.4,826.4,0.11,0.14,0.16,0.11,0.27,0.07
65,0.69,0,1,12.04,28.14,76.85,449.9,0.09,0.06,0.02,0.02,0.19,0.06,0.61,2.64,4.1,44.96,0.01,0.02,0.01,0.01,0.02,0.0,13.6,33.33,87.24,567.6,0.1,0.1,0.06,0.06,0.24,0.07

Unnamed: 0,Trust Score,Model Prediction,target,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
138,3.28,0,0,23.21,26.97,153.5,1670.0,0.1,0.17,0.2,0.12,0.19,0.06,1.06,0.96,7.25,155.8,0.01,0.03,0.04,0.02,0.02,0.0,31.01,34.51,206.0,2944.0,0.15,0.41,0.58,0.26,0.31,0.09
127,3.14,1,1,11.29,13.04,72.23,388.0,0.1,0.08,0.03,0.03,0.18,0.06,0.19,0.53,1.16,13.17,0.01,0.01,0.01,0.01,0.02,0.0,12.32,16.18,78.27,457.5,0.14,0.15,0.13,0.09,0.27,0.08
142,2.9,0,0,19.59,18.15,130.7,1214.0,0.11,0.17,0.25,0.13,0.2,0.06,0.74,1.05,4.79,97.07,0.0,0.02,0.04,0.01,0.02,0.0,26.73,26.39,174.9,2232.0,0.14,0.38,0.68,0.22,0.36,0.09
43,2.68,1,1,10.32,16.35,65.31,324.9,0.09,0.05,0.01,0.01,0.19,0.06,0.21,0.97,1.36,12.97,0.01,0.01,0.01,0.01,0.02,0.0,11.25,21.77,71.12,384.9,0.13,0.09,0.04,0.02,0.27,0.07
180,2.63,0,0,18.63,25.11,124.8,1088.0,0.11,0.19,0.23,0.12,0.22,0.06,0.83,1.47,5.57,105.0,0.01,0.03,0.05,0.01,0.02,0.0,23.15,34.01,160.5,1670.0,0.15,0.43,0.61,0.18,0.34,0.1


### Analyzing the output

The condition alerts us to the fact that the mean Trust Score has declined by ~21%, which is more than the 10% we allowed!

The decline is also evident in the plot showing the distribution of Trust Scores in each dataset, in which we can see that test data has significantly more samples with Trust Score around 1 compared to training data. We can also see the distribution of the Trust Score for the modified test data used here is visibly skewed to the left (low Trust Score) due to the introduction of concept drift into the test data. The condition helps us detect this new skew. Did this skew in the data really change the performance of the model?

In [6]:
from deepchecks.tabular.checks.performance import MultiModelPerformanceReport

In [7]:
MultiModelPerformanceReport().run([train, train], [test, mod_test], {'unmodified test': clf, 'modified test': clf})

Using the MultiModelPerformanceReport we can clearly see that several metrics (such as f1, and recall) have declined on the modified test dataset. In a use case in which labels were not available for test data, we would have still known to be wary of that thanks to the condition raised by the Trust Score check on the modified data!