To demonstrate the ShapExplainer.fit()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann24c]_):

In [1]:
import shap
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(10)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq, )

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
1,Q14802,MQKVTLGLLVFLAGF...PGETPPLITPGSAQS,0,37,59,NSPFYYDWHS,LQVGGLICAGVLCAMGIIIVMSA,KCKCKFGQKS
2,Q86UE4,MAARSWQDELAQQAE...SPKQIKKKKKARRET,0,50,72,LGLEPKRYPG,WVILVGTGALGLLLLFLLGYGWA,AACAGARKKR
3,Q969W9,MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL,0,41,63,FQSMEITELE,FVQIIIIVVVMMVMVVVITCLLS,HYKLSARSFI
4,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
5,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
6,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER


We can now create a ``ShapExplainer`` object and fit it to obtain the shap values and the expected value using the ``shap_values`` and ``exp_value`` (expected/base value) attributes:

In [2]:
se = aa.ShapExplainer()
se.fit(X, labels=labels)

shap_values = se.shap_values
exp_value = se.exp_value

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))

print("\nThe expected value approximates the expected model output (average prediction score).")
print("For a binary classification with balanced datasets, it is around 0.5:")
print(exp_value)

SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.05 -0.06 -0.04 -0.04 -0.04 -0.   -0.06 -0.06 -0.06 -0.04]
 [-0.06 -0.07 -0.05 -0.05 -0.04 -0.04 -0.05 -0.07 -0.05  0.01]
 [-0.07 -0.08 -0.01 -0.04 -0.01 -0.04 -0.05 -0.06 -0.   -0.04]
 [ 0.06  0.08  0.03  0.04  0.02  0.04  0.06  0.06  0.06  0.02]
 [ 0.07  0.08  0.05  0.05  0.05  0.01  0.05  0.06  0.02  0.03]
 [ 0.06  0.08  0.04  0.04  0.03  0.04  0.06  0.06  0.05  0.03]]

The expected value approximates the expected model output (average prediction score).
For a binary classification with balanced datasets, it is around 0.5:
0.4935000000000002


Shap values are computed with respect to the classification class, which can be adjusted using the ``class_index`` parameter (default=1, standing for the positive class):

In [3]:
se = aa.ShapExplainer()
# Reverse sign of shap values by setting class to 0
se.fit(X, labels=labels, class_index=0)

shap_values = se.shap_values
exp_value = se.exp_value

print("Reverse sign of shap values by changing reference class from 1 to 0")
print(shap_values.round(2))
print("\nBase value stays around 0.5:")
print(exp_value)

Reverse sign of shap values by changing the reference class
[[ 0.05  0.06  0.04  0.06  0.04  0.    0.05  0.06  0.05  0.03]
 [ 0.06  0.07  0.04  0.06  0.04  0.03  0.05  0.07  0.05 -0.01]
 [ 0.07  0.08  0.02  0.05  0.02  0.03  0.05  0.06  0.    0.03]
 [-0.07 -0.07 -0.03 -0.05 -0.02 -0.03 -0.06 -0.07 -0.05 -0.02]
 [-0.07 -0.08 -0.05 -0.06 -0.05 -0.01 -0.05 -0.07 -0.01 -0.02]
 [-0.06 -0.07 -0.04 -0.06 -0.04 -0.03 -0.06 -0.07 -0.05 -0.02]]
Base value stays around 0.5:
0.5056666666666668


To obtain Monte Carlo estimates of the both, the ``ShapExplainer.fit()`` method performs 5 rounds of model fitting and averages the ``shap_values`` and ``exp_value`` across all rounds. The number of rounds can be adjusted using the ``n_rounds`` (default=5) parameter:

In [4]:
se = aa.ShapExplainer()
se = se.fit(X, labels=labels, n_rounds=10)

Pre-selection of features can be provided using the ``is_selected`` parameter: 

In [5]:
# Create pre-selection arrays (top 3 and top 5 features will be selected) 
is_selected = [[1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
               [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]
se = aa.ShapExplainer()
se = se.fit(X, labels=labels, is_selected=is_selected)

print("Impact of feature pre-selection")
print(se.shap_values.round(2))

Impact of feature pre-selection
[[-0.12 -0.13 -0.12 -0.05 -0.04  0.    0.    0.    0.    0.  ]
 [-0.14 -0.15 -0.12 -0.05 -0.04  0.    0.    0.    0.    0.  ]
 [-0.16 -0.17 -0.05 -0.05 -0.01  0.    0.    0.    0.    0.  ]
 [ 0.15  0.15  0.07  0.05  0.02  0.    0.    0.    0.    0.  ]
 [ 0.14  0.15  0.12  0.05  0.04  0.    0.    0.    0.    0.  ]
 [ 0.14  0.15  0.12  0.05  0.04  0.    0.    0.    0.    0.  ]]


Obtain a reliable shap value estimation for a fuzzy labeled sample (0 < label < 1) by setting ``fuzyy_labeling=True``: 

In [6]:
# Create fuzzy label
labels[0] = 0.5
se = aa.ShapExplainer()
se = se.fit(X, labels=labels, is_selected=is_selected, fuzzy_labeling=True)

print("First sample is labeled as 0.5 between negative (0) and positive (1)")
print(se.shap_values.round(2))

First sample is labeled in middle between negative (0) and positive (1)
[[ 0.09  0.07  0.04  0.    0.01  0.    0.    0.    0.    0.  ]
 [-0.21 -0.21 -0.11 -0.05 -0.03  0.    0.    0.    0.    0.  ]
 [-0.2  -0.21 -0.07 -0.03 -0.02  0.    0.    0.    0.    0.  ]
 [ 0.12  0.13  0.03  0.02  0.01  0.    0.    0.    0.    0.  ]
 [ 0.12  0.12  0.06  0.02  0.01  0.    0.    0.    0.    0.  ]
 [ 0.12  0.12  0.06  0.02  0.01  0.    0.    0.    0.    0.  ]]


If the model-agnostic ``KernelExplainer`` is used, a subset of the given dataset can be provided obtain by internal clustering and selecting a representative sample per cluster. The number of samples can be set by ``n_background_data`` (by default=``None`` disabled):

In [8]:
from sklearn.svm import SVC

# Use KernelExplainer to obtain shap values for any prediction model 
se = aa.ShapExplainer(explainer_class=shap.KernelExplainer, list_model_classes=[SVC])