To demonstrate the ``ShapExplainer.add_feat_impact()`` method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann24c]_):

In [1]:
import shap
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
1,Q14802,MQKVTLGLLVFLAGF...PGETPPLITPGSAQS,0,37,59,NSPFYYDWHS,LQVGGLICAGVLCAMGIIIVMSA,KCKCKFGQKS
2,Q86UE4,MAARSWQDELAQQAE...SPKQIKKKKKARRET,0,50,72,LGLEPKRYPG,WVILVGTGALGLLLLFLLGYGWA,AACAGARKKR
3,Q969W9,MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL,0,41,63,FQSMEITELE,FVQIIIIVVVMMVMVVVITCLLS,HYKLSARSFI
4,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
5,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
6,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER


We can now create a ``ShapExplainer`` object and fit it to create the ``shap_values``, which are saved internally:

In [2]:
se = aa.ShapExplainer()
se.fit(X, labels=labels)

shap_values = se.shap_values

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))

SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.1  -0.1  -0.09 -0.1  -0.08]
 [-0.11 -0.11 -0.08 -0.1  -0.08]
 [-0.14 -0.13 -0.03 -0.1  -0.03]
 [ 0.12  0.13  0.05  0.09  0.04]
 [ 0.12  0.12  0.08  0.1   0.08]
 [ 0.12  0.12  0.08  0.09  0.07]]


We can now include the feature impact (i.e., SHAP values normalized such that their absolute values sum up to 100%) by providing ``df_feat`` to the ``ShapExplainer.add_feat_impact`` method:

In [3]:
# Add feature impact of each sample (Protein0 to Protein5)
df_feat = se.add_feat_impact(df_feat=df_feat)
aa.display_df(df_feat)

Unnamed: 0,feature,category,subcategory,scale_name,scale_description,abs_auc,abs_mean_dif,mean_dif,std_test,std_ref,p_val_mann_whitney,p_val_fdr_bh,positions,feat_importance,feat_importance_std,feat_impact_Protein0,feat_impact_Protein1,feat_impact_Protein2,feat_impact_Protein3,feat_impact_Protein4,feat_impact_Protein5
1,"TMD_C_JMD_C-Seg...3,4)-KLEP840101",Energy,Charge,Charge,"Net charge (Kle...n et al., 1984)",0.244,0.103666,0.103666,0.106692,0.110506,0.0,0.0,3132333435,0.9704,1.438918,-20.91,-23.22,-33.29,28.24,23.86,24.47
2,"TMD_C_JMD_C-Seg...3,4)-FINA910104",Conformation,α-helix (C-cap),α-helix termination,"Helix terminati...n et al., 1991)",0.243,0.085064,0.085064,0.098774,0.096946,0.0,0.0,3132333435,0.0,0.0,-21.13,-21.85,-30.19,28.69,23.88,24.48
3,"TMD_C_JMD_C-Seg...6,9)-LEVM760105",Shape,Side chain length,Side chain length,"Radius of gyrat... (Levitt, 1976)",0.233,0.137044,0.137044,0.161683,0.176964,0.0,1e-06,3233,1.5548,2.109848,-19.34,-16.91,-7.37,11.76,16.41,16.66
4,"TMD_C_JMD_C-Seg...3,4)-HUTJ700102",Energy,Entropy,Entropy,"Absolute entrop...Hutchens, 1970)",0.229,0.098224,0.098224,0.106865,0.124608,0.0,1e-06,3132333435,3.1112,3.109955,-21.19,-20.92,-23.04,21.3,19.53,19.46
5,"TMD_C_JMD_C-Seg...6,9)-RADA880106",ASA/Volume,Volume,Accessible surface area (ASA),"Accessible surf...olfenden, 1988)",0.223,0.095071,0.095071,0.114758,0.132829,0.0,2e-06,3233,0.0,0.0,-17.43,-17.1,-6.1,10.01,16.32,14.93


To include the impact of a specific sample, use the ``pos`` parameter indicating the position index of the sample within the ``shap_values`` attribute (the same as in the ``labels`` provided to the ``ShapExplainer.fit()`` method). You need to set ``drop=True`` to override the feature impact columns:

In [4]:
# First protein
df_feat = se.add_feat_impact(df_feat=df_feat, drop=True, pos=0)
aa.display_df(df_feat, n_cols=-1)

Unnamed: 0,feat_impact_Protein0
1,-20.91
2,-21.13
3,-19.34
4,-21.19
5,-17.43


You can provide a specific ``name`` for the corresponding sample:

In [5]:
# Single sample
df_feat = se.add_feat_impact(df_feat=df_feat, drop=True, pos=0, name="Selected_sample")
aa.display_df(df_feat, n_cols=-3)

Unnamed: 0,feat_importance,feat_importance_std,feat_impact_Selected_sample
1,0.9704,1.438918,-20.91
2,0.0,0.0,-21.13
3,1.5548,2.109848,-19.34
4,3.1112,3.109955,-21.19
5,0.0,0.0,-17.43


**Computing feature impact**

Three different scenarios are possible:

a) **Single sample**: Compute the feature impact for one sample (shown above).
b) **Multiple samples**: Computes the feature impact for multiple sample (all by default).
c) **Group of samples**: Compute the average feature impact and standard deviation for a group.

To focus on specific samples, specify their indices in ``pos``. If ``name`` is provided, its length should match ``pos``.

In [6]:
# Multiple samples
df_feat = se.add_feat_impact(df_feat=df_feat, drop=True, pos=[0, 1], name=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-4)

Unnamed: 0,feat_importance,feat_importance_std,feat_impact_Sample 1,feat_impact_Sample 2
1,0.9704,1.438918,-20.91,-23.22
2,0.0,0.0,-21.13,-21.85
3,1.5548,2.109848,-19.34,-16.91
4,3.1112,3.109955,-21.19,-20.92
5,0.0,0.0,-17.43,-17.1


To calculate the group average, set ``group_average=True`` and specify the sample indices in `pos`. Provide a ``name`` for the group, or 'Group' will be used by default:

In [7]:
# Group of samples
df_feat = se.add_feat_impact(df_feat=df_feat, drop=True, pos=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-4)

Unnamed: 0,feat_importance,feat_importance_std,feat_impact_Group,feat_impact_std_Group
1,0.9704,1.438918,-22.11,2.013108
2,0.0,0.0,-21.51,1.20009
3,1.5548,2.109848,-18.08,0.503155
4,3.1112,3.109955,-21.05,0.684384
5,0.0,0.0,-17.25,0.508527


Setting ``shap_feat_importance=True``, will compute the SHAP value-based feature importance:

In [10]:
# SHAP value-based feature importance
df_feat = se.add_feat_impact(df_feat=df_feat, drop=True, shap_feat_importance=True)
aa.display_df(df_feat, n_cols=-3)

Unnamed: 0,feat_impact_Group,feat_impact_std_Group,feat_importance
1,-22.11,2.013108,25.47
2,-21.51,1.20009,24.88
3,-18.08,0.503155,14.94
4,-21.05,0.684384,20.84
5,-17.25,0.508527,13.87
