To demonstrate the ``ShapModel().add_feat_impact()`` method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]_):

In [25]:
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
1,Q14802,MQKVTLGLLVFLAGF...PGETPPLITPGSAQS,0,37,59,NSPFYYDWHS,LQVGGLICAGVLCAMGIIIVMSA,KCKCKFGQKS
2,Q86UE4,MAARSWQDELAQQAE...SPKQIKKKKKARRET,0,50,72,LGLEPKRYPG,WVILVGTGALGLLLLFLLGYGWA,AACAGARKKR
3,Q969W9,MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL,0,41,63,FQSMEITELE,FVQIIIIVVVMMVMVVVITCLLS,HYKLSARSFI
4,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
5,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
6,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER


We can now create a ``ShapModel`` object and fit it to create the ``shap_values``, which are saved internally:

In [26]:
sm = aa.ShapModel()
sm.fit(X, labels=labels)

shap_values = sm.shap_values

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))

SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.11 -0.1  -0.07 -0.1  -0.07]
 [-0.13 -0.12 -0.07 -0.09 -0.08]
 [-0.14 -0.13 -0.03 -0.09 -0.02]
 [ 0.13  0.13  0.04  0.09  0.04]
 [ 0.13  0.13  0.08  0.1   0.07]
 [ 0.13  0.13  0.08  0.09  0.06]]


We can now include the feature impact (i.e., SHAP values normalized such that their absolute values sum up to 100%) by providing ``df_feat`` to the ``ShapModel().add_feat_impact()`` method:

In [27]:
# Add feature impact of each sample (Protein0 to Protein5)
df_feat = sm.add_feat_impact(df_feat=df_feat)
aa.display_df(df_feat)

Unnamed: 0,feature,category,subcategory,scale_name,scale_description,abs_auc,abs_mean_dif,mean_dif,std_test,std_ref,p_val_mann_whitney,p_val_fdr_bh,positions,feat_importance,feat_importance_std,feat_impact_Protein0,feat_impact_Protein1,feat_impact_Protein2,feat_impact_Protein3,feat_impact_Protein4,feat_impact_Protein5
1,"TMD_C_JMD_C-Seg...3,4)-KLEP840101",Energy,Charge,Charge,"Net charge (Kle...n et al., 1984)",0.244,0.103666,0.103666,0.106692,0.110506,0.0,0.0,3132333435,0.9704,1.438918,-24.2,-26.17,-33.82,30.3,25.46,26.02
2,"TMD_C_JMD_C-Seg...3,4)-FINA910104",Conformation,α-helix (C-cap),α-helix termination,"Helix terminati...n et al., 1991)",0.243,0.085064,0.085064,0.098774,0.096946,0.0,0.0,3132333435,0.0,0.0,-22.48,-23.82,-31.8,30.13,25.18,25.68
3,"TMD_C_JMD_C-Seg...6,9)-LEVM760105",Shape,Side chain length,Side chain length,"Radius of gyrat... (Levitt, 1976)",0.233,0.137044,0.137044,0.161683,0.176964,0.0,1e-06,3233,1.5548,2.109848,-16.03,-15.13,-7.37,8.98,15.66,15.9
4,"TMD_C_JMD_C-Seg...3,4)-HUTJ700102",Energy,Entropy,Entropy,"Absolute entrop...Hutchens, 1970)",0.229,0.098224,0.098224,0.106865,0.124608,0.0,1e-06,3132333435,3.1112,3.109955,-21.45,-19.4,-22.28,21.48,19.2,19.26
5,"TMD_C_JMD_C-Seg...6,9)-RADA880106",ASA/Volume,Volume,Accessible surface area (ASA),"Accessible surf...olfenden, 1988)",0.223,0.095071,0.095071,0.114758,0.132829,0.0,2e-06,3233,0.0,0.0,-15.85,-15.48,-4.73,9.12,14.5,13.14


To include the impact of a specific sample, use the ``sample_positions`` parameter indicating the position index of the sample within the ``shap_values`` attribute (the same as in the ``labels`` provided to the ``ShapModel().fit()`` method). You need to set ``drop=True`` to override the feature impact columns:

In [28]:
# First protein
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=0)
aa.display_df(df_feat, n_cols=-1)

Unnamed: 0,feat_impact_Protein0
1,-24.2
2,-22.48
3,-16.03
4,-21.45
5,-15.85


You can provide a specific ``names`` for the corresponding sample:

In [29]:
# Single sample
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)

Unnamed: 0,feat_impact_Selected_sample
1,-24.2
2,-22.48
3,-16.03
4,-21.45
5,-15.85


**Computing feature impact**

Three different scenarios are possible:

a) **Single sample**: Compute the feature impact for a single sample (above).
b) **Multiple samples**: Compute the feature impact for multiple samples (all by default).
c) **Group of samples**: Compute the average feature impact and standard deviation for a group.

To focus on specific samples, specify their indices in ``sample_positions``. If ``names`` is provided, its length should match ``sample_positions``.

In [30]:
# Multiple samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)

Unnamed: 0,feat_impact_Sample 1,feat_impact_Sample 2
1,-24.2,-26.17
2,-22.48,-23.82
3,-16.03,-15.13
4,-21.45,-19.4
5,-15.85,-15.48


To calculate the group average, set ``group_average=True`` and specify the sample indices in `sample_positions`. Provide a ``names`` for the group, or 'Group' will be used by default:

In [31]:
# Group of samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-2)

Unnamed: 0,feat_impact_Group,feat_impact_std_Group
1,-25.22,1.97532
2,-23.18,1.57887
3,-15.56,0.162996
4,-20.39,0.219933
5,-15.66,0.432344


Setting ``shap_feat_importance=True``, will compute the SHAP value-based feature importance:

In [32]:
# SHAP value-based feature importance
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, shap_feat_importance=True)
aa.display_df(df_feat, n_cols=-1)

Unnamed: 0,feat_importance
1,27.5
2,26.37
3,13.36
4,20.44
5,12.32
