To demonstrate the ``ShapModel().add_sample_mean_dif()`` method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]_):

In [7]:
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
1,Q14802,MQKVTLGLLVFLAGF...PGETPPLITPGSAQS,0,37,59,NSPFYYDWHS,LQVGGLICAGVLCAMGIIIVMSA,KCKCKFGQKS
2,Q86UE4,MAARSWQDELAQQAE...SPKQIKKKKKARRET,0,50,72,LGLEPKRYPG,WVILVGTGALGLLLLFLLGYGWA,AACAGARKKR
3,Q969W9,MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL,0,41,63,FQSMEITELE,FVQIIIIVVVMMVMVVVITCLLS,HYKLSARSFI
4,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
5,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
6,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER


You need to provide ``X``, ``labels``, and ``df_feat`` to the ``ShapModel().add_samples_mean_dif()`` method, which will then  compute the feature value difference for each sample against the reference group average:

In [8]:
sm = aa.ShapModel()

# Compute difference against average for negative (0) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat)
aa.display_df(df_feat)

Unnamed: 0,feature,category,subcategory,scale_name,scale_description,abs_auc,abs_mean_dif,mean_dif,std_test,std_ref,p_val_mann_whitney,p_val_fdr_bh,positions,feat_importance,feat_importance_std,mean_dif_Protein0,mean_dif_Protein1,mean_dif_Protein2,mean_dif_Protein3,mean_dif_Protein4,mean_dif_Protein5
1,"TMD_C_JMD_C-Seg...3,4)-KLEP840101",Energy,Charge,Charge,"Net charge (Kle...n et al., 1984)",0.244,0.103666,0.103666,0.106692,0.110506,0.0,0.0,3132333435,0.9704,1.438918,0.1,-0.1,0.0,0.2,0.2,0.2
2,"TMD_C_JMD_C-Seg...3,4)-FINA910104",Conformation,α-helix (C-cap),α-helix termination,"Helix terminati...n et al., 1991)",0.243,0.085064,0.085064,0.098774,0.096946,0.0,0.0,3132333435,0.0,0.0,0.0876,-0.0876,0.0,0.1752,0.1752,0.1752
3,"TMD_C_JMD_C-Seg...6,9)-LEVM760105",Shape,Side chain length,Side chain length,"Radius of gyrat... (Levitt, 1976)",0.233,0.137044,0.137044,0.161683,0.176964,0.0,1e-06,3233,1.5548,2.109848,0.12389,-0.36078,0.23689,0.28289,0.36289,0.33856
4,"TMD_C_JMD_C-Seg...3,4)-HUTJ700102",Energy,Entropy,Entropy,"Absolute entrop...Hutchens, 1970)",0.229,0.098224,0.098224,0.106865,0.124608,0.0,1e-06,3132333435,3.1112,3.109955,0.131267,-0.269733,0.138467,0.231467,0.312467,0.277867
5,"TMD_C_JMD_C-Seg...6,9)-RADA880106",ASA/Volume,Volume,Accessible surface area (ASA),"Accessible surf...olfenden, 1988)",0.223,0.095071,0.095071,0.114758,0.132829,0.0,2e-06,3233,0.0,0.0,0.067557,-0.230443,0.162887,0.165557,0.278227,0.208887


To change the reference group, use the ``label_ref`` parameter (default=0). Since ``df_feat`` already contains mean difference columns, we must set ``drop=True`` to remove them:

In [9]:
# Compute difference against average for positive (1) group
df_feat = sm.add_sample_mean_dif(X, labels=labels, label_ref=1, df_feat=df_feat, drop=True)
aa.display_df(df_feat, n_cols=-6)

Unnamed: 0,mean_dif_Protein0,mean_dif_Protein1,mean_dif_Protein2,mean_dif_Protein3,mean_dif_Protein4,mean_dif_Protein5
1,-0.1,-0.3,-0.2,-0.0,-0.0,-0.0
2,-0.0876,-0.2628,-0.1752,-0.0,-0.0,-0.0
3,-0.204223,-0.688893,-0.091223,-0.045223,0.034777,0.010447
4,-0.142667,-0.543667,-0.135467,-0.042467,0.038533,0.003933
5,-0.15,-0.448,-0.05467,-0.052,0.06067,-0.00867


Select a specific sample based in its position index in label using the ``sample_positions`` parameter. You can provide its name by the ``names`` parameter:

In [10]:
# Single sample
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)

Unnamed: 0,mean_dif_Selected_sample
1,0.1
2,0.0876
3,0.12389
4,0.131267
5,0.067557


Three different scenarios are possible:

a) **Single sample**: Compute the difference for a single sample (above).
b) **Multiple samples**: Compute the difference for multiple samples (all by default).
c) **Group of samples**: Compute the difference using the average of a group of samples.

To target on specific samples, define their indices in ``sample_positions``. Ensure the ``names`` parameter, if used, corresponds in length to ``sample_positions``.

In [11]:
# Multiple samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)

Unnamed: 0,mean_dif_Sample 1,mean_dif_Sample 2
1,0.1,-0.1
2,0.0876,-0.0876
3,0.12389,-0.36078
4,0.131267,-0.269733
5,0.067557,-0.230443


To compute the group average, set ``group_average=True`` and specify the sample indices in `sample_positions`.Assign a name to the group using the ``names`` parameter; if not provided, 'Group' will be used as the default name:

In [12]:
# Group of samples
df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat, drop=True, sample_positions=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-1)

Unnamed: 0,mean_dif_Group
1,0.0
2,0.0
3,-0.118445
4,-0.069233
5,-0.081443
