To demonstrate the ``TreeModel.predict_proba()``method, we obtain the ``DOM_GSEC`` example dataset and its respective feature set (see [Breimann24c]_):

In [3]:
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can not fit the ``TreeModel``, which will internally fit 3 tree-based models over 5 training rounds be default:

In [4]:
tm = aa.TreeModel()
tm = tm.fit(X, labels=labels)

<aaanalysis.explainable_ai._tree_model.TreeModel at 0x7f752a2acac0>

Using the ``TreeModel.predict_proba()`` method calculates probability predictions by averaging across multiple models and rounds, using a Monte Carlo approach for robust estimation:

In [5]:
pred, pred_std = tm.predict_proba(X)

df_seq["prediction"] = pred
df_seq["pred_std"] = pred_std

print("Prediction scores for 10 substrates")
aa.display_df(df_seq.head(10))
print("Prediction scores for 10 substrates")
aa.display_df(df_seq.tail(10))

Prediction scores for 10 substrates


Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c,prediction,pred_std
1,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH,0.995923,0.002485
2,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD,0.987712,0.006009
3,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER,0.993268,0.003653
4,Q03157,MGPTSPAARGQGRRW...HGYENPTYRFLEERP,1,585,607,APSGTGVSRE,ALSGLLIMGAGGGSLIVLSLLLL,RKKKPYGTIS,0.986547,0.005568
5,Q06481,MAATGTAAAAATGRL...GYENPTYKYLEQMQI,1,694,716,LREDFSLSSS,ALIGLLVIAVAIATVIVISLVML,RKRQYGTISH,0.999939,2.1e-05
6,P35613,MAAALFVLLGFALLG...HQNDKGKNVRQRNSS,1,323,345,IITLRVRSHL,AALWPFLGIVAEVLVLVTIIFIY,EKRRKPEDVL,0.980406,0.006478
7,P35070,MDRAARCSGASSLPL...DITPINEDIEETNIA,1,119,141,LFYLRGDRGQ,ILVICLIAVMVVFIILVIGVCTC,CHPLRKRRKR,0.905012,0.012997
8,P09803,MGARCRSFSALLLLL...RFKKLADMYGGGEDD,1,711,733,GIVAAGLQVP,AILGILGGILALLILILLLLLFL,RRRTVVKEPL,0.992552,0.003257
9,P19022,MCRIAGALRTLLPLL...PRFKKLADMYGGGDD,1,724,746,RIVGAGLGTG,AIIAILLCIIILLILVLMFVVWM,KRRDKERQAK,0.98884,0.007673
10,P16070,MDKFWWHAAWGLCLV...DETRNLQNVDMKIGV,1,650,672,GPIRTPQIPE,WLIILASLLALALILAVCIAVNS,RRRCGQKKKL,0.971014,0.004342


Prediction scores for 10 substrates


Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c,prediction,pred_std
117,Q4KMG9,MGVRVHVVAASALLY...PPTEKESTRIVDSWN,0,42,64,EHCLTTDWVH,LWYIWLLVVIGALLLLCGLTSLC,FRCCCLSRQQ,0.009489,0.004405
118,Q8NEW7,MAGWPGAGPLCVLGG...EEDEKNEAKKKKGEK,0,56,78,KETVVFWDMR,LWHVVGIFSLFVLSIIITLCCVF,NCRVPRTRKE,0.006065,0.002505
119,A2RUT3,MLHVLASLPLLLLLV...TQRQIQIKGTSTQSG,0,64,86,CPGYWLGPGA,SRIYPVAAVMITTTMLMICRKIL,QGRRRSQATK,0.050539,0.010062
120,B7ZWI3,MLDTWVWGTLTLTFG...EGPAGQMRGRAYATL,0,64,86,VWDPANDRFR,FLVILACIIFPILFICALVSLFC,PNCTELQHDV,0.002724,0.00388
121,O35305,MAPRARRRRQLPAPL...AQTSLHTQGSGQCAE,0,212,234,RRPPKEAQAY,LPSLIVLLLFISVVVVAAIIFGV,YYRKGGKALT,0.078782,0.015155
122,P36941,MLLPWATSAPGLAWG...TPSNRGPRNQFITHD,0,226,248,PLPPEMSGTM,LMLAVLLPLAFFLLLATVFSCIW,KSHPSLCRKL,0.002753,0.001338
123,P25446,MLWIWAVLPLVLAGS...STPDTGNENEGQCLE,0,170,187,NCRKQSPRNR,LWLLTILVLLIPLVFIYR,KYRKRKCWKR,0.087387,0.01959
124,Q9P2J2,MVWCLGLAVLSLVIS...AYRQPVPHPEQATLL,0,738,760,PGLLPQPVLA,GVVGGVCFLGVAVLVSILAGCLL,NRRRAARRRR,0.070756,0.011771
125,Q96J42,MVPAAGRRPPRVMRL...SIRWLIPGQEQEHVE,0,324,342,LPSTLIKSVD,WLLVFSLFFLISFIMYATI,RTESIRWLIP,0.019023,0.011497
126,P0DPA2,MRVGGAFHLLLVCLS...DCAEGPVQCKNGLLV,0,265,287,KVSDSRRIGV,IIGIVLGSLLALGCLAVGIWGLV,CCCCGGSGAG,0.004117,0.005324
