# Quick Start with AAanalysis
Dive into the powerful capabilities of ``AAanalysis``—a Python framework dedicated to sequence-based, alignment-free protein prediction. In this tutorial, using gamma-secretase substrates and non-substrates as an example, we'll focus on extracting interpretable features from protein sequences using the ``AAclust`` and ``CPP`` models and how they can be harnessed for binary classification tasks.

What You Will Learn:
- ``Loading Sequences and Scales``: How to easily load protein sequences and their amino acid scales.
- ``Feature Engineering``: Extract essential features using the ``AAclust`` and ``CPP`` models.
- ``Protein Prediction``: Make predictions using the RandomForest model.
- ``Explainable AI``: Interpret predictions at the group and individual levels by combining ``CPP`` with ``SHAP``.

## 1. Loading Sequences and Scales
With AAanalysis, you have access to numerous benchmark datasets for protein sequence analysis. Using our γ-secretase substrates and non-substrates dataset as a hands-on example, you can effortlessly retrieve these datasets using the ``aa.load_dataset()`` function. Furthermore, amino acid scales, predominantly from AAindex, along with their hierarchical classification (known as ``AAontology``), are available at your fingertips with the ``aa.load_scales()`` function.

In [8]:
import aaanalysis as aa
# Load scales and scale categories (AAontology) 
df_scales = aa.load_scales()
df_cat = aa.load_scales(name="scales_cat")
# Load training data
df_seq = aa.load_dataset(name="DOM_GSEC", n=50)
df_seq

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
0,Q14802,MQKVTLGLLVFLAGFPVLDANDLEDKNSPFYYDWHSLQVGGLICAG...,0,37,59,NSPFYYDWHS,LQVGGLICAGVLCAMGIIIVMSA,KCKCKFGQKS
1,Q86UE4,MAARSWQDELAQQAEEGSARLREMLSVGLGFLRTELGLDLGLEPKR...,0,50,72,LGLEPKRYPG,WVILVGTGALGLLLLFLLGYGWA,AACAGARKKR
2,Q969W9,MHRLMGVNSTAAAAAGQPNVSCTCNCKRSLFQSMEITELEFVQIII...,0,41,63,FQSMEITELE,FVQIIIIVVVMMVMVVVITCLLS,HYKLSARSFI
3,P53801,MAPGVARGPTPYWRLRLGGAALLLLLIPVAAAQEPPGAACSQNTNK...,0,97,119,RWGVCWVNFE,ALIITMSVVGGTLLLGIAICCCC,CCRRKRSRKP
4,Q8IUW5,MAPRALPGSAVLAAAVFVGGAVSSPLVAPDNGSSRTLHSRTETTPS...,0,59,81,NDTGNGHPEY,IAYALVPVFFIMGLFGVLICHLL,KKKGYRCTTE
...,...,...,...,...,...,...,...,...
95,P15209,MSPWLKWHGPAMARLWGLCLLVLGFWRASLACPTSCKCSSARIWCT...,1,431,453,VADQSNREHL,SVYAVVVIASVVGFCLLVMLLLL,KLARHSKFGM
96,Q86YL7,MWKVSALLFVLGSASLWVLAEGASTGQPEDDTETTGLEGGVAMPGA...,1,130,152,TVEKDGLSTV,TLVGIIVGVLLAIGFIGAIIVVV,MRKMSGRYSP
97,Q13308,MGAARGSPARPRRLPLLSVLLLPLLGGTQTAIVFIKQPSSQDALQG...,1,704,726,GSPPPYKMIQ,TIGLSVGAAVAYIIAVLGLMFYC,KKRCKAKRLQ
98,P10586,MAPEPAPGRTMVPLVPALVMLGLVAGAHGDSKPVFIKVPEDQTGLS...,1,1262,1284,PAQQQEEPEM,LWVTGPVLAVILIILIVIAILLF,KRKRTHSPSS


## 2.  Feature Engineering
The centerpiece of AAanalysis is the Comparative Physicochemical Profiling (``CPP``) model, which is supported by ``AAclust`` for the pre-selection of amino acid scales. 

### AAclust
Since redundancy is an essential problem for machine learning tasks, the ``AAclust`` object provides a lightweight wrapper for sklearn clustering algorithms such as Agglomerative clustering. AAclust clusters a set of scales and selects for each cluster the most representative scale (i.e., the scale closes to the cluster center).

In [9]:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
aac = aa.AAclust(model=AgglomerativeClustering, model_kwargs=dict(linkage="ward"))
X = np.array(df_scales).T
scales = aac.fit(X, n_clusters=10, names=list(df_scales)) 
df_scales = df_scales[scales]
df_scales

Unnamed: 0_level_0,SUEM840101,NISK860101,KANM800101,CHOP780101,MIYS990105,FAUJ880103,QIAN880126,MUNV940105,LINS030104,JOND920101
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A,0.788,0.406,0.875,0.174,0.492,0.124,0.451,0.175,0.093,0.818
C,0.544,0.906,0.312,0.661,0.016,0.301,0.324,0.089,0.0,0.078
D,0.146,0.006,0.542,0.908,0.825,0.344,0.745,0.337,0.588,0.494
E,0.622,0.055,1.0,0.248,0.857,0.468,0.471,0.182,0.804,0.623
F,0.813,0.968,0.552,0.119,0.0,0.729,0.186,0.066,0.082,0.338
G,0.0,0.262,0.115,1.0,0.492,0.0,0.676,0.393,0.144,0.779
H,0.425,0.559,0.615,0.44,0.492,0.577,0.696,0.125,0.423,0.117
I,0.901,1.0,0.583,0.0,0.079,0.495,0.314,0.05,0.01,0.506
K,0.571,0.0,0.729,0.495,1.0,0.59,0.088,0.155,1.0,0.584
L,0.901,0.942,0.719,0.11,0.016,0.495,0.059,0.152,0.041,1.0


### Comparative Physicochemical Profiling (CPP)
 CPP is a sequence-based feature engineering algorithm. It aims at identifying a set of features most discriminant between two sets of sequences: the test set and the reference set. Supported by the ``SequenceFeature`` object (``sf``), A CPP feature integrates:
 
- ``Parts``: Are combination of a target middle domain (TMD) and N- and C-terminal adjacent regions (JMD-N and JMD-C, respectively), obtained ``sf.get_df_parts()``.
- ``Splits``: These `Parts` can be split into various continuous segments or discontinuous patterns, specified ``sf.get_split_kws()``. 
- ``Scales``: Sets of amino acid scales.

In [10]:
# Feature Engineering
y = list(df_seq["label"])
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq, jmd_n_len=10, jmd_c_len=10)
split_kws = sf.get_split_kws(n_split_max=1, split_types=["Segment"])
df_parts

Unnamed: 0,tmd,jmd_n_tmd_n,tmd_c_jmd_c
Q14802,LQVGGLICAGVLCAMGIIIVMSA,NSPFYYDWHSLQVGGLICAGVL,CAMGIIIVMSAKCKCKFGQKS
Q86UE4,WVILVGTGALGLLLLFLLGYGWA,LGLEPKRYPGWVILVGTGALGL,LLLFLLGYGWAAACAGARKKR
Q969W9,FVQIIIIVVVMMVMVVVITCLLS,FQSMEITELEFVQIIIIVVVMM,VMVVVITCLLSHYKLSARSFI
P53801,ALIITMSVVGGTLLLGIAICCCC,RWGVCWVNFEALIITMSVVGGT,LLLGIAICCCCCCRRKRSRKP
Q8IUW5,IAYALVPVFFIMGLFGVLICHLL,NDTGNGHPEYIAYALVPVFFIM,GLFGVLICHLLKKKGYRCTTE
...,...,...,...
P15209,SVYAVVVIASVVGFCLLVMLLLL,VADQSNREHLSVYAVVVIASVV,GFCLLVMLLLLKLARHSKFGM
Q86YL7,TLVGIIVGVLLAIGFIGAIIVVV,TVEKDGLSTVTLVGIIVGVLLA,IGFIGAIIVVVMRKMSGRYSP
Q13308,TIGLSVGAAVAYIIAVLGLMFYC,GSPPPYKMIQTIGLSVGAAVAY,IIAVLGLMFYCKKRCKAKRLQ
P10586,LWVTGPVLAVILIILIVIAILLF,PAQQQEEPEMLWVTGPVLAVIL,IILIVIAILLFKRKRTHSPSS


Running the CPP algorithm creates all `Part`, `Split`, `Split` combinations and filters a selected maximum of non-redundant features:

In [12]:
# Small set of features (300 features created)
cpp = aa.CPP(df_parts=df_parts, df_cat=df_cat, df_scales=df_scales, split_kws=split_kws)
df_feat = cpp.run(labels=y, tmd_len=20, n_filter=100)  

1. CPP creates 30 features for 100 samples
   |#########################| 100.00%[0m91mm
2. CPP pre-filters 1 features (5%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2
3. CPP filtering algorithm


ValueError: 'jmd_n_seq' should be string (type=<class 'list'>)

## 3. Protein Prediction
A feature matrix from a given set of CPP features can be created using ``sf.feat_matrix``:

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
X = sf.feat_matrix(df_parts=df_parts, df_scales=df_scales, features=df_feat["feature"])

NameError: name 'df_feat' is not defined

This feature matrix can now be used for common machine learning models.

In [None]:
# ML evaluation
rf = RandomForestClassifier()
cv = cross_val_score(rf, X, y, scoring="accuracy", cv=5, n_jobs=8) # Set n_jobs=1 to disable multi-processing
print(f"Mean accuracy of {round(np.mean(cv), 2)}")

Creating more initial features will take some more time but improve prediction performance. 

In [5]:
# Default split settings for features (around 100.000 features created)
split_kws = sf.get_split_kws()
cpp = aa.CPP(df_cat=df_cat, df_parts=df_parts, df_scales=df_scales, split_kws=split_kws)
df_feat = cpp.run(labels=y, tmd_len=200, n_processes=8, n_filter=100)
X = sf.feat_matrix(df_parts=df_parts, df_scales=df_scales, features=df_feat["feature"])
# ML evaluation
rf = RandomForestClassifier()
cv = cross_val_score(rf, X, y, scoring="accuracy", cv=5, n_jobs=1)  # Set n_jobs=1 to disable multi-processing
print(f"Mean accuracy of {round(np.mean(cv), 2)}")

1. CPP creates 9900 features for 200 samples
   |#########################| 100.00%
2. CPP pre-filters 495 features (5%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2
3. CPP filtering algorithm
4. CPP returns df with 32 unique features including general information and statistics
Mean accuracy of 0.74


## 4. Explainable AI

### Explainable AI on group level

### Explainable AI on individual level