# SequenceFeature: Creation of CPP feature components

A CPP feature is the combination of the three components:

- **Part**: A continuous subset of a sequence, such as a protein domain.
- **Split**: Continuous or discontinuous subset of a **Part**, either segment or pattern.
- **Scale**: A physicochemical scale, i.e., a set of numerical values (typically [0-1]) assigned to amino acids.

While **Scales** can be obtained using the ``load_scales()`` function and selecting by the ``AAclust`` class, the ``SequenceFeature`` class is designed to create various forms of **Parts** and **Splits**, which can then all be provided to ``CPP``. See the [SequenceFeature API](https://aaanalysis.readthedocs.io/en/latest/generated/aaanalysis.SequenceFeature.html#) for more details.

## Creation of Parts

To define **Parts**, the ``SequenceFeature`` class provides the ``SequenceFeature.get_df_parts()`` method. To demonstrate this method, we first obtain an example sequence dataset using the ``load_dataset()`` function

In [1]:
import aaanalysis as aa
aa.options["verbose"] = False

sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="SEQ_CAPSID", min_len=40, max_len=100)
aa.display_df(df_seq, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (172, 3)


Unnamed: 0,entry,sequence,label
1,CAPSID_4,MERGDIP...EMDAGLI,0
2,CAPSID_26,MDTGDRL...PANAGMY,0
3,CAPSID_35,MTKLLLT...LDDGQAA,0



By default, three sequence parts (``tmd``, ``jmd_n_tmd_n``, ``tmd_c_jmd_c``) with a ``jmd_n`` and ``jmd_c`` length of each 10 residues are provided:

In [2]:
df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df=df_parts, n_rows=5, show_shape=True, char_limit=15)

DataFrame shape: (172, 3)


Unnamed: 0_level_0,tmd,jmd_n_tmd_n,tmd_c_jmd_c
entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CAPSID_4,VGRHRRI...KRRQALE,MERGDIP...KAEDVSK,YQRIRDE...EMDAGLI
CAPSID_26,GEVAALF...PAAPTGP,MDTGDRL...PGGHRRF,RESEVRA...PANAGMY
CAPSID_35,AADLLGV...LLAFVHR,MTKLLLT...LNSGDLE,SVRIGRA...LDDGQAA
CAPSID_58,KYLEALF...NTLRKGQ,MRWDGLS...TVYRWLQ,TGVIPAY...VNDEDQP
CAPSID_141,YLTLSEA...SVLNEPI,MYLTIKE...FDGQQHL,INKEQFN...PDVKDED


Any combination of valid sequence parts can be obtained using the ``list_part`` parameter:

In [3]:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'])
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (172, 4)


Unnamed: 0_level_0,jmd_n,tmd,jmd_c,tmd_jmd
entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CAPSID_4,MERGDIPFKY,VGRHRRI...KRRQALE,ELAEMDAGLI,MERGDIP...EMDAGLI
CAPSID_26,MDTGDRLLTP,GEVAALF...PAAPTGP,GPGPANAGMY,MDTGDRL...PANAGMY
CAPSID_35,MTKLLLTPTE,AADLLGV...LLAFVHR,LRGLDDGQAA,MTKLLLT...LDDGQAA


Set the length of both JMDs by the ``jmd_c_len`` and ``jmd_n_len`` parameters:

In [4]:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'], jmd_c_len=8, jmd_n_len=8)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

DataFrame shape: (172, 4)


Unnamed: 0_level_0,jmd_n,tmd,jmd_c,tmd_jmd
entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CAPSID_4,MERGDIPF,KYVGRHR...RQALEEL,AEMDAGLI,MERGDIP...EMDAGLI
CAPSID_26,MDTGDRLL,TPGEVAA...APTGPGP,GPANAGMY,MDTGDRL...PANAGMY
CAPSID_35,MTKLLLTP,TEAADLL...AFVHRLR,GLDDGQAA,MTKLLLT...LDDGQAA


For more details, see the [SequenceFeature.get_df_parts API](https://aaanalysis.readthedocs.io/en/latest/generated/aaanalysis.SequenceFeature.get_df_parts.html#).

## Creation of Splits 

Three different types of splits exist:

- **Segment**: continuous sub-sequence.
- **Pattern**: non-periodic discontinuous sub-sequence
- **PeriodicPattern**: periodic discontinuous sub-sequence.

Due to the plethora of combinatorial options, ``SeuqenceFeature`` has a special method (``SequenceFeature.get_split_kws()``) to create a dictionary containing all relevant **Split** information. 

  You can get the default arguments for all split types as follows:

In [5]:
split_kws = sf.get_split_kws()
split_kws

{'Segment': {'n_split_min': 1, 'n_split_max': 15},
 'Pattern': {'steps': [3, 4], 'n_min': 2, 'n_max': 4, 'len_max': 15},
 'PeriodicPattern': {'steps': [3, 4]}}

You can also retrieve arguments for specific split types:

In [6]:
split_kws = sf.get_split_kws(split_types=["Segment", "Pattern"])
split_kws

{'Segment': {'n_split_min': 1, 'n_split_max': 15},
 'Pattern': {'steps': [3, 4], 'n_min': 2, 'n_max': 4, 'len_max': 15}}

The arguments for each split type can be adjusted. For ``Segments``, their minimum and maximum length can be changed by the ``n_split_min`` (default=1) and ``n_split_max`` (default=15) parameters:

In [7]:
split_kws = sf.get_split_kws(split_types="Segment", n_split_min=5, n_split_max=10)
split_kws

{'Segment': {'n_split_min': 5, 'n_split_max': 10}}

For ``PeriodicPattern``, the step size of each odd and even step can be specified as follows using the ``steps_periodicpattern`` (default=[3, 4]):

In [8]:
split_kws = sf.get_split_kws(split_types="PeriodicPattern", steps_periodicpattern=[5, 10])
split_kws

{'PeriodicPattern': {'steps': [5, 10]}}

And for ``Patterns``, the step size, the minimum and maximum number of steps, and the maximum residue size of the pattern can be adjusted using the ``steps_pattern`` (default=[3, 4]), ``n_min`` (default=2), ``n_max`` (default=4), and ``len_max`` (default=10) parameters: 

In [9]:
split_kws = sf.get_split_kws(split_types="Pattern", steps_pattern=[5, 10], n_min=3, n_max=5, len_max=30)
split_kws

{'Pattern': {'steps': [5, 10], 'n_min': 3, 'n_max': 5, 'len_max': 30}}

## Combining Parts + Splits + Scales

Any combination of the three feature combinations can be provided to ``CPP``, which will create all **Part-Split-Scale** combinations and filter them down to a user-defined number (default=100) of non-redundant scales through the ``CPP.run()`` method:

In [10]:
# Load default scales, parts, and splits
df_scales = aa.load_scales()
df_parts = sf.get_df_parts(df_seq=df_seq)
split_kws = sf.get_split_kws()

# Get labels for test and reference class
labels = df_seq["label"].to_list()

# Creat CPP object by providing three feature components
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales)
df_feat = cpp.run(labels=labels)

aa.display_df(df=df_feat, show_shape=True)

DataFrame shape: (100, 13)


Unnamed: 0,feature,category,subcategory,scale_name,scale_description,abs_auc,abs_mean_dif,mean_dif,std_test,std_ref,p_val_mann_whitney,p_val_fdr_bh,positions
1,"TMD_C_JMD_C-Seg...,15)-AURR980107",Conformation,α-helix (N-term),"α-helix (N-terminal, inside)","Normalized posi...ora-Rose, 1998)",0.268,0.136,-0.136,0.143,0.152,0.0,4.5e-05,37
2,"TMD-Segment(2,12)-PALJ810113",Conformation,α-helix (left-handed),β-turn (α class),"Normalized freq...u et al., 1981)",0.263,0.144,0.144,0.152,0.132,0.0,4e-05,1213
3,"TMD-Segment(1,6)-TANS770107",Conformation,α-helix (left-handed),α-helix (left-handed),"Normalized freq...Scheraga, 1977)",0.258,0.101,0.101,0.113,0.089,0.0,4.2e-05,111213
4,"TMD-PeriodicPat...3,2)-HUTJ700103",Energy,Entropy,Entropy,"Entropy of form...Hutchens, 1970)",0.257,0.065,-0.065,0.065,0.074,0.0,3.6e-05,111518222529
5,"JMD_N_TMD_N-Pat...,13)-KARS160119",Shape,Graph (1. eigenvalue),Eigenvalue (maximum),"Weighted maximu...-Knisley, 2016)",0.256,0.115,-0.115,0.13,0.12,0.0,2.9e-05,261013
6,"JMD_N_TMD_N-Seg...1,4)-ROBB760102",Conformation,α-helix (N-term),α-helix (N-terminal),"Information mea...n-Suzuki, 1976)",0.256,0.086,-0.086,0.108,0.086,0.0,3e-05,12345
7,"TMD-Segment(1,6)-BULH740102",ASA/Volume,Partial specific volume,Partial specific volume,"Apparent partia...l-Breese, 1974)",0.254,0.097,-0.097,0.124,0.098,0.0,2.1e-05,111213
8,"TMD_C_JMD_C-Seg...4,4)-RICJ880105",Conformation,α-helix (N-term),α-helix (N-terminal),"Relative prefer...chardson, 1988)",0.253,0.09,-0.09,0.082,0.111,0.0,2e-05,3637383940
9,"JMD_N_TMD_N-Seg...1,3)-FAUJ880112",Energy,Charge (negative),Charge (negative),"Negative charge...e et al., 1988)",0.253,0.086,-0.086,0.081,0.104,0.0,2e-05,123456
10,"TMD_C_JMD_C-Seg...1,2)-SNEP660104",Others,PC 4,Principal Component 4 (Sneath),"Principal compo... (Sneath, 1966)",0.251,0.069,0.069,0.078,0.064,0.0,2.3e-05,21222324252627282930


Further information on the CPP feature concept can be found in thr [CPP Usage Principles](https://aaanalysis.readthedocs.io/en/latest/index/usage_principles/feature_identification.html) section.