# SequenceFeature: Creation of CPP feature components

****
A CPP feature is the combination of the three components:

- **Part**: A continuous subset of a sequence, such as a protein domain.
- **Split**: Continuous or discontinuous subset of a **Part**, either segment or pattern.
- **Scale**: A physicochemical scale, i.e., a set of numerical values (typically [0-1]) assigned to amino acids.

While **Scales** can be obtained using the ``load_scales()`` function and selecting by the ``AAclust`` class, the ``SequenceFeature`` class is designed to create various forms of **Parts** and **Splits**, which can then all be provided to ``CPP``. See the [SequenceFeature API](https://aaanalysis.readthedocs.io/en/latest/generated/aaanalysis.SequenceFeature.html#) for more details.

**Creation of Parts**

To define **Parts**, the ``SequenceFeature`` class provides the ``SequenceFeature.get_df_parts()`` method. To demonstrate this method, we first obtain an example sequence dataset using the ``load_dataset()`` function

In [None]:
import aaanalysis as aa
aa.options["verbose"] = False

sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="SEQ_CAPSID", min_len=40, max_len=100)
aa.display_df(df_seq, n_rows=3, show_shape=True, char_limit=15)


By default, three sequence parts (``tmd``, ``jmd_n_tmd_n``, ``tmd_c_jmd_c``) with a ``jmd_n`` and ``jmd_c`` length of each 10 residues are provided:

In [None]:
df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df=df_parts, n_rows=5, show_shape=True, char_limit=15)

Any combination of valid sequence parts can be obtained using the ``list_part`` parameter:

In [None]:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'])
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

Set the length of both JMDs by the ``jmd_c_len`` and ``jmd_n_len`` parameters:

In [None]:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'], jmd_c_len=8, jmd_n_len=8)
aa.display_df(df=df_parts, n_rows=3, show_shape=True, char_limit=15)

For more details, see the [SequenceFeature.get_df_parts API](https://aaanalysis.readthedocs.io/en/latest/generated/aaanalysis.SequenceFeature.get_df_parts.html#).

**Creation of Splits** 

Three different types of splits exist:

- **Segment**: continuous sub-sequence.
- **Pattern**: non-periodic discontinuous sub-sequence
- **PeriodicPattern**: periodic discontinuous sub-sequence.

Due to the plethora of combinatorial options, ``SeuqenceFeature`` has a special method (``SequenceFeature.get_split_kws()``) to create a dictionary containing all relevant **Split** information. 

  You can get the default arguments for all split types as follows:

In [None]:
split_kws = sf.get_split_kws()
split_kws

You can also retrieve arguments for specific split types:

In [None]:
split_kws = sf.get_split_kws(split_types=["Segment", "Pattern"])
split_kws

The arguments for each split type can be adjusted. For ``Segments``, their minimum and maximum length can be changed by the ``n_split_min`` (default=1) and ``n_split_max`` (default=15) parameters:

In [None]:
split_kws = sf.get_split_kws(split_types="Segment", n_split_min=5, n_split_max=10)
split_kws

For ``PeriodicPattern``, the step size of each odd and even step can be specified as follows using the ``steps_periodicpattern`` (default=[3, 4]):

In [None]:
split_kws = sf.get_split_kws(split_types="PeriodicPattern", steps_periodicpattern=[5, 10])
split_kws

And for ``Patterns``, the step size, the minimum and maximum number of steps, and the maximum residue size of the pattern can be adjusted using the ``steps_pattern`` (default=[3, 4]), ``n_min`` (default=2), ``n_max`` (default=4), and ``len_max`` (default=10) parameters: 

In [None]:
split_kws = sf.get_split_kws(split_types="Pattern", steps_pattern=[5, 10], n_min=3, n_max=5, len_max=30)
split_kws

**Combining Parts + Splits + Scales**

Any combination of the three feature combinations can be provided to ``CPP``, which will create all **Part-Split-Scale** combinations and filter them down to a user-defined number (default=100) of non-redundant scales through the ``CPP.run()`` method:

In [None]:
# Load default scales, parts, and splits
df_scales = aa.load_scales()
df_parts = sf.get_df_parts(df_seq=df_seq)
split_kws = sf.get_split_kws()

# Get labels for test and reference class
labels = df_seq["label"].to_list()

# Creat CPP object by providing three feature components
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales)
df_feat = cpp.run(labels=labels)

aa.display_df(df=df_feat, show_shape=True)

Further information on the CPP feature concept can be found in thr [CPP Usage Principles](https://aaanalysis.readthedocs.io/en/latest/index/usage_principles/feature_identification.html) section.