To address gaps in protein sequences caused by amino acids not encoded in your scales, the ``NumericalFeature().extend_alphabet()`` method effectively expands the existing alphabet. It adds new letters with values based on key statistics like minimum or average, calculated from the current amino acids. This enhancement helps prevent missing values and improves the reliability of feature engineering. To demonstrate this, we load our default scale DataFrame using ``load_scales``:

In [11]:
import aaanalysis as aa
df_scales = aa.load_scales()
aa.display_df(df_scales, n_cols=4, show_shape=True)

DataFrame shape: (20, 586)


Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.494,0.23,0.355,0.504
C,0.864,0.404,0.579,0.387
D,1.0,0.174,0.0,0.0
E,0.42,0.177,0.019,0.032
F,0.877,0.762,0.601,0.67
G,0.025,0.026,0.138,0.17
H,0.84,0.23,0.082,0.053
I,0.0,0.838,0.44,0.543
K,0.506,0.434,0.003,0.004
L,0.272,0.577,1.0,0.989


Using the utility ``NumericalFeature`` class, you can add a new letter (``letter_new``) to the ``df_seq`` DataFrame and select a ``value_type`` (default='mean'). 

In [12]:
nf = aa.NumericalFeature()
# Add new letter in last row of DataFrame
df_scales_x_mean = nf.extend_alphabet(df_scales=df_scales, letter_new="X")
aa.display_df(df_scales_x_mean, n_cols=4, show_shape=True)

DataFrame shape: (21, 586)


Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.494,0.23,0.355,0.504
C,0.864,0.404,0.579,0.387
D,1.0,0.174,0.0,0.0
E,0.42,0.177,0.019,0.032
F,0.877,0.762,0.601,0.67
G,0.025,0.026,0.138,0.17
H,0.84,0.23,0.082,0.053
I,0.0,0.838,0.44,0.543
K,0.506,0.434,0.003,0.004
L,0.272,0.577,1.0,0.989


In [13]:
# This should set each value of X to 0 since scales are min-max normalized
df_scales_x_min = nf.extend_alphabet(df_scales=df_scales, letter_new="X", value_type="min")
aa.display_df(df_scales_x_min, n_cols=4)

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.494,0.23,0.355,0.504
C,0.864,0.404,0.579,0.387
D,1.0,0.174,0.0,0.0
E,0.42,0.177,0.019,0.032
F,0.877,0.762,0.601,0.67
G,0.025,0.026,0.138,0.17
H,0.84,0.23,0.082,0.053
I,0.0,0.838,0.44,0.543
K,0.506,0.434,0.003,0.004
L,0.272,0.577,1.0,0.989


This modified ``df_scales`` DataFrame can now set as global default using ``options``:

In [14]:
aa.options["df_scales"] = df_scales_x_mean
# This will set internal default df_scales (but not affect load_scales)
cpp_plot = aa.CPPPlot()
df_scales_default = cpp_plot._df_scales
aa.display_df(df_scales_default, n_cols=4)

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.494,0.23,0.355,0.504
C,0.864,0.404,0.579,0.387
D,1.0,0.174,0.0,0.0
E,0.42,0.177,0.019,0.032
F,0.877,0.762,0.601,0.67
G,0.025,0.026,0.138,0.17
H,0.84,0.23,0.082,0.053
I,0.0,0.838,0.44,0.543
K,0.506,0.434,0.003,0.004
L,0.272,0.577,1.0,0.989
