To address gaps in protein sequences caused by amino acids not encoded in your scales, the ``NumericalFeature().extend_alphabet()`` method effectively expands the existing alphabet. It adds new letters with values based on key statistics like minimum or average, calculated from the current amino acids. This enhancement helps prevent missing values and improves the reliability of feature engineering. To demonstrate this, we load our default scale DataFrame using ``load_scales``:

In [3]:
import aaanalysis as aa
df_scales = aa.load_scales()
aa.display_df(df_scales, n_cols=3, show_shape=True)

DataFrame shape: (20, 586)


Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.494,0.23,0.355
C,0.864,0.404,0.579
D,1.0,0.174,0.0
E,0.42,0.177,0.019
F,0.877,0.762,0.601
G,0.025,0.026,0.138
H,0.84,0.23,0.082
I,0.0,0.838,0.44
K,0.506,0.434,0.003
L,0.272,0.577,1.0


Using the utility ``NumericalFeature`` class, you can add a new letter (``new_letter``) to the ``df_seq`` DataFrame and select a ``value_type`` (default='mean'). 

In [4]:
nf = aa.NumericalFeature()
# Add new letter in last row of DataFrame
df_scales_x_mean = nf.extend_alphabet(df_scales=df_scales, new_letter="X")
aa.display_df(df_scales_x_mean, n_cols=3, show_shape=True, row_to_show="X")

DataFrame shape: (21, 586)


Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
X,0.5773,0.37635,0.28865


In [5]:
# This should set each value of X to 0 since scales are min-max normalized
df_scales_x_min = nf.extend_alphabet(df_scales=df_scales, new_letter="X", value_type="min")
aa.display_df(df_scales_x_min, n_cols=3, row_to_show="X")

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
X,0.0,0.0,0.0


This modified ``df_scales`` DataFrame can now set as global default using ``options``:

In [6]:
aa.options["df_scales"] = df_scales_x_mean
# This will set internal default df_scales (but not affect load_scales)
cpp_plot = aa.CPPPlot()
df_scales_default = cpp_plot._df_scales
aa.display_df(df_scales_default, n_cols=3, row_to_show="X")

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
X,0.5773,0.37635,0.28865
