# Scale Loading Tutorial
This is a tutorial on the `load_scales` function for loading of amino acid scales sets, their classification (AAontology), or evaluation (AAclust top60).

Six different datasets can be loading in total by using the ``name`` parameter: ``scales``, ``scales_raw``, ``scales_pc``, ``scales_cat``, ``top60``,
and ``top60_eval``.

## Three sets of numerical amino acid scales
- ``scales_raw``: Original amino acid scales sourced from AAindex and two other datasets.
- ``scales``: Min-max normalized version of the raw scales.
- ``scales_pc``: Scales compressed using principal component analysis (PCA).

Amino acid scales are indicated by a unique id (columns) and assign a numerical value to each canonical amino acid:

In [1]:
import aaanalysis as aa
df_scales = aa.load_scales()
df_raw = aa.load_scales(name="scales_raw")
df_pc = aa.load_scales(name="scales_pc")
df_scales

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103,BEGF750101,BEGF750102,BEGF750103,BHAR880101,BIGC670101,BIOV880101,...,KOEH090103,KOEH090104,KOEH090105,KOEH090106,KOEH090107,KOEH090108,KOEH090109,KOEH090110,KOEH090111,KOEH090112
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0.494,0.23,0.355,0.504,1.0,0.512,0.0,0.249,0.164,0.476,...,0.194,0.3,0.551,0.222,0.308,0.273,0.221,0.14,0.522,0.345
C,0.864,0.404,0.579,0.387,0.0,0.233,0.783,0.205,0.323,0.936,...,0.279,0.222,0.174,0.0,0.449,0.273,0.346,0.285,0.416,1.0
D,1.0,0.174,0.0,0.0,0.404,0.233,1.0,0.867,0.324,0.191,...,0.583,0.889,0.72,0.556,1.0,1.0,0.875,0.919,0.796,0.44
E,0.42,0.177,0.019,0.032,0.713,0.0,0.267,0.811,0.488,0.106,...,0.542,0.889,0.732,0.593,0.718,0.857,0.853,0.913,1.0,0.681
F,0.877,0.762,0.601,0.67,0.574,1.0,0.267,0.076,0.783,1.0,...,0.049,0.189,0.0,0.148,0.0,0.182,0.0,0.029,0.186,0.017
G,0.025,0.026,0.138,0.17,0.309,0.233,1.0,1.0,0.0,0.388,...,0.23,0.544,0.608,0.222,0.538,0.571,0.191,0.221,0.481,0.112
H,0.84,0.23,0.082,0.053,0.574,0.651,0.633,0.112,0.561,0.579,...,0.455,0.856,0.402,0.37,0.5,0.545,0.618,0.651,0.726,0.267
I,0.0,0.838,0.44,0.543,0.713,1.0,0.0,0.671,0.663,0.885,...,0.0,0.0,0.246,0.074,0.167,0.091,0.051,0.029,0.398,0.276
K,0.506,0.434,0.003,0.004,0.574,0.0,0.633,0.687,0.694,0.0,...,0.737,0.933,0.873,1.0,0.936,0.779,1.0,1.0,0.734,0.405
L,0.272,0.577,1.0,0.989,1.0,0.651,0.267,0.281,0.663,0.867,...,0.082,0.078,0.233,0.148,0.167,0.0,0.118,0.0,0.333,0.259


## AAontology
- ``scales_cat`` provides a two-level classification for all ``scales``, termed AAontology.
 
The entries in the ``scale_id`` column align with the column names of ``df_scales`` and ``df_raw`. Additional columns detail further information from AAontology, such as scale category or description.

In [2]:
df_cat = aa.load_scales(name="scales_cat")
df_cat.head(10)

Unnamed: 0,scale_id,category,subcategory,scale_name,scale_description
0,LINS030110,ASA/Volume,Accessible surface area (ASA),ASA (folded coil/turn),Total median accessible surfaces of whole resi...
1,LINS030113,ASA/Volume,Accessible surface area (ASA),ASA (folded coil/turn),% total accessible surfaces of whole residues ...
2,JANJ780101,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),"Average accessible surface area (Janin et al.,..."
3,JANJ780103,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),"Percentage of exposed residues (Janin et al., ..."
4,LINS030104,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),Total median accessible surfaces of whole resi...
5,LINS030107,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),% total accessible surfaces vs win3 of whole r...
6,CHOC760102,ASA/Volume,Accessible surface area (ASA),ASA (folded proteins),Residue accessible surface area in folded prot...
7,LINS030116,ASA/Volume,Accessible surface area (ASA),ASA (folded β-strand),Total median accessible surfaces of whole resi...
8,LINS030119,ASA/Volume,Accessible surface area (ASA),ASA (folded β-strand),% total accessible surfaces of whole residues ...
9,LINS030103,ASA/Volume,Accessible surface area (ASA),Hydrophilic ASA,Hydrophilic accessible surfaces of whole resid...


## Redundancy-reduce scale subsets 
The remaining two datasets stem from an in-depth analysis of redundancy-reduced subsets of ``scales`` using AAclust. 

- ``top60`` comprises the 60 best performing scale sets, benchmarked on our protein datasets available by the ``aa.load_dataset`` function.

These have a unique AAclust id (``top60_id`` index, 'ACC' for AAclust) and the presence (1) or absence (0) of scales is indicated. 

In [3]:
df_top60 = aa.load_scales(name="top60")
df_top60.sum(axis=1)    # Shows number of scales in each subset
df_top60.head(10)

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103,BEGF750101,BEGF750102,BEGF750103,BHAR880101,BIGC670101,BIOV880101,...,KOEH090103,KOEH090104,KOEH090105,KOEH090106,KOEH090107,KOEH090108,KOEH090109,KOEH090110,KOEH090111,KOEH090112
top60_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAC01,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
AAC02,1,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
AAC03,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
AAC04,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
AAC05,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
AAC06,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
AAC07,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
AAC08,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
AAC09,1,0,0,1,0,1,1,1,0,0,...,0,1,0,0,0,0,0,0,0,1
AAC10,1,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


- ``top60_eval`` shows the average accuracy for each protein scale subset given by their ids (index) across all tested protein benchmarks (columns):  

In [4]:
df_eval = aa.load_scales(name="top60_eval")
df_eval.mean(axis=1)  # Shows the overall average performance used for ranking
df_eval

Unnamed: 0_level_0,SEQ_AMYLO,SEQ_CAPSID,SEQ_DISULFIDE,SEQ_LOCATION,SEQ_SOLUBLE,SEQ_TAIL,AA5_CASPASE3,AA5_FURIN,AA5_LDR,AA5_MMP2,...,AA9_LDR,AA9_MMP2,AA9_RNABIND,AA9_SA,AA13_CASPASE3,AA13_FURIN,AA13_LDR,AA13_MMP2,AA13_RNABIND,AA13_SA
top60_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAC01,0.761,0.827,0.732,0.746,0.646,0.884,0.862,0.901,0.612,0.68,...,0.644,0.703,0.659,0.664,0.79,0.918,0.694,0.681,0.652,0.615
AAC02,0.747,0.83,0.733,0.742,0.653,0.886,0.855,0.907,0.608,0.688,...,0.642,0.706,0.657,0.671,0.792,0.916,0.69,0.676,0.656,0.608
AAC03,0.741,0.829,0.734,0.746,0.648,0.884,0.857,0.904,0.601,0.685,...,0.636,0.71,0.659,0.67,0.791,0.914,0.695,0.684,0.656,0.613
AAC04,0.747,0.828,0.731,0.747,0.654,0.885,0.859,0.906,0.605,0.686,...,0.64,0.714,0.654,0.664,0.79,0.915,0.689,0.68,0.656,0.61
AAC05,0.739,0.83,0.735,0.752,0.646,0.888,0.859,0.906,0.601,0.684,...,0.64,0.713,0.658,0.671,0.79,0.918,0.689,0.682,0.649,0.607
AAC06,0.743,0.827,0.736,0.746,0.652,0.883,0.857,0.906,0.608,0.684,...,0.642,0.709,0.659,0.665,0.789,0.915,0.691,0.68,0.653,0.611
AAC07,0.742,0.833,0.736,0.747,0.65,0.882,0.858,0.901,0.606,0.688,...,0.638,0.711,0.661,0.671,0.789,0.914,0.689,0.682,0.655,0.606
AAC08,0.743,0.831,0.733,0.746,0.65,0.884,0.858,0.903,0.603,0.687,...,0.64,0.71,0.658,0.669,0.787,0.916,0.689,0.681,0.654,0.608
AAC09,0.753,0.826,0.736,0.747,0.647,0.882,0.858,0.905,0.608,0.684,...,0.642,0.703,0.653,0.664,0.787,0.915,0.69,0.679,0.647,0.614
AAC10,0.75,0.828,0.734,0.748,0.646,0.886,0.86,0.908,0.602,0.684,...,0.632,0.711,0.655,0.67,0.787,0.913,0.689,0.676,0.653,0.61


Use the ``top60_n`` parameters to select the n-th best scale set, either as ``scales``, ``scales_raw``, or ``scales_cat``

In [5]:
df_cat_1 = aa.load_scales(name="scales_cat", top60_n=1)
df_raw_1 = aa.load_scales(name="scales_raw", top60_n=1)
df_scales_1 = aa.load_scales(top60_n=1)
# Which is the same as 
df_top60 = aa.load_scales(name="top60")
selected_scales = df_top60.columns[df_top60.loc["AAC01"] == 1].tolist()
df_aac1 = df_scales[selected_scales] 
df_scales_1

Unnamed: 0_level_0,ARGP820103,BUNA790103,BURA740102,CHAM820102,CHAM830102,CHAM830103,CHAM830105,CHOP780101,CHOP780204,CHOP780206,...,KARS160110,KARS160112,KARS160118,KARS160119,KARS160120,KARS160122,LINS030105,LINS030109,KOEH090110,KOEH090112
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0.504,0.691,0.264,0.262,0.182,0.0,0.0,0.174,0.425,0.298,...,0.863,0.697,0.429,0.205,0.952,0.0,0.111,0.149,0.14,0.345
C,0.387,0.819,1.0,0.947,0.442,0.5,0.0,0.661,0.11,0.256,...,0.557,0.697,1.0,0.863,0.952,0.5,0.0,0.0,0.285,1.0
D,0.0,0.745,0.213,0.601,0.397,0.5,0.0,0.908,0.79,0.529,...,0.293,0.473,0.566,0.889,0.952,0.247,0.2,0.809,0.919,0.44
E,0.032,0.745,0.186,0.561,0.017,0.5,0.667,0.248,1.0,0.579,...,0.396,0.311,0.544,0.893,0.952,0.152,0.4,0.894,0.913,0.681
F,0.67,1.0,0.354,0.462,0.354,0.5,0.667,0.119,0.085,0.488,...,0.208,0.139,0.429,0.821,0.952,0.169,0.133,0.0,0.029,0.017
G,0.17,0.596,0.336,0.24,0.47,0.0,0.0,1.0,0.16,0.884,...,0.863,0.393,0.0,0.0,0.952,0.0,0.133,0.298,0.221,0.112
H,0.053,0.851,0.0,0.313,0.514,0.5,0.667,0.44,0.145,0.727,...,0.355,0.213,0.206,0.708,0.562,0.134,0.4,0.489,0.651,0.267
I,0.543,0.745,0.634,0.424,0.633,1.0,0.0,0.0,0.115,0.364,...,0.033,0.163,0.429,0.733,0.583,0.281,0.044,0.0,0.029,0.276
K,0.004,0.691,0.195,0.313,0.376,0.5,0.667,0.495,0.11,0.554,...,0.433,0.312,0.458,0.647,0.912,0.114,1.0,1.0,1.0,0.405
L,0.989,0.691,0.601,0.463,0.267,0.5,0.0,0.11,0.07,0.421,...,0.218,0.553,0.429,0.74,0.952,0.259,0.067,0.021,0.0,0.259


## Filtering of scales
Two parameters are provided to filter ``df_scales``, ``df_cat``, and ``df_raw``. You can exclude scales from the other two data sources (i.e., scales not contained in AAindex) setting ``just_aaindex=True``, which is disabled by default. AAontology comprises as well scales that were not subordinated to any subcategory, named 'unclassified' scales. If you want to exclude such scales (and also scales from the 'Others' category), set ``unclassified_in=False``:  

In [6]:
df_scales = aa.load_scales(just_aaindex=True)
df_raw = aa.load_scales(name="scales_raw", unclassified_in=False)
df_cat = aa.load_scales(name="scales_cat", just_aaindex=True, unclassified_in=False)
df_cat.head(10)

Unnamed: 0,scale_id,category,subcategory,scale_name,scale_description
0,JANJ780101,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),"Average accessible surface area (Janin et al.,..."
1,JANJ780103,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),"Percentage of exposed residues (Janin et al., ..."
2,CHOC760102,ASA/Volume,Accessible surface area (ASA),ASA (folded proteins),Residue accessible surface area in folded prot...
3,GUYH850104,ASA/Volume,Accessible surface area (ASA),Partition energy,Apparent partition energies calculated from Ja...
4,GUYH850105,ASA/Volume,Accessible surface area (ASA),Partition energy,Apparent partition energies calculated from Ch...
5,RACS770103,ASA/Volume,Accessible surface area (ASA),Side chain orientation,Side chain orientational preference (Rackovsky...
6,VHEG790101,ASA/Volume,Accessible surface area (ASA),TFE to lipophilic phase,Transfer free energy to lipophilic phase (von ...
7,BIOV880101,ASA/Volume,Buried,Buriability,Information value for accessibility; average f...
8,BIOV880102,ASA/Volume,Buried,Buriability,Information value for accessibility; average f...
9,WERD780101,ASA/Volume,Buried,Buriability,Propensity to be buried inside (Wertz-Scheraga...
