# Scale Loading Tutorial
This is a tutorial on the `load_scales` function for loading of amino acid scales sets, their classification (AAontology), or evaluation (AAclust top60).

Six different datasets can be loading in total by using the ``name`` parameter: ``scales``, ``scales_raw``, ``scales_pc``, ``scales_cat``, ``top60``,
and ``top60_eval``.

## Three sets of numerical amino acid scales
- ``scales_raw``: Original amino acid scales sourced from AAindex and two other datasets.
- ``scales``: Min-max normalized version of the raw scales.
- ``scales_pc``: Scales compressed using principal component analysis (PCA).

Amino acid scales are indicated by a unique id (columns) and assign a numerical value to each canonical amino acid:

In [11]:
import aaanalysis as aa
df_scales = aa.load_scales()
df_raw = aa.load_scales(name="scales_raw")
df_pc = aa.load_scales(name="scales_pc")
df_scales.iloc[:5, :4]

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.494,0.23,0.355,0.504
C,0.864,0.404,0.579,0.387
D,1.0,0.174,0.0,0.0
E,0.42,0.177,0.019,0.032
F,0.877,0.762,0.601,0.67


## AAontology
- ``scales_cat`` provides a two-level classification for all ``scales``, termed AAontology.
 
The entries in the ``scale_id`` column align with the column names of ``df_scales`` and ``df_raw`. Additional columns detail further information from AAontology, such as scale category or description.

In [12]:
df_cat = aa.load_scales(name="scales_cat")
df_cat.head(5)

Unnamed: 0,scale_id,category,subcategory,scale_name,scale_description
0,LINS030110,ASA/Volume,Accessible surface area (ASA),ASA (folded coil/turn),Total median accessible surfaces of whole resi...
1,LINS030113,ASA/Volume,Accessible surface area (ASA),ASA (folded coil/turn),% total accessible surfaces of whole residues ...
2,JANJ780101,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),"Average accessible surface area (Janin et al.,..."
3,JANJ780103,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),"Percentage of exposed residues (Janin et al., ..."
4,LINS030104,ASA/Volume,Accessible surface area (ASA),ASA (folded protein),Total median accessible surfaces of whole resi...


## Redundancy-reduce scale subsets 
The remaining two datasets stem from an in-depth analysis of redundancy-reduced subsets of ``scales`` using AAclust. 

- ``top60`` comprises the 60 best performing scale sets, benchmarked on our protein datasets available by the ``aa.load_dataset`` function.

These have a unique AAclust id (``top60_id`` index, 'ACC' for AAclust) and the presence (1) or absence (0) of scales is indicated. 

In [13]:
df_top60 = aa.load_scales(name="top60")
df_top60.sum(axis=1)    # Shows number of scales in each subset
df_top60.iloc[:5, :4]

Unnamed: 0_level_0,ANDN920101,ARGP820101,ARGP820102,ARGP820103
top60_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAC01,0,0,0,1
AAC02,1,0,0,1
AAC03,1,0,0,1
AAC04,1,0,0,1
AAC05,1,0,0,1


- ``top60_eval`` shows the average accuracy for each protein scale subset given by their ids (index) across all tested protein benchmarks (columns):  

In [14]:
df_eval = aa.load_scales(name="top60_eval")
df_eval.mean(axis=1)  # Shows the overall average performance used for ranking
df_eval.iloc[:5, :4]

Unnamed: 0_level_0,SEQ_AMYLO,SEQ_CAPSID,SEQ_DISULFIDE,SEQ_LOCATION
top60_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAC01,0.761,0.827,0.732,0.746
AAC02,0.747,0.83,0.733,0.742
AAC03,0.741,0.829,0.734,0.746
AAC04,0.747,0.828,0.731,0.747
AAC05,0.739,0.83,0.735,0.752


Use the ``top60_n`` parameters to select the n-th best scale set, either as ``scales``, ``scales_raw``, or ``scales_cat``

In [15]:
df_cat_1 = aa.load_scales(name="scales_cat", top60_n=1)
df_raw_1 = aa.load_scales(name="scales_raw", top60_n=1)
df_scales_1 = aa.load_scales(top60_n=1)
# Which is the same as 
df_top60 = aa.load_scales(name="top60")
selected_scales = df_top60.columns[df_top60.loc["AAC01"] == 1].tolist()
df_aac1 = df_scales[selected_scales] 

## Filtering of scales
Two parameters are provided to filter ``df_scales``, ``df_cat``, and ``df_raw``. You can exclude scales from the other two data sources (i.e., scales not contained in AAindex) setting ``just_aaindex=True``, which is disabled by default. AAontology comprises scales that were not subordinated to any subcategory ('unclassified' scales), which can be excluded by setting ``unclassified_out=True``:  

In [9]:
df_scales = aa.load_scales(just_aaindex=True, unclassified_out=True)
df_raw = aa.load_scales(name="scales_raw")
df_cat = aa.load_scales(name="scales_cat", just_aaindex=True, unclassified_out=True)