We load an example scale dataset to showcase the ``AAclust().filter_coverage()`` method:

In [6]:
import aaanalysis as aa
aa.options["verbose"] = False

# Create test dataset of 100 amino acid scales
df_scales = aa.load_scales()
scale_ids = list(df_scales.columns)[:100]
X = df_scales[scale_ids].T.values

# Create list of subcategories to be covered
df_cat = aa.load_scales(name="scales_cat")
names_ref = df_cat[df_cat["scale_id"].isin(scale_ids)]["subcategory"].tolist()

# Create AAclu model
aac = aa.AAclust()

Internally, the methods fits the ``AAclust`` model and computes iteratively a subcategory coverage until a given threshold is reached. Provide the scales as feature matrix (``X``) and as id list (``scale_ids``), along with a category DataFrame (``df_cat``) and the list of subcategories to cover (``names_ref``): 

In [7]:
# Select scales based on subcategory coverage
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref)
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]
aa.display_df(df_cat_selected, show_shape=True, n_rows=5)

From 100 scales, the following 69 scales were selected:
DataFrame shape: (69, 5)


Unnamed: 0,scale_id,category,subcategory,scale_name,scale_description
7,CHOC760102,ASA/Volume,Accessible surface area (ASA),ASA (folded proteins),"Residue accessi...(Chothia, 1976)"
24,BIOV880101,ASA/Volume,Buried,Buriability,"Information val...u et al., 1988)"
28,ARGP820103,ASA/Volume,Buried,Buried,"Membrane-buried...s et al., 1982)"
30,CHOC760104,ASA/Volume,Buried,Buried,"Proportion of r...(Chothia, 1976)"
46,COHE430101,ASA/Volume,Partial specific volume,Partial specific volume,"Partial specifi...n-Edsall, 1943)"


Adjust the subcategory coverage threshold using the ``min_coverage`` (default=100) parameter:

In [8]:
# Select scales using a subcategory coverage of 50%
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref, min_coverage=50)
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]
aa.display_df(df_cat_selected, show_shape=True, n_rows=5)

From 100 scales, the following 41 scales were selected:
DataFrame shape: (41, 5)


Unnamed: 0,scale_id,category,subcategory,scale_name,scale_description
24,BIOV880101,ASA/Volume,Buried,Buriability,"Information val...u et al., 1988)"
28,ARGP820103,ASA/Volume,Buried,Buried,"Membrane-buried...s et al., 1982)"
29,CHOC760103,ASA/Volume,Buried,Buried,"Proportion of r...(Chothia, 1976)"
46,COHE430101,ASA/Volume,Partial specific volume,Partial specific volume,"Partial specifi...n-Edsall, 1943)"
48,CHOC760101,ASA/Volume,Volume,Accessible surface area (ASA),"Residue accessi...(Chothia, 1976)"


Instead of using the 'subcategory' coverage, you can compute the coverage of 'category' or 'scale_name' by adjusting the ``col_name`` (default='subcategory') parameter:

In [9]:
# Create list of categories to be covered
df_cat = aa.load_scales(name="scales_cat")
names_ref = df_cat[df_cat["scale_id"].isin(scale_ids)]["category"].tolist()

# Select scales using a category coverage of 100%
selected_scale_ids = aac.filter_coverage(X, scale_ids=scale_ids, df_cat=df_cat, names_ref=names_ref, col_name="category")
print(f"From 100 scales, the following {len(selected_scale_ids)} scales were selected:")
df_cat_selected = df_cat[df_cat["scale_id"].isin(selected_scale_ids)]
aa.display_df(df_cat_selected, show_shape=True, n_rows=5)

From 100 scales, the following 23 scales were selected:
DataFrame shape: (23, 5)


Unnamed: 0,scale_id,category,subcategory,scale_name,scale_description
24,BIOV880101,ASA/Volume,Buried,Buriability,"Information val...u et al., 1988)"
30,CHOC760104,ASA/Volume,Buried,Buried,"Proportion of r...(Chothia, 1976)"
46,COHE430101,ASA/Volume,Partial specific volume,Partial specific volume,"Partial specifi...n-Edsall, 1943)"
48,CHOC760101,ASA/Volume,Volume,Accessible surface area (ASA),"Residue accessi...(Chothia, 1976)"
54,DAWD720101,ASA/Volume,Volume,Volume,"Size (Dawson, 1972)"
