We load an example scale dataset to showcase the ``AAclust.fit()`` method:

In [50]:
import aaanalysis as aa
aa.options["verbose"] = False
# Create test dataset of 25 amino acid scales
df_scales = aa.load_scales().T.sample(25).T
X = df_scales.T

By fitting ``AAclust``, its three-step algorithm is performed to select an optimized ``n_clusters`` (k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes: 

In [51]:
# Fit clustering model
aac = aa.AAclust()
aac.fit(X)
# Get output parameters
n_clusters = aac.n_clusters
print("n_clusters: ", n_clusters)
labels = aac.labels_
print("Labels: ", labels)
centers = aac.centers_ # Cluster centers (average scales for each cluster)
labels_centers = aac.labels_centers_
medoids = aac.medoids_ # Representative scale for each cluster
labels_medoids = aac.labels_medoids_
print("Labels of medoids: ", labels_medoids)
is_medoid = aac.is_medoid_
df_scales_medoids = df_scales.T[is_medoid].T
aa.display_df(df_scales_medoids)

n_clusters:  4
Labels:  [2 2 1 1 1 2 3 0 1 0 0 0 0 1 3 3 3 2 0 1 1 0 0 1 0]
Labels of medoids:  [2 1 3 0]


Unnamed: 0_level_0,LINS030117,KOEH090110,CHOP780201,CASG920101
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.186,0.14,0.904,0.514
C,0.0,0.285,0.138,1.0
D,0.186,0.919,0.468,0.057
E,0.349,0.913,1.0,0.086
F,0.326,0.029,0.596,0.743
G,0.023,0.221,0.0,0.429
H,0.419,0.651,0.457,0.571
I,0.14,0.029,0.543,0.857
K,1.0,1.0,0.628,0.0
L,0.186,0.0,0.681,0.6


``names`` can be provided to the ``AAclust().fit()`` method to retrieve the names of the medoids:

In [52]:
names = [f"scale {i+1}" for i in range(len(df_scales.T))]
aac.fit(X, names=names)
medoid_names = aac.medoid_names_
print(medoid_names)

['scale 7', 'scale 23', 'scale 3', 'scale 18']


The ``n_clusters`` parameter can as well be pre-defined:

In [53]:
aac.fit(X, n_clusters=5, names=names)
medoid_names = aac.medoid_names_
print(medoid_names)

['scale 16', 'scale 23', 'scale 3', 'scale 6', 'scale 17']


The second step of the ``AAclust`` algorithm (recursive k optimization) can be adjusted using the ``min_th`` and ``on_center`` parameters:

In [54]:
# Pearson correlation within all cluster members >= 0.5
aac.fit(X, on_center=False, min_th=0.5)
print(aac.n_clusters)
# Pearson correlation between all cluster members and the respective center >= 0.5
aac.fit(X, on_center=True, min_th=0.5)
print(aac.n_clusters)
# The latter is less strict, leading to bigger and thus fewer clusters 

14
4


The third and optional merging step can be adjusted using the ``metric`` parameter and disabled setting ``merge=False``. The attributes can be directly retrieved since the ``AAclust.fit()`` method returns the fitted clustering model:

In [55]:
# Load over 500 scales
X = aa.load_scales().T
n_with_merging_euclidean = aac.fit(X).n_clusters
n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters
n_without_merging = aac.fit(X, merge=False).n_clusters
print(n_with_merging_euclidean)
print(n_with_merging_cosine)
print(n_without_merging)

56
56
53
