We load an example scale dataset to showcase the ``AAclust.fit()`` method:

In [1]:
import aaanalysis as aa
aa.options["verbose"] = False
# Create test dataset of 25 amino acid scales
df_scales = aa.load_scales().T.sample(25).T
X = df_scales.T

By fitting ``AAclust``, its three-step algorithm is performed to select an optimized ``n_clusters`` (k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes: 

In [2]:
# Fit clustering model
aac = aa.AAclust()
aac.fit(X)
# Get output parameters
n_clusters = aac.n_clusters
print("n_clusters: ", n_clusters)
labels = aac.labels_
print("Labels: ", labels)
centers = aac.centers_ # Cluster centers (average scales for each cluster)
labels_centers = aac.labels_centers_
medoids = aac.medoids_ # Representative scale for each cluster
labels_medoids = aac.labels_medoids_
print("Labels of medoids: ", labels_medoids)
is_medoid = aac.is_medoid_
df_scales_medoids = df_scales.T[is_medoid].T
aa.display_df(df_scales_medoids)

n_clusters:  5
Labels:  [1 1 1 3 0 1 4 3 0 2 0 4 2 0 0 2 1 0 1 0 2 3 3 2 4]
Labels of medoids:  [1 3 0 4 2]


Unnamed: 0_level_0,KARS160117,AURR980110,LINS030107,COHE430101,PTIO830101
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,0.082,1.0,0.2,0.5,0.87
C,0.344,0.242,0.0,0.033,0.739
D,0.443,0.455,0.8,0.0,0.478
E,0.541,0.958,0.911,0.2,0.783
F,0.672,0.491,0.067,0.567,0.87
G,0.0,0.103,0.422,0.133,0.435
H,0.656,0.188,0.467,0.233,0.652
I,0.377,0.57,0.022,1.0,0.87
K,0.492,0.661,1.0,0.733,0.783
L,0.377,0.8,0.044,1.0,1.0


``names`` can be provided to the ``AAclust().fit()`` method to retrieve the names of the medoids:

In [3]:
names = [f"scale {i+1}" for i in range(len(df_scales.T))]
aac.fit(X, names=names)
medoid_names = aac.medoid_names_
print(medoid_names)

['scale 24', 'scale 3', 'scale 14', 'scale 11']


The ``n_clusters`` parameter can as well be pre-defined:

In [4]:
aac.fit(X, n_clusters=5, names=names)
medoid_names = aac.medoid_names_
print(medoid_names)

['scale 5', 'scale 17', 'scale 23', 'scale 10', 'scale 25']


The second step of the ``AAclust`` algorithm (recursive k optimization) can be adjusted using the ``min_th`` and ``on_center`` parameters:

In [5]:
# Pearson correlation within all cluster members >= 0.5
aac.fit(X, on_center=False, min_th=0.5)
print(aac.n_clusters)
# Pearson correlation between all cluster members and the respective center >= 0.5
aac.fit(X, on_center=True, min_th=0.5)
print(aac.n_clusters)
# The latter is less strict, leading to bigger and thus fewer clusters 

17
7


The third and optional merging step can be adjusted using the ``metric`` parameter and disabled setting ``merge=False``. The attributes can be directly retrieved since the ``AAclust.fit()`` method returns the fitted clustering model:

In [6]:
# Load over 500 scales
X = aa.load_scales().T
n_with_merging_euclidean = aac.fit(X).n_clusters
n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters
n_without_merging = aac.fit(X, merge=False).n_clusters
print(n_with_merging_euclidean)
print(n_with_merging_cosine)
print(n_without_merging)

49
57
59
