We load an example scale dataset to showcase the ``AAclust().fit()`` method:

In [1]:
import aaanalysis as aa
aa.options["verbose"] = False

# Create test dataset of 25 amino acid scales
df_scales = aa.load_scales().T.sample(25).T
X = df_scales.T

By fitting ``AAclust``, its three-step algorithm is performed to select an optimized ``n_clusters`` (k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes: 

In [2]:
# Fit clustering model
aac = aa.AAclust()
aac.fit(X)
# Get output parameters
n_clusters = aac.n_clusters
print("n_clusters: ", n_clusters)
labels = aac.labels_
print("Labels: ", labels)
centers = aac.centers_ # Cluster centers (average scales for each cluster)
labels_centers = aac.labels_centers_
medoids = aac.medoids_ # Representative scale for each cluster
labels_medoids = aac.labels_medoids_
print("Labels of medoids: ", labels_medoids)
is_medoid = aac.is_medoid_
df_scales_medoids = df_scales.T[is_medoid].T
aa.display_df(df_scales_medoids, show_shape=True, n_rows=5)

n_clusters:  4
Labels:  [0 0 0 1 1 3 2 0 2 1 2 0 1 1 0 3 1 1 1 0 1 0 0 2 2]
Labels of medoids:  [0 1 3 2]
DataFrame shape: (20, 4)


Unnamed: 0_level_0,ISOY800107,MIYS850101,MIYS990103,EISD860101
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.482,0.36,0.5,0.589
C,0.518,0.678,0.029,0.528
D,0.637,0.14,0.786,0.191
E,0.914,0.162,0.871,0.285
F,0.155,1.0,0.057,0.936


``names`` can be provided to the ``AAclust().fit()`` method to retrieve the names of the medoids:

In [3]:
names = [f"scale {i+1}" for i in range(len(df_scales.T))]
aac.fit(X, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)

Name of medoid scales:
['scale 10', 'scale 15', 'scale 4']


The ``n_clusters`` parameter can as well be pre-defined:

In [4]:
aac.fit(X, n_clusters=7, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)

Name of medoid scales:
['scale 20', 'scale 15', 'scale 22', 'scale 14', 'scale 6', 'scale 24', 'scale 9']


The second step of the ``AAclust`` algorithm (recursive k optimization) can be adjusted using the ``min_th`` and ``on_center`` parameters:

In [5]:
# Pearson correlation within all cluster members >= 0.5
aac.fit(X, on_center=False, min_th=0.5)
print("n clusters (pairwise correlation): ", aac.n_clusters)
# Pearson correlation between all cluster members and the respective center >= 0.5
aac.fit(X, on_center=True, min_th=0.5)
print("n clusters (center correlation): ", aac.n_clusters)
# The latter is less strict, leading to bigger and thus fewer clusters 

n clusters (pairwise correlation):  10
n clusters (center correlation):  5


The third and optional merging step can be adjusted using the ``metric`` parameter and disabled setting ``merge=False``. The attributes can be directly retrieved since the ``AAclust.fit()`` method returns the fitted clustering model:

In [6]:
# Load over 500 scales
X = aa.load_scales().T
n_with_merging_euclidean = aac.fit(X, metric="euclidean").n_clusters
n_without_merging_euclidean = aac.fit(X, merge=False, metric="euclidean").n_clusters
n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters
n_without_merging_cosine = aac.fit(X, merge=False, metric="cosine").n_clusters
print("n clusters (merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (no merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (merging, cosine): ", n_with_merging_cosine)
print("n clusters (no merging, cosine): ", n_without_merging_cosine)

n clusters (merging, euclidean):  54
n clusters (no merging, euclidean):  54
n clusters (merging, cosine):  52
n clusters (no merging, cosine):  59
