We load a default scale dataset to showcase the ``AAclust.fit()`` method:

In [15]:
import aaanalysis as aa
aa.options["verbose"] = False
# Create test dataset of 25 amino acid scales
df_scales = aa.load_scales().T.sample(15).T
aa.display_df(df_scales)
X = df_scales.T

Unnamed: 0_level_0,WERD780102,NAKH900112,KARS160121,GEOR030102,ROBB790101,GEIM800110,QIAN880102,NADH010102,RADA880107,TANS770109,NAKH900106,LIFS790101,LINS030107,ROBB760101,GEIM800108
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
A,0.522,0.292,0.248,0.25,0.038,0.507,1.0,0.749,0.82,0.292,0.237,0.369,0.2,0.921,0.306
C,0.368,0.02,0.776,0.246,0.635,0.471,0.349,1.0,0.919,0.285,0.303,0.539,0.0,0.445,0.324
D,0.302,0.008,0.683,0.091,0.0,0.735,0.825,0.371,0.573,0.478,0.0,0.057,0.8,0.555,0.759
E,0.187,0.057,0.71,0.404,0.096,0.728,0.921,0.263,0.614,0.326,0.09,0.149,0.911,1.0,0.361
F,0.297,0.346,0.842,0.536,0.769,0.14,0.778,0.915,0.919,0.13,0.724,0.603,0.067,0.622,0.13
G,0.346,0.21,0.0,0.0,0.288,0.654,0.0,0.561,0.803,1.0,0.259,0.149,0.422,0.0,0.861
H,0.335,0.034,0.683,0.201,0.442,0.324,0.746,0.439,0.6,0.081,0.401,0.376,0.467,0.598,0.296
I,0.33,0.588,0.604,0.161,1.0,0.191,0.54,0.909,1.0,0.155,0.697,1.0,0.022,0.561,0.065
K,0.368,0.035,0.66,0.195,0.058,0.39,0.778,0.0,0.224,0.293,0.127,0.213,1.0,0.665,0.222
L,0.192,1.0,0.604,0.513,0.615,0.081,0.556,0.901,0.878,0.198,0.905,0.638,0.044,0.72,0.009


By fitting ``AAclust``, its three-step algorithm is performed to select an optimized ``n_clusters`` (k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes: 

In [26]:
# Fit clustering model
aac = aa.AAclust()
aac.fit(X)
# Get output parameters
n_clusters = aac.n_clusters
print("n_clusters: ", n_clusters)
labels = aac.labels_
print("Labels: ", labels)
centers = aac.centers_ # Cluster centers (average scales for each cluster)
labels_centers = aac.labels_centers_
medoids = aac.medoids_ # Representative scale for each cluster
labels_medoids = aac.labels_medoids_
print("Labels of medoids: ", labels_medoids)
is_medoid = aac.is_medoid_
df_scales_medoids = df_scales.T[is_medoid].T
aa.display_df(df_scales_medoids)

n_clusters:  3
Labels:  [0 1 1 1 2 0 1 2 2 0 2 2 0 1 0]
Labels of medoids:  [0 1 2]


Unnamed: 0_level_0,NADH010102,ROBB760101,GEIM800108
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.749,0.921,0.306
C,1.0,0.445,0.324
D,0.371,0.555,0.759
E,0.263,1.0,0.361
F,0.915,0.622,0.13
G,0.561,0.0,0.861
H,0.439,0.598,0.296
I,0.909,0.561,0.065
K,0.0,0.665,0.222
L,0.901,0.72,0.009


``names`` can be provided to the ``AAclust.fit()`` method to retrieve the names of the medoids:

In [31]:
names = [f"scale {i+1}" for i in range(len(df_scales.T))]
aac.fit(X, names=names)
medoid_names = aac.medoid_names_
print(medoid_names)

['scale 15', 'scale 11', 'scale 14', 'scale 5']


The ``n_clusters`` parameter can as well be pre-defined:

In [40]:
aac.fit(X, n_clusters=5, names=names)
medoid_names = aac.medoid_names_
print(medoid_names)

['scale 10', 'scale 14', 'scale 6', 'scale 5', 'scale 8']


The second step of the ``AAclust`` algorithm (recursive k optimization) can be adjusted using the ``min_th`` and ``on_center`` parameters:

In [42]:
# Pearson correlation within all cluster members >= 0.5
aac.fit(X, on_center=False, min_th=0.5)
print(aac.n_clusters)
# Pearson correlation between all cluster members and the respective center >= 0.5
aac.fit(X, on_center=True, min_th=0.5)
print(aac.n_clusters)
# The latter is less strict, leading to bigger and thus fewer clusters 

8
3


The third and optional merging step can be adjusted using the ``metric`` parameter and disabled setting ``merge=False``. The attributes can be directly retrieved since the ``AAclust.fit()`` method returns the fitted clustering model:

In [49]:
# Load over 500 scales
X = aa.load_scales().T
n_with_merging_euclidean = aac.fit(X).n_clusters
n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters
n_without_merging = aac.fit(X, merge=False).n_clusters
print(n_with_merging_euclidean)
print(n_with_merging_cosine)
print(n_without_merging)

49
47
54
