We load an example scale dataset to showcase the ``AAclust().fit()`` method:

In [1]:
import aaanalysis as aa
aa.options["verbose"] = False

# Create test dataset of 25 amino acid scales
df_scales = aa.load_scales().T.sample(25).T
X = df_scales.T

By fitting ``AAclust``, its three-step algorithm is performed to select an optimized ``n_clusters`` (k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes: 

In [2]:
# Fit clustering model
aac = aa.AAclust()
aac.fit(X)
# Get output parameters
n_clusters = aac.n_clusters
print("n_clusters: ", n_clusters)
labels = aac.labels_
print("Labels: ", labels)
centers = aac.centers_ # Cluster centers (average scales for each cluster)
labels_centers = aac.labels_centers_
medoids = aac.medoids_ # Representative scale for each cluster
labels_medoids = aac.labels_medoids_
print("Labels of medoids: ", labels_medoids)
is_medoid = aac.is_medoid_
df_scales_medoids = df_scales.T[is_medoid].T
aa.display_df(df_scales_medoids, show_shape=True, n_rows=5)

n_clusters:  5
Labels:  [0 1 0 1 2 2 1 3 0 2 1 3 0 3 4 3 4 1 2 4 4 3 2 0 4]
Labels of medoids:  [0 1 2 3 4]
DataFrame shape: (20, 5)


Unnamed: 0_level_0,KUMS000104,QIAN880135,PALJ810112,LINS030109,GRAR740103
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,1.0,0.276,0.379,0.149,0.168
C,0.0,0.297,0.408,0.0,0.311
D,0.302,0.611,0.146,0.809,0.305
E,0.556,0.36,0.0,0.894,0.479
F,0.294,0.23,0.621,0.0,0.772


``names`` can be provided to the ``AAclust().fit()`` method to retrieve the names of the medoids:

In [3]:
names = [f"scale {i+1}" for i in range(len(df_scales.T))]
aac.fit(X, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)

Name of medoid scales:
['scale 23', 'scale 2', 'scale 24', 'scale 12', 'scale 20']


The ``n_clusters`` parameter can as well be pre-defined:

In [4]:
aac.fit(X, n_clusters=7, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)

Name of medoid scales:
['scale 11', 'scale 2', 'scale 24', 'scale 5', 'scale 6', 'scale 12', 'scale 20']


The second step of the ``AAclust`` algorithm (recursive k optimization) can be adjusted using the ``min_th`` and ``on_center`` parameters:

In [5]:
# Pearson correlation within all cluster members >= 0.5
aac.fit(X, on_center=False, min_th=0.5)
print("n clusters (pairwise correlation): ", aac.n_clusters)
# Pearson correlation between all cluster members and the respective center >= 0.5
aac.fit(X, on_center=True, min_th=0.5)
print("n clusters (center correlation): ", aac.n_clusters)
# The latter is less strict, leading to bigger and thus fewer clusters 

n clusters (pairwise correlation):  13
n clusters (center correlation):  6


The third and optional merging step can be adjusted using the ``metric`` parameter and disabled setting ``merge=False``. The attributes can be directly retrieved since the ``AAclust.fit()`` method returns the fitted clustering model:

In [17]:
# Load over 500 scales
X = aa.load_scales().T
n_with_merging_euclidean = aac.fit(X, metric="euclidean").n_clusters
n_without_merging_euclidean = aac.fit(X, merge=False, metric="euclidean").n_clusters
n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters
n_without_merging_cosine = aac.fit(X, merge=False, metric="cosine").n_clusters
print("n clusters (merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (no merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (merging, cosine): ", n_with_merging_cosine)
print("n clusters (no merging, cosine): ", n_without_merging_cosine)

n clusters (merging, euclidean):  50
n clusters (no merging, euclidean):  50
n clusters (merging, cosine):  58
n clusters (no merging, cosine):  53
