# CLUSTERING MODELS

### Clustering Workflow Overview:
1. **Define Clustering Methods**  
   Define functions for the following 5 clustering methods:
   - K-Means
   - K-Medoids with euclidean distance
   - K-Medoids with cosine similarity (transfered to distance)
   - Spectral Clustering with euclidean distance (transfered to similarity)
   - Spectral Clustering with cosine similarity

2. **Set Number of Clusters**  
   Calculate the number of original categories and set the clustering count to match this number.

3. **Execute Clustering and Save Results**  
   Perform clustering using each method and store results in dictionary format.  
   Save all clustering results as a pickle file in the directory `./output/clustering/`.

In [1]:
import pandas as pd
import numpy as np
import warnings
import os

warnings.filterwarnings("ignore")

## 1 K-MEANS (with raw data)

In [2]:
from sklearn.cluster import KMeans

def Kmeans_Raw(data,n_clusters,state=520):
    # Apply K-means clustering with 3 clusters
    kmeans = KMeans(n_clusters=n_clusters, random_state=state)
    kmeans.fit(data)

    return kmeans.labels_

## 2 KMedoids (with distance matrix)
- This method needs to use distance data

In [3]:
try:
    from sklearn_extra.cluster import KMedoids
except Exception as e:
    print('installing scikit-learn-extra package')
    
    !pip install scikit-learn-extra
    
    from sklearn_extra.cluster import KMedoids

### 2.1 Euclidean Distance

In [4]:
from sklearn.metrics.pairwise import euclidean_distances

def Kmedoids_Euc(data,n_clusters,state=520):

    # Calculate the semnatic distance using euclidean distance
    euclindean_distance = euclidean_distances(data)

    # Perform KMedoids
    kmedoids_euc = KMedoids(n_clusters=n_clusters, metric="precomputed",random_state=state)
    kmedoids_euc.fit(euclindean_distance)

    return kmedoids_euc.labels_

### 2.2 Cosine Similarity - Distance

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

def Kmedoids_Cos(data,n_clusters,state=520):
    
    ## Calculate the semnatic distance using cosine similarity
    cosine_sim = cosine_similarity(data)

    # transfer similarity to distance
    cosine_distance = 1 - 0.5 * (cosine_sim + 1)
    cosine_distance = np.clip(cosine_distance, 0, None) # Make sure there is no negative values

    # Perform KMedoids
    kmedoids_cos = KMedoids(n_clusters=n_clusters, metric="precomputed",random_state=state)
    kmedoids_cos.fit(cosine_distance)

    return kmedoids_cos.labels_

## 3 Spectral Clustering (with similarity matrix)
- This method needs to use similarity 

In [6]:
from sklearn.cluster import SpectralClustering

### 2.1 Euclidean Distance - Similarity

In [7]:
from sklearn.metrics.pairwise import euclidean_distances

def Spectral_Euc(data,n_clusters,state=520):

    # Calculate the semnatic distance using euclidean distance
    euclidean_distance = euclidean_distances(data)

    # Here we use a Gaussian (RBF) kernel to transform distances to similarities.
    sigma = np.median(euclidean_distance)  # Using the median distance as sigma for the RBF kernel
    euclidean_similarity = np.exp(-euclidean_distance ** 2 / (2. * sigma ** 2))

    # Perform clustery
    spectral_euc = SpectralClustering(n_clusters=n_clusters, affinity='precomputed',random_state = state)
    spectral_euc_labels = spectral_euc.fit_predict(euclidean_similarity)

    return spectral_euc_labels

### 2.2 Cosine Similarity

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

def Spectral_Cos(data,n_clusters,state=520):

    # Calculate the semantic similarity using cosine similarity
    cos_similarity = cosine_similarity(data)
    cos_similarity = np.clip(cos_similarity, 0, None) # make sure there is no negative values

    # Perform clustery
    spectral_cos = SpectralClustering(n_clusters=n_clusters, affinity='precomputed',random_state=state)
    spectral_cos_labels = spectral_cos.fit_predict(cos_similarity)

    return spectral_cos_labels

## 4. Calculate the amount of original categories

In [9]:
# We need to keep the clustering number is equal to the amount of original categories
file_path = './/dataset//wikispeedia_paths-and-graph//categories.tsv'
category_df = pd.read_csv(file_path, sep='\t', skiprows=12,header=None)
category_df.columns = ['concept','category']

In [10]:
# the type of data is dict
print('The total number of articles in the category_dataset is: {}'.format(category_df.shape[0]))

The total number of articles in the category_dataset is: 5204


In [11]:
# Collect all primary categories
category_df['primary_category'] = category_df['category'].apply(lambda x: x.split('.')[1])
print('The orginal primary categories are:')
print(category_df['primary_category'].unique())

# Set the number of clustering
n_clusters = len(category_df['primary_category'].unique())
print('The clustering number we need to set is: {}'.format(n_clusters))

The orginal primary categories are:
['History' 'People' 'Countries' 'Geography' 'Business_Studies' 'Science'
 'Everyday_life' 'Design_and_Technology' 'Music' 'IT'
 'Language_and_literature' 'Mathematics' 'Religion' 'Art' 'Citizenship']
The clustering number we need to set is: 15


## 5. Perform Clustering and Save results

### 5.1 Read the embedding data

In [2]:
import pickle

# Set the embedding file path
file_path = './/output//embeddings//all_mpnet_base_v2//20241109_150434//embeddings.pkl'
saved_path = './/output//clustering//all_mpnet_base_v2'

# Read the file and print embeddings
with open(file_path, 'rb') as file:
    embedding = pickle.load(file)
    embedding_values = list(embedding.values())

print('The total number of articles in the category_dataset is: {}'.format(len(embedding_values)))

The total number of articles in the category_dataset is: 4604


In [34]:
# Set the embedding file path

file_path = './/output//embeddings//all_MiniLM_L6_v2//20241109_104244//embeddings.pkl'
saved_path = './/output//clustering//all_MiniLM_L6_v2'

# Read the file and print embeddings
with open(file_path, 'rb') as file:
    embedding = pickle.load(file)
    embedding_values = list(embedding.values())

print('The total number of articles in the category_dataset is: {}'.format(len(embedding_values)))

The total number of articles in the category_dataset is: 5232


### 5.2 Perform K-MEANS Clustering

In [29]:
# Calling the function
Kmeans_Clustering = Kmeans_Raw(embedding_values,n_clusters)

# Transfer the result into dict
Kmeans_result = dict(zip(list(embedding.keys()), Kmeans_Clustering))
print("K-means Cluster labels:")
print(Kmeans_result)

# Calculate the counts of clustering results
value_counts = pd.Series(list(Kmeans_result.values())).value_counts()
print(value_counts)

# Save the clustering result
os.makedirs(saved_path, exist_ok=True)

with open(saved_path+'//KMeans.pkl', 'wb') as f:
    pickle.dump(Kmeans_result, f)


K-means Cluster labels:
{'%C3%81ed%C3%A1n_mac_Gabr%C3%A1in': 1, '%C3%85land': 9, '%C3%89douard_Manet': 3, '%C3%89ire': 9, '%C3%93engus_I_of_the_Picts': 1, '%E2%82%AC2_commemorative_coins': 9, '10th_century': 1, '11th_century': 1, '12th_century': 1, '13th_century': 1, '14th_century': 1, '15th_century': 1, '15th_Marine_Expeditionary_Unit': 10, '16th_century': 1, '16_Cygni': 6, '16_Cygni_Bb': 6, '1755_Lisbon_earthquake': 7, '17th_century': 9, '1896_Summer_Olympics': 1, '18th_century': 0, '1928_Okeechobee_Hurricane': 7, '1973_oil_crisis': 9, '1980_eruption_of_Mount_St._Helens': 7, '1997_Pacific_hurricane_season': 7, '19th_century': 9, '1st_century': 1, '1st_century_BC': 1, '1_Ceres': 6, '2-6-0': 10, '2-8-0': 10, '2003_Atlantic_hurricane_season': 7, '2004_Atlantic_hurricane_season': 7, '2004_Indian_Ocean_earthquake': 7, '2005_Atlantic_hurricane_season': 7, '2005_Hertfordshire_Oil_Storage_Terminal_fire': 7, '2005_Kashmir_earthquake': 7, '2005_Lake_Tanganyika_earthquake': 7, '2005_Sumatra_ear

### 5.3 Perform K-Medoids Clustering with Euclidean Distance

In [30]:
# Calling the function
Kmedoids_Euc_Clustering = Kmedoids_Euc(embedding_values,n_clusters)

# Transfer the result into dict
Kmedoids_Euc_result = dict(zip(list(embedding.keys()), Kmedoids_Euc_Clustering))
print("Kmedoids_Euc Cluster labels:")
print(Kmedoids_Euc_result)

# Calculate the counts of clustering results
value_counts = pd.Series(list(Kmedoids_Euc_result.values())).value_counts()
print(value_counts)

# Save the clustering result
os.makedirs(saved_path, exist_ok=True)

with open(saved_path+'//Kmedoids_Euc.pkl', 'wb') as f:
    pickle.dump(Kmedoids_Euc_result, f)


Kmedoids_Euc Cluster labels:
{'%C3%81ed%C3%A1n_mac_Gabr%C3%A1in': 2, '%C3%85land': 2, '%C3%89douard_Manet': 1, '%C3%89ire': 2, '%C3%93engus_I_of_the_Picts': 8, '%E2%82%AC2_commemorative_coins': 2, '10th_century': 9, '11th_century': 2, '12th_century': 8, '13th_century': 13, '14th_century': 1, '15th_century': 9, '15th_Marine_Expeditionary_Unit': 2, '16th_century': 9, '16_Cygni': 11, '16_Cygni_Bb': 11, '1755_Lisbon_earthquake': 10, '17th_century': 8, '1896_Summer_Olympics': 1, '18th_century': 8, '1928_Okeechobee_Hurricane': 10, '1973_oil_crisis': 11, '1980_eruption_of_Mount_St._Helens': 6, '1997_Pacific_hurricane_season': 11, '19th_century': 0, '1st_century': 3, '1st_century_BC': 13, '1_Ceres': 6, '2-6-0': 13, '2-8-0': 13, '2003_Atlantic_hurricane_season': 10, '2004_Atlantic_hurricane_season': 10, '2004_Indian_Ocean_earthquake': 11, '2005_Atlantic_hurricane_season': 10, '2005_Hertfordshire_Oil_Storage_Terminal_fire': 0, '2005_Kashmir_earthquake': 7, '2005_Lake_Tanganyika_earthquake': 4, '

### 5.4 Perform K-Medoids Clustering with cosine similarity

In [31]:
# Calling the function
Kmedoids_Cos_Clustering = Kmedoids_Cos(embedding_values,n_clusters)

# Transfer the result into dict
Kmedoids_Cos_result = dict(zip(list(embedding.keys()), Kmedoids_Cos_Clustering))
print("Kmedoids_Cos Cluster labels:")
print(Kmedoids_Cos_result)

# Calculate the counts of clustering results
value_counts = pd.Series(list(Kmedoids_Cos_result.values())).value_counts()
print(value_counts)

# Save the clustering result
os.makedirs(saved_path, exist_ok=True)

with open(saved_path+'//Kmedoids_Cos.pkl', 'wb') as f:
    pickle.dump(Kmedoids_Cos_result, f)

Kmedoids_Cos Cluster labels:
{'%C3%81ed%C3%A1n_mac_Gabr%C3%A1in': 0, '%C3%85land': 0, '%C3%89douard_Manet': 5, '%C3%89ire': 0, '%C3%93engus_I_of_the_Picts': 6, '%E2%82%AC2_commemorative_coins': 0, '10th_century': 1, '11th_century': 0, '12th_century': 6, '13th_century': 12, '14th_century': 5, '15th_century': 1, '15th_Marine_Expeditionary_Unit': 0, '16th_century': 1, '16_Cygni': 13, '16_Cygni_Bb': 13, '1755_Lisbon_earthquake': 3, '17th_century': 6, '1896_Summer_Olympics': 5, '18th_century': 6, '1928_Okeechobee_Hurricane': 3, '1973_oil_crisis': 13, '1980_eruption_of_Mount_St._Helens': 9, '1997_Pacific_hurricane_season': 13, '19th_century': 2, '1st_century': 8, '1st_century_BC': 12, '1_Ceres': 9, '2-6-0': 12, '2-8-0': 12, '2003_Atlantic_hurricane_season': 3, '2004_Atlantic_hurricane_season': 3, '2004_Indian_Ocean_earthquake': 13, '2005_Atlantic_hurricane_season': 3, '2005_Hertfordshire_Oil_Storage_Terminal_fire': 2, '2005_Kashmir_earthquake': 10, '2005_Lake_Tanganyika_earthquake': 4, '2005

### 5.5 Perform Spectral Clustering with Euclidean Distance

In [32]:
# Calling the function
Spectral_Euc_Clustering = Spectral_Euc(embedding_values,n_clusters)

# Transfer the result into dict
Spectral_Euc_result = dict(zip(list(embedding.keys()), Spectral_Euc_Clustering))
print("Spectral_Euc Cluster labels:")
print(Spectral_Euc_result)

# Calculate the counts of clustering results
value_counts = pd.Series(list(Spectral_Euc_result.values())).value_counts()
print(value_counts)

# Save the clustering result
os.makedirs(saved_path, exist_ok=True)

with open(saved_path+'//Spectral_Euc.pkl', 'wb') as f:
    pickle.dump(Spectral_Euc_result, f)

Spectral_Euc Cluster labels:
{'%C3%81ed%C3%A1n_mac_Gabr%C3%A1in': 14, '%C3%85land': 6, '%C3%89douard_Manet': 4, '%C3%89ire': 3, '%C3%93engus_I_of_the_Picts': 14, '%E2%82%AC2_commemorative_coins': 6, '10th_century': 4, '11th_century': 4, '12th_century': 4, '13th_century': 14, '14th_century': 4, '15th_century': 4, '15th_Marine_Expeditionary_Unit': 8, '16th_century': 4, '16_Cygni': 0, '16_Cygni_Bb': 0, '1755_Lisbon_earthquake': 12, '17th_century': 6, '1896_Summer_Olympics': 14, '18th_century': 6, '1928_Okeechobee_Hurricane': 12, '1973_oil_crisis': 6, '1980_eruption_of_Mount_St._Helens': 12, '1997_Pacific_hurricane_season': 12, '19th_century': 6, '1st_century': 14, '1st_century_BC': 14, '1_Ceres': 0, '2-6-0': 7, '2-8-0': 7, '2003_Atlantic_hurricane_season': 12, '2004_Atlantic_hurricane_season': 12, '2004_Indian_Ocean_earthquake': 12, '2005_Atlantic_hurricane_season': 12, '2005_Hertfordshire_Oil_Storage_Terminal_fire': 12, '2005_Kashmir_earthquake': 12, '2005_Lake_Tanganyika_earthquake': 12

### 5.6 Perform Spectral Clustering with Cosine Similarity

In [33]:
# Calling the function
Spectral_Cos_Clustering = Spectral_Cos(embedding_values,n_clusters)

# Transfer the result into dict
Spectral_Cos_result = dict(zip(list(embedding.keys()), Spectral_Cos_Clustering))
print("Spectral_Euc Cluster labels:")
print(Spectral_Cos_result)

# Calculate the counts of clustering results
value_counts = pd.Series(list(Spectral_Cos_result.values())).value_counts()
print(value_counts)

# Save the clustering result
os.makedirs(saved_path, exist_ok=True)

with open(saved_path+'//Spectral_Cos.pkl', 'wb') as f:
    pickle.dump(Spectral_Cos_result, f)

Spectral_Euc Cluster labels:
{'%C3%81ed%C3%A1n_mac_Gabr%C3%A1in': 6, '%C3%85land': 2, '%C3%89douard_Manet': 0, '%C3%89ire': 9, '%C3%93engus_I_of_the_Picts': 6, '%E2%82%AC2_commemorative_coins': 5, '10th_century': 6, '11th_century': 6, '12th_century': 6, '13th_century': 6, '14th_century': 6, '15th_century': 6, '15th_Marine_Expeditionary_Unit': 5, '16th_century': 6, '16_Cygni': 12, '16_Cygni_Bb': 12, '1755_Lisbon_earthquake': 2, '17th_century': 13, '1896_Summer_Olympics': 6, '18th_century': 6, '1928_Okeechobee_Hurricane': 7, '1973_oil_crisis': 5, '1980_eruption_of_Mount_St._Helens': 7, '1997_Pacific_hurricane_season': 7, '19th_century': 2, '1st_century': 6, '1st_century_BC': 6, '1_Ceres': 12, '2-6-0': 8, '2-8-0': 8, '2003_Atlantic_hurricane_season': 7, '2004_Atlantic_hurricane_season': 7, '2004_Indian_Ocean_earthquake': 7, '2005_Atlantic_hurricane_season': 7, '2005_Hertfordshire_Oil_Storage_Terminal_fire': 8, '2005_Kashmir_earthquake': 14, '2005_Lake_Tanganyika_earthquake': 7, '2005_Suma