
This python notebook file topic_modelling_step1_clustering.ipynb is the refined version of the previous python notebook script file topic_modelling_advanced.ipynb, here are the main extensions of this file:

- we explore both the subset of “Relevant_PNAS” =1  or the whole dataset
- we mainly try clustering results around 20 topics
- we use both intrinsic and extrinsic metrics to evaluate the clustering results
- we add a visualisation of topics across power conditions


## Import and process the dataset

In [None]:
import pandas as pd
import numpy as np
raw_df = pd.read_csv('06 Data analysis/00 data/python_datasets/dataset1_text_ID_created.csv')
text_df=raw_df[["text_id","Text","Condition","Relevant_PNAS"]]

## Pre-calculated embeddings
 The embedding model we use, for example, OpenAI's embedding model here, can be used within the bertopic, however, we still choose to get the embedding for our dataset first and separately, and store the embeddings separately, and then pass it into the bertopic pipeline, this makes the workflow safer and more replicable.

 In this code script, we have removed the code of calling openai's api to get the embedding since we have already saved them last time, we just import the saved embedding.

In [None]:
import pickle
with open("06 Data analysis/04 Topic Modeling/outputs/embeddings/ds1_text_embedding", 'rb') as file:
    embedding_dict=pickle.load(file)

#join the embedding back to the dataset
text_df["embedding"]=text_df["text_id"].map(embedding_dict)

# Clustering Pipeline

In [11]:
#define a pipeline to fit bertopic model with different clustering choices, and save each model and its outputs
from bertopic import BERTopic
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans,AgglomerativeClustering

from bertopic.dimensionality import BaseDimensionalityReduction
from umap import UMAP

from bertopic.representation import PartOfSpeech
from sklearn.feature_extraction.text import CountVectorizer

default_vectorizer_model = CountVectorizer(stop_words="english",max_df=0.9)
default_representation_model=PartOfSpeech("en_core_web_sm",pos_patterns = [ [{'POS': 'NOUN'}]])
default_dimension_model=BaseDimensionalityReduction()

#since we are trying a lot of models on different range of datasets and also calculating a lot of metrics, 
#we want to define a class here, so that we can save all the models together in the same place 

class topic_clustering_models:
    def __init__(self,path_to_write:str,dataset:pd.DataFrame,text_col:str="Text",embedding_col:str="embedding"):
        self.path_to_write=path_to_write
        self.dataset=dataset.reset_index(drop=True)
        self.text_col=text_col
        self.embedding_col=embedding_col
        self.models_dic=dict()
        
    def fit_and_save(self,model_name:str,clustering_model,dimension_model=default_dimension_model,
                     vectorizer_model=default_vectorizer_model,representation_model=default_representation_model):
        bert_topic_model= BERTopic(umap_model=dimension_model,hdbscan_model=clustering_model,
                                   vectorizer_model=vectorizer_model,representation_model=representation_model)
        clustering_results=bert_topic_model.fit_transform(documents=self.dataset[self.text_col], 
                                                          embeddings=np.vstack(self.dataset[self.embedding_col]))[0]
        bert_topic_model.save( f"{self.path_to_write}/{model_name}",save_embedding_model=False)
        self.models_dic.update({model_name:bert_topic_model})
        self.dataset[model_name]=clustering_results


  from .autonotebook import tqdm as notebook_tqdm


## on the sub dataset that "Relevant_PNAS"=1

In [None]:
clustering_sub_ds=topic_clustering_models(dataset=text_df[text_df["Relevant_PNAS"]==1],path_to_write="06 Data analysis/04 Topic Modeling/outputs/bertopic_models/ds1_PNAS_Relevant_sub")

kmeans models and agglomerative models around 20 topics, no dimension reduction is needed for these models 

In [70]:
for cluster_num in range(10,25,2):
    clustering_sub_ds.fit_and_save(model_name=f"m_km_{cluster_num}",clustering_model=KMeans(n_clusters=cluster_num,random_state=123))
    clustering_sub_ds.fit_and_save(model_name=f"m_agg_{cluster_num}",clustering_model=AgglomerativeClustering(n_clusters=cluster_num))
    print(f"Kmeans model and Agglomerative model of cluster number {cluster_num} fitted" )



Kmeans model and Agglomerative model of cluster number 10 fitted




Kmeans model and Agglomerative model of cluster number 12 fitted




Kmeans model and Agglomerative model of cluster number 14 fitted




Kmeans model and Agglomerative model of cluster number 16 fitted




Kmeans model and Agglomerative model of cluster number 18 fitted




Kmeans model and Agglomerative model of cluster number 20 fitted




Kmeans model and Agglomerative model of cluster number 22 fitted




Kmeans model and Agglomerative model of cluster number 24 fitted


We also want to include some hdbscan models for comparison

There is great flexibility when fitting hdbscan models, we can change the min_cluster_size, the min_samples, the metric for distance and the dimension reduction model to speed up the clustering, however, none of them directly control the number of topics(clusters), so we have to do some trial and error to first get a sense of which combinations of parameters are generating reasonable results around 20 topics, we will then calculate and save some parameters combinations that generate around 20 topics 


In [71]:
#dimension reduction moels 
for dim_reduction in [15,50,100]:
    umap_model= UMAP(n_neighbors=dim_reduction, n_components=dim_reduction, min_dist=0.0, metric='cosine',random_state=1234)
    for size_para in range(40, 61, 5):
        clustering_sub_ds.fit_and_save(f"m_hdb_dm{dim_reduction}_sz{size_para}",
                                       clustering_model=HDBSCAN(min_cluster_size=size_para, min_samples=size_para,max_cluster_size=3000,core_dist_n_jobs=-1,allow_single_cluster=True),
                                       dimension_model=umap_model )
        print(f"HDBSCAN model of n_components {dim_reduction} and min_cluster_size and min_cluster_size {size_para} fitted" )



HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 40 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 45 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 50 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 55 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 60 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 40 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 45 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 50 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 55 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 60 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 40 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 45 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 50 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 55 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 60 fitted


In [None]:
with open("06 Data analysis/04 Topic Modeling/outputs/bertopic_models/ds1_PNAS_Relevant_sub/all_models_object", 'wb') as file:
    pickle.dump(clustering_sub_ds, file, protocol=pickle.HIGHEST_PROTOCOL)

## on the whole dataset

In [None]:
#create the pipeline
clustering_whole_ds=topic_clustering_models(dataset=text_df,path_to_write="06 Data analysis/04 Topic Modeling/outputs/bertopic_models/ds1_whole")

In [19]:
#fit the series of models 

#kemans and agglomerative models 
for cluster_num in range(10,25,2):
    clustering_whole_ds.fit_and_save(model_name=f"m_km_{cluster_num}",clustering_model=KMeans(n_clusters=cluster_num,random_state=123))
    clustering_whole_ds.fit_and_save(model_name=f"m_agg_{cluster_num}",clustering_model=AgglomerativeClustering(n_clusters=cluster_num))
    print(f"Kmeans model and Agglomerative model of cluster number {cluster_num} fitted" )

for dim_reduction in [15,50,100]:
    umap_model= UMAP(n_neighbors=dim_reduction, n_components=dim_reduction, min_dist=0.0, metric='cosine',random_state=1234)
    for size_para in range(40, 61, 5):
        clustering_whole_ds.fit_and_save(f"m_hdb_dm{dim_reduction}_sz{size_para}",
                                       clustering_model=HDBSCAN(min_cluster_size=size_para, min_samples=size_para,max_cluster_size=3000,core_dist_n_jobs=-1,allow_single_cluster=True),
                                       dimension_model=umap_model )
        print(f"HDBSCAN model of n_components {dim_reduction} and min_cluster_size and min_cluster_size {size_para} fitted" )



Kmeans model and Agglomerative model of cluster number 10 fitted




Kmeans model and Agglomerative model of cluster number 12 fitted




Kmeans model and Agglomerative model of cluster number 14 fitted




Kmeans model and Agglomerative model of cluster number 16 fitted




Kmeans model and Agglomerative model of cluster number 18 fitted




Kmeans model and Agglomerative model of cluster number 20 fitted




Kmeans model and Agglomerative model of cluster number 22 fitted




Kmeans model and Agglomerative model of cluster number 24 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 40 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 45 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 50 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 55 fitted




HDBSCAN model of n_components 15 and min_cluster_size and min_cluster_size 60 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 40 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 45 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 50 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 55 fitted




HDBSCAN model of n_components 50 and min_cluster_size and min_cluster_size 60 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 40 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 45 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 50 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 55 fitted




HDBSCAN model of n_components 100 and min_cluster_size and min_cluster_size 60 fitted


# Evaluate the clustering models 
We have got a bunch of representative clustering models, each with different clustering methods and hyperparameters. While it is not straightforward to determine which clustering result is the best, we want to have a sense of how these clustering results are different from each other and which is more reasonable to some extent.

### intrinsic metrics 

In [9]:
# first we make use of the most common internal scores: silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.metrics import silhouette_score,calinski_harabasz_score,davies_bouldin_score
def get_intrinsic_metrics(models_obejct:topic_clustering_models ):
    measure_df=pd.DataFrame(None, index=models_obejct.models_dic.keys(),columns=["Silhouette","Calinski","Davies"])
    dataset=models_obejct.dataset
    embedding_array=np.array(dataset["embedding"].to_list())
    for model_name in measure_df.index:
        measure_df.loc[model_name]=(silhouette_score(embedding_array,dataset[model_name]),
                                calinski_harabasz_score(embedding_array,dataset[model_name]),
                                davies_bouldin_score(embedding_array,dataset[model_name]))
    return measure_df.sort_index()

### extrinsic metrics 

In [10]:
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, normalized_mutual_info_score
#define a function that, given a model object ( topic_clustering_models class we define) , we return the pairwise similarity of all the models 
#in three common metrics: adjusted_rand_score, adjusted_mutual_info_score, normalized_mutual_info_score
def get_similarity_matrix(models_obejct:topic_clustering_models  ):
    model_name_list=sorted(list(models_obejct.models_dic.keys()))
    sim_df=pd.DataFrame(None, index=model_name_list,columns=model_name_list)
    dataset=models_obejct.dataset
    n=len(model_name_list)
    for i in range(n):
        sim_df.iloc[i, i] = 1 
        for j in range(i + 1, n): 
            cluster_i=dataset[model_name_list[i]]
            cluster_j=dataset[model_name_list[j]]
            sim_df.iloc[j, i]= list(map(lambda x: round(x, 2),
                                        [ adjusted_rand_score(cluster_i,cluster_j),
                                adjusted_mutual_info_score(cluster_i,cluster_j),
                                normalized_mutual_info_score(cluster_i,cluster_j) ]))
    return sim_df


## evaluate the models fitted on the subset that "Relevant_PNAS"=1

In [11]:
intrinsics_sub_df=get_intrinsic_metrics(clustering_sub_ds)

# for the intrinsic metrics, we may also be interested to see the rankings to have a sense of how the models perform on different scores
# noted that for Silhouette and Calinski, higher is better, for Davies, lower is better, our rankings will adjust based on this, 
# that is, top rankings for the high Silhouette and Calinski scores and for the low Davies scores 
intrinsics_rank_sub_df=intrinsics_sub_df.copy()
intrinsics_rank_sub_df["Silhouette"]=intrinsics_rank_sub_df["Silhouette"].rank(ascending=False)
intrinsics_rank_sub_df["Calinski"]=intrinsics_rank_sub_df["Calinski"].rank(ascending=False)
intrinsics_rank_sub_df["Davies"]=intrinsics_rank_sub_df["Davies"].rank(ascending=True)
intrinsics_rank_sub_df

Unnamed: 0,Silhouette,Calinski,Davies
m_agg_10,8.0,2.0,20.0
m_agg_12,19.0,5.0,17.0
m_agg_14,12.0,7.0,25.0
m_agg_16,18.0,15.0,29.0
m_agg_18,21.0,22.0,27.0
m_agg_20,31.0,27.0,31.0
m_agg_22,30.0,29.0,26.0
m_agg_24,29.0,30.0,24.0
m_hdb_dm100_sz40,10.0,19.0,12.0
m_hdb_dm100_sz45,14.0,18.0,3.0


In [12]:
similarties_sub_df=get_similarity_matrix(clustering_sub_ds )
similarties_sub_df

Unnamed: 0,m_agg_10,m_agg_12,m_agg_14,m_agg_16,m_agg_18,m_agg_20,m_agg_22,m_agg_24,m_hdb_dm100_sz40,m_hdb_dm100_sz45,...,m_hdb_dm50_sz55,m_hdb_dm50_sz60,m_km_10,m_km_12,m_km_14,m_km_16,m_km_18,m_km_20,m_km_22,m_km_24
m_agg_10,1,,,,,,,,,,...,,,,,,,,,,
m_agg_12,"[0.94, 0.97, 0.97]",1,,,,,,,,,...,,,,,,,,,,
m_agg_14,"[0.85, 0.93, 0.93]","[0.91, 0.96, 0.96]",1,,,,,,,,...,,,,,,,,,,
m_agg_16,"[0.81, 0.91, 0.91]","[0.87, 0.94, 0.94]","[0.96, 0.98, 0.98]",1,,,,,,,...,,,,,,,,,,
m_agg_18,"[0.78, 0.9, 0.9]","[0.84, 0.93, 0.93]","[0.94, 0.96, 0.96]","[0.98, 0.98, 0.98]",1,,,,,,...,,,,,,,,,,
m_agg_20,"[0.7, 0.87, 0.87]","[0.75, 0.9, 0.9]","[0.84, 0.94, 0.94]","[0.88, 0.96, 0.96]","[0.91, 0.97, 0.97]",1,,,,,...,,,,,,,,,,
m_agg_22,"[0.69, 0.86, 0.86]","[0.74, 0.89, 0.89]","[0.83, 0.93, 0.93]","[0.87, 0.95, 0.95]","[0.9, 0.96, 0.96]","[0.99, 0.99, 0.99]",1,,,,...,,,,,,,,,,
m_agg_24,"[0.67, 0.85, 0.85]","[0.72, 0.88, 0.88]","[0.81, 0.92, 0.92]","[0.85, 0.94, 0.94]","[0.88, 0.95, 0.95]","[0.97, 0.98, 0.98]","[0.98, 0.99, 0.99]",1,,,...,,,,,,,,,,
m_hdb_dm100_sz40,"[0.47, 0.57, 0.57]","[0.46, 0.57, 0.57]","[0.48, 0.58, 0.58]","[0.48, 0.58, 0.58]","[0.49, 0.59, 0.59]","[0.41, 0.58, 0.58]","[0.41, 0.58, 0.58]","[0.41, 0.58, 0.59]",1,,...,,,,,,,,,,
m_hdb_dm100_sz45,"[0.45, 0.58, 0.58]","[0.45, 0.57, 0.58]","[0.47, 0.58, 0.59]","[0.46, 0.58, 0.58]","[0.47, 0.59, 0.59]","[0.39, 0.58, 0.58]","[0.39, 0.58, 0.58]","[0.39, 0.58, 0.58]","[0.88, 0.92, 0.92]",1,...,,,,,,,,,,


## evaluate the models fitted on the whole set 

In [20]:
intrinsics_whole_df=get_intrinsic_metrics(clustering_whole_ds)
intrinsics_rank_whole_df=intrinsics_whole_df.copy()
intrinsics_rank_whole_df["Silhouette"]=intrinsics_rank_whole_df["Silhouette"].rank(ascending=False)
intrinsics_rank_whole_df["Calinski"]=intrinsics_rank_whole_df["Calinski"].rank(ascending=False)
intrinsics_rank_whole_df["Davies"]=intrinsics_rank_whole_df["Davies"].rank(ascending=True)
intrinsics_rank_whole_df

Unnamed: 0,Silhouette,Calinski,Davies
m_agg_10,22.0,3.0,19.0
m_agg_12,9.0,5.0,27.0
m_agg_14,5.0,7.0,24.0
m_agg_16,4.0,9.0,25.0
m_agg_18,24.0,17.0,30.0
m_agg_20,23.0,25.0,29.0
m_agg_22,25.0,29.0,28.0
m_agg_24,31.0,30.0,31.0
m_hdb_dm100_sz40,12.0,26.0,1.0
m_hdb_dm100_sz45,6.0,11.0,13.0


In [21]:
similarties_whole_df=get_similarity_matrix(clustering_whole_ds )
similarties_whole_df

Unnamed: 0,m_agg_10,m_agg_12,m_agg_14,m_agg_16,m_agg_18,m_agg_20,m_agg_22,m_agg_24,m_hdb_dm100_sz40,m_hdb_dm100_sz45,...,m_hdb_dm50_sz55,m_hdb_dm50_sz60,m_km_10,m_km_12,m_km_14,m_km_16,m_km_18,m_km_20,m_km_22,m_km_24
m_agg_10,1,,,,,,,,,,...,,,,,,,,,,
m_agg_12,"[0.91, 0.96, 0.96]",1,,,,,,,,,...,,,,,,,,,,
m_agg_14,"[0.86, 0.94, 0.94]","[0.96, 0.97, 0.97]",1,,,,,,,,...,,,,,,,,,,
m_agg_16,"[0.77, 0.9, 0.9]","[0.86, 0.94, 0.94]","[0.9, 0.97, 0.97]",1,,,,,,,...,,,,,,,,,,
m_agg_18,"[0.68, 0.87, 0.87]","[0.77, 0.91, 0.91]","[0.81, 0.94, 0.94]","[0.9, 0.97, 0.97]",1,,,,,,...,,,,,,,,,,
m_agg_20,"[0.64, 0.85, 0.85]","[0.73, 0.89, 0.89]","[0.77, 0.92, 0.92]","[0.86, 0.95, 0.95]","[0.96, 0.98, 0.98]",1,,,,,...,,,,,,,,,,
m_agg_22,"[0.63, 0.84, 0.84]","[0.71, 0.88, 0.88]","[0.75, 0.91, 0.91]","[0.84, 0.94, 0.94]","[0.94, 0.97, 0.97]","[0.98, 0.99, 0.99]",1,,,,...,,,,,,,,,,
m_agg_24,"[0.52, 0.82, 0.82]","[0.59, 0.86, 0.86]","[0.63, 0.88, 0.88]","[0.72, 0.92, 0.92]","[0.81, 0.95, 0.95]","[0.85, 0.96, 0.97]","[0.87, 0.98, 0.98]",1,,,...,,,,,,,,,,
m_hdb_dm100_sz40,"[0.42, 0.57, 0.57]","[0.44, 0.58, 0.58]","[0.45, 0.58, 0.58]","[0.47, 0.59, 0.6]","[0.43, 0.58, 0.58]","[0.44, 0.59, 0.59]","[0.42, 0.59, 0.59]","[0.32, 0.58, 0.58]",1,,...,,,,,,,,,,
m_hdb_dm100_sz45,"[0.43, 0.58, 0.58]","[0.45, 0.58, 0.58]","[0.46, 0.59, 0.59]","[0.48, 0.6, 0.6]","[0.43, 0.58, 0.59]","[0.44, 0.59, 0.59]","[0.43, 0.59, 0.59]","[0.32, 0.58, 0.58]","[0.94, 0.94, 0.94]",1,...,,,,,,,,,,


In [None]:
#save the metrics
intrinsics_whole_df.to_csv("06 Data analysis/04 Topic Modeling/outputs/clustering_results_metrics/intrinsics_on_ds1_whole.csv" )
similarties_whole_df.to_csv("06 Data analysis/04 Topic Modeling/outputs/clustering_results_metrics/similarties_on_ds1_whole.csv")
intrinsics_sub_df.to_csv("06 Data analysis/04 Topic Modeling/outputs/clustering_results_metrics/intrinsics_on_ds1_sub.csv")
similarties_sub_df.to_csv("06 Data analysis/04 Topic Modeling/outputs/clustering_results_metrics/similarties_on_ds1_sub.csv")

# some simple visualisations of how these metrics change with the model parameters are also created and saved in the same folder

### What do we learn from these metrics? 

From the internal metrics:

- Kmeans with cluster number 10 seems to be the top performing clustering on both ranges of datasets
- On both datasets, agglomerative clustering with high cluster numbers generate the worst metrics. 
- HDBSCAN models with 100 dimensions of input and larger min_sample/min_cluster size generate good results 

From the external(similarity) metrics:

- HDBSCAN models with the same dimension reduction model and similar min_sample/min_cluster size parameters are quite similar to each other
- Agglomerative models with similar num_cluster are like each other 
- Clustering from different methods, even if they have the same cluster number, are quite different from each other



## Compare the clustering result on the two datasets
The clusterings on the two ranges of the dataset cannot be compared directly using the above metrics. To compare them we will display some metrics that are calculated on the same points, that is, compare the clustering of “Relevant_PNAS"=1 points derived from using the whole dataset or only using “Relevant_PNAS"=1 points

We calculate the similarity of the two clusterings on the two datasets using the same method with the same hyperparameter.

In [22]:
df_compare_clustering_on_two_range=pd.DataFrame(None,
                                                index=clustering_sub_ds.models_dic.keys(),
                                                columns=["ARI","AMI","NMI"])
for model_name in df_compare_clustering_on_two_range.index:
   cluster_whole_ds=clustering_whole_ds.dataset.query("Relevant_PNAS==1").sort_values("text_id")[model_name]
   cluster_sub_ds=clustering_sub_ds.dataset.sort_values("text_id")[model_name]
   df_compare_clustering_on_two_range.loc[model_name,:]=[ adjusted_rand_score(cluster_whole_ds,cluster_sub_ds),
                                adjusted_mutual_info_score(cluster_whole_ds,cluster_sub_ds),
                                normalized_mutual_info_score(cluster_whole_ds,cluster_sub_ds) ]
df_compare_clustering_on_two_range.sort_index()

Unnamed: 0,ARI,AMI,NMI
m_agg_10,0.61817,0.69291,0.69372
m_agg_12,0.659374,0.697812,0.698914
m_agg_14,0.636309,0.695402,0.696857
m_agg_16,0.66944,0.704931,0.706706
m_agg_18,0.675564,0.71101,0.713138
m_agg_20,0.656764,0.712631,0.715164
m_agg_22,0.670509,0.723979,0.726931
m_agg_24,0.549729,0.718336,0.721834
m_hdb_dm100_sz40,0.849319,0.871838,0.872908
m_hdb_dm100_sz45,0.874691,0.884247,0.885062


### What do we learn from the above comparison? 

HDBSCAN models with the same hyperparameter, are quite similar whether clustering the whole dataset or the sub dataset, kmeans models and agglomerative models are not that similar on different datasets, which is not surprising because the HDBSCAN models are good at dealing with noise, this also shows that there is noise in the whole dataset. 

# Some Visualisation 

We select a few models to do some visualisation, I select a few representative ones here so that we have a focus, but we can definitely do this on all models we have fitted with the same code. 

The models we check here are: 

- kmeans with 10 clusters

This is the top-performing model, we choose to display the results on the subset

- kmeans with 20 clusters and agg with 14 clusters

20 is the desired number of clusters, though it is not generating good scores, we still want to have a look, and kmeans of 20 clusters is better than the agg model with 20 clusters based on the scores. Similarly, we plot agg model of 14 clusters.

- m_hdb_dm100_sz60

One of the top performance hdbscan models based on the metrics

## topics across conditions

In [None]:
#define a function that generate the plot for conditons across topics 
#the input should be the topic_clustering_models object we define and a model name that the object contain
import plotly.express as px
import seaborn as sns
import plotly.io as pio
pio.templates.default = 'simple_white'
import nbformat
def plot_topics_across_condition(models_object:topic_clustering_models,model_name:str):
    model=models_object.models_dic[model_name]
    dataset=models_object.dataset
    model.update_topics(docs=list(dataset["Text"]),vectorizer_model=default_vectorizer_model,representation_model=default_representation_model)
    topics_per_class=model.topics_per_class(
        docs=list(dataset["Text"]),
        classes=list(dataset["Condition"]))
    fig = model.visualize_topics_per_class(topics_per_class, top_n_topics=30, 
                                      normalize_frequency = True)
    fig.write_html(f'06 Data analysis/04 Topic Modeling/outputs/visual_topicXcondition/ds1_raw_clustering_results/{model_name}.html')

In [75]:
plot_topics_across_condition(clustering_sub_ds,"m_km_10" )
plot_topics_across_condition(clustering_sub_ds,"m_agg_14" )
plot_topics_across_condition(clustering_sub_ds,"m_km_20" )
plot_topics_across_condition(clustering_sub_ds,"m_hdb_dm100_sz60" )

## clusters on 2D

In [None]:
umap_model_for_2Dplot= UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine',random_state=1234)

def plot_2D_clusters(models_object:topic_clustering_models,model_name:str):
    model=models_object.models_dic[model_name]
    dataset=models_object.dataset
    fig = model.visualize_documents(docs=list(dataset["Text"]),
                                    reduced_embeddings=umap_model_for_2Dplot.fit_transform(np.vstack(dataset["embedding"]))  )
    fig.write_html(f'06 Data analysis/04 Topic Modeling/outputs/visual_cluster_in_2D/models_on_ds1_sub{model_name}.html')
    print(model_name, "done!" )


In [None]:
plot_2D_clusters(clustering_sub_ds,"m_km_10" )
plot_2D_clusters(clustering_sub_ds,"m_agg_14" )
plot_2D_clusters(clustering_sub_ds,"m_km_20" )
plot_2D_clusters(clustering_sub_ds,"m_hdb_dm100_sz60" )


### What do we know from these plots?

First of all these plots are not going to reflect which clustering is better, the first set of plots shows the topics across conditions but not about clustering quality, the second set of visualisation reflects how the points are clustered together in the 2D space, and the meaning of the 2D space is highly abstract. 

However, we can get some insights from the plots, these include:

- For all clustering results, answers belonging to the grocery and last meal conditions can be grouped together and get corresponding summaries, which at least shows that our clustering process can understand and identify the topics implied in the answers, although other results are very sensitive to the method we use and the hyperparameters we set.

- From the 2D graph of the clustering results, 20 topics seem to be too many, because the points of the two conditions, last meal and grocery, are divided into many subclasses, which may not be necessary. The HDBSCAN model has the advantage of using density, so it looks beautiful on the 2D plot, but on the other hand, it leads to a lot of noise, which may not be reasonable. Kmeans's 10 categories are the best results based on the intrinsic metrics, but from the graph, it does not look that good, there are many overlapping clusters, how the points are clustered are not in the way that human expect.


In [74]:
# some additional visualisation
for modelname in clustering_sub_ds.models_dic.keys():
    plot_2D_clusters(clustering_sub_ds,modelname )


m_km_10 done!
m_agg_10 done!
m_km_12 done!
m_agg_12 done!
m_km_14 done!
m_agg_14 done!
m_km_16 done!
m_agg_16 done!
m_km_18 done!
m_agg_18 done!
m_km_20 done!
m_agg_20 done!
m_km_22 done!
m_agg_22 done!
m_km_24 done!
m_agg_24 done!
m_hdb_dm15_sz40 done!
m_hdb_dm15_sz45 done!
m_hdb_dm15_sz50 done!
m_hdb_dm15_sz55 done!
m_hdb_dm15_sz60 done!
m_hdb_dm50_sz40 done!
m_hdb_dm50_sz45 done!
m_hdb_dm50_sz50 done!
m_hdb_dm50_sz55 done!
m_hdb_dm50_sz60 done!
m_hdb_dm100_sz40 done!
m_hdb_dm100_sz45 done!
m_hdb_dm100_sz50 done!
m_hdb_dm100_sz55 done!
m_hdb_dm100_sz60 done!


# Generative model summarisation


In the previous topic_modelling_with_metrics scripts, we have fitted a series of topic clustering model, they have all been summarised in keywords, while these keywords are already meaningful, they are not in plain text. Using generative model to summarise the topics is a new and smart way to summarise the themes/topics when the SOTA genreative model become powerful recently. We use OpenAI's GPT4 model here. 

Here, we use genreative models to summarise the topics in two representative models we derived in topic_modelling_with_metrics script, one is the kmeans model with 10 clusters, which has the best intrinsic scores to many extents, the other is the HDBSCAN model with reduced 100 dimension embedding and 60 min_sample size and 60 min_cluster size, the model is one of the top performing HDBSCAN models to many extents and generates quite resonable 2D clustering plots.  

In [13]:
from openai import OpenAI
client = OpenAI()
import time


def summarise_cluster_topics(dataframe,cluster_column,text_reference_column,sample_number=100):
    sample=dataframe.groupby(cluster_column).apply(
    lambda group: group.sample(n=min(sample_number, len(group)), replace=False)).reset_index(drop=True)
    entries_list=sample.groupby(cluster_column).apply(lambda sub_sample:'\n'.join(sub_sample[text_reference_column].tolist()) )
    
    title_dataframe=pd.DataFrame({"cluster_name" : entries_list.index,"cluster_topic" : None})
    for idx in entries_list.index:
        time.sleep(10)
        entries=entries_list[idx]
        response = client.chat.completions.create( model="gpt-4-turbo",
                                messages=[
                                    {"role": "user",
                                    "content": f"I have a set of responses from a questionnaire that have been grouped into a single cluster due to their similarities. Please analyze these responses and provide a single short topic that focuses on the shared situation or activity of this cluster. Return me only the summative topic text without anything else. Below are the sample replies from this cluster: \n\n{entries} "
                                    }])
        title_temp=response.choices[0].message.content
        title_dataframe["cluster_topic"][title_dataframe["cluster_name"]==idx]=title_temp
        print(f"title {idx} : {title_temp}")
    dataframe2= pd.merge(dataframe, title_dataframe, left_on=cluster_column, right_on='cluster_name', how='left')
    dataframe2.drop("cluster_name", axis=1, inplace=True)
    return dataframe2[["text_id",text_reference_column,"Condition","Relevant_PNAS",cluster_column,"cluster_topic"]]

In [33]:
m_km_10_sub_GPT_sum=summarise_cluster_topics(clustering_sub_ds.dataset,"m_km_10","Text")


title 0 : Grocery shopping experiences and feelings.
title 1 : "Experiences of feeling powerless and frustrated due to workplace dynamics and authoritative supervision."
title 2 : Managing and evaluating team members.
title 3 : "Experiences of Exercising Power and Authority in Various Leadership Roles"
title 4 : Feeling powerless due to external circumstances or individuals.
title 5 : Experiences of Powerlessness due to External Control or Unforeseen Events
title 6 : Parental control and decision-making power over children's actions and desires.
title 7 : Sharing meals and eating together with family or friends.
title 8 : Work-related evaluations and interviews.
title 9 : "Collaborative decision-making with equal power and control."


In [43]:
m_km_16_sub_GPT_sum=summarise_cluster_topics(clustering_sub_ds.dataset,"m_km_16","Text")

title 0 : Workplace Power Struggles and Employee Disempowerment
title 1 : Grocery Shopping: Routine Experiences and Challenges
title 2 : Grocery Shopping Experiences and Budget Challenges
title 3 : Feeling powerless and lacking control in challenging life situations.
title 4 : Leadership and Decision-Making Authority in Various Contexts
title 5 : Feeling Powerless in Various Life Situations
title 6 : "Managing and Supervising Employees"
title 7 : Workplace Powerlessness
title 8 : Leadership and Management Experiences
title 9 : Experiences of exerting or encountering power dynamics in relationships.
title 10 : Situations involving feelings of power dynamics and the impact of decision-making by or over individuals.
title 11 : Enjoying Meals Together
title 12 : Job Interview Experiences and Power Dynamics
title 13 : Workplace Performance Evaluations and Their Impact on Raises and Promotions
title 14 : **Collaborative Decision-Making Among Equals**
title 15 : Breakfast Meals and Morning Ro

In [45]:
m_km_20_sub_GPT_sum=summarise_cluster_topics(clustering_sub_ds.dataset,"m_km_20","Text")

title 0 : Workplace Disempowerment and Power Struggles
title 1 : Grocery Shopping Experiences
title 2 : Experiences of Powerlessness and Vulnerability
title 3 : Feeling Powerless in Various Adverse Life Situations
title 4 : Workplace Powerlessness
title 5 : Routine Grocery Shopping Trips
title 6 : "Leadership and Decision-Making Responsibilities in Various Contexts"
title 7 : Parental and Familial Control and Negotiation
title 8 : Experiences of Supervisors and Managers in Roles of Authority
title 9 : Being in a Leadership Role
title 10 : "Experiences of Exercising Power and Control Over Others"
title 11 : "Experiences of powerlessness in educational settings"
title 12 : Crowded and Stressful Grocery Shopping Experiences.
title 13 : "Decision-Making in Evaluating and Selecting Candidates"
title 14 : Employee Performance Evaluations
title 15 : Job Interview Experiences Under Perceived Power Imbalance
title 16 : Recent dining experiences.
title 17 : Collaborative Decision-Making in Partn

In [None]:
import pickle
with open("06 Data analysis/04 Topic Modeling/outputs/bertopic_models/ds1_PNAS_Relevant_sub/all_models_object", 'rb') as file:
    clustering_sub_ds=pickle.load(file)

In [14]:
m_hdb_100_60_sub_GPT_sum=summarise_cluster_topics(clustering_sub_ds.dataset,"m_hdb_dm100_sz60","Text",200)

title -1 : Experiences of Powerlessness
title 0 : Experiencing Power and Responsibility in Professional Roles
title 1 : Grocery Shopping Experiences
title 2 : Struggles with Powerlessness in Professional and Group Dynamics
title 3 : Meal experiences and satisfaction
title 4 : Babysitting and caregiving responsibilities.
title 5 : Job Application and Interview Experiences
title 6 : Equal Power and Shared Decision-Making in Collaborative Scenarios
title 7 : Feelings of powerlessness in medical and health-related crises.
title 8 : Experiences of feeling powerless in situations involving vehicles and traffic incidents.
title 9 : Parental and Authority Control
title 10 : Domestic Powerlessness and Control Issues
title 11 : Financial Distress and Powerlessness
title 12 : Power Dynamics and Control over Resources and Outcomes


In [47]:
m_hdb_100_50_sub_GPT_sum=summarise_cluster_topics(clustering_sub_ds.dataset,"m_hdb_dm100_sz50","Text")

title -1 : Feeling Powerless and Lacking Control
title 0 : Supervision and Leadership Responsibility
title 1 : Grocery Shopping Experiences
title 2 : Experiences of feeling powerless or lacking control over work situations and performance evaluations
title 3 : Enjoying Meals at Home with Loved Ones
title 4 : Exercising authority or decision-making power, often in the context of caregiving or managing relationships with children, siblings, or friends.
title 5 : Job Interviews and Feelings of Powerlessness
title 6 : Collaborative Decision-Making in Equal Power Dynamics
title 7 : Powerlessness During Illness and Caregiving
title 8 : Experiences of Powerlessness and Lack of Control in Critical and Unforeseen Situations
title 9 : Parental Control and Powerlessness in Childhood
title 10 : Experiences of Powerlessness in Controlling and Abusive Relationships
title 11 : Financial Instability and Housing Insecurity
title 12 : Incidents of Power Dynamics and Influence Over Others


In [51]:
m_hdb_100_40_sub_GPT_sum=summarise_cluster_topics(clustering_sub_ds.dataset,"m_hdb_dm100_sz40","Text",50)

title -1 : Experiences of Feeling Powerless and Lack of Control
title 0 : Experiences of Exercising Authority and Making Critical Decisions in Leadership and Management Roles
title 1 : Recent grocery store visits.
title 2 : Workplace Powerlessness and Domineering Bosses
title 3 : Recent Meals and Eating Experiences
title 4 : "Situations of Exercising Parental or Guardian-Like Authority"
title 5 : Experiences of Powerlessness During Job Interviews
title 6 : Shared Control and Collaboration in Equal-Power Relationships
title 7 : Experiencing Powerlessness in Medical and Life-Or-Death Situations
title 8 : Experiences of Powerlessness in Situations Beyond Control
title 9 : Parental Control and Childhood Powerlessness
title 10 : Feeling Powerless Due to Financial Instability and Dependency
title 11 : Experiences of Powerlessness in Controlling and Abusive Relationships
title 12 : Incidents involving the dynamics of power and control in interpersonal interactions.
title 13 : Power Imbalance 

In [None]:
#def a function to save the summaristion from GPT


def write_GPT_summaries( GPT_sum_table:pd.DataFrame,model_name:str): 
    GPT_sum_table["cluster_topic"].value_counts().to_csv( f"06 Data analysis/04 Topic Modeling/outputs/GPT_summarization/ds1_sub_clustering_sum/{model_name}_sub_topics_info.csv")
    GPT_sum_table.to_csv( f"06 Data analysis/04 Topic Modeling/outputs/GPT_summarization/ds1_sub_clustering_sum/{model_name}_sub_all_rows.csv")

In [58]:
write_GPT_summaries(m_km_10_sub_GPT_sum,"m_km_10" )

In [59]:
write_GPT_summaries(m_km_16_sub_GPT_sum,"m_km_16" )

In [60]:
write_GPT_summaries(m_km_20_sub_GPT_sum,"m_km_20" )

In [61]:
write_GPT_summaries(m_hdb_100_60_sub_GPT_sum,"m_hdb_dm100_sz60" )

In [17]:
write_GPT_summaries(m_hdb_100_60_sub_GPT_sum,"m_hdb_dm100_sz60_GPT4turobo_refined" )

In [62]:
write_GPT_summaries(m_hdb_100_50_sub_GPT_sum,"m_hdb_dm100_sz50" )

In [63]:
write_GPT_summaries(m_hdb_100_40_sub_GPT_sum,"m_hdb_dm100_sz40" )