# Experimentation Notebook for the Q-Cluster Pipeline

This notebook provides an interactive environment to run, analyze, and experiment with the text clustering pipeline defined in `pipeline.py`. 

The goal of the pipeline is to:
1. Load a dataset of customer support interactions with predefined categories.
2. Generate embeddings for the text.
3. Use an unsupervised clustering algorithm to group the interactions.
4. Use a Large Language Model (LLM) to describe the contents of each cluster.
5. Match the generated clusters back to the original predefined categories.
6. Evaluate the performance of the clustering.

You can use this notebook to easily swap out different algorithms for feature extraction, clustering, and similarity matching to find the best combination for your data.

## 1. Setup

Before running the pipeline, you need to set up your environment.

### 1.1 Install Dependencies

First, ensure you have all the required Python libraries installed. You can install them using pip.

In [None]:
!uv sync

### 1.2. Set Environment Variables

The pipeline uses environment variables to configure key settings, such as prompt templates and output directories. Create a `.env` file in the root of your project with the following content. The `python-dotenv` library will load these automatically if it's installed.

```.env
# Path to the directory where evaluation results will be stored
EVALUATION_RESULTS_DIR="./evaluation_results"

# The prompt template for generating cluster descriptions
DESCRIPTION_PROMPT_TEMPLATE="description_prompt_from_instructions.txt"

# (Optional) The LLM model to use for generating qualitative reports
OLLAMA_REPORTING_MODEL="llama2"
```

**Note:** Ensure the prompt template file (e.g., `description_prompt_from_instructions.txt`) exists and contains the template you want the LLM to use.

### 1.3. Prepare the Data

The script expects the dataset to be located at `../data/Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv`. Make sure you have downloaded the data and placed it in the correct directory relative to your project's `ROOT_DIR`.

### 1.4. Imports

Let's import all the necessary modules and functions from the `qcluster` library.

In [None]:
import functools
import os
import time
from pathlib import Path
from os import PathLike

import torch
from loguru import logger
from tqdm.notebook import tqdm
from dotenv import load_dotenv

from qcluster import ROOT_DIR
from qcluster.algorithms.clustering import kmeans_clustering, hdbscan_clustering
from qcluster.llm.describer import get_description
from qcluster.algorithms.feature_extractors import create_embeddings, umap_reduction, pca_reduction
from qcluster.algorithms.similarity import get_top_n_similar_embeddings
from qcluster.custom_types import CategoryType, IdToCategoryResultType, category_to_idx, ClusterType
from qcluster.datamodels.instruction import InstructionCollection
from qcluster.datamodels.sample import SampleCollection
from qcluster.evaluation import evaluate_results, cluster_to_class_similarity_measures, store_results
from qcluster.preload import MODEL

# Load environment variables from .env file
load_dotenv()
logger.info("Environment variables loaded.")

## 2. Defining the Pipeline Components

Here, we define the core components of our pipeline. This is where you can experiment by swapping out functions and changing parameters.

### 2.1. Configure Experiment Parameters

Choose which algorithms you want to use for this run.

In [None]:
# --- EXPERIMENT HERE ---

# 1. Choose a Feature Extractor
def feature_extractor_umap(texts: list[str]) -> torch.Tensor:
    """Creates embeddings and reduces dimensionality with UMAP."""
    embeddings = create_embeddings(texts, model=MODEL)
    embeddings = umap_reduction(embeddings, n_components=28)
    return embeddings

def feature_extractor_pca(texts: list[str]) -> torch.Tensor:
    """Creates embeddings and reduces dimensionality with PCA."""
    embeddings = create_embeddings(texts, model=MODEL)
    embeddings = pca_reduction(embeddings, n_components=28)
    return embeddings

# Set the feature_extractor for this run
feature_extractor = feature_extractor_umap 
logger.info(f"Using feature extractor: {feature_extractor.__name__}")

# 2. Choose a Clustering Function
N_CATEGORIES = len(SampleCollection.all_category_classes()) - 1 # Exclude 'UNKNOWN'

clustering_kmeans = functools.partial(kmeans_clustering, n_clusters=N_CATEGORIES)
clustering_hdbscan = functools.partial(hdbscan_clustering, min_cluster_size=100)

# Set the clustering_function for this run
clustering_function = clustering_kmeans
logger.info(f"Using clustering function: {clustering_function.func.__name__}")


# 3. Configure the Similarity Matching Function
similarity_function = functools.partial(
    get_top_n_similar_embeddings,
    use_mmr=False, # Try setting to True
    # mmr_lambda=0.3, # Tune this if use_mmr is True
)
logger.info(f"Using similarity function with MMR: {similarity_function.keywords.get('use_mmr', False)}")


# 4. Configure the Describer Function
describer = functools.partial(
    get_description,
    template_name=os.environ["DESCRIPTION_PROMPT_TEMPLATE"],
)

# 5. Define Data Path
CSV_PATH = (
    ROOT_DIR.parent
    / "data"
    / "Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv"
)

## 3. Running the Pipeline Step-by-Step

In [None]:
# Step 1: Load Samples
logger.info(f"Loading samples from {CSV_PATH}...")
samples = SampleCollection.from_csv(CSV_PATH)

# Optional: uncomment the line below to run on a smaller subset for faster testing
# samples = samples[:1000]

logger.info(f"Loaded {len(samples)} samples.")

In [None]:
# Step 2: Process Samples (Group them and create embeddings)
logger.info("Grouping samples by category and updating embeddings...")
samples_by_category = samples.group_by_category()
logger.info(f"Grouped samples into {len(samples_by_category)} categories.")

logger.info("Describing samples in each category...")
for category, sample_collection in tqdm(samples_by_category.items()):
    sample_collection.update_embeddings(feature_extractor)
    sample_collection.describe(describer)
logger.info("Embeddings updated and samples described.")

In [None]:
# Step 3: Create Instructions and Cluster them
logger.info("Creating instruction collection from samples...")
instructions = InstructionCollection.from_samples(samples)
logger.info(f"Created an instruction collection with {len(instructions)} instructions.")

logger.info("Updating instruction embeddings and clustering...")
instructions.update_embeddings(feature_extractor)
instructions.update_clusters(
    clustering_function=clustering_function, 
    use_raw_instructions=False
)
logger.info("Instruction embeddings updated and clusters created.")

In [None]:
# Step 4: Get Clusters and Describe Them
logger.info("Grouping instructions by cluster...")
instructions_by_cluster = instructions.group_by_cluster()
logger.info(f"Grouped instructions into {len(instructions_by_cluster)} clusters.")

logger.info("Describing instructions in each cluster...")
for cluster, instruction_collection in tqdm(instructions_by_cluster.items()):
    # Skip the noise cluster if it exists (often labeled -1 by density-based algorithms)
    if cluster == -1:
        continue
    instruction_collection.describe(describer)
logger.info("Instructions described.")

In [None]:
# Step 5: Match Clusters to Original Categories
logger.info("Finding top similar sample categories for each instruction cluster...")
id_to_category_pairs: IdToCategoryResultType = {}
for cluster, instruction_collection in tqdm(instructions_by_cluster.items()):
    # Skip noise cluster
    if cluster == -1:
        for sample in instruction_collection:
            # Map noise points to an 'UNKNOWN' category or handle as needed
            id_to_category_pairs[sample.id] = (samples.get_sample_by_id(sample.id).category, 'NOISE_CLUSTER')
        continue

    predicted_category = instruction_collection.get_cluster_category(
        sample_collections=list(samples_by_category.values()),
        similarity_function=similarity_function,
    )
    logger.info(
        f"Cluster N {instruction_collection.cluster} title: `{instruction_collection.title}` -> matched to category: {predicted_category}"
    )
    
    for sample in instruction_collection:
        id_to_category_pairs[sample.id] = (
            samples.get_sample_by_id(sample.id).category,
            predicted_category,
        )
logger.info("Matching completed.")

## 4. Evaluate and Store Results

In [None]:
# Step 6: Evaluate the results
logger.info("Evaluating results...")
cm = evaluate_results(id_to_category_pairs)

predicted_cluster_list = []
actual_category_list = []

for id_, (actual_category, predicted_category) in id_to_category_pairs.items():
    # We can't score noise clusters, so we skip them in the evaluation
    if predicted_category == 'NOISE_CLUSTER':
        continue
    predicted_cluster_list.append(predicted_category)
    actual_category_list.append(actual_category)

cluster_to_class_scores = cluster_to_class_similarity_measures(
    predicted_cluster_list, actual_category_list
)

logger.info("--- Evaluation Results ---")
for measure, score in cluster_to_class_scores.items():
    print(f"{measure.capitalize()}: {score:.4f}")

cm.print_matrix(sparse=True)
cm.stat(summary=True)

In [None]:
# Step 7: Store the results
output_path = Path(os.environ["EVALUATION_RESULTS_DIR"])
timestamp = time.strftime("%Y%m%d-%H%M%S")
try:
    git_commit = os.popen("git rev-parse --short HEAD").read().strip()
except Exception:
    git_commit = "unknown"

unique_folder_name = f"{timestamp}-{git_commit}-{feature_extractor.__name__}-{clustering_function.func.__name__}"
unique_folder_path = output_path / unique_folder_name

logger.info(f"Storing results in: {unique_folder_path}")

store_results(
    cm=cm,
    cluster_to_class_scores=cluster_to_class_scores,
    storage_path=unique_folder_path,
    instructions_by_cluster=instructions_by_cluster,
)

logger.info("Run complete.")

## 5. How to Experiment

The most important part of this notebook is the configuration cell in **Section 2.1**. By changing the function definitions and assignments in that cell, you can fundamentally alter the pipeline's behavior.

### Ideas for Experiments

1.  **Clustering Algorithm**: 
    - Change `clustering_function = clustering_kmeans` to `clustering_function = clustering_hdbscan`.
    - Observe the results. HDBSCAN can identify noise and doesn't require a fixed number of clusters. How does this affect your evaluation scores? 
    - Tune the `min_cluster_size` parameter for HDBSCAN.

2.  **Dimensionality Reduction**:
    - Change `feature_extractor = feature_extractor_umap` to `feature_extractor = feature_extractor_pca`.
    - PCA is a linear technique, while UMAP is non-linear. Does one work better for this specific text data?

3.  **Similarity Matching**:
    - In the `similarity_function` definition, set `use_mmr=True`.
    - Maximal Marginal Relevance (MMR) is designed to promote diversity in the results. Does this help in correctly matching clusters to categories?
    - Try tuning the `mmr_lambda` parameter (between 0 and 1). A higher value emphasizes similarity, while a lower value emphasizes diversity.

4.  **Prompt Engineering**:
    - Create a new prompt template file (e.g., `my_new_prompt.txt`).
    - Update the `.env` file to point `DESCRIPTION_PROMPT_TEMPLATE` to your new file.
    - See if a different prompt results in better, more coherent cluster descriptions, which might improve the matching process.

After each change in **Section 2.1**, you can re-run all the cells from **Section 3** onwards to see the impact of your experiment.