# Embedding Fine-Tuning Methodology

This document describes the methodology used to fine-tune a sentence embedding model for customer support utterance clustering. The implementation is in [`finetune_embeddings.ipynb`](finetune_embeddings.ipynb).

---

## 1. Problem Statement

We use sentence embeddings to cluster 26,872 customer support utterances into groups that align with 27 ground-truth intents (e.g., `cancel_order`, `track_refund`, `change_shipping_address`). The clustering pipeline is:

```
Utterance text → Sentence Embedding (384d) → UMAP (15d) → HDBSCAN → Cluster labels
```

The pre-trained `all-MiniLM-L6-v2` model produces general-purpose embeddings. While it captures broad semantic similarity, it was not trained on customer support language or optimized to distinguish between the specific intents in our dataset. Fine-tuning adapts the embedding space so that **utterances with the same intent are pulled closer together** and **utterances with different intents are pushed apart**, directly improving downstream clustering quality.

## 2. Dataset

**Source:** [`bitext/Bitext-customer-support-llm-chatbot-training-dataset`](https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset) from HuggingFace Datasets Hub.

| Property | Value |
|----------|-------|
| Total samples | 26,872 |
| Intents (fine-grained) | 27 classes |
| Categories (coarse) | 11 classes |
| Avg samples per intent | ~995 |
| Text column used | `instruction` |

Each row contains a customer support utterance (`instruction`) labeled with a ground-truth `intent` (e.g., `cancel_order`) and a broader `category` (e.g., `ORDER`). We use the `intent` labels as supervision for fine-tuning and both `intent` and `category` labels for evaluation.

## 3. Base Model

**Model:** [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

| Property | Value |
|----------|-------|
| Architecture | MiniLM (6 layers, 384 hidden dim) |
| Embedding dimensions | 384 |
| Max sequence length | 256 tokens |
| Parameters | ~22.7M |
| Pre-training | Distilled from `all-MiniLM-L12-v2`, trained on 1B+ sentence pairs |

This model was chosen because it is lightweight, fast to encode, and optimized for short texts—well-suited for customer support utterances that are typically 5–25 words long.

## 4. Train / Validation Split

We perform a **stratified 80/20 split at the utterance level**, not at the pair level.

| Split | Utterances | Purpose |
|-------|-----------|----------|
| Train | ~21,498 | Generate training pairs |
| Validation | ~5,374 | Generate evaluation triplets |

### Why split at the utterance level?

If we split at the pair level, the same utterance could appear in both a training pair and a validation pair. The model would memorize individual utterances rather than learning generalizable intent representations. By splitting utterances first and then generating pairs within each split, we guarantee **zero data leakage**.

Stratification ensures every intent is proportionally represented in both splits.

## 5. Training Data: Pair Generation

### Pair format

Each training example is an **(anchor, positive)** pair where both utterances share the same intent:

```
Anchor:   "I need assistance talking with an agent"
Positive: "I want help to speak to a live person"
```

Both belong to the `contact_human_agent` intent.

### Sampling strategy

For each of the 27 intents, we randomly sample up to **1,000 pairs** from all possible C(n, 2) combinations within the training split. This yields approximately **27,000 training pairs** total.

| Detail | Value |
|--------|-------|
| Pairs per intent | up to 1,000 |
| Total training pairs | ~27,000 |
| Sampling method | Random without replacement |

### Why 1,000 pairs per intent?

- **Exhaustive pairing is impractical:** With ~800 utterances per intent in the training split, exhaustive C(800, 2) = 319,600 pairs per intent, or ~8.6M total. This is excessive for a small model.
- **1,000 per intent balances signal and compute:** 27K pairs with batch_size=64 gives ~422 training steps per epoch. At 3 epochs, total training completes in minutes.
- **Diminishing returns:** Contrastive learning with in-batch negatives is sample-efficient. Most of the learning happens in the first few thousand gradient updates.

## 6. Validation Data: Triplet Generation

For monitoring training progress, we generate **(anchor, positive, negative)** triplets from the validation split:

```
Anchor:   "help me checking the available payment options"   (check_payment_methods)
Positive: "I want to see what payment options are accepted"  (check_payment_methods)
Negative: "I need help cancelling my purchase"               (cancel_order)
```

| Detail | Value |
|--------|-------|
| Triplets per intent | up to 200 |
| Total validation triplets | ~5,400 |
| Evaluator | `TripletEvaluator` |

The `TripletEvaluator` measures the percentage of triplets where `cosine(anchor, positive) > cosine(anchor, negative)`. A higher score means the model is better at placing same-intent utterances closer together than different-intent ones.

## 7. Loss Function: MultipleNegativesRankingLoss

### What it does

`MultipleNegativesRankingLoss` (MNRL) implements the **InfoNCE** contrastive objective. Given a batch of (anchor, positive) pairs, it:

1. Encodes all anchors and all positives in the batch
2. Computes the cosine similarity matrix between all anchors and all positives
3. Treats each anchor's corresponding positive as the correct match
4. All other positives in the batch serve as **in-batch negatives**
5. Applies cross-entropy loss over the similarity scores

### Illustration

With a batch of 4 pairs, the similarity matrix looks like:

```
              Positive_0  Positive_1  Positive_2  Positive_3
Anchor_0     [  0.92  ]    0.31        0.45        0.28       ← label = 0
Anchor_1       0.35     [  0.88  ]    0.40        0.33       ← label = 1
Anchor_2       0.41       0.37      [  0.90  ]    0.29       ← label = 2
Anchor_3       0.30       0.34        0.32      [  0.85  ]   ← label = 3
```

The diagonal entries (bracketed) are the correct anchor-positive matches. The loss pushes diagonal scores up and off-diagonal scores down.

### Why MNRL?

- **Only requires positive pairs** — no need to explicitly mine hard negatives
- **In-batch negatives are free** — a batch of 64 gives 63 negatives per anchor
- **Scales with batch size** — larger batches provide more diverse negatives
- **Standard choice** for sentence-transformers fine-tuning when labeled pairs are available

## 8. Training Hyperparameters

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **Epochs** | 3 | Standard for fine-tuning small pre-trained models on moderate-sized data. Enough to converge without overfitting. |
| **Batch size** | 64 | Larger batches provide more in-batch negatives per anchor (63 negatives), strengthening the contrastive signal. Limited by GPU/MPS memory. |
| **Learning rate** | 2e-5 | Default for sentence-transformers fine-tuning. Low enough to preserve pre-trained knowledge while adapting to the domain. |
| **Warmup** | 10% of total steps | Gradually ramps the learning rate to avoid large early parameter updates that could destabilize the pre-trained weights. |
| **Scheduler** | WarmupLinear | Linear decay after warmup. Standard default for transformer fine-tuning. |
| **Optimizer** | AdamW | Default optimizer in sentence-transformers. |
| **Evaluation frequency** | Every ~211 steps (twice per epoch) | Frequent enough to monitor convergence and save the best checkpoint. |
| **Save strategy** | Best model by TripletEvaluator score | Only the checkpoint with the highest triplet accuracy is saved, avoiding overfitting to later epochs. |

### Training scale

| Metric | Value |
|--------|-------|
| Steps per epoch | ~422 |
| Total training steps | ~1,266 |
| Warmup steps | ~127 |
| Trainable parameters | ~22.7M |

## 9. Evaluation Methodology

### Clustering pipeline (identical to baseline)

To ensure a fair comparison, we re-run the **exact same** clustering pipeline from `clustering_analysis.ipynb` on both base and fine-tuned embeddings:

| Stage | Parameters |
|-------|------------|
| **UMAP** (clustering) | 15 dims, n_neighbors=15, min_dist=0.0, cosine metric |
| **HDBSCAN** | min_cluster_size=500, min_samples=10, EOM selection |
| **UMAP** (visualization) | 2 dims, n_neighbors=15, min_dist=0.1, cosine metric |

We intentionally do **not** re-tune HDBSCAN hyperparameters for the fine-tuned embeddings. Using fixed parameters isolates the effect of the embeddings from hyperparameter tuning.

### Metrics

All metrics exclude HDBSCAN noise points (label = -1).

| Metric | Range | What it measures |
|--------|-------|------------------|
| **ARI** (Adjusted Rand Index) | [-1, 1] | Agreement between predicted clusters and ground-truth labels, adjusted for chance. 1 = perfect, 0 = random. |
| **NMI** (Normalized Mutual Information) | [0, 1] | Mutual information between cluster assignments and ground-truth, normalized. 1 = perfect correlation. |
| **Per-intent purity** | [0, 1] | For each ground-truth intent, the fraction of its non-noise samples in the single most common cluster. Higher = cleaner clusters. |
| **Embedding separation** | (-1, 1) | Difference between mean intra-class cosine similarity and mean inter-class cosine similarity. Higher = better class separation in the embedding space. |

Metrics are computed against both **intent** (27 classes) and **category** (11 classes) ground-truth labels.

## 10. Visualizations Produced

The fine-tuning notebook generates the following comparative visualizations:

| Visualization | Purpose |
|---------------|----------|
| **Side-by-side UMAP** (intent-colored) | Shows whether intent clusters become tighter and more separated after fine-tuning |
| **Side-by-side UMAP** (HDBSCAN-colored) | Compares discovered cluster structure and noise levels |
| **ARI / NMI bar chart** | Direct numeric comparison of clustering quality |
| **Per-intent purity bars** | Shows which intents improved or degraded, and by how much |
| **Cosine similarity histograms** | Visualizes the shift in intra-class vs inter-class similarity distributions |

## 11. Design Decisions & Trade-offs

### Why fine-tune rather than use a larger model?

A larger pre-trained model (e.g., `all-mpnet-base-v2` at 768d) might improve clustering without fine-tuning, but:
- Fine-tuning a smaller model is faster and cheaper to run at inference time
- Domain-specific adaptation often outperforms generic larger models
- 384d embeddings use less memory for UMAP/HDBSCAN on the full dataset

### Why MNRL over triplet loss?

- **Triplet loss** requires explicit (anchor, positive, negative) triplets and is sensitive to hard-negative mining strategy
- **MNRL** only needs positive pairs and automatically leverages the entire batch as negatives
- With batch_size=64, MNRL effectively sees 63 negatives per anchor—more diverse than hand-selected triplets

### Why not use the full C(n,2) pairs?

Exhaustive pairing from ~800 samples per intent would produce ~320K pairs per intent (~8.6M total). This is:
- Computationally wasteful for a 22.7M parameter model
- Highly redundant (many near-duplicate pairs from paraphrases)
- Unnecessary since MNRL is sample-efficient via in-batch negatives

### Why fixed HDBSCAN parameters for evaluation?

Re-tuning HDBSCAN for fine-tuned embeddings would conflate two effects: better embeddings vs. better hyperparameters. Using identical parameters (min_cluster_size=500) isolates the embedding improvement. In production, you would re-tune HDBSCAN to maximize the benefit of the new embeddings.

## 12. Reproducibility

### Random seeds

All stochastic operations are seeded with `SEED = 42`:
- `random.seed(42)` — pair/triplet sampling
- `np.random.seed(42)` — NumPy operations
- `torch.manual_seed(42)` — model initialization and training
- `random_state=42` — UMAP projections and train/val split

### Environment

| Component | Version / Detail |
|-----------|------------------|
| Python | 3.13 |
| Package manager | `uv` |
| sentence-transformers | (see `uv.lock`) |
| PyTorch device | MPS (Apple Silicon), CUDA, or CPU (auto-detected) |
| Dependencies | Locked in `pyproject.toml` + `uv.lock` |

### To reproduce

```bash
# 1. Install dependencies
uv sync

# 2. Register the Jupyter kernel
uv run python -m ipykernel install --user --name contact-center --display-name "Contact Center (Python 3.13)"

# 3. Run the notebook end-to-end
#    Select the "Contact Center (Python 3.13)" kernel and run all cells in finetune_embeddings.ipynb
```

The fine-tuned model is saved to `./finetuned-MiniLM-L6-v2-customer-support/` (excluded from git via `.gitignore`).

## 13. Potential Improvements

| Improvement | Description |
|-------------|-------------|
| **Hard-negative mining** | Use `GISTEmbedLoss` or mine hard negatives (high similarity, different intent) for stronger contrastive signal |
| **Larger base model** | Try `all-mpnet-base-v2` (768d) or `BAAI/bge-base-en-v1.5` as starting points |
| **More epochs with early stopping** | Monitor validation loss and stop when it plateaus |
| **Augmented pairs** | Use paraphrase generation to create additional training pairs for low-frequency intents |
| **Multi-task training** | Jointly optimize for intent and category separation using `MatryoshkaLoss` or multi-label supervision |
| **Re-tune HDBSCAN** | After fine-tuning, sweep HDBSCAN parameters again to find the optimal clustering for the new embedding space |