# üß≠ Project CX: Travel agent tool classifier

Welcome to your first ML project coding exercise of the week! You are building an intelligent routing system that solves a real travel industry challenge: automatically directing booking requests to the right service API.

A user might type *"Find me a hotel in Zurich for the weekend"* or *"Book a direct flight to Tokyo"* ‚Äî and your classifier needs to instantly decide which backend tool to call: a flight booker, a hotel booker, or a car rental service. This is the core intelligence behind modern agentic travel apps.

**Pipeline:**

You'll walk through the complete Machine Learning Engineer pipeline:

1. üóÇÔ∏è **Load the data** ‚Äî Get your travel booking dataset ready
2. üîç **Inspect the dataset** ‚Äî Understand what you're working with
3. üìä **Visualize the embeddings** ‚Äî See your data in 2D space using PCA
4. üß± **Define architecture** ‚Äî Build your model, dataset, and metrics
5. üèãÔ∏è **Train the classifier** ‚Äî Watch your MLP learn the routing patterns
6. üìã **Evaluate performance** ‚Äî Test how well it routes on unseen data
7. üöÄ **Live routing demo** ‚Äî Try it yourself with real travel requests!

By the end, you will have trained an MLP classifier that intelligently routes travel plans between the 3 booking classes: flights, hotels, and activities. This is the core intelligence behind agentic travel apps where users describe their dream vacation and the system automatically books everything with a single click.

---
## 1 üóÇÔ∏è Load the Data

We load pre-computed 384-dim sentence embeddings from CSVs from the `data-generation/` directory. Each row contains an embedding vector (columns `"0"`‚Äì`"383"`), a string label (`Flight`, `Hotel`, or `CarRental`), and the original sentence.

In [None]:
# Run this cell to download the dataset and clone the repo
!git clone -b week1/tool-picker/demo2 https://github.com/eth-bmai-fs26/project.git
!git fetch && git checkout week1/tool-picker/demo2
%cd project/week1/tool-picker

In [None]:
import os, sys

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

DATA_DIR = os.path.join(os.getcwd(), "data")

# Store file paths for each data split (train / validation / test)
train_path = os.path.join(DATA_DIR, "train.csv")
val_path = os.path.join(DATA_DIR, "val.csv")
test_path = os.path.join(DATA_DIR, "test.csv")

---

### How did sentences become numbers?

Before training a classifier, we need to turn each sentence into something a model can work with: a list of numbers called an **embedding**.

Our classifier will work by learning to categorize these sentence embeddings rather than the raw text itself.

Think of it like this:

> **"Book a flight to Paris"** ‚Üí `[0.12, -0.03, 0.47, ‚Ä¶, 0.08]` (384 numbers)

A small pretrained language model reads the sentence and compresses its meaning into a fixed-size vector of 384 numbers. Sentences that mean similar things end up with similar vectors. For example, *"Reserve a plane ticket to Rome"* would land close to *"Book a flight to Paris"* in this 384-dimensional space, while *"Find me a hotel downtown"* would be farther away.

You don't need to run this step yourself. It was already done ahead of time, and the CSVs you loaded above contain the resulting embedding vectors.

---
## 2 üîç Load & Inspect

Read the three CSVs into DataFrames and print shapes, class distribution, and a preview of the first few rows to verify the data looks correct.

In [None]:
import numpy as np
import pandas as pd

LABEL_NAMES = {0: "FLIGHT_BOOKER", 1: "HOTEL_BOOKER", 2: "CAR_RENTAL_BOOKER"}
LABEL_MAP = {"Flight": 0, "Hotel": 1, "CarRental": 2}

train_df = pd.read_csv(train_path, index_col=0)
val_df   = pd.read_csv(val_path,   index_col=0)
test_df  = pd.read_csv(test_path,  index_col=0)

for df in (train_df, val_df, test_df):
    df["label"] = df["label"].map(LABEL_MAP)

print("Overview of the dataset:")
print(f"Train shape: {train_df.shape}")
print(f"Val shape:   {val_df.shape}")
print(f"Test shape:  {test_df.shape}")
print(f"\nClass distribution (train):")
print(train_df["label"].value_counts().sort_index().rename(LABEL_NAMES))
train_df.head()

#### üîé What to look for here

A few things to notice in the output above:

- **Shape `(4664, 386)`** ‚Äî 4664 training samples, each with 384 embedding dimensions + the `label` column + the `sentence` column. The validation and test sets are much smaller (583 each), which is a common 80/10/10 split.
- **Balanced classes** ‚Äî Each class has roughly 1555 samples in training (~33% each). This is great! With balanced classes we don't need to worry about the model favoring one class over others, and plain **accuracy** is a fair metric. If classes were imbalanced (e.g. 90% flights, 5% hotels, 5% cars), accuracy would be misleading and we'd need metrics like F1-score or balanced accuracy.
- **The DataFrame preview** ‚Äî Each row is one training example. The `sentence` column holds the original text, `label` is the integer class (0/1/2), and columns `0` through `383` hold the 384 embedding values. The embedding values are small floating-point numbers roughly in the range [-0.15, 0.15], which is typical for normalized sentence embeddings.

---
## 3 üìä - üéØ: Visualize the embeddings

**Exercise:** Project the 384-dim embeddings down to 2D so we can visualize them.

Your task:
1. Create a `PCA` object that reduces to **2 components** (use `random_state=42`)
2. Call `fit_transform` on `X_train` to get the 2D coordinates

The plotting code is already provided ‚Äî once you fill in the two lines, run the cell to see your scatter plot!

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

emb_cols = []
for col_name in train_df.columns:
    if str(col_name).isdigit():
        emb_cols.append(col_name)
emb_cols.sort(key=int)
X_train = train_df[emb_cols].values
y_train = train_df["label"].values

# üéØ TODO: Create a PCA object that projects down to 2 components (you only need to fill in the n_components and random_state arguments)
# Hint: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
pca = ...

# üéØ TODO: Call fit_transform method of pca on X_train to get the 2D projection
# Hint: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform
X_2d = ...

plt.figure(figsize=(8, 6))
for k in range(3):
    mask = y_train == k
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=LABEL_NAMES[k], alpha=0.5, s=15)

plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} var)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} var)")
plt.title("Training Embeddings ‚Äî PCA Projection")
plt.legend()
plt.tight_layout()
plt.show()

#### üìñ Reading the PCA plot

**What is PCA doing here?** Each training sample lives in a 384-dimensional space (one dimension per embedding feature). We obviously can't visualize 384 dimensions, so PCA (Principal Component Analysis) finds the two directions in that space that capture the most variance and projects every point onto them. Think of it as finding the "best camera angle" to photograph a 3D object in 2D ‚Äî except here we're going from 384D to 2D.

**What to look for:**
- **Distinct clusters** ‚Äî If the three classes form well-separated blobs, that's a strong signal that a simple classifier will work well. The embedding model has already done most of the heavy lifting by mapping semantically similar sentences to nearby vectors.
- **Overlap between clusters** ‚Äî Where clusters overlap, those are the regions where the classifier will struggle. Misclassified samples at test time will almost always come from these boundary regions.
- **Explained variance** ‚Äî The axis labels show how much of the total variance each principal component captures (e.g. "PC1 (15.2% var)"). Don't be alarmed if these percentages seem low ‚Äî with 384 dimensions, the information is spread across many axes. Even 10-15% per axis is meaningful.

> **Key insight:** PCA is just a visualization tool here, not a preprocessing step. The classifier will train on the full 384-dim embeddings and can exploit structure in all dimensions, not just the two shown in the plot.

---
## 4 üß± Model, Dataset & Metrics Definitions

### 4.1 üóÉÔ∏è PyTorch Dataset + DataLoaders

We import `make_dataloaders` from `dataset.py`, which handles reading the CSVs and packaging each split (train/val/test) into a PyTorch `DataLoader`. Each `DataLoader` serves data in mini-batches during training and evaluation.

In [None]:
from lib.dataset import make_dataloaders

print("‚úÖ Dataset & DataLoader ready.")

#### üí° Why wrap data in a PyTorch Dataset?

You might wonder why we don't just pass NumPy arrays directly to the model. The `Dataset` + `DataLoader` pattern gives us several things for free:

1. **Batching** ‚Äî Instead of feeding all 4664 samples at once (which works here but not with larger datasets), the `DataLoader` serves them in mini-batches of 64. This controls memory usage and provides the stochastic gradient updates that help training.
2. **Shuffling** ‚Äî The training `DataLoader` shuffles data each epoch, so the model doesn't memorize the order of examples. The validation and test loaders don't shuffle because evaluation order doesn't matter.
3. **Automatic tensor conversion** ‚Äî The dataset returns PyTorch tensors ready for GPU/CPU computation, keeping the data pipeline clean.
4. **Scalability** ‚Äî This same pattern scales from thousands of samples to millions. For huge datasets you could load from disk lazily instead of holding everything in memory.


### 4.2 üß† - üéØ Building the Neural Network

A minimal two-layer MLP: `Linear(d ‚Üí hidden) ‚Üí ReLU ‚Üí Dropout ‚Üí Linear(hidden ‚Üí 3)`. The output layer produces one score per class ‚Äî the highest score wins.

In [None]:
import torch.nn as nn


class ToolRouterMLP(nn.Module):
    """Simple two-layer MLP: Linear ‚Üí ReLU ‚Üí Dropout ‚Üí Linear.

    Outputs raw logits.
    """

    def __init__(self, input_dim, hidden_dim=128, num_classes=3, dropout_p=0.1):
        super().__init__()
        # üéØ TODO: Inside the nn.Sequential define the network architecture using the provided arguments
        self.network = nn.Sequential(
            nn.Linear(..., ...),
            ...,
            nn.Dropout(...),
            ...
        )

    def forward(self, x):
        # üéØ TODO: return the output of passing x through the network called self.network
        return ...


print("‚úÖ ToolRouterMLP ready.")

#### üèóÔ∏è Understanding the architecture

Let's unpack what each layer does and *why* it's there:

| Layer | What it does | Why we need it |
|-------|-------------|----------------|
| `Linear(384 ‚Üí 128)` | Multiplies the 384-dim input by a weight matrix to produce 128 features | Learns which combinations of embedding dimensions are useful for classification |
| `ReLU()` | Replaces negative values with zero: `max(0, x)` | Introduces **non-linearity** ‚Äî without it, stacking linear layers is mathematically equivalent to a single linear layer, so the network couldn't learn complex decision boundaries |
| `Dropout(0.1)` | Randomly zeroes 10% of neurons during training | **Regularization** ‚Äî prevents the network from relying too heavily on any single neuron, reducing overfitting |
| `Linear(128 ‚Üí 3)` | Maps 128 hidden features to 3 output scores (one per class) | Produces the final prediction ‚Äî the class with the highest score wins |

**Why no softmax at the end?** The last layer outputs one raw score per class. PyTorch's loss function handles converting these scores into probabilities internally, so we don't need to add that step ourselves.

**Why only two layers?** Our input embeddings are already highly informative (a pretrained language model produced them). The classifier just needs to learn a relatively simple decision boundary in embedding space. Adding more layers would risk overfitting on our small dataset of 480 samples.

### 4.3 üìê Metrics

Two simple evaluation tools: `accuracy` (how many predictions are correct) and `confusion_matrix` (shows which classes get confused with each other).

In [14]:
def accuracy(y_true, y_pred) -> float:
    """Fraction of correct predictions."""
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    num_correct = (y_true == y_pred).sum()
    return float(num_correct) / len(y_true)


from lib.metrics import confusion_matrix

#### üìä Why these two metrics?

- **Accuracy** is the go-to metric when classes are balanced (and ours are, at ~33% each). It simply answers: *"What fraction of predictions are correct?"* With 3 balanced classes, a random-guessing baseline would score ~33%, so anything above that shows the model has learned something.

- **Confusion matrix** tells us *where* the model makes mistakes. Each cell `[i, j]` counts how many samples from true class `i` were predicted as class `j`. The diagonal shows correct predictions; off-diagonal entries are errors. For example, if many hotel bookings get misclassified as car rentals, the confusion matrix will reveal that pattern ‚Äî something overall accuracy alone would hide.

---
## 5 üèãÔ∏è Train the classifier

### 5.1 ‚öôÔ∏è Hyperparameters & Reproducibility

Set all training hyperparameters in one place and fix random seeds so you get the same results every time you run the code.

In [None]:
from lib.utils import set_seed
import torch

SEED       = 42
EPOCHS     = 20
BATCH_SIZE = 64
HIDDEN_DIM = 128
LR         = 1e-3
DROPOUT_P  = 0.1

set_seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üñ•Ô∏è  Device: {device}")

#### üéõÔ∏è Why these specific values?

| Hyperparameter | Value | Rationale |
|---|---|---|
| `SEED = 42` | Fixed seed | Ensures you get the same results every run ‚Äî critical for debugging and reproducibility |
| `EPOCHS = 20` | 20 passes over the training data | Enough for the loss to converge on this dataset without excessive training time |
| `BATCH_SIZE = 64` | 64 samples per gradient update | With 4664 training samples, that's ~73 batches per epoch ‚Äî a good balance between stable gradients and frequent updates |
| `HIDDEN_DIM = 128` | 128 hidden neurons | Gives the network enough capacity to learn the routing patterns without being so large it overfits |
| `LR = 1e-3` | Learning rate of 0.001 | The default for Adam and a solid starting point ‚Äî too high causes instability, too low makes training painfully slow |
| `DROPOUT_P = 0.1` | 10% dropout rate | Light regularization as a starting point; heavier dropout (e.g. 0.3‚Äì0.5) is worth exploring if you observe overfitting |

> **Tip:** In practice, hyperparameter tuning is often the difference between a mediocre and a great model. Common strategies include grid search, random search, or more sophisticated methods like Bayesian optimization.


### 5.2 üì¶ DataLoaders + Model Init

Wrap the three CSVs into PyTorch `DataLoader`s and instantiate the MLP, Adam optimizer, and loss function.

In [None]:
train_loader, val_loader, test_loader = make_dataloaders(
    train_path, val_path, test_path,
    batch_size=BATCH_SIZE,
)
input_dim = train_loader.dataset.dim

print(f"Input dim: {input_dim}")
print(f"Train: {len(train_loader.dataset)}  "
      f"Val: {len(val_loader.dataset)}  "
      f"Test: {len(test_loader.dataset)}")

# Note:     earlier on in part 4.2 you defined the ToolRouterMLP class, now you will create an instance of it
#           a class is like a blueprint for our model, and an instance is a specific model created based on that blueprint
model = ToolRouterMLP(
    input_dim=input_dim,
    hidden_dim=HIDDEN_DIM,
    num_classes=3,
    dropout_p=DROPOUT_P,
).to(device)

# üéØ TODO: complete the missing parts to define the optimizer
# Hint: what is the optimizer trying to optimize? 
optimizer = torch.optim.Adam(..., lr=LR)

loss_fn = nn.CrossEntropyLoss()

print(f"\n{model}")

#### üß© What we just assembled

Let's take stock of the pieces we now have ready for training:

- **DataLoaders** ‚Äî Three iterators that serve mini-batches of `(embedding, label)` pairs from train/val/test splits
- **Model** ‚Äî A 2-layer MLP with 49,539 trainable parameters (`384√ó128 + 128 + 128√ó3 + 3`)
- **Optimizer** ‚Äî Adam, which adapts the learning rate per-parameter using running estimates of gradient mean and variance. It's the workhorse optimizer in deep learning for good reason: it works well out of the box for most problems
- **Loss function** ‚Äî The standard loss for multi-class classification. It penalizes the model more when it's confidently wrong and rewards confident correct predictions

### 5.3 üîÅ Training Loop

Define `train_one_epoch` and `evaluate` helper functions, then run the full training loop for  `EPOCH` epochs, printing train/val loss and accuracy each epoch.

In [None]:
def train_one_epoch(model, loader, optimizer, loss_fn, device):
    """Train for one epoch. Returns (avg_loss, accuracy)."""
    model.train() # Set model to training mode
    total_loss = 0.0
    correct = 0
    total = 0

    for x, y in loader:
        x, y = x.to(device), y.to(device)

        # This block is the core of the training loop for one batch!
        logits = model(x)           # Forward pass: compute raw class scores (logits) from the model
        loss = loss_fn(logits, y)   # Compute cross-entropy loss between predicted logits and true labels
        optimizer.zero_grad()       # Reset gradients from the previous batch before backpropagation
        loss.backward()             # Backward pass: compute gradients of loss w.r.t. all model parameters
        optimizer.step()            # Update model parameters using the computed gradients

        # Accumulate weighted loss (multiply by batch size to undo the mean)
        batch_size = x.size(0)
        total_loss = total_loss + loss.item() * batch_size

        # Pick the class with the highest logit as the prediction
        preds = logits.argmax(dim=1)
        num_correct_in_batch = (preds == y).sum()
        correct = correct + num_correct_in_batch.item()
        total = total + batch_size

    # Return mean loss and fraction of correct predictions over the full epoch
    return total_loss / total, correct / total


def evaluate(model, loader, loss_fn, device):
    """Evaluate on a dataset. Returns (avg_loss, accuracy, y_true, y_pred)."""
    model.eval() # Set model to evaluation mode
    total_loss = 0.0
    all_true = []
    all_pred = []

    # Disable gradient tracking ‚Äî we only need forward passes during evaluation
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            loss = loss_fn(logits, y)

            batch_size = x.size(0)
            total_loss = total_loss + loss.item() * batch_size

            # Predicted class is the index with the highest logit score
            preds = logits.argmax(dim=1)
            # Collect true labels and predictions for metric computation
            all_true.extend(y.cpu().numpy())
            all_pred.extend(preds.cpu().numpy())

    total = len(all_true)
    acc = accuracy(all_true, all_pred)
    return total_loss / total, acc, all_true, all_pred


# --- run training ---
for epoch in range(1, EPOCHS + 1):
    # Train for one full pass over the training set
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, loss_fn, device)
    # Evaluate on the validation set to monitor generalisation (no weight updates)
    val_loss, val_acc, _, _ = evaluate(model, val_loader, loss_fn, device)
    print(f"Epoch {epoch:2d}/{EPOCHS}  "
          f"train_loss={train_loss:.4f}  train_acc={train_acc:.4f}  "
          f"val_loss={val_loss:.4f}  val_acc={val_acc:.4f}")

#### üìà Interpreting the training log

Here's what to watch for as you read through the epoch-by-epoch output:

1. **Train loss decreasing** ‚Äî The loss should steadily fall from epoch to epoch. If it plateaus early, the learning rate might be too low or the model too small. If it oscillates wildly, the learning rate is too high.

2. **Train accuracy climbing** ‚Äî Starting around 72% in epoch 1 and reaching ~98-99% by epoch 20 shows the model is successfully learning the routing patterns.

3. **Val loss & accuracy** ‚Äî This is the real report card. The validation set was *never* used for training, so val accuracy reflects how well the model generalizes:
   - **Val accuracy ‚âà train accuracy** ‚Üí Good generalization, no significant overfitting
   - **Val accuracy << train accuracy** ‚Üí Overfitting ‚Äî the model is memorizing training data rather than learning general patterns. Consider more dropout, fewer epochs, or more training data.
   - **Val accuracy > train accuracy** ‚Üí Can happen early on (especially with dropout active during training but not evaluation) and is nothing to worry about.

4. **Convergence** ‚Äî Notice how both train and val metrics stabilize toward the later epochs. The model has essentially learned all it can from this data. Training further would yield diminishing returns or even start overfitting.

> **Why do we track validation performance during training?** In practice, you'd use val performance to decide *when to stop training* (early stopping) and *which hyperparameters are best*. The test set is only touched once, at the very end, to get an unbiased estimate of real-world performance.

---
## 6 üìã Evaluate performance

Run the trained model on the held-out test split and print overall accuracy plus a confusion matrix (rows = true, cols = predicted).

In [None]:
test_loss, test_acc, y_true, y_pred = evaluate(model, test_loader, loss_fn, device)

print(f"üìã Test Loss:     {test_loss:.4f}")
print(f"üìã Test Accuracy: {test_acc:.4f}")

cm = confusion_matrix(y_true, y_pred, num_classes=3)
cm_df = pd.DataFrame(
    cm,
    index=[LABEL_NAMES[i] for i in range(3)],
    columns=[LABEL_NAMES[i] for i in range(3)],
)
print(f"\nüìã Confusion Matrix (rows=true, cols=pred):")
cm_df

#### üèÜ Interpreting the results

**97% test accuracy** ‚Äî Almost every single test sample was routed to the correct tool! The confusion matrix confirms this: the diagonal is filled with the correct counts and all off-diagonal entries are almost zero.

**Is this too good to be true?** Not necessarily. Remember:
- The sentence embeddings come from a powerful pretrained language model that already captures semantic meaning very well
- The three classes (flights, hotels, car rentals) are semantically quite distinct ‚Äî sentences about booking flights sound very different from sentences about renting cars
- The dataset is relatively small and clean (synthetically generated)

In real-world production, you'd typically see lower accuracy because:
- User queries are messy, ambiguous, or multi-intent ("I need a flight and a hotel")
- The domain boundaries are fuzzier
- There may be out-of-distribution inputs that don't belong to any class

> **Good practice:** Even with 100% accuracy, always look at the confusion matrix. When accuracy drops (which it will in production), the confusion matrix tells you *which* classes are getting confused, guiding you on where to focus improvement efforts.

### 6.1 üî• Confusion Matrix Heatmap

Render the confusion matrix as a color-coded heatmap for a quick visual read on where the model confuses classes.

In [None]:
from lib.metrics import plot_confusion_matrix
plot_confusion_matrix(cm, LABEL_NAMES)

#### üó∫Ô∏è How to read the heatmap

The heatmap is a visual representation of the same confusion matrix shown as a table above. Here's how to read it:

- **Rows** = true (actual) class, **Columns** = predicted class
- **Dark diagonal** = correct predictions. The darker/higher the diagonal values, the better.
- **Off-diagonal cells** = misclassifications. If cell `(Hotel, Flight)` were dark, it would mean the model often confuses hotel bookings for flight bookings.
- **Color intensity** maps to count ‚Äî darker blue means more samples fell into that cell.

In our case the heatmap should show a clean diagonal pattern (all predictions correct). In more challenging scenarios, the heatmap immediately reveals patterns like "car rentals often get confused with hotels" ‚Äî which would suggest those two classes have similar sentence patterns and might need more training data or better features to distinguish.