# Notebook: Fine-tune BERT on Your Labeled CSV Data

### Objective:

1. Walk through the entire fine-tuning pipeline step by step
2. Explain what's going on in accessible language
3. Prints out shapes, label distrubtions, and smaple rows at each stage so you can "see" the data evolving.
4. Uses the `local_llm`'s core functions, so the notebook stays relatively clean and focused on _workflow_, not implementation details.

### Step 1 - Setup and Imports 

In this fist step, we import all the tools we need:
- `pandas` to load and inspect the CSV.
- The fine-tuning helpers form `local_llm.training.text_finetune`.

We'll also define the paths to your data and BERT assets later. 

In [None]:
from pathlib import Path
import pandas as pd

from local_llm.training.text_finetune import (
    FineTuneConfig,
    set_seed,
    prepare_label_mapping,
    stratified_split_indices,
    encode_splits,
    build_dataloaders,
    build_bert_text_classifier_from_assets,
    train_text_classifier,
    evaluate_on_split,
    save_finetuned_classifier,
    export_predictions_csv,
)

print("Imports complete.")


### Step 2 - Load Your Labeled Data

Now we: 
1. Load the labeled CSV file into a pandas DataFrame.
2. Print:
    - The first few rows
    - Shape of the data (rows, columns)
    - The Column Names
 
This helps confirm that:
- The file path is correct.
- The column names match what we expect later.

In [None]:
data_path = Path("C:/Users/Cameron.Webster/Python/local-llm/data/wbs_data.csv")  # adjust if your file lives elsewhere

print(f"Loading data from: {data_path.resolve()}")
df = pd.read_csv(data_path)

print("\n✅ Data loaded.")
print(f"Data shape: {df.shape[0]} rows × {df.shape[1]} columns\n")
print("Columns:")
print(df.columns.tolist())

print("\nFirst 5 rows:")
display(df.head())

### Step 3 - Define Fune-tuning Configuration

Here we deine a `FineTuneConfig`, which is a single object that:
- Tells the library which text columns touse and which columns is the label. 
- Controls how we split into train/validation/test sets.
- Point to your BERT assets directory. 
- Sets training hyperparameters like learning rate and number of epochs.
- Defines how much BERT we fine-tune(`finetune_policy` and `finetune_last_n`).

We also call `set_seed` to make the run reproducible (same random splits each time).

In [None]:
cfg = FineTuneConfig(
    # Which text columns to concatenate into a single input string
    text_cols=("wbs_name", "wbs_title_hierarchy", "keyword"),

    # Which column contains the labels
    label_col="level_1",

    # Where the converted BERT assets live
    assets_dir=Path("C:/Users/Cameron.Webster/Python/local-llm/assets/bert-base-local"),

    # Where to save fine-tuning outputs (models, predictions, etc.)
    output_dir=Path("C:/Users/Cameron.Webster/Python/local-llm/artifacts/finetune_bert"),

    # Train/val/test fractions (they should sum to <= 1, remainder is test)
    train_frac=0.7,
    val_frac=0.15,

    # Training loop hyperparameters
    epochs=5,
    base_lr=2e-5,
    batch_size=32,

    # How much of BERT to fine-tune
    finetune_policy="last_n",  # options: "none", "last_n", "full"
    finetune_last_n=2,         # last 2 transformer layers trainable
    train_embeddings=False,    # embeddings stay frozen
)

print("FineTuneConfig created:")
print(cfg)

# Set random seed for reproducibility
set_seed(cfg.seed)
print(f"\nRandom seed set to: {cfg.seed}")


### Step 4 - Map String Labels to Integer IDs

Neural networks operate on numbers, not strings. 

Here we:
- Convert the label columns (e.g., `"Construction"`, `"Design"`) into integers IDs.
- Build two dictionaries:
    - `label_to_id`: string label --> integer
    - `id_to_label`: integer --> string label
- Add a new column `label_id` to the DataFrame.

We also print:
- The mappings
- How many examples there are per class (class balanace)

In [None]:
label_col = cfg.label_col
print(f"Preparing label mapping using label column: '{label_col}'")

df_mapped, label_to_id, id_to_label = prepare_label_mapping(df, label_col)

print("\n✅ Label mapping created.")
print("label_to_id:")
for lab, idx in label_to_id.items():
    print(f"  {lab!r} -> {idx}")

print("\nid_to_label:")
for idx, lab in id_to_label.items():
    print(f"  {idx} -> {lab!r}")

print("\nValue counts for label_id (class distribution):")
display(df_mapped["label_id"].value_counts().sort_index())

print("\nPreview with label_id:")
display(df_mapped[[label_col, "label_id"]].head())


### Step 5 - Stratified Train / Validation / Test Split

We now split the data into three sets:
- **Training**: used to fit the model. 
- **Validation**: used to tune hyperparameters and monitor overfitting.
- **Test**: held out until the very end for final evaluation. 

The split is stratified, meaning:
- Each set keeps apprximately the same label distribution as the full dataset. 

We print:
- The sizes of each split.
- Label distributions in each split.

In [None]:
labels = df_mapped["label_id"].values

train_idx, val_idx, test_idx = stratified_split_indices(
    labels,
    train_frac=cfg.train_frac,
    val_frac=cfg.val_frac,
    seed=cfg.seed,
)

print("✅ Stratified indices created.")
print(f"Train size: {len(train_idx)}")
print(f"Val size:   {len(val_idx)}")
print(f"Test size:  {len(test_idx)}")

# Inspect label distribution across splits
train_labels = df_mapped.iloc[train_idx]["label_id"]
val_labels   = df_mapped.iloc[val_idx]["label_id"]
test_labels  = df_mapped.iloc[test_idx]["label_id"]

print("\nLabel distribution in TRAIN:")
display(train_labels.value_counts().sort_index())

print("\nLabel distribution in VAL:")
display(val_labels.value_counts().sort_index())

print("\nLabel distribution in TEST:")
display(test_labels.value_counts().sort_index())


### Step 6 - Encode Text with BERT Input Encoder

BERT expects tokeized inputs:
- `input_ids`: integers representing word pieces.
- `attention_mask`: 1 for real tokens, 0 for padding.
- `token_type_ids`: segment IDs (here mostly 0s since we use single sentences).

We use:
- `encode_splits` to:
    - Build a tokenier based on your local BERT vocab.
    - Turn text into tensors for each split (train/validation/test).

We print:
- Tensor shapes for each split.
- Sequence length (should be equal `cfg.max_len`).

In [None]:
print("Encoding text splits using BERT input encoder...")
splits = encode_splits(df_mapped, train_idx, val_idx, test_idx, cfg)

for split_name, tensors in splits.items():
    print(f"\nSplit: {split_name}")
    for key, tensor in tensors.items():
        print(f"  {key}: shape={tuple(tensor.shape)}, dtype={tensor.dtype}")


### Step 7 - Build PyTorch Datasets and DataLoaders

PyTorch's `DataLoader`:
- Handles batching (groups of exaqmples processed together).
- Optionally shuffles data (for training).

We:
- Wrap each split (train, validation, test) into `TensorDictDataset`.
- Build `DataLoaders`s with a batch size from `cfg.batch_size`.

We print:
- Number of batches for each split.
- The shape of one batch from the trianing loader.

In [None]:
print("Building PyTorch DataLoaders...")
loaders = build_dataloaders(splits, cfg)

for name, loader in loaders.items():
    num_batches = len(loader)
    print(f"{name.capitalize()} loader: {num_batches} batch(es) with batch_size ~ {loader.batch_size}")

# Peek at one training batch to understand the tensor shapes
train_loader = loaders["train"]
first_batch = next(iter(train_loader))

input_ids_batch, token_type_ids_batch, attention_mask_batch, labels_batch = first_batch

print("\nExample TRAIN batch shapes:")
print(f"  input_ids:      {tuple(input_ids_batch.shape)}")
print(f"  token_type_ids: {tuple(token_type_ids_batch.shape)}")
print(f"  attention_mask: {tuple(attention_mask_batch.shape)}")
print(f"  labels:         {tuple(labels_batch.shape)}")


### Step 8 - Load BERT and Build the Classifier

Now we:
- Load the base BERT encoder from your local assets:
    - Reads `config.json` and `pytorch_model.bin`.
- Wrap it in a `BertTextClassifier`, which:
    - Uses either `[CLS]` token or mean pooling (here: `cfg.pooling`).
    - Adds a classifier head on top to predict your label classes. 
- Apply your fine-tuning policy (e.g., train only the last N transformer layers).

We print:
- Number of labels
- model device (CPU or GPU).
- Pooling strategy.
- Fine-tuning Policy.

In [None]:
num_labels = len(label_to_id)
print(f"Building BERT text classifier for {num_labels} label(s)...")

model = build_bert_text_classifier_from_assets(cfg, num_labels=num_labels)

print("\n✅ Model built.")
print(f"Pooling strategy: {model.pooling}")
print(f"Model device: {next(model.parameters()).device}")
print(f"Fine-tune policy: {cfg.finetune_policy}, last_n={cfg.finetune_last_n}, train_embeddings={cfg.train_embeddings}")


### Step 9 - Train the Classifier

**Time to train!**

`train_text_classifer` will:
- Build an optimizer (AdamW) using only trainable parameters.
- Loop over epochs:
    - Train on the training loader.
    - Evaluate on the validation loader
- Track:
    - Training/validation loss.
    - Training/validation accuracy.
- Keep the best model state (based on validation accuracy). 

We print:
- The per-epoch metrics (the function already does this).
- The final training history in a DataFrame.

In [None]:
print("Starting training...")
history, best_state = train_text_classifier(model, loaders, cfg)

print("\n✅ Training complete.")

# Convert history to a DataFrame for easier viewing
history_df = pd.DataFrame(history)
print("\nTraining history:")
display(history_df)


### Step 10 - Save the Fine-tuned Model and Metadata 

After training, we want to:
- Save:
    - The full classifier (encoder + head).
    - The fine-tuned encoder weights in BERT format.
    - A small JSON metadata file with label mappings and training settings. 

The make it easy to:
- Reload the model later
- Run inference in an other script or environment.

In [None]:
print("Saving fine-tuned classifier and encoder weights...")
save_finetuned_classifier(model, best_state, cfg, label_to_id, id_to_label)

print("\n✅ Fine-tuned model and metadata saved to:")
print(cfg.output_dir.resolve())


### Step 11 - Evaluate on Test Set and Inspect Predictions

Now we evaluate on the test split:
- Compute:
    - Test loss
    - Test accuracy
- Collect Predictions
    - True label ids
    - predicted label ids
    - Confidence scores (max softmax probability)

Then: 
- Merge predictions back with the originial test rows.
- Save the combined table as a CSV (so you can inspect errors, etc.).

In [None]:
print("Evaluating model on TEST split...")
test_metrics, test_preds = evaluate_on_split(model, splits["test"], cfg)

print("\n✅ Test evaluation complete.")
print("Test metrics:")
for k, v in test_metrics.items():
    print(f"  {k}: {v:.4f}")

print("\nFirst 10 prediction rows (label_id, pred_label_id, pred_confidence):")
display(test_preds.head(10))

# Align test raw rows with test_idx
test_raw_df = df_mapped.iloc[test_idx]

print("\nExporting test predictions merged with raw data to CSV...")
csv_path = export_predictions_csv(test_preds, test_raw_df, id_to_label, cfg, split_name="test")

print("\n✅ Test predictions saved to CSV:")
print(csv_path.resolve())

### Step 12 - Optional Quick Sanity Check on the Output CSV

Finally, we can quickly reload the exported CSV and:
- Check a few rows
- look at the distribution of predicted labels

This helps sanity-check that:
- The file wrote correctly.
- THe predictions look reasonable.

In [None]:
print("Reloading exported predictions for sanity check...")
preds_loaded = pd.read_csv(csv_path)

print(f"\nLoaded {len(preds_loaded)} rows from predictions CSV.")
print("Columns in predictions CSV:")
print(preds_loaded.columns.tolist())

print("\nFirst 5 rows from predictions CSV:")
display(preds_loaded.head())

print("\nPredicted label counts:")
if "pred_label" in preds_loaded.columns:
    display(preds_loaded["pred_label"].value_counts())
elif "pred_label_id" in preds_loaded.columns:
    display(preds_loaded["pred_label_id"].value_counts())

