# CBDC-Discourse Model

This section documents the full training and evaluation pipeline used for fine-tuning a BERT-based model on the **CBDC Classification Dataset**. The objective is to classify sentences into three conceptual categories: **Process**, **Feature**, and **Risk-Benefit**. The implementation follows best practices in natural language processing (NLP) for reproducibility and robustness.

---

## Environment and Dependencies

All experiments were executed on **Google Colab** with a **Tesla T4 GPU**. The following Python packages were installed:

```bash
pip install -U transformers datasets accelerate evaluate scikit-learn
```

---

## Data Preparation

* **Dataset:** `cbdc_classification_training.csv`
* **Columns retained:** `row_id`, `url`, `sentence`, `label`
* Sentences were stripped of leading/trailing whitespace and filtered for non-null values.
* Labels were normalized and mapped to integer IDs:

| Label        | ID |
| ------------ | -- |
| Feature      | 0  |
| Process      | 1  |
| Risk-Benefit | 2  |

* **Splits:** Stratified into **80% training**, **10% validation**, and **10% test**.

---

## Class Balancing

To avoid bias toward majority classes, performed **full downsampling** of the training set so that each label had an equal number of examples. This ensured that the classifier did not over-optimize for the largest class.

---

## Model and Tokenization

* **Base Model:** [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT), a domain-adapted BERT variant specialized on central banking text.
* **Tokenizer:** WordPiece (Hugging Face AutoTokenizer).
* **Maximum sequence length:** 256 tokens (dynamic padding used).

---

## Training Procedure

* **Framework:** Hugging Face `Trainer` API with a custom `WeightedTrainer` (Cross-Entropy Loss, unweighted due to balanced data).
* **Hyperparameters:**

| Parameter                     | Value    |
| ----------------------------- | -------- |
| Epochs                        | 6        |
| Train batch size (per device) | 8        |
| Eval batch size (per device)  | 16       |
| Gradient accumulation steps   | 2        |
| Effective batch size          | 16       |
| Learning rate                 | 2e-5     |
| Weight decay                  | 0.01     |
| Warmup ratio                  | 0.06     |
| Scheduler                     | Cosine   |
| Mixed precision (fp16)        | Enabled  |
| Early stopping patience       | 2 epochs |

* **Evaluation Metric:** Macro-F1 (monitored for early stopping and best checkpoint selection).

---

## Results

### Validation Set

* **Accuracy:** \~0.851
* **Macro-F1:** \~0.839
* **Weighted-F1:** \~0.852

### Test Set

* **Accuracy:** \~0.823
* **Macro-F1:** \~0.803
* **Weighted-F1:** \~0.825

**Per-class performance (test):**

| Class        | Precision | Recall | F1    |
| ------------ | --------- | ------ | ----- |
| Feature      | 0.759     | 0.782  | 0.770 |
| Process      | 0.927     | 0.845  | 0.884 |
| Risk-Benefit | 0.700     | 0.817  | 0.754 |

**Confusion matrix** was exported (`confusion_matrix.csv`) for transparency.

---

## Reproducibility

* Random seeds fixed (`42`) across Python, NumPy, and PyTorch.
* Code tested with PyTorch **2.8.0+cu126**.
* All scripts are available in the accompanying Jupyter notebook.

---

## Remarks

This pipeline ensures a fair evaluation of **CBDC-related textual classification**, balancing domain adaptation with methodological rigor. The results demonstrate strong discriminative performance across heterogeneous categories of discourse (technical processes, design features, and policy risks/benefits). This framework can be readily extended to other central banking corpora or adapted for cross-lingual analysis.


In [None]:
# ====================== Colab setup ======================
# !pip -q install -U transformers datasets accelerate evaluate scikit-learn

import os, json, random, numpy as np, pandas as pd, torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from transformers import (
    AutoTokenizer, AutoConfig, AutoModelForSequenceClassification,
    DataCollatorWithPadding, TrainingArguments, EarlyStoppingCallback
)
from datasets import Dataset, DatasetDict

# Reproducibility
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)

print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

# ====================== Paths & I/O ======================
import pandas as pd
# --- Mount Google Drive ---
from google.colab import drive
drive.mount('/content/drive')
BASE_DIR   = "/content/drive/MyDrive/cbdc-classification"
DATA_PATH  = os.path.join(BASE_DIR, "cbdc_classification_training.csv")
OUTPUT_DIR = os.path.join(BASE_DIR, "cbbert_cls_out_balanced")
os.makedirs(OUTPUT_DIR, exist_ok=True)

# ====================== Load & inspect ======================
df = pd.read_csv(DATA_PATH)
df = df.rename(columns={'#': 'row_id'})
df = df[['row_id', 'url', 'sentence', 'label']].copy()
df['sentence'] = df['sentence'].astype(str).str.strip()
df['label']    = df['label'].astype(str).str.strip()
df = df.dropna(subset=['sentence','label']).reset_index(drop=True)

print("=== Sentence Count per Label (full data) ===")
print(df['label'].value_counts())
print("\n=== Percentage Distribution (full data) ===")
print((df['label'].value_counts(normalize=True) * 100).round(2))

# ====================== Label encoding ======================
labels   = sorted(df['label'].unique())
label2id = {l:i for i,l in enumerate(labels)}
id2label = {i:l for l,i in label2id.items()}
df['label_id'] = df['label'].map(label2id)
print("\nLabel map:", label2id)

# ====================== Stratified splits ======================
train_df, temp_df = train_test_split(
    df, test_size=0.20, random_state=SEED, stratify=df['label_id']
)
val_df, test_df = train_test_split(
    temp_df, test_size=0.50, random_state=SEED, stratify=temp_df['label_id']
)

def show_split(name, sdf):
    print(f"\n{name} size: {len(sdf)}")
    print(sdf['label'].value_counts())

show_split("Train (before balance)", train_df)
show_split("Val", val_df)
show_split("Test", test_df)

# ====================== FULL BALANCE the TRAIN split ======================
def full_downsample_balance(train_df, label_col="label", seed=SEED):
    counts = train_df[label_col].value_counts()
    target = counts.min()  # minority count
    parts = []
    for lab, grp in train_df.groupby(label_col, sort=False):
        if len(grp) > target:
            parts.append(grp.sample(n=target, random_state=seed))
        else:
            parts.append(grp)
    out = pd.concat(parts, axis=0).sample(frac=1.0, random_state=seed).reset_index(drop=True)
    print("\n[Balance] Train before:\n", counts.to_string())
    print("\n[Balance] Train after:\n", out[label_col].value_counts().to_string())
    return out

train_df_bal = full_downsample_balance(train_df, label_col="label", seed=SEED)

# ====================== HF Datasets ======================
# Map labels to ids
train_df_bal = train_df_bal.assign(label_id=train_df_bal['label'].map(label2id))
val_df       = val_df.assign(label_id=val_df['label'].map(label2id))
test_df      = test_df.assign(label_id=test_df['label'].map(label2id))

MODEL_NAME = "bilalzafar/CentralBank-BERT"
MAX_LEN = 256
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tokenize_batch(batch):
    return tokenizer(
        batch["sentence"],
        truncation=True,
        padding=False,
        max_length=MAX_LEN
    )

ds = DatasetDict({
    "train": Dataset.from_pandas(train_df_bal[['sentence','label_id']].rename(columns={'label_id':'labels'}), preserve_index=False),
    "val":   Dataset.from_pandas(val_df[['sentence','label_id']].rename(columns={'label_id':'labels'}), preserve_index=False),
    "test":  Dataset.from_pandas(test_df[['sentence','label_id']].rename(columns={'label_id':'labels'}), preserve_index=False),
})
ds = ds.map(tokenize_batch, batched=True, remove_columns=['sentence'])

collator = DataCollatorWithPadding(tokenizer)

# ====================== Model ======================
from transformers import Trainer
import torch.nn as nn

config = AutoConfig.from_pretrained(
    MODEL_NAME,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, config=config)

class WeightedTrainer(Trainer):
    def __init__(self, class_weights=None, **kwargs):
        super().__init__(**kwargs)
        self.class_weights = class_weights  # keep param for API symmetry
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.get("labels")
        outputs = model(**{k:v for k,v in inputs.items() if k != "labels"})
        logits = outputs.get("logits")
        # Since train is fully balanced, use unweighted CE
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

# ====================== Metrics ======================
def compute_metrics(eval_pred):
    preds, labels_np = eval_pred
    preds = np.argmax(preds, axis=1)
    acc = accuracy_score(labels_np, preds)
    f1_macro = f1_score(labels_np, preds, average="macro")
    f1_weighted = f1_score(labels_np, preds, average="weighted")
    return {"accuracy": acc, "f1_macro": f1_macro, "f1_weighted": f1_weighted}

# ====================== Training arguments ======================
args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    num_train_epochs=6,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.06,
    lr_scheduler_type="cosine",
    fp16=torch.cuda.is_available(),
    log_level="error",
    report_to="none",
    save_total_limit=2,
    dataloader_num_workers=2,
    seed=SEED,
)

callbacks = [EarlyStoppingCallback(early_stopping_patience=2)]

trainer = WeightedTrainer(
    model=model,
    args=args,
    train_dataset=ds["train"],
    eval_dataset=ds["val"],
    processing_class=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
    callbacks=callbacks,
    class_weights=None,           # fully balanced => no class weights
)

# ====================== Train ======================
train_out = trainer.train()
print("\nBest checkpoint:", trainer.state.best_model_checkpoint)

# ====================== Evaluate ======================
print("\n=== Validation ===")
val_metrics = trainer.evaluate(eval_dataset=ds["val"])
print(val_metrics)

print("\n=== Test ===")
test_metrics = trainer.evaluate(eval_dataset=ds["test"])
print(test_metrics)

# Detailed test report
preds_logits = trainer.predict(ds["test"]).predictions
test_pred_ids = preds_logits.argmax(axis=1)
y_true = test_df['label_id'].to_numpy()

print("\n--- Classification report (test) ---")
print(classification_report(y_true, test_pred_ids, target_names=labels, digits=4))

# Confusion matrix
cm = confusion_matrix(y_true, test_pred_ids, labels=list(range(len(labels))))
cm_df = pd.DataFrame(cm, index=[f"true_{l}" for l in labels], columns=[f"pred_{l}" for l in labels])
cm_path = os.path.join(OUTPUT_DIR, "confusion_matrix.csv")
cm_df.to_csv(cm_path, index=True)
print(f"\nConfusion matrix saved to: {cm_path}")

# ====================== Save best model & mappings ======================
save_dir = os.path.join(OUTPUT_DIR, "best_model_balanced")
os.makedirs(save_dir, exist_ok=True)
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

with open(os.path.join(save_dir, "label_mapping.json"), "w") as f:
    json.dump({"label2id": label2id, "id2label": id2label}, f, indent=2)

with open(os.path.join(OUTPUT_DIR, "metrics_summary.json"), "w") as f:
    json.dump(
        {"val": val_metrics, "test": test_metrics, "best_checkpoint": trainer.state.best_model_checkpoint},
        f, indent=2
    )

print("\nSaved model to:", save_dir)



Torch: 2.8.0+cu126 | CUDA available: True
GPU: Tesla T4
=== Sentence Count per Label (full data) ===
label
Process         2706
Feature         1323
Risk-Benefit    1202
Name: count, dtype: int64

=== Percentage Distribution (full data) ===
label
Process         51.73
Feature         25.29
Risk-Benefit    22.98
Name: proportion, dtype: float64

Label map: {'Feature': 0, 'Process': 1, 'Risk-Benefit': 2}

Train (before balance) size: 4184
label
Process         2164
Feature         1058
Risk-Benefit     962
Name: count, dtype: int64

Val size: 523
label
Process         271
Feature         132
Risk-Benefit    120
Name: count, dtype: int64

Test size: 524
label
Process         271
Feature         133
Risk-Benefit    120
Name: count, dtype: int64

[Balance] Train before:
 label
Process         2164
Feature         1058
Risk-Benefit     962

[Balance] Train after:
 label
Risk-Benefit    962
Feature         962
Process         962


Map:   0%|          | 0/2886 [00:00<?, ? examples/s]

Map:   0%|          | 0/523 [00:00<?, ? examples/s]

Map:   0%|          | 0/524 [00:00<?, ? examples/s]

{'eval_loss': 0.4360942244529724, 'eval_accuracy': 0.8279158699808795, 'eval_f1_macro': 0.8067051528777398, 'eval_f1_weighted': 0.8285695484600237, 'eval_runtime': 1.1758, 'eval_samples_per_second': 444.789, 'eval_steps_per_second': 28.065, 'epoch': 1.0}
{'eval_loss': 0.426888108253479, 'eval_accuracy': 0.8432122370936902, 'eval_f1_macro': 0.825527753634535, 'eval_f1_weighted': 0.8444083217979258, 'eval_runtime': 1.5439, 'eval_samples_per_second': 338.758, 'eval_steps_per_second': 21.375, 'epoch': 2.0}
{'loss': 0.4339, 'grad_norm': 3.64510440826416, 'learning_rate': 1.2349425326215867e-05, 'epoch': 2.7645429362880884}
{'eval_loss': 0.5811964869499207, 'eval_accuracy': 0.8393881453154876, 'eval_f1_macro': 0.8227211484480778, 'eval_f1_weighted': 0.8427597007900519, 'eval_runtime': 0.889, 'eval_samples_per_second': 588.321, 'eval_steps_per_second': 37.122, 'epoch': 3.0}
{'eval_loss': 0.6854845285415649, 'eval_accuracy': 0.8508604206500956, 'eval_f1_macro': 0.8383143880500755, 'eval_f1_wei

#Inference

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
import json

# ====================== Paths ======================
MODEL_DIR = "/content/drive/MyDrive/cbdc-classification/cbbert_cls_out_balanced/best_model_balanced"

# ====================== Load model & tokenizer ======================
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
model.eval()

# Load label mapping
with open(f"{MODEL_DIR}/label_mapping.json", "r") as f:
    mapping = json.load(f)
id2label = {int(k): v for k, v in mapping["id2label"].items()}

# ====================== Inference helper ======================
def predict(sentences):
    inputs = tokenizer(sentences, padding=True, truncation=True, max_length=256, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=-1)
        preds = torch.argmax(probs, dim=1)
    for s, p, prob in zip(sentences, preds, probs):
        print(f"\nSentence: {s}")
        print(f"Predicted: {id2label[p.item()]} (score={prob[p].item():.4f})")

# ====================== Example sentences ======================
examples = [
    "The central bank is exploring CBDC for retail use in cross-border payments.",
    "CBDC could introduce risks to financial stability if not carefully designed.",
    "Programmability of CBDC offers many new financial features."
]

predict(examples)


Sentence: The central bank is exploring CBDC for retail use in cross-border payments.
Predicted: Process (score=0.9991)

Sentence: CBDC could introduce risks to financial stability if not carefully designed.
Predicted: Risk-Benefit (score=0.9986)

Sentence: Programmability of CBDC offers many new financial features.
Predicted: Feature (score=0.9991)
