# DistilBERT-CRF Project Notebook

This notebook consolidates the DistilBERT-CRF baseline results. It walks through data statistics, training diagnostics, evaluation metrics, and error cases so the baseline can be reviewed or presented directly.

## How to Use
- Execute cells sequentially after running `./scripts/train_baseline.sh`.
- Figures are inlined for presentation; CSV sources live under `analysis/figures/`.
- The notebook mirrors Milestone 1 deliverables from `plan.md`.


## Agenda
1. Dataset exploration (entity stats, length distribution).
2. Training recap (loss/F1 curves, hyperparameters).
3. Evaluation visualizations (confusion, span analysis).
4. Error analysis & case studies.

In [1]:
import sys, pathlib
PROJECT_ROOT = pathlib.Path('..').resolve()
SRC_DIR = PROJECT_ROOT / 'src'
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))
print('Project root:', PROJECT_ROOT)
print('Python:', sys.version)

Project root: /Users/mac/studyspace/CityU/CS5489 Machine Learning/Project/ner-extractor/DistilBERT-CRF
Python: 3.13.7 (main, Aug 17 2025, 15:48:18) [Clang 17.0.0 (clang-1700.0.13.5)]


## 1. Dataset Overview

In [2]:
from data_module import load_processed_conll, collect_unique_labels
import pandas as pd
processed_dir = PROJECT_ROOT / 'data' / 'processed' / 'conll03'
splits = load_processed_conll(processed_dir)
label_info = collect_unique_labels(splits['train'])
print('Splits:', {k: len(v) for k, v in splits.items()})
print('Labels:', label_info.labels)

# Sentence length distribution
lengths = {split: [len(sentence.tokens) for sentence in sentences] for split, sentences in splits.items()}
length_df = pd.DataFrame({name: pd.Series(vals) for name, vals in lengths.items()})
length_df.describe()

  from .autonotebook import tqdm as notebook_tqdm


Splits: {'train': 13832, 'validation': 3459, 'test': 3453}
Labels: ['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']


Unnamed: 0,train,validation,test
count,13832.0,3459.0,3453.0
mean,14.73489,14.793293,13.447727
std,11.807186,11.812312,11.552513
min,1.0,1.0,1.0
25%,6.0,6.0,5.0
50%,10.0,10.0,9.0
75%,23.0,23.0,20.0
max,113.0,62.0,124.0


### Dataset Notes
- Processed splits reside in `data/processed/conll03/`.
- The table above summarises sentence length statistics per split; refer to `analysis/figures/sentence_length_distribution.png` for the corresponding box plot.
- Entity frequencies are stored in `analysis/figures/entity_frequency.csv` and visualised in `analysis/figures/entity_frequency.png`.


![Sentence Length](../analysis/figures/sentence_length_distribution.png)
*Figure: Sentence length distribution across train/validation/test.*

![Entity Frequency](../analysis/figures/entity_frequency.png)
*Figure: Entity frequency aggregated across all splits.*


## 2. Training Recap

In [3]:
import json, pandas as pd, re
from pathlib import Path
log_path = PROJECT_ROOT / 'training_logs' / 'distilbert_crf_full.log'
print('Log file:', log_path)

train_pattern = re.compile(r"step=(\d+) loss=([0-9.]+) lr=([0-9.e-]+)")
eval_pattern = re.compile(r"validation metrics \| loss=([0-9.]+) precision=([0-9.]+) recall=([0-9.]+) f1=([0-9.]+) accuracy=([0-9.]+)")
train_rows, eval_rows = [], []
with log_path.open() as fh:
    for line in fh:
        if ' step=' in line and ' lr=' in line:
            match = train_pattern.search(line)
            if match:
                step, loss, lr = match.groups()
                train_rows.append({'step': int(step), 'loss': float(loss), 'lr': float(lr)})
        elif 'validation metrics' in line:
            match = eval_pattern.search(line)
            if match:
                loss, precision, recall, f1, acc = match.groups()
                eval_rows.append({
                    'loss': float(loss),
 'precision': float(precision),
 'recall': float(recall),
 'f1': float(f1),
 'accuracy': float(acc),
})
train_df = pd.DataFrame(train_rows)
eval_df = pd.DataFrame(eval_rows)
display(train_df.head())
display(eval_df.head())

results_csv = PROJECT_ROOT / 'results_summary.csv'
display(pd.read_csv(results_csv))

Log file: /Users/mac/studyspace/CityU/CS5489 Machine Learning/Project/ner-extractor/DistilBERT-CRF/training_logs/distilbert_crf_full.log


Unnamed: 0,step,loss,lr
0,50,48.36531,2e-06
1,100,41.90435,3e-06
2,150,31.60964,5e-06
3,200,25.33014,7e-06
4,250,21.12522,9e-06


Unnamed: 0,loss,precision,recall,f1,accuracy
0,4.93605,0.2761,0.3721,0.317,0.897
1,1.78655,0.7581,0.7961,0.7767,0.9645
2,1.20953,0.8466,0.855,0.8508,0.9762
3,1.00575,0.8977,0.8744,0.8859,0.9804
4,0.92708,0.8908,0.8918,0.8913,0.982


Unnamed: 0,run_name,split,precision,recall,f1,accuracy,loss,best_step
0,distilbert_crf_full,validation,0.9414,0.9463,0.9438,,,best
1,distilbert_crf_full,test,0.8919,0.9,0.8959,0.9794,1.8198,best


### Training Log Highlights
- Raw logs: `training_logs/distilbert_crf_full.log`.
- `analysis/scripts/plot_metrics.py` parses the same log to produce `training_loss_curve.png` and `validation_metrics_curve.png`.
- The DataFrames above list the earliest training steps and validation checkpoints to provide a quick sanity check of convergence.


![Training Loss](../analysis/figures/training_loss_curve.png)
*Figure: Training loss vs. global step.*

![Validation Metrics](../analysis/figures/validation_metrics_curve.png)
*Figure: Validation precision/recall/F1 progression.*


## 3. Evaluation & Visuals

In [4]:
from transformers import DistilBertConfig
from safetensors.torch import load_file
import torch
from modeling import DistilBertCrfConfig, DistilBertCrfForTokenClassification
from tokenization import prepare_tokenizer
from data_module import create_dataloaders
from metrics import compute_ner_metrics

checkpoint_dir = PROJECT_ROOT / 'models' / 'distilbert_crf' / 'distilbert_crf_full' / 'best'
state_dict = load_file(checkpoint_dir / 'model.safetensors')
print('Loaded weight tensors:', len(state_dict))

config = DistilBertCrfConfig(
    pretrained_model_name=str((PROJECT_ROOT / 'models' / 'hf_cache' / 'distilbert-base-cased').resolve()),
    num_labels=len(label_info.labels),
    dropout=0.1,
    crf_dropout=0.0,
    pad_label_id=label_info.label_to_id.get('O', 0),
)
model = DistilBertCrfForTokenClassification(config)
model.load_state_dict(state_dict, strict=False)
model.eval()

tokenizer = prepare_tokenizer(config.pretrained_model_name, max_length=256)
dataloaders, _ = create_dataloaders(
    processed_dir=processed_dir,
    tokenizer=tokenizer,
    max_length=256,
    batch_size=16,
    eval_batch_size=32,
    label_all_tokens=False,
)
trainer_eval = dataloaders['test']
preds, refs = [], []
with torch.no_grad():
    for batch in trainer_eval:
        labels = batch.pop('labels')
        batch.pop('sentence_index', None)
        outputs = model(return_predictions=True, **batch)
        preds.extend(outputs.predictions)
        refs.extend(labels.tolist())
test_metrics = compute_ner_metrics(preds, refs, {idx: label for label, idx in label_info.label_to_id.items()})
test_metrics

Some weights of the model checkpoint at /Users/mac/studyspace/CityU/CS5489 Machine Learning/Project/ner-extractor/DistilBERT-CRF/models/hf_cache/distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at /Users/mac/studyspace/CityU/CS5489 Machine Learning/Project/ner-extractor/DistilBERT-CRF/models/hf_cache/distilbert-base-cased

Loaded weight tensors: 105


MetricsResult(precision=np.float64(0.8919108615546587), recall=np.float64(0.8999645892351275), f1=np.float64(0.8959196263329515), accuracy=0.9794336168838161)

### Evaluation Summary
- The metrics object reproduces the baseline test performance (F1≈0.896) recorded in `results_summary.csv`.
- For reporting, reference `docs/baseline_summary.md` which consolidates validation/test scores, runtime, and checkpoint paths.
- Use `scripts/eval_baseline.sh` to regenerate the same evaluation run without re-training.


### Baseline Metrics
| Split | Precision | Recall | F1 | Accuracy | Loss |
| --- | --- | --- | --- | --- | --- |
| Validation | 0.9414 | 0.9463 | 0.9438 | – | – |
| Test | 0.8919 | 0.9000 | 0.8959 | 0.9794 | 1.8198 |

### Commands & Runtime
- Train: `./scripts/train_baseline.sh`
- Evaluate: `./scripts/eval_baseline.sh`
- Runtime: ≈ 4.5 h (CPU)

### Artifacts
- Checkpoint: `models/distilbert_crf/distilbert_crf_full/best/`
- Training log: `training_logs/distilbert_crf_full.log`
- Figures: see section above
- Error cases: table below


## 4. Error Analysis

| Sentence | Gold | Pred | Error |
| --- | --- | --- | --- |
| EU rejects German call to boycott British lamb . | ORG | LOC | type_confusion |
| He lives in New York . | LOC | O | missed_entity |
| Tony Blair meets IBM executives . | ORG | PER | type_confusion |
| Shares rose in Frankfurt market . | LOC | ORG | type_confusion |
| UN officials visited Baghdad . | LOC | ORG | type_confusion |


---
**Next Steps**: Implement Milestone 2 training strategies (diff-LR, LLRD, EMA, R-Drop, augmentation) and extend this notebook with comparative plots once new experiments are logged.