# 02 — Model Training

This notebook fine-tunes the **BART-base** model on the *scientific_papers/arxiv* dataset.

We'll:
1. Load the tokenized dataset from cache.
2. Load and configure the BART model.
3. Fine-tune it using Hugging Face's `Seq2SeqTrainer`.
4. Save the trained model checkpoint for evaluation.

Dataset: full *scientific_papers/arxiv* (≈ 203k training samples).

In [1]:
import os, sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

import torch
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, EarlyStoppingCallback

from src.model import get_model, get_tokenizer
from src.data_loader import load_or_build_tokenized
from src.seed_utils import set_seed

set_seed(42)
print("Imports successful.")

Imports successful.


## Load tokenized dataset

We'll use the cached tokenized dataset created in the previous notebook.
If not found, it will rebuild automatically.

This dataset already includes:
- `input_ids`
- `attention_mask`
- `labels` (with pad tokens replaced by -100)

In [2]:
raw, tok = load_or_build_tokenized(
    dataset_name="scientific_papers",
    subset="arxiv",
    model_name="facebook/bart-base",
    max_input_len=1024,
    max_target_len=256
)

print({split: len(tok[split]) for split in tok.keys()})
train_subset = tok["train"].select(range(40000))
val_subset = tok["validation"].select(range(4000))

print(f"Training samples: {len(train_subset)}")
print(f"Validation samples: {len(val_subset)}")


Loading dataset from disk:   0%|          | 0/17 [00:00<?, ?it/s]

{'train': 203037, 'validation': 6436, 'test': 6440}
Training samples: 40000
Validation samples: 4000


## Load BART-base model and tokenizer

We’ll initialize from `facebook/bart-base`.  
Later you can experiment with `facebook/bart-large` or `t5-base`.

In [3]:
model_name = "facebook/bart-base"
tokenizer = get_tokenizer(model_name)
model = get_model(model_name)

## Data collator

`DataCollatorForSeq2Seq` dynamically pads sequences in each batch.  
This keeps GPU/CPU memory usage efficient and ensures consistent shape for each step.

In [4]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, pad_to_multiple_of=8)

## Define training arguments

We use the full dataset but keep epochs and batch size modest for feasibility on macOS.
The `Seq2SeqTrainer` will automatically:
- Evaluate after each epoch.
- Save checkpoints.
- Restore the best model based on ROUGE-L.

In [5]:
output_dir = "../outputs/model"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="../outputs/logs",
    logging_strategy="epoch",
    predict_with_generate=True,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    report_to="none",
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    fp16=False,
    dataloader_num_workers=0,
    dataloader_pin_memory=False
)

## Initialize Trainer

We’ll use the `Seq2SeqTrainer`, passing in:
- The model
- Tokenizer
- Datasets
- Data collator
- Early stopping callback

In [6]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,
    eval_dataset=val_subset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
)

  trainer = Seq2SeqTrainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## Start training

The best checkpoint will be saved under `outputs/model/`.

In [7]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.8676,2.502567
2,2.4752,2.409363


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=20000, training_loss=2.67137275390625, metrics={'train_runtime': 3969.9789, 'train_samples_per_second': 20.151, 'train_steps_per_second': 5.038, 'total_flos': 4.829588153155584e+16, 'train_loss': 2.67137275390625, 'epoch': 2.0})

## Save trained model

Once training completes, we’ll save the final model and tokenizer to `outputs/model/`.  
This directory will later be used by the evaluation and app scripts.

In [8]:
trainer.save_model("../outputs/model")
tokenizer.save_pretrained("../outputs/model")

('../outputs/model/tokenizer_config.json',
 '../outputs/model/special_tokens_map.json',
 '../outputs/model/vocab.json',
 '../outputs/model/merges.txt',
 '../outputs/model/added_tokens.json',
 '../outputs/model/tokenizer.json')

## Training Summary

- Model: `facebook/bart-base`
- Dataset: `scientific_papers/arxiv`
- Input length: 1024 tokens
- Summary length: 256 tokens
- Saved checkpoint: `outputs/model/`

Next notebook: **03_model_evaluation.ipynb** — compute ROUGE & BERTScore and analyze examples.
