# Domain‑Adaptive Pretraining (DAPT)

**This notebook demonstrates how to continue masked‑language model pretraining for** `DeBERTa‑v3‑large` **on an unlabeled news corpus.**

> Reference: Gururangan et al. (2020), “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.”

In [ ]:
# Install dependencies (only needed first time in Colab)
!pip install -q transformers datasets torch accelerate peft

In [ ]:
# Optional: mount Google Drive to save or load checkpoints
# from google.colab import drive
# drive.mount('/content/drive')

In [ ]:
import os
from src.pretrain_lm import run_dapt

# Configuration parameters
MODEL_NAME     = 'microsoft/deberta-v3-large'
DATA_FILE      = 'data/external/unlabeled.txt'
OUTPUT_DIR     = 'outputs/dapt_checkpoints/'
NUM_EPOCHS     = 3        # Increase to 5–10 for production
BATCH_SIZE     = 4        # Smaller batch on Colab GPU
LEARNING_RATE  = 5e-5

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Run domain‑adaptive pretraining
run_dapt(
    model_name=MODEL_NAME,
    data_file=DATA_FILE,
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
)

After training completes, the adapted checkpoint is saved to **`outputs/dapt_checkpoints/`**.  
You can now fine‑tune this checkpoint with LoRA by pointing your training script at that directory (e.g. `--model_name_or_path outputs/dapt_checkpoints/`).