# SmolLM2-135M Fine-Tuning on Colab

This notebook walks through adapting the official `config_smollm2_135M.yaml` recipe to Google Colab. It installs dependencies, downloads the base model and tokenizer, prepares the `input.txt` corpus, trains for 5,000 steps with periodic text generation every 500 steps, checkpoints the run, and then resumes for 50 additional steps.

> **Tip:** Set your Colab runtime to GPU (A100/TPU preferred, but T4 will also work with the settings below).


In [None]:
!nvidia-smi


In [None]:
%pip install --upgrade --force-reinstall -q numpy==1.26.4 transformers==4.44.2 accelerate==0.33.0 datasets==3.0.1 evaluate==0.4.3 sentencepiece==0.2.0 omegaconf==2.3.0 bitsandbytes==0.43.1


In [None]:
import numpy as np
print("NumPy version:", np.__version__)
assert np.__version__ == "1.26.4", "Please restart the runtime and rerun the install cell before proceeding."


> If this cell re-runs after you already imported libraries, restart the Colab runtime (`Runtime â†’ Restart runtime`) so the pinned NumPy wheel loads before importing `datasets` or `transformers`.


In [None]:
from pathlib import Path
import json
import textwrap

import requests
import yaml

CONFIG_URL = "https://raw.githubusercontent.com/huggingface/smollm/main/text/pretraining/smollm2/config_smollm2_135M.yaml"
CONFIG_PATH = Path("config_smollm2_135M.yaml")

if not CONFIG_PATH.exists():
    resp = requests.get(CONFIG_URL, timeout=30)
    resp.raise_for_status()
    CONFIG_PATH.write_text(resp.text)

with CONFIG_PATH.open() as fh:
    smollm_config = yaml.safe_load(fh)

summary = {
    "hidden_size": smollm_config["model"]["model_config"]["hidden_size"],
    "layers": smollm_config["model"]["model_config"]["num_hidden_layers"],
    "heads": smollm_config["model"]["model_config"]["num_attention_heads"],
    "intermediate_size": smollm_config["model"]["model_config"]["intermediate_size"],
    "seq_length": smollm_config["tokens"]["sequence_length"],
    "lr": smollm_config["optimizer"]["learning_rate_scheduler"]["learning_rate"],
    "warmup_steps": smollm_config["optimizer"]["learning_rate_scheduler"]["lr_warmup_steps"],
    "grad_accum": smollm_config["tokens"]["batch_accumulation_per_replica"],
    "micro_batch": smollm_config["tokens"]["micro_batch_size"],
}

print("Loaded config from", CONFIG_PATH)
print(json.dumps(summary, indent=2))


## Upload and Inspect `input.txt`

Upload the provided corpus from your local machine. After the upload completes, the file will be available in the Colab working directory.


In [None]:
from google.colab import files

uploaded = files.upload()
assert "input.txt" in uploaded, "Please upload input.txt before continuing."


In [None]:
from pathlib import Path

corpus_path = Path("input.txt")
print(f"input.txt size: {corpus_path.stat().st_size / 1024:.1f} KB")
print("--- sample ---")
print(corpus_path.read_text(encoding="utf-8")[:500])


## Build Tokenized Dataset

We tokenize with the official `HuggingFaceTB/cosmo2-tokenizer`, chunk the text into 2,048-token sequences, and create train/eval splits. Gradient accumulation is used to respect the original micro-batch settings while fitting in Colab memory.


In [None]:
from itertools import chain

from datasets import load_dataset
from transformers import AutoTokenizer

raw_dataset = load_dataset("text", data_files={"train": str(corpus_path)})
stage0 = smollm_config["data_stages"][0]
seed = stage0.get("seed")
if seed is None:
    seed = stage0.get("data", {}).get("seed")
if seed is None:
    seed = smollm_config.get("general", {}).get("seed", 42)
split_dataset = raw_dataset["train"].train_test_split(test_size=0.1, seed=seed)
print("Using seed:", seed)

tokenizer = AutoTokenizer.from_pretrained(
    smollm_config["tokenizer"]["tokenizer_name_or_path"],
    revision=smollm_config["tokenizer"].get("tokenizer_revision"),
)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
    model_pad_added = True
else:
    model_pad_added = False

block_size = smollm_config["tokens"]["sequence_length"]


def tokenize_function(examples):
    return tokenizer(examples["text"])  # returns input_ids and attention_mask


tokenized_dataset = split_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=2,
    remove_columns=["text"],
)

pad_token_id = tokenizer.pad_token_id
attention_pad_token_id = 0


def group_texts(examples):
    concatenated = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated["input_ids"])
    if total_length < block_size:
        pad_length = block_size - total_length
        padded = {}
        for k, sequence in concatenated.items():
            pad_token = attention_pad_token_id if k == "attention_mask" else pad_token_id
            padded[k] = [sequence + [pad_token] * pad_length]
        padded["labels"] = padded["input_ids"].copy()
        return padded

    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_dataset.map(group_texts, batched=True, num_proc=2)
print(lm_datasets)


## Stage 1: Train for 5,000 Steps

We mirror the YAML hyperparameters (optimizer, scheduler, gradient clipping, etc.) and adapt them to Colab with gradient accumulation. A callback prints generated text every 500 steps. Training can take ~1-2 hours on a T4 and is much faster on an A100.


In [None]:
import math
import torch
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
    TrainerCallback,
)

if torch.cuda.is_available():
    major_capability, _ = torch.cuda.get_device_capability()
    bf16_enabled = major_capability >= 8
    fp16_enabled = False  # disable AMP FP16 to avoid GradScaler issues on older GPUs
    torch.backends.cuda.matmul.allow_tf32 = True
    if not bf16_enabled and major_capability < 8:
        print("CUDA device lacks native bfloat16; using full precision training to avoid FP16 GradScaler issues.")
else:
    bf16_enabled = False
    fp16_enabled = False

if bf16_enabled:
    target_dtype = torch.bfloat16
else:
    target_dtype = torch.float32
print(f"Loading model weights in dtype: {target_dtype}")

model_name = "HuggingFaceTB/SmolLM2-135M"
flash_attn_available = False
try:
    import importlib

    flash_attn_available = importlib.util.find_spec("flash_attn") is not None
except Exception:
    flash_attn_available = False

attn_impl = "flash_attention_2" if (torch.cuda.is_available() and flash_attn_available) else None
if attn_impl is None and torch.cuda.is_available():
    print("FlashAttention 2 not available; falling back to default attention.")

config = AutoConfig.from_pretrained(model_name)
config.vocab_size = len(tokenizer)
config.pad_token_id = tokenizer.pad_token_id
if tokenizer.bos_token_id is not None:
    config.bos_token_id = tokenizer.bos_token_id
if tokenizer.eos_token_id is not None:
    config.eos_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_config(config)
model.config.pad_token_id = tokenizer.pad_token_id
print("Initialized SmolLM2-135M from config with randomly-initialized weights.")
if model_pad_added:
    print(f"Extended vocab size to {len(tokenizer)} to include the new pad token.")

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

if model_pad_added:
    print("Model embeddings resized to account for newly added pad token.")

if torch.cuda.is_available():
    torch.cuda.empty_cache()

train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["test"]

output_dir = Path("smollm2_135m_runs")
output_dir.mkdir(parents=True, exist_ok=True)
stage1_dir = output_dir / "stage1_5000_steps"
stage1_dir.mkdir(exist_ok=True)


In [None]:
class SampleGenerationCallback(TrainerCallback):
    def __init__(self, tokenizer, sample_prompt, sample_interval=500, max_new_tokens=200):
        self.tokenizer = tokenizer
        self.sample_prompt = sample_prompt
        self.sample_interval = sample_interval
        self.max_new_tokens = max_new_tokens
        prompt = tokenizer(sample_prompt, return_tensors="pt", padding=True)
        self.input_ids = prompt["input_ids"]
        self.attention_mask = prompt.get("attention_mask")

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step == 0:
            return control
        if state.global_step % self.sample_interval != 0:
            return control

        model = kwargs["model"]
        was_training = model.training
        model.eval()
        try:
            generate_kwargs = {
                "max_new_tokens": self.max_new_tokens,
                "do_sample": True,
                "temperature": 0.8,
                "top_p": 0.95,
                "eos_token_id": self.tokenizer.eos_token_id,
            }
            if self.attention_mask is not None:
                generate_kwargs["attention_mask"] = self.attention_mask.to(model.device)
            generated = model.generate(
                self.input_ids.to(model.device),
                **generate_kwargs,
            )
            decoded = self.tokenizer.decode(generated[0], skip_special_tokens=True)
            print(f"\n===== Sample @ step {state.global_step} =====\n{decoded}\n==============================\n")
        finally:
            if was_training:
                model.train()

        return control


sample_prompt = corpus_path.read_text(encoding="utf-8")[:400]
sample_callback = SampleGenerationCallback(tokenizer, sample_prompt)
callbacks = [sample_callback]



In [None]:
grad_accum_target = smollm_config["tokens"]["batch_accumulation_per_replica"]
micro_batch_target = smollm_config["tokens"]["micro_batch_size"]
per_device_train_batch_size = 1  # fits T4
per_device_eval_batch_size = 1

if micro_batch_target % per_device_train_batch_size == 0:
    gradient_accumulation_steps = micro_batch_target // per_device_train_batch_size
else:
    gradient_accumulation_steps = math.ceil(micro_batch_target / per_device_train_batch_size)

optim_name = "adamw_torch_fused" if (torch.cuda.is_available() and smollm_config["optimizer"]["optimizer_factory"].get("torch_adam_is_fused", False)) else "adamw_torch"

training_args = TrainingArguments(
    output_dir=str(stage1_dir),
    overwrite_output_dir=True,
    max_steps=5000,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    eval_strategy="steps",
    eval_steps=500,
    logging_steps=50,
    logging_first_step=True,
    learning_rate=smollm_config["optimizer"]["learning_rate_scheduler"]["learning_rate"],
    warmup_steps=smollm_config["optimizer"]["learning_rate_scheduler"]["lr_warmup_steps"],
    weight_decay=smollm_config["optimizer"]["weight_decay"],
    max_grad_norm=smollm_config["optimizer"]["clip_grad"],
    lr_scheduler_type="linear",
    optim=optim_name,
    adam_beta1=smollm_config["optimizer"]["optimizer_factory"]["adam_beta1"],
    adam_beta2=smollm_config["optimizer"]["optimizer_factory"]["adam_beta2"],
    adam_epsilon=smollm_config["optimizer"]["optimizer_factory"]["adam_eps"],
    report_to=["tensorboard"],
    bf16=bf16_enabled,
    fp16=fp16_enabled and not bf16_enabled,
    gradient_checkpointing=smollm_config["parallelism"].get("recompute_layer", False),
    save_strategy="steps",
    save_steps=5000,
    save_total_limit=2,
    logging_dir=str(stage1_dir / "logs"),
    push_to_hub=False,
)

print(training_args)


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    callbacks=callbacks,
    preprocess_logits_for_metrics=None,
)

train_result = trainer.train()
trainer.save_model(str(stage1_dir / "checkpoint-final"))
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

final_eval = trainer.evaluate()
trainer.log_metrics("eval", final_eval)
trainer.save_metrics("eval", final_eval)

print("Stage 1 completed. Final checkpoint stored under:", stage1_dir)



In [None]:
# Sync trainer state into the final checkpoint directory for easy resumption
final_checkpoint_dir = stage1_dir / "checkpoint-final"
final_checkpoint_dir.mkdir(parents=True, exist_ok=True)

trainer.state.save_to_json(str(final_checkpoint_dir / "trainer_state.json"))
try:
    trainer.save_optimizer_and_scheduler(str(final_checkpoint_dir))
except Exception as err:
    print(f"Warning: could not persist optimizer/scheduler to {final_checkpoint_dir}: {err}")
else:
    print(f"Saved trainer state, optimizer, and scheduler to {final_checkpoint_dir}.")



## Stage 2: Resume for 50 Additional Steps

After the initial 5,000 steps, reload the checkpoint and continue for 50 more steps (total 5,050). The same evaluation cadence and sampling callback are reused.


In [None]:
stage1_model_dir = stage1_dir / "checkpoint-final"
if not stage1_model_dir.exists():
    stage1_model_dir = stage1_dir

resume_checkpoint = stage1_dir if (stage1_dir / "trainer_state.json").exists() else stage1_model_dir

print(f"Stage 2 will load weights from: {stage1_model_dir}")
print(f"Stage 2 will resume optimizer state from: {resume_checkpoint}")

stage2_dir = output_dir / "stage2_5050_steps"
stage2_dir.mkdir(exist_ok=True)

model_stage2 = AutoModelForCausalLM.from_pretrained(
    stage1_model_dir,
    torch_dtype=target_dtype,
    attn_implementation=attn_impl,
    device_map="auto" if torch.cuda.device_count() > 1 else None,
)
if model_pad_added:
    model_stage2.resize_token_embeddings(len(tokenizer))

training_args_stage2 = TrainingArguments(
    output_dir=str(stage2_dir),
    overwrite_output_dir=True,
    max_steps=5050,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    eval_strategy="steps",
    eval_steps=500,
    logging_steps=50,
    logging_first_step=True,
    learning_rate=smollm_config["optimizer"]["learning_rate_scheduler"]["learning_rate"],
    warmup_steps=smollm_config["optimizer"]["learning_rate_scheduler"]["lr_warmup_steps"],
    weight_decay=smollm_config["optimizer"]["weight_decay"],
    max_grad_norm=smollm_config["optimizer"]["clip_grad"],
    lr_scheduler_type="linear",
    optim=optim_name,
    adam_beta1=smollm_config["optimizer"]["optimizer_factory"]["adam_beta1"],
    adam_beta2=smollm_config["optimizer"]["optimizer_factory"]["adam_beta2"],
    adam_epsilon=smollm_config["optimizer"]["optimizer_factory"]["adam_eps"],
    report_to=["tensorboard"],
    bf16=bf16_enabled,
    fp16=fp16_enabled and not bf16_enabled,
    gradient_checkpointing=smollm_config["parallelism"].get("recompute_layer", False),
    save_strategy="steps",
    save_steps=5050,
    save_total_limit=2,
    logging_dir=str(stage2_dir / "logs"),
    push_to_hub=False,
)

trainer_stage2 = Trainer(
    model=model_stage2,
    args=training_args_stage2,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    callbacks=callbacks,
    preprocess_logits_for_metrics=None,
)

train_result_stage2 = trainer_stage2.train(resume_from_checkpoint=str(resume_checkpoint))
trainer_stage2.save_model(str(stage2_dir / "checkpoint-final"))
metrics_stage2 = train_result_stage2.metrics
trainer_stage2.log_metrics("train", metrics_stage2)
trainer_stage2.save_metrics("train", metrics_stage2)
trainer_stage2.save_state()

eval_stage2 = trainer_stage2.evaluate()
trainer_stage2.log_metrics("eval", eval_stage2)
trainer_stage2.save_metrics("eval", eval_stage2)

print("Stage 2 completed. Final checkpoint stored under:", stage2_dir)



## Next Steps & Notes

- Use `tensorboard --logdir smollm2_135m_runs/stage1_5000_steps/logs` (or stage2) to inspect metrics.
- Download checkpoints via the Colab file browser or `from google.colab import files; files.download(...)`.
- To push to the Hugging Face Hub, call `trainer.push_to_hub()` or manually upload the stage directories.
- If you need shorter context windows (e.g., due to limited RAM), lower `block_size` before tokenization and adjust `max_position_embeddings` in the model config accordingly.
