# Free FP8 training with unit scaling

Zero-effort, zero-cost FP8 training using the `unit_scaling` library.

💻 **Use the library**: [graphcore-research.github.io/unit-scaling](https://graphcore-research.github.io/unit-scaling/)

📖 **Read the paper**: [arxiv.org/abs/2303.11257](https://arxiv.org/abs/2303.11257)

## TL;DR

Naïvely casting to FP8 causes training to fail as some values are out-of-range.
This can be easily fixed by using
[unit-scaled](https://graphcore-research.github.io/unit-scaling/) layers:

In [1]:
from notebook_utils import config, train
from nanoGPT.model import GPT
from unit_scaling.transforms import simulate_fp8, unit_scale

gpt = GPT(config)
fp8_gpt = simulate_fp8(gpt)
unit_scaled_fp8_gpt = unit_scale(fp8_gpt)

models = [gpt, fp8_gpt, unit_scaled_fp8_gpt]
for model in models:
    train(model)

{'out_dir': 'out-shakespeare-char', 'eval_interval': 250, 'eval_iters': 200, 'log_interval': 10, 'always_save_checkpoint': False, 'wandb_log': False, 'wandb_project': 'shakespeare-char', 'wandb_run_name': 'mini-gpt', 'dataset': 'shakespeare_char', 'gradient_accumulation_steps': 1, 'batch_size': 64, 'block_size': 256, 'n_layer': 6, 'n_head': 6, 'n_embd': 384, 'dropout': 0.2, 'learning_rate': 0.001, 'max_iters': 5000, 'lr_decay_iters': 5000, 'min_lr': 0.0001, 'beta2': 0.99, 'warmup_iters': 100}
GPTConfig(block_size=256, vocab_size=65, n_layer=6, n_head=6, n_embd=384, dropout=0.2, bias=True)
number of parameters: 10.67M
