# GistNet Training Demo

This notebook reproduces the Phase 2 GistNet workflow entirely from Colab. It clones the MegaContext repository, downloads a curated subset of Project Gutenberg titles (<1 GB), prepares Arrow shards, and trains the lightweight gist model.

## 1. Clone the repository & install dependencies

If you're running locally with the repo already on disk, skip this cell.

In [None]:
!git clone https://github.com/brandf/MegaContext.git
%cd MegaContext


In [None]:
!git pull

In [None]:
!pip install -r requirements.txt
!pip install -e .[dev]

## 2. Download the Gutenberg subset

Feel free to trim or expand the list inside `tools/download_gutenberg.sh`. The default selection is under 1 GB.

In [None]:
!bash tools/download_gutenberg.sh data/raw/gutenberg

## 3. Prepare the dataset shard

The `configs/data/gutenberg_sample.yaml` file uses `block_size=32` and `horizon=64`, matching the paper plan requirements.

In [None]:
!rm -f data/gutenberg_sample/train.arrow

In [None]:
%run tools/prepare_dataset.py --config configs/data/gutenberg_sample.yaml


## 4. Train the GistNet model

This step auto-displays the loss curve inside Colab and saves artifacts under `artifacts/gistnet/`.

In [None]:
%run tools/train_gistnet.py \
    --dataset data/gutenberg_sample/train.arrow \
    --config configs/runs/gistnet_example.yaml \
    --metrics-path artifacts/gistnet/metrics.json \
    --save-plot artifacts/gistnet/loss.png

## 5. Next steps

* Run the ΔNLL smoke eval (coming in Task 2.4).
* Push metrics & checkpoints to W&B or Novita storage with `--use-wandb`.
* Swap `configs/runs/gistnet_example.yaml` for a larger hidden size when you have a bigger teacher.