## Train d32

I'll do the training on lambda cloud with new storage.

### storage

How much? Really rough...

- 800 parquet files @ 90 mb each = 72 gb
- room for 10 checkpoints @ 8 gb each (?) = 80 gb
- eval stuff = 200 mb

500 gb should be plenty

### Machine setup

```
ssh ssh ubuntu@[ip]

# ssh key for git
ssh-keygen -t ed25519 -C "lambda-cloud"
cat ~/.ssh/id_ed25519.pub
copy into github UI (https://github.com/settings/keys)

git config --global user.email "ericsilberstein@gmail.com"
git config --global user.name "Eric Silberstein"

# clone this repo
git clone git@github.com:ericsilberstein1/learn-nanochat.git

# UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
echo '. "$HOME/.cargo/env"' >> .bashrc

echo 'export NANOCHAT_BASE_DIR="/home/ubuntu/mynanochat"' >> .bashrc
echo 'export RUN_OUTPUTS_DIR="$NANOCHAT_BASE_DIR/run_outputs"' >> .bashrc
echo 'export OMP_NUM_THREADS=1' >> .bashrc

# in .bashrc add
# export WANDB_API_KEY="XXX"

source .bashrc

cd learn-nanochat

uv sync
source .venv/bin/activate

# for now until organize this better
uv tool install maturin
cd challenge-07-rust-and-python-simplified-tokenizer/rust_tokenizer
maturin develop
cd -

# looks like lambda automatically runs jupyter but for now at least let me run it
# in the way I understand
uv run jupyter lab --port=7001
jupyter server list

# ON MY LAPTOP make a tunnel to jupyter
ssh -N -L 7001:localhost:7001 ubuntu@[ip]
```

### Run...

In tmux shell

```
source .venv/bin/activate

mkdir -p $NANOCHAT_BASE_DIR
mkdir -p $RUN_OUTPUTS_DIR

cd challenge-38-train-d32

curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

export PYTHONPATH=../my_nanochat/

python -m nanochat.dataset -n 800

python -m scripts.my_tok_train --max_chars=4000000000

# do a short run to confirm it runs to completion

python -m my_nanochat.my_report reset

torchrun --standalone --nproc_per_node=8 -m scripts.my_base_train -- --depth=32 --device_batch_size=8 --num_iterations=10 --run=challenge-38-1 > $RUN_OUTPUTS_DIR/base_train_output_001.txt 2>&1

# if good do the full run and then the base evals

python -m my_nanochat.my_report reset

torchrun --standalone --nproc_per_node=8 -m scripts.my_base_train -- --depth=32 --device_batch_size=8 --run=challenge-38-2 > $RUN_OUTPUTS_DIR/base_train_output_002.txt 2>&1

torchrun --standalone --nproc_per_node=8 -m scripts.my_base_loss -- --model_tag=d32 > $RUN_OUTPUTS_DIR/base_loss_output_001.txt 2>&1

torchrun --standalone --nproc_per_node=8 -m scripts.my_base_eval -- --model-tag=d32 > $RUN_OUTPUTS_DIR/base_eval_output.001.txt 2>&1

# do mid training and chat eval

torchrun --standalone --nproc_per_node=8 -m scripts.my_mid_train -- --model_tag=d32 --device_batch_size=8 --run=challenge-38-3 > $RUN_OUTPUTS_DIR/mid_train_output_001.txt 2>&1

torchrun --standalone --nproc_per_node=8 -m scripts.my_chat_eval -- --model-tag=d32 --source=mid > $RUN_OUTPUTS_DIR/mid_chat_eval_output_001.txt 2>&1

# do sft and chat eval on sft

torchrun --standalone --nproc_per_node=8 -m scripts.my_chat_sft -- --model-tag=d32 --run=challenge-38-4 > $RUN_OUTPUTS_DIR/sft_train_output_001.txt 2>&1

torchrun --standalone --nproc_per_node=8 -m scripts.my_chat_eval -- --model-tag=d32 --source=sft > $RUN_OUTPUTS_DIR/sft_chat_eval_output_001.txt 2>&1

# run final report

python -m my_nanochat.my_report

```