## Getting ready

This notebook will cover whatever I should do locally to this learning repo before I start working on a server with a GPU and instructions for configuring that server. The other notebook will be like `baby-pretrain.ipynb` from challenge 13 but I'll run it with a GPU.

### A

During challenge 8 I downloaded 10 parquet files in the notebook and later moved them to `~/.cache/my_nanochat`. Since I'll need to download files again once on the GPU server, and will probably need to do this many times in the future, copy the `if __name__ == "__main__"` part of [dataset.py](https://github.com/karpathy/nanochat/blob/master/nanochat/dataset.py) to `my_dataset.py` so I can download on the command line.

Also, update `download_single_file` to put the files in the `get_base_dir()` (which, again, we didn't worry about back in challenge 8)

In [3]:
import sys
sys.path.append('../my_nanochat')

In [4]:
from my_nanochat.my_dataset import download_single_file

In [5]:
!ls ~/.cache/my_nanochat/

my-tokenizer.pkl    shard_00002.parquet shard_00005.parquet shard_00008.parquet
shard_00000.parquet shard_00003.parquet shard_00006.parquet shard_00009.parquet
shard_00001.parquet shard_00004.parquet shard_00007.parquet


In [6]:
download_single_file(1)

downloading shard_00001.parquet...
shard_00001.parquet already downloaded


True

In [7]:
download_single_file(10)

downloading shard_00010.parquet...
downloaded /Users/ericsilberstein/.cache/my_nanochat/shard_00010.parquet


True

In [8]:
!ls ~/.cache/my_nanochat/

my-tokenizer.pkl    shard_00002.parquet shard_00005.parquet shard_00008.parquet
shard_00000.parquet shard_00003.parquet shard_00006.parquet shard_00009.parquet
shard_00001.parquet shard_00004.parquet shard_00007.parquet shard_00010.parquet


In [9]:
!PYTHONPATH=../my_nanochat python -m my_nanochat.my_dataset --help

usage: my_dataset.py [-h] [-n NUM_FILES] [-w NUM_WORKERS]

Download FineWeb-Edu 100BT dataset shards

options:
  -h, --help            show this help message and exit
  -n NUM_FILES, --num-files NUM_FILES
                        Number of shards to download (default: -1), -1 = all
  -w NUM_WORKERS, --num-workers NUM_WORKERS
                        Number of parallel download workers (default: 4)


In [10]:
!PYTHONPATH=../my_nanochat python -m my_nanochat.my_dataset --num-files 12 --num-workers 2

Downloading 12 shards using 2 workers
downloading shard_00000.parquet...
downloading shard_00002.parquet...
shard_00000.parquet already downloaded
shard_00002.parquet already downloaded
downloading shard_00001.parquet...
downloading shard_00003.parquet...
shard_00001.parquet already downloaded
shard_00003.parquet already downloaded
downloading shard_00004.parquet...
shard_00004.parquet already downloaded
downloading shard_00005.parquet...
shard_00005.parquet already downloaded
downloading shard_00006.parquet...
downloading shard_00008.parquet...
shard_00008.parquet already downloaded
downloading shard_00009.parquet...
shard_00006.parquet already downloaded
downloading shard_00007.parquet...
shard_00009.parquet already downloaded
shard_00007.parquet already downloaded
downloading shard_00010.parquet...
shard_00010.parquet already downloaded
downloading shard_00011.parquet...
downloaded /Users/ericsilberstein/.cache/my_nanochat/shard_00011.parquet
done! download 12 of 12


In [11]:
!ls ~/.cache/my_nanochat/

my-tokenizer.pkl    shard_00003.parquet shard_00007.parquet shard_00011.parquet
shard_00000.parquet shard_00004.parquet shard_00008.parquet
shard_00001.parquet shard_00005.parquet shard_00009.parquet
shard_00002.parquet shard_00006.parquet shard_00010.parquet


### B

In `my_gpt.py` there are places I skipped device, add it in now

```
--- a/my_nanochat/my_nanochat/my_gpt.py
+++ b/my_nanochat/my_nanochat/my_gpt.py
@@ -115,9 +115,11 @@ class GPT(nn.Module):
         self.register_buffer("sin", sin, persistent=False)
 
     def _precompute_rotary_embeddings(self, seq_len, head_dim, base=10_000, device=None):
-        channel_range = torch.arange(0, head_dim, 2, dtype=torch.float32)
+        if device is None:
+            device = self.get_device()
+        channel_range = torch.arange(0, head_dim, 2, dtype=torch.float32, device=device)
         inv_freq = 1.0 / (base ** (channel_range / head_dim))
-        t = torch.arange(seq_len, dtype=torch.float32)
+        t = torch.arange(seq_len, dtype=torch.float32, device=device)
         freqs = torch.outer(t, inv_freq)
         cos, sin = freqs.cos(), freqs.sin()
         cos, sin = cos.bfloat16(), sin.bfloat16()
@@ -136,6 +138,9 @@ class GPT(nn.Module):
         elif isinstance(module, nn.Embedding):
             torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)
 
+    def get_device(self):
+        return self.transformer.wte.weight.device
+
     def init_weights(self):
         self.apply(self._init_weights)
         # zero out the classifier weights
```

### C

I'll use digital ocean paperspace because I've used it before. From other projects, I'll prob do something like this, will update/fix once I actually get on there.

```
ssh paperspace@[public_ip]

# ssh key for git
ssh-keygen -t ed25519 -C "paperspace-vm"
cat ~/.ssh/id_ed25519.pub
copy into github UI (https://github.com/settings/keys)

git config --global user.email "ericsilberstein@gmail.com"
git config --global user.name "Eric Silberstein"

# clone this repo
git clone git@github.com:ericsilberstein1/nanogpt-learning.git

# Basic packages
sudo apt-get update && sudo apt-get install -y build-essential git curl wget ca-certificates

# confirm / install smi
nvidia-detector
nvidia-smi
sudo apt-get install nvidia-driver-580 nvidia-utils-580
sudo modprobe nvidia # (before did sudo shutdown -r now)
nvidia-smi

# later, after working on challenge 15, looked into why torch was failing to compile the model
# and realized needed to do this so python headers get installed
# the specific error was: /tmp/tmp0normshd/cuda_utils.c:6:10: fatal error: Python.h: No such file 
# or directory 6 | #include <Python.h>
sudo apt-get install python3.10-dev

# UV
curl -LsSf https://astral.sh/uv/install.sh | sh
source .zshrc

# rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
echo '. "$HOME/.cargo/env"' >> .zshrc
source .zshrc

cd nanogpt-learning

uv sync
source .venv/bin/activate

# download data files
PYTHONPATH=my_nanochat python -m my_nanochat.my_dataset --num-files 12 --num-workers 4

uv run jupyter lab

# ON MY LAPTOP make port 8889 a tunnel to jupyter
ssh -N -L 8889:localhost:8888 paperspace@[public_ip]

# now back on paperspace machine, for now until organize this better
uv tool install maturin
cd challenge-07-rust-and-python-simplified-tokenizer/rust_tokenizer
maturin develop

