[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Andrei-Aksionov/nanoGPTplus/blob/main/notebooks/examples/run_on_google_colab.ipynb)

<h1><center>RUNNING TRAINING AND SAMPLING</center></h1>
<h5><center>in Google Colaboratory</center></h5>

If you don't have a GPU in your possession or don't want to install this project on your local machine, you can run this notebook in Google Colab. This service provides an instance with CPU, GPU and TPU (the latter we will not use).

In order to do this you can:
1. Click on `Open in Colab` badge.
2. Copy this notebook to [Google Colab](https://colab.research.google.com/) manually and run it there.

# 1. Preparations

First we need to verify that the instance is ready, then clone the repository, create virtual environment and install all dependencies with the project itself.

## 1.1. Runtime type

Google Colab provides CPU, GPU and TPU instance.

The code will work on CPU and GPU, but I recommend to use GPU instance just for the sake of speed.

Here is a [link](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm) on how to change runtime type.

If GPU is selected and available, the code below will output info about available GPU and it's current status.

In [1]:
!nvidia-smi

Sat Mar 18 14:12:12 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P0    23W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 1.2. Clone repository

In [2]:
!git clone https://github.com/Andrei-Aksionov/nanoGPTplus.git

Cloning into 'nanoGPTplus'...
remote: Enumerating objects: 295, done.[K
remote: Counting objects: 100% (177/177), done.[K
remote: Compressing objects: 100% (118/118), done.[K
remote: Total 295 (delta 90), reused 76 (delta 52), pack-reused 118[K
Receiving objects: 100% (295/295), 672.00 KiB | 18.16 MiB/s, done.
Resolving deltas: 100% (124/124), done.


Colab allows to `cd` into a directory. That means that from now on all the commands will be executed from this directory.

In [3]:
%cd nanoGPTplus/
!ls

/content/nanoGPTplus
LICENSE    poetry.lock	   README.md   src
notebooks  pyproject.toml  references  tests


## 1.3. Prepare virtual environment

Each instance of Google Colab comes with plethora of preinstalled packages. But for reproducibility in the future, since I don't control versions of all preinstalled packages, I'd rather create a new empty virtual environment and install project's dependencies into it. 

**Important Note**: I can deal with packages, but I definatelly cannot control version of python interpreter. For now it's 3.9. If you have any issues with running cells first check that the output of the cell bellow is `Python 3.9.*`. 

The project should work on python 3.8 and up, but it was not tested in Colab, only 3.9 is tested.

In [4]:
!python --version

Python 3.9.16


In [6]:
# install package that allows create virtual environments and create `venv` inside project's folder
%pip install --quiet virtualenv
!virtualenv venv

created virtual environment CPython3.9.16.final.0-64 in 254ms
  creator CPython3Posix(dest=/content/nanoGPTplus/venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.0.1, setuptools==67.4.0, wheel==0.38.4
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


One trick to activate virtual environment permanently, so all the commands are executed within this environment, is to change `$PATH` environment variable, so the venv is first in line.

In [7]:
import os

# there are some difficulties with standard exporting of environment variable,
# so I use python's builtin `os` module
os.environ["PATH"] = f"{os.getcwd()}/venv/bin:{os.environ['PATH']}"
!echo $PATH

/content/nanoGPTplus/venv/bin:/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


Now we can check the the venv is activated: the command below shows that it's indeed an empty virtual environment.

In [8]:
!pip list

Package    Version
---------- -------
pip        23.0.1
setuptools 67.4.0
wheel      0.38.4


So now we can install dependencies that are specified in `pyproject.toml` into our venv.

In [10]:
%pip install --quiet -e .

  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for nanogptplus (pyproject.toml) ... [?25l[?25hdone


# 2. Running models

Now we are all set.

We can train Bigram model, sample from it new tokens. Do the same for GPT and even load weight from pretrained GPT2 model from Huggingface and use it for new token sampling.

## 2.1. Download dataset

For simplicity this project uses tiny shakespeare dataset. You can definatelly use your own. You can check README on what needs to be done.

In [11]:
!python src/data/scripts/download_tiny_shakespeare.py

[32m2023-03-18 14:15:59.049[0m | [34m[1mDEBUG   [0m | [36msrc.data.downloader[0m:[36mdownload[0m:[36m34[0m - [34m[1mDownloading https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt into /content/nanoGPTplus/data/raw/tiny_shakespeare[0m
[32m2023-03-18 14:15:59.420[0m | [34m[1mDEBUG   [0m | [36msrc.data.downloader[0m:[36mdownload[0m:[36m44[0m - [34m[1mDownloading is finished[0m


## 2.2. Bigram language model

First we start with something fairly simple: bigram language model. This model just learns what token is the most frequent after the current one and uses this statistics during sampling. More about it in `src/model/bigram_language_model/README.md`.

In [12]:
# for bigram is only large size is available
!python src/model/train.py bigram --size large

[32m2023-03-18 14:16:44.527[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mtrain[0m:[36m60[0m - [34m[1mRandom seed is fixed for training.[0m
[32m2023-03-18 14:16:44.528[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m66[0m - [1mLoading the data...[0m
[32m2023-03-18 14:16:44.533[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m70[0m - [1mData is loaded.[0m
[32m2023-03-18 14:16:44.533[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m73[0m - [1mStarting tokenizing...[0m
[32m2023-03-18 14:16:44.662[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m76[0m - [1mTokenizing is done.[0m
[32m2023-03-18 14:16:44.663[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m79[0m - [1mSaving tokenizer...[0m
[32m2023-03-18 14:16:44.665[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m81[0m - [1mTokenizer is saved.[0m
[32m2023-03-18 14:16:44.666[0m | [1mINFO    [0m | [36m__main__[0m:

And now we can sample from trained model.

In [13]:
!python src/model/generate.py bigram --size large --max-new-tokens 100 --fix-seed

[32m2023-03-18 14:18:36.245[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m69[0m - [34m[1mRandom seed is fixed for token generation.[0m
[32m2023-03-18 14:18:38.415[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m104[0m - [34m[1mGenerating tokens on 'cuda' device[0m
[32m2023-03-18 14:18:38.443[0m | [1mINFO    [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m120[0m - [1mNew generated tokens:  d
O: as; nte tis, te othut mod thand he, preckn,

Henthif o--wishelapinisers we s, orean,
TAUCluprt,[0m
[32m2023-03-18 14:18:38.444[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m121[0m - [34m[1mToken generation took: 0.0290 seconds[0m


Yes, the model is fast, can't deny it. 
The output is somewhat similar to real words which is kinda ok for such a simple model.

But still, this is not war do we want, don't we?

Let's check what GPT can offer.

## 2.3. GPT

GPT accepts three sizes: `small`, `medium` and `large`.

Small is good for debugging, while for more or less good result of course the bigger model the better. Also you can play with `--dataset-fraction` argument, which specifies what portion/fraction of dataset to use for training.

Since the tokenizer is fairly simple and the training might take a while this time let's take only 10% of the dataset. Though you can try to use the full dataset if you have a more powerfull GPU in your posession (for example on Google Colab Pro/Pro+).

In [21]:
os.environ["gpt_size"] = "medium"

In [22]:
!python src/model/train.py gpt --size $gpt_size --dataset-fraction 0.1

[32m2023-03-18 14:29:47.277[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mtrain[0m:[36m60[0m - [34m[1mRandom seed is fixed for training.[0m
[32m2023-03-18 14:29:47.277[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m66[0m - [1mLoading the data...[0m
[32m2023-03-18 14:29:47.283[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m70[0m - [1mData is loaded.[0m
[32m2023-03-18 14:29:47.283[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m73[0m - [1mStarting tokenizing...[0m
[32m2023-03-18 14:29:47.412[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m76[0m - [1mTokenizing is done.[0m
[32m2023-03-18 14:29:47.413[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m79[0m - [1mSaving tokenizer...[0m
[32m2023-03-18 14:29:47.415[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain[0m:[36m81[0m - [1mTokenizer is saved.[0m
[32m2023-03-18 14:29:47.415[0m | [1mINFO    [0m | [36m__main__[0m:

And new tokens are:

In [23]:
!python src/model/generate.py gpt --size $gpt_size --max-new-tokens 1000 --fix-seed

[32m2023-03-18 14:46:14.743[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m69[0m - [34m[1mRandom seed is fixed for token generation.[0m
[32m2023-03-18 14:46:15.673[0m | [34m[1mDEBUG   [0m | [36msrc.model.gpt_language_model.gpt[0m:[36m__init__[0m:[36m114[0m - [34m[1mGPT language model is created with number of parameters: 10.65 million[0m
[32m2023-03-18 14:46:17.848[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m104[0m - [34m[1mGenerating tokens on 'cuda' device[0m
100% 1000/1000 [00:07<00:00, 137.21it/s]
[32m2023-03-18 14:46:25.140[0m | [1mINFO    [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m120[0m - [1mNew generated tokens:  d
Our sufferance. There's ne'er arm in the war,
Our still on in sufferity, or be some of them our truth: and deliver him
Which our distinction; and it our shall answer
The treasure of our strange.

MENENIUS:
Now, be gone, beseech you.

CORIOLANUS:
That 

Ok, that looks much better than what Bigram LM did. Don't forget that the dataset is fairly small and tokenizer is a basic one. So the power of GPT isn't utilized fully.

Also you can achieve better results with bigger model and training on the full dataset, but it will take a while on Nvidia T4.

## 2.4. GPT with pretrained weights

This GPT implementation supports loading pretrained weights for GPT2 model from Huggingface (weights are provided by OpenAI). That model was trained on large corpus of data and uses much more sophisticated [byte-pair tokenizer](https://huggingface.co/course/chapter6/5?fw=pt).

**Note**: the weights are pretrained not on shakespeare dataset, so the output will be different to what we saw before.

GPT2 has 4 configs:
1. gpt2 (124M parameters) 
2. gpt2-medium (350M)
3. gpt2-large (774M)
4. gpt2-xl (1.5B)

*Google Colab with Nvidia T4 can handle up to gpt2-large.
Though it possible to use even the largest one it will require change of how the model is loaded.*

The large the model is the better the sampling, but it means that the memory consumption will be also increased.

In [29]:
os.environ["gpt2_config"] = "gpt2-medium"
os.environ["max_new_tokens"] = "1000"
os.environ["continue_tokens"] = "My name is Giovanni Giorgio but everybody calls me "

In [30]:
!python src/model/generate.py gpt --gpt2-config $gpt2_config --max-new-tokens "$((max_new_tokens))" --fix-seed --continue-tokens "$continue_tokens"

[32m2023-03-18 14:51:48.867[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m69[0m - [34m[1mRandom seed is fixed for token generation.[0m
[32m2023-03-18 14:51:51.251[0m | [34m[1mDEBUG   [0m | [36msrc.model.gpt_language_model.gpt[0m:[36mfrom_pretrained[0m:[36m340[0m - [34m[1mCreating GPT model with parameters: {'vocab_size': 50257, 'embeddings_size': 1024, 'context_size': 1024, 'num_layers': 24, 'num_heads': 16, 'head_size': None, 'feed_forward_scaling': 4, 'bias': True, 'dropout': 0.1}[0m
[32m2023-03-18 14:52:00.173[0m | [34m[1mDEBUG   [0m | [36msrc.model.gpt_language_model.gpt[0m:[36m__init__[0m:[36m114[0m - [34m[1mGPT language model is created with number of parameters: 353.77 million[0m
[32m2023-03-18 14:52:00.175[0m | [34m[1mDEBUG   [0m | [36msrc.model.gpt_language_model.gpt[0m:[36mfrom_pretrained[0m:[36m348[0m - [34m[1mLoading pretrained Huggingface model of size 'gpt2-medium' ...[0m
[32m2023-03-18 14

### 2.4.1. Key-Value cache

For GPT it's possible to use kv-cache in order to speed up new token generation.

In [31]:
!python src/model/generate.py gpt --gpt2-config $gpt2_config --max-new-tokens "$((max_new_tokens))" --fix-seed --continue-tokens "$continue_tokens" --use-kv-cache

[32m2023-03-18 14:54:23.120[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mgenerate_new_tokens[0m:[36m69[0m - [34m[1mRandom seed is fixed for token generation.[0m
[32m2023-03-18 14:54:25.108[0m | [34m[1mDEBUG   [0m | [36msrc.model.gpt_language_model.gpt[0m:[36mfrom_pretrained[0m:[36m340[0m - [34m[1mCreating GPT model with parameters: {'vocab_size': 50257, 'embeddings_size': 1024, 'context_size': 1024, 'num_layers': 24, 'num_heads': 16, 'head_size': None, 'feed_forward_scaling': 4, 'bias': True, 'dropout': 0.1}[0m
[32m2023-03-18 14:54:35.310[0m | [34m[1mDEBUG   [0m | [36msrc.model.gpt_language_model.gpt[0m:[36m__init__[0m:[36m114[0m - [34m[1mGPT language model is created with number of parameters: 353.77 million[0m
[32m2023-03-18 14:54:35.312[0m | [34m[1mDEBUG   [0m | [36msrc.model.gpt_language_model.gpt[0m:[36mfrom_pretrained[0m:[36m348[0m - [34m[1mLoading pretrained Huggingface model of size 'gpt2-medium' ...[0m
[32m2023-03-18 14

Look at the difference: 120-150 seconds without caching, 18-19 second - with. That's why kv-caching is widely adopted.