### Notice
Everything should be ran from the main directory.

You need to install all required dependencies.

In [None]:
%pip install -r requirements.txt

# Training 
Train the custom nanoGPT model with every tokenizer, for 1000 steps.

We used the Wikitext.txt file for training each tokenizer, but you can use as well the TaylorSwiftWiki.txt (Taylor swift Wiki page) or the tiny_shakespeare.txt (sample from Shakespeare's book).

### Byte level
Should take 1-2 minutes

In [None]:
%run utils/train_nanogpt_mod.py --dataset data/Wikitext.txt --tokenizer byt5 --max_steps 1000

### Character level
Should take 1-2 minutes

In [None]:
%run utils/train_nanogpt_mod.py --dataset data/Wikitext.txt --tokenizer char --max_steps 1000

### Byte-Pair Encoding
Should take 6-7 hours with a strong GPU. We recommend not to do that and just use the pre-generated checkpoint file. Here you can define your desired vocabulary size by changing --vocab_size value.

In [None]:
%run utils/train_nanogpt_mod.py --dataset data/Wikitext.txt --tokenizer bpe --vocab_size 32000 --max_steps 1000

# Evaluating
Evaluate the best checkpoints that the model generated for each tokenization method.

### Byte level
Should take 1-2 minutes

In [None]:
%run utils/enhanced_eval.py --checkpoint checkpoints/WikiText/nanogpt_byt5_step1000.pt --dataset data/Wikitext.txt


### Character level
Should take 1-2 minutes

In [None]:
%run utils/enhanced_eval.py --checkpoint checkpoints/WikiText/nanogpt_char_step1000.pt --dataset data/Wikitext.txt


### Byte-Pair Encoding
Should take 6-7 hours with a strong GPU. We recommend not to do that and just use the pre-generated evaluation file (json).

In [None]:
%run utils/enhanced_eval.py --checkpoint checkpoints/WikiText/nanogpt_bpe_step1000.pt --dataset data/Wikitext.txt

At this point, metrics files have been generated for each tokenization method, inside results folder.

# Comparison
Now you can compare the tokenizers and generate some plots!

In [None]:
%run utils/compare_tokenizers.py results/enhanced_metrics_byt5_step1000.json results/enhanced_metrics_char_step1000.json results/enhanced_metrics_bpe_step1000.json --out_dir results/plots

# Text Generation
Also, you can try and generate some text using any prompt and any checkpoint!

*Works only with the byte-level tokenizer right now :(

In [None]:
%run utils/generate_nanogpt_mod.py --checkpoint checkpoints/WikiText/nanogpt_byt5_step1000.pt --prompt "Hello"