# What is a tokenizer?

A system that converts a text into a sequence of discreet symbols (tokens)


Tokenizer usually includes the following steps
- text normalization / clean up
- token splitting (rules on how text is split)
- vocabulary
- encoder (tokens -> integers)
- decoder (integers -> tokens)

here the implementation is for a BPE Tokenizer.

Reference:
- https://en.wikipedia.org/wiki/Byte-pair_encoding
- https://huggingface.co/learn/llm-course/en/chapter6/5



![Status](https://img.shields.io/badge/status-dev%20in%20progress-yellow)

#1. Set up

##1.1 loading the dataset

In [1]:
!wget -q https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
!wget -q https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt

!wget -q https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
!gunzip -f owt_train.txt.gz
!wget -q https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
!gunzip -f owt_valid.txt.gz

!ls -lh

total 14G
-rw-r--r-- 1 root root  12G Feb  7 04:48 owt_train.txt
-rw-r--r-- 1 root root 277M Feb  7 04:50 owt_valid.txt
drwxr-xr-x 1 root root 4.0K Jan 16 14:24 sample_data
-rw-r--r-- 1 root root 2.1G Feb  7 04:47 TinyStoriesV2-GPT4-train.txt
-rw-r--r-- 1 root root  22M Feb  7 04:47 TinyStoriesV2-GPT4-valid.txt


##1.2 overview of dataset

In [2]:
!echo "== TinyStories train =="; head -n 5 TinyStoriesV2-GPT4-train.txt #head -n 5 --> show me the first 5 lines of
!echo "== TinyStories valid =="; head -n 5 TinyStoriesV2-GPT4-valid.txt
!echo "== OWT train =="; head -n 5 owt_train.txt
!echo "== OWT valid =="; head -n 5 owt_valid.txt

== TinyStories train ==

Once upon a time there was a little boy named Ben. Ben loved to explore the world around him. He saw many amazing things, like beautiful vases that were on display in a store. One day, Ben was walking through the store when he came across a very special vase. When Ben saw it he was amazed!  
He said, “Wow, that is a really amazing vase! Can I buy it?” 
The shopkeeper smiled and said, “Of course you can. You can take it home and show all your friends how amazing it is!”
So Ben took the vase home and he was so proud of it! He called his friends over and showed them the amazing vase. All his friends thought the vase was beautiful and couldn't believe how lucky Ben was. 
== TinyStories valid ==
u don't have to be scared of the loud dog, I'll protect you". The mole felt so safe with the little girl. She was very kind and the mole soon came to trust her. He leaned against her and she kept him safe. The mole had found his best friend.
<|endoftext|>
Once upon a time, i

## 1.3 Training and Validation Files

In [3]:
from pathlib import Path

train_files = [
    "TinyStoriesV2-GPT4-train.txt",
    "owt_train.txt",
]

valid_files = [
    "TinyStoriesV2-GPT4-valid.txt",
    "owt_valid.txt",
]

# Sanity checks
for f in train_files + valid_files:
    p = Path(f)
    assert p.exists(), f"Missing file: {f}"
    assert p.stat().st_size > 0, f"Empty file: {f}"

print("All files found and non-empty.")

All files found and non-empty.


# 2. Text Normalization

# 2.1 Whitespaces

This code reads text line by line and replaces multiple spaces, tabs, or newlines with a single space.


In [12]:
import re

whitespace_pattern = re.compile(r'\s+')

def stream_line(paths):
  for path in paths:
    with open(path, 'r') as f:
      for line in f:
        if line:
          line = line.strip()
          line = whitespace_pattern.sub(' ', line)
          yield line


In [13]:
#test
stream = stream_line(train_files)
for i in range(5):
  print(next(stream))


Once upon a time there was a little boy named Ben. Ben loved to explore the world around him. He saw many amazing things, like beautiful vases that were on display in a store. One day, Ben was walking through the store when he came across a very special vase. When Ben saw it he was amazed!
He said, “Wow, that is a really amazing vase! Can I buy it?”
The shopkeeper smiled and said, “Of course you can. You can take it home and show all your friends how amazing it is!”
So Ben took the vase home and he was so proud of it! He called his friends over and showed them the amazing vase. All his friends thought the vase was beautiful and couldn't believe how lucky Ben was.


##2.2 End of Word
Take a word, split it into characters, and append an end-of-word marker.

In [15]:
EOW = "</eow>"
def word_to_symbols(word):
    return tuple(word) + (EOW,)

print(word_to_symbols("hello"))

('h', 'e', 'l', 'l', 'o', '</eow>')


#