<a href="https://colab.research.google.com/github/au1206/build_llm_from_scratch/blob/main/tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# prompt: explain the regex text = re.sub(r'([,.?_!"()\']|--|\s)', r'\1', text)

The provided regex `text = re.sub(r'([,.?_!"()\']|--|\s)', r'\1', text)` performs the following operations:

1. **Character Class**:
   - `([,.?_!"()\'])`: This character class matches any of the special characters: comma (,), period (.), question mark (?), underscore (_), exclamation mark (!), double quotes ("), single quotes ('), opening and closing parentheses (), and hyphen (--).

2. **Grouping**:
   - The parentheses around the character class create a capturing group, allowing us to refer to the matched character later using `\1`.

3. **Replacement**:
   - `r'\1'`: This replacement string simply reinserts the captured character (`\1`) at the same position where it was matched.

4. **Overall Effect**:
   - The regex effectively removes any whitespace or special characters from the text and replaces them with themselves. This can be useful for pre-processing text data, such as removing punctuation or formatting characters.

For example, if we have the following text:



# Tokenization

## Vocab
create a vocab i.e. a dictioanry with word/token as key and int index as value

In [None]:
text_corpus = "<text corpus path .txt>"
with open(text_corpus, 'r') as f:
  text = f.read()

tokens = re.split(r'([,.?_!"()\']|--|\s)', text)
tokens = [x.strip() for x in tokens]
vocab = {x: i for i, x in enumerate(sorted(list(set(tokens))))}

## Simple tokenizer
a tokenizer which takes rach word, without whitesepaces and treats punctuations sperately.

needs a vocab of the format {word:int, ...}

In [2]:
import re
from typing import List

In [3]:
class SimpleTokenizer:
  def __init__(self, vocab) -> None:
    self.str_to_int = vocab
    self.int_to_str = {v: k for k, v in vocab.items()}

  def encode(self, text: str) -> List[int]:
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
    ids = [self.str_to_int[x.strip()] for x in preprocessed]
    return ids

  def decode(self, tokens: List[int]) -> str:
    text = ' '.join([self.int_to_str[x] for x in tokens])
    # replace \s before the punctuation
    text = re.sub(r'([,.?_!"()\']|--|\s)', r'\1', text)
    return text
