<a href="https://colab.research.google.com/github/elliemci/building-LLM/blob/main/tokenizing_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization of Text Data

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
%cd /content/drive/MyDrive/Colab\ Notebooks/LLM

pytorch_wormup.ipynb  the-verdict.txt


Transform discrete text data like words into continuos vector space

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()

print(f"Total number of characters inoriginal text including white spaces: {len(raw_text)}")
print(raw_text[:99])

Total number of characters inoriginal text including white spaces: 20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Brake raw text into tokens which can be words or characters, and convert these strings tokens into integer token IDs

## Word Embedding

In [None]:
import re

A tokenization scheme that splits text into individual words and punctuation characters.

In [None]:
# split a text on white spaces \s, commas and periods, question marks,
# underscore, explamation marks quoatation marks, and double dashes
text = "Hello, world! Is this-- a test?"
split_text = re.split(r'(\s|[,.?_!"()\']|--)', text)
print(split_text)

['Hello', ',', '', ' ', 'world', '!', '', ' ', 'Is', ' ', 'this', '--', '', ' ', 'a', ' ', 'test', '?', '']


In [None]:
# remove the whitespace characters
split_text = [item.strip() for item in split_text if item.strip()]
print(f"split into {len(split_text)} individual tockens:\n{split_text}")

split into 10 individual tockens:
['Hello', ',', 'world', '!', 'Is', 'this', '--', 'a', 'test', '?']


### Apply Tokenizer to text

In [None]:
preprocessed = re.split(r'(\s|[,.?_!"()\']|--)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(f"Total number of tokens, excluding white spaces: {len(preprocessed)}")

Total number of tokens, excluding white spaces: 4649


In [None]:
# print the first 20 tokens
print(f"First 20 tokens:\n{preprocessed[:20]}")

First 20 tokens:
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


## Map tockens into IDs

Build a vocabulary by sorting alphabetially the individual tokens of the tokenized training text, and mapping each unique word and special character into a unique integer, which is called token ID

### Vocabulary

In [None]:
# sorted the list of unique tokens
sorted_tokens = sorted(list(set(preprocessed)))
vocab_size = len(sorted_tokens)
print(f"Vacabulary size: {vocab_size}")

Vacabulary size: 1159


In [None]:
# create vocabulary by mapping the alphabetically sorted list of tokens to unizque integers
vocab = {token:integer for integer, token in enumerate(sorted_tokens)}
# print the fist 50 tokens and their IDs
[item for i,item in enumerate(vocab.items()) if i < 50]

### Tokens into IDs

In [None]:
class TokenizerV1:
  """ A tokenizer class with encode method that splits text
      into tokens and carries out string-to-integer mapping
      to produce token IDs via vocabulary """

  def __init__(self, vocab):
      self.str_to_int = vocab
      self.int_to_str = {i:s for s, i  in vocab.items()}

  def encode(self, text):
      preprocessed = re.split(r'(\s|[,.?_!"()\']|--)', text)
      preprocessed = [item.strip() for item in preprocessed if item.strip()]
      ids = [self.str_to_int[s] for s in preprocessed]
      return ids

  def decode(self, ids):
      text = " ".join([self.int_to_str[i] for i in ids])
      text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
      return text


In [None]:
# instantiate a tokenizer object to tokenize a given text
tokenizer = TokenizerV1(vocab)

test_text1 = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
test_text2 = "Hello, do you like tea?"
test_ids = tokenizer.encode(test_text1)

print(f"sample text: {test_text1}")
print(f"encoded tokens: {test_ids}")
print(f"decoded token: {tokenizer.decode(test_ids)}")

# N.B: A word not included in vocabulary raises an error

sample text: "It's the last he painted, you know," Mrs. Gisburn said with pardonable pride.
encoded tokens: [1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]
decoded token: " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [None]:
test_text

'Hello, do you like tea?'

In [None]:
tokenizer.decode(test_ids)

'I HAD always thought Jack Gisburn rather a cheap genius -- though a good fellow enough -- so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera.( Though I rather thought it would have been Rome or Florence.)" The height of his glory" -- that was what the women called it. I can hear Mrs. Gideon Thwing -- his last Chicago sitter -- deploring his unaccountable abdication." Of course it\' s going to send the value of my picture\' way up; but I don\' t think of that, Mr. Rickham -- the loss to Arrt is all I think of." The word, on Mrs. Thwing\' s lips, multiplied its _ rs _ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn\' s" Moon-dancers" to say, with tears in her eyes:" We shall not look upon its like again"? 

#### Special Context Tokens

Modify the Vocabbulary and the Tokenizer adding special tokens <|endoftext|> <|unk|> marking document boundeies and unknown words

#### Extend Vocabulary

In [None]:
# add special tokens to vocabulary buldt on training text
sorted_tokens = sorted(list(set(preprocessed)))
sorted_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(sorted_tokens)}
#vocab_size = len(sorted_tokens)
print(f"Vacabulary size: {len(vocab.items())}")

In [None]:
# print the last 5 entries of the updated vocab
 [item for i, item in enumerate(list(vocab.items())[-5:])]

## Tokenizer that handles unknown words

In [None]:
class TokenizerV2:
  """ A tokenizer class with encode method that splits text
      into tokens and carries out string-to-integer mapping
      to produce token IDs via vocabulary. It replaces unknown
      words by <|unk|> token. """

  def __init__(self, vocab):
      self.str_to_int = vocab
      self.int_to_str = {i:s for s, i  in vocab.items()}

  def encode(self, text):
      preprocessed = re.split(r'(\s|[,.?_!"()\']|--)', text)
      preprocessed = [item.strip() for item in preprocessed if item.strip()]
      ids = [self.str_to_int[s] for s in preprocessed]
      return ids

  def decode(self, ids):
      text = " ".join([self.int_to_str[i] for i in ids])
      text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
      return text

In [None]:
# test the tokenazier on sample text
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))

print(f"Sample text:\n{text}")

tokenizer = SimpleTokenizerV2(vocab)

print(f"Tokenized text:\n{tokenizer.encode(text)}"
)
print(f"De-tokenized text:\n{tokenizer.decode(tokenizer.encode(text))}")

## Byte Pair Encoding

BPE builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words

In [None]:
!pip install tiktoken

In [None]:
import importlib
import tiktoken
print("tiktoken version:", importlib.metadata.version("tiktoken"))

In [None]:
# instantiate BPE tokenizer from tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

BPE tokenizer usd to train models like GPT-2, GPR-3 and ChatGPT has a vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID

In [None]:
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(f"Token IDs:\n{integers}")

strings = tokenizer.decode(integers)

print(f"De-tokenizedtext:\n{strings}")

The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words, which ensures that the tokenizer, and consequently the LLM that is trained with it, can process any text, even if it contains words that were not present in its training data.

In [None]:
# test BPE tokenizer on the uknown word "Akwirw ier"
text = "Akwirw ier"
integers = tokenizer.encode(text)

print(f"Token IDs:\n{integers}")

strings = tokenizer.decode(integers)

print(f"De-tokenizedtext:\n{strings}")