<a href="https://colab.research.google.com/github/ajayrfhp/LearningDeepLearning/blob/main/working_with_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Contents**
- Prepare text for LLM training
- Splitting text into word and subword tokens
- Byte pair encoding
- Sampling training examples
- Convert tokens into vectors that go into LLM

- GPT2 models have model size at 117M and 125M parameters, embedding size of 768.
- GPT3 has 175B parameters and embedding size of 12,288.

## Tokenize text

In [None]:
import urllib.request
import re
from collections import defaultdict

In [None]:
from os import read

def download_data(url, file_path):
  urllib.request.urlretrieve(url, file_path)

def read_data(file_path):
  with open(file_path, 'r', encoding='utf-8') as f:
    data = f.read()
  return data




url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "verdict.txt"
download_data(url, file_path)
data = read_data(file_path)
print(data[:1000])




I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"?

Well!--even through th

- White spaces are useful if we training models to generate code, but if we are operating pure text data, removing it is helpful for lowering memory and cpu constraints

In [None]:
def tokenize_text(text):
  tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
  tokens = [token for token in tokens if token.strip()]
  return tokens

tokens = tokenize_text(data)
print(tokens[:30])
print(len(tokens))


['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']
4690


## Convert tokens into integers

In [None]:
class TokenizerV1:
  def __init__(self, text):
    self.text = text
    self.tokens = self.tokenize_text(self.text)
    print(f"Text tokenized {len(self.tokens)}")
    self.word_to_idx = defaultdict(lambda: len(self.word_to_idx))
    self.idx_to_word = defaultdict(lambda : "UNK")
    self.build_vocab()
    print(f"Vocab size: {self.vocab_size} constructed")


  def tokenize_text(self, text):
    tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    return [token for token in tokens if token.strip()]


  def build_vocab(self):
    self.all_words = sorted(set(self.tokens))
    self.vocab_size = len(self.all_words)
    for idx, word in enumerate(self.all_words):
      self.word_to_idx[word] = idx
      self.idx_to_word[idx] = word

  def encode(self, text_to_be_encoded):
    tokens = self.tokenize_text(text_to_be_encoded)
    return [self.word_to_idx[token] for token in tokens]

  def decode(self, encoded_text):
    decoded_text = " ".join([self.idx_to_word[idx] for idx in encoded_text])
    print(decoded_text)
    decoded_text = re.sub(r'\s+([,.:;?_!"()\']])', r'\1', decoded_text)
    return decoded_text


tokenizer = TokenizerV1(data)


Text tokenized 4690
Vocab size: 1130 constructed


In [None]:
tokenizer.encode("Jack is a genius")

[57, 584, 115, 486]

In [None]:
tokenizer.decode(tokenizer.encode("Jack is a genius")) == "Jack is a genius"

Jack is a genius


True