# Dependencies setup

In [28]:
!pip install uv



Instead of using a `requirements.txt`, install all required packages in one
command:


In [29]:
!uv pip install \
  "torch>=2.3.0" \
  "jupyterlab>=4.0" \
  "tiktoken>=0.5.1" \
  "matplotlib>=3.7.1" \
  "tensorflow>=2.18.0" \
  "tqdm>=4.66.1" \
  "numpy>=1.26,<2.1" \
  "pandas>=2.2.1" \
  "psutil>=5.9.5"

[2mUsing Python 3.11.13 environment at: /usr[0m
[2mAudited [1m9 packages[0m [2min 109ms[0m[0m


# Text tokenization

Text tokenization is the process of breaking raw text into discrete units (“tokens”) that a language model can understand and manipulate. It’s important because:

Vocabulary alignment: Tokens map words or subwords to indices in the model’s embedding matrix, ensuring every piece of input corresponds to a learned representation.

Efficiency: By using subword or byte-pair encodings, tokenization balances vocabulary size and sequence length, reducing memory usage and speeding up inference.

Robustness: Proper tokenization handles unknown or rare words gracefully (e.g., splitting “unfamiliarity” into “un”, “familiar”, “ity”), improving the model’s ability to generalize.

Context control: Accurate token counts allow precise management of the model’s context window, preventing truncation or overflow of important information.

Together, these factors ensure that LLMs can process, understand and generate text both effectively and efficiently.

In [30]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()

In [31]:
raw_text

'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)\n\n"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it\'s going to send the value of my picture \'way up; but I don\'t think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing\'s lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn\'s "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"?\n\nWell!--even 

In [32]:
len(raw_text)

20479

In [33]:
import re

In [34]:
text = 'Hello world. This is a test'
pattern = r"\w+|[^\w\s]"
tokens = re.findall(pattern, text)

In [35]:
tokens

['Hello', 'world', '.', 'This', 'is', 'a', 'test']

In [36]:
pattern = r"\s+|\w+|[^\w\s]"

In [37]:
tokens = re.findall(pattern, text)

In [38]:
tokens

['Hello', ' ', 'world', '.', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test']

In [39]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]

In [40]:
result

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']

In [41]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

In [42]:
preprocessed[:30]

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius',
 '--',
 'though',
 'a',
 'good',
 'fellow',
 'enough',
 '--',
 'so',
 'it',
 'was',
 'no',
 'great',
 'surprise',
 'to',
 'me',
 'to',
 'hear',
 'that',
 ',',
 'in']

In [43]:
len(preprocessed)

4690

In [44]:
preprocessed[:10]

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius']

# Converting tokens into token IDs

Converting tokens into token IDs is the process of mapping each text token to its unique integer index in the model’s vocabulary. It’s important because:

Embedding lookup: IDs serve as pointers into the embedding matrix, turning discrete tokens into continuous vector representations the model can process.

Model compatibility: Neural networks operate on numeric tensors, so IDs provide a standardized, language-agnostic input format.

Efficiency: Integer IDs enable fast batch-processing and optimized memory usage compared to handling raw strings.

Reproducibility: A fixed token-to-ID mapping ensures experiments and deployments yield consistent behavior across runs.

Error handling: Unknown tokens can be assigned a special ID (e.g., <unk>), allowing the model to gracefully deal with out-of-vocabulary inputs.

In [45]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

In [46]:
vocab_size

1130

In [47]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [48]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [49]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [50]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)

In [51]:
ids

[1,
 56,
 2,
 850,
 988,
 602,
 533,
 746,
 5,
 1126,
 596,
 5,
 1,
 67,
 7,
 38,
 851,
 1108,
 754,
 793,
 7]

In [52]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [53]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'