# Reading in a short story as text sample in Python.

# Step 1: Creating Tokens

<div style="background-color:#e0f7e9; padding: 10px; border-left: 5px solid green;">
    <b>Note:</b> The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes.
</div>


In [1]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("The total number of characters in the story:", len(raw_text))
print(raw_text[:99])

FileNotFoundError: [Errno 2] No such file or directory: 'the-verdict.txt'

<div style="background-color:#e0f7e9; padding: 10px; border-left: 5px solid green;">
    <b>Note:</b> Our goal is to tokenize this 20479 character short story into individual words and special characters that we can then turn into embeddings for LLM training.
</div>

<div style="background-color:#d0f0fd; padding: 10px; border-left: 5px solid #2196F3;">
    <b>Note:</b> Note that it's common to process millions of articles and hundreds of thousands of
    books -- many gigabytes of text -- when working with LLMs. However, for educational
    purposes, it's sufficient to work with smaller text samples like a single book to
    illustrate the main ideas behind the text processing steps and to make it possible to
    run it in reasonable time on consumer hardware.
</div>

<div style="background-color:#ffe0b2; padding: 10px; border-left: 5px solid orange;">
    <b>Note:</b> How can we best split this text to obtain a list of tokens? For this, we go on a small
    excursion and use Python's regular expression library <code>re</code> for illustration purposes. (Note
    that you don't have to learn or memorize any regular expression syntax since we will
    transition to a pre-built tokenizer later in this chapter.)
</div>

<div style="background-color:#d0f0fd; padding: 10px; border-left: 5px solid #2196F3;">
    <b>Info:</b> Using some simple example text, we can use the <code>re.split</code> command with the following
    syntax to split a text on whitespace characters.
</div>

In [None]:
import re
text = "Hello, World. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'World.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
    <b>Note:</b> The result is a list of individual words, whitespaces, and punctuation characters.
</div>

<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> Let's modify the regular expression splits on whitespaces (<code>\s</code>) and commas, and periods (<code>[,.]</code>).
</div>

In [None]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'World', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


<div style="background-color:#e0f7e9; padding: 10px; border-left: 5px solid green;">
    <b>Note:</b> We can see that the words and punctuation characters are now separate list entries just as we wanted (<code>[,.]</code>).
</div>

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
    <b>Note:</b> A small remaining issue is that the list still includes whitespace characters. Optionally, we
    can remove these redundant characters safely as follows:
</div>

In [None]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'World', '.', 'This', ',', 'is', 'a', 'test', '.']


<div style="background-color:#ffe0b2; padding: 10px; border-left: 5px solid orange;">
    <b>REMOVING WHITESPACES OR NOT</b><br><br>
    When developing a simple tokenizer, whether we should encode whitespaces as
    separate characters or just remove them depends on our application and its
    requirements. Removing whitespaces reduces the memory and computing
    requirements. However, keeping whitespaces can be useful if we train models that
    are sensitive to the exact structure of the text (for example, Python code, which is
    sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
    and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
    that includes whitespaces.
</div>

<div style="background-color:#bbdefb; padding: 10px; border-left: 5px solid #2196f3;">
    <b>Note:</b> The tokenization scheme we devised above works well on the simple sample text. Let's
    modify it a bit further so that it can also handle other types of punctuation, such as
    question marks, quotation marks, and the double-dashes we have seen earlier in the first
    100 characters of Edith Wharton's short story, along with additional special characters.
</div>


In [None]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Is', ' ', 'this', '--', '', ' ', 'a', ' ', 'test', '?', '']


In [None]:
# Strip whitespace from each item and then filter our any empty string
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [None]:
text = "And I, Am...Iron Ma!"
result = re.split(r'(\.\.\.|[,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['And', 'I', ',', 'Am', '...', 'Iron', 'Ma', '!']


<div style="background-color:#e8f5e9; padding: 10px; border-left: 5px solid #43a047;">
    <b>Note:</b> Now that we got a basic tokenizer, let's apply it to the story.
</div>

In [None]:
preprocessed = re.split(r'(\.\.\.|[,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [None]:
print(len(preprocessed))

4690


# Step: 2 Creating Token IDs

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #8e24aa;">
    <b>Note:</b> In the previous section, we tokenized Edith Wharton's short story and assigned it to a
    Python variable called <code>preprocessed</code>. Let's now create a list of all unique tokens and sort
    them alphabetically to determine the vocabulary size.
</div>

In [None]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
    <b>Note:</b> After determining that the vocabulary size is 1,130 via the above code, we create the
    vocabulary and print its first 51 entries for illustration purposes.
</div>

In [None]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [None]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i>=300:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)
('How', 52)
('I', 53)
('If', 54)
('In', 55)
('It', 56)
('Jack', 57)
('Jove', 58)
('Just', 59)
('Lord', 60)
('Made', 61)
('Miss', 62)
('Money', 63)
('Monte', 64)
('Moon-dancers', 65)
('Mr', 66)
('Mrs', 67)
('My', 68)
('Never', 69)
('No', 70)
('Now', 71)
('Nutley', 72)
('Of', 73)
('Oh', 74)
('On', 75)
('Once', 76)
('Only', 77)
('

<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> As we can see, based on the output above, the dictionary contains individual tokens
    associated with unique integer labels.
</div>

<div style="background-color:#d0f0c0; padding: 10px; border-left: 5px solid #388e3c;">
    <b>Note:</b> Later in this book, when we want to convert the outputs of an LLM from numbers back into
    text, we also need a way to turn token IDs into text. <br><br>
    For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.
</div>

<div style="background-color:#e3f2fd; padding: 10px; border-left: 5px solid #1e88e5;">
    <b>Note:</b> Let's implement a complete tokenizer class in Python. <br><br>
    The class will have an <code>encode</code> method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. <br><br>
    In addition, we implement a <code>decode</code> method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.
</div>

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #8e24aa;">
    <b>Steps:</b><br>
    Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods<br>
    Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens<br>
    Step 3: Process input text into token IDs<br>
    Step 4: Convert token IDs back into text<br>
    Step 5: Replace spaces before the specified punctuation
</div>


In [None]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'(\.\.\.|--|[.,;:?!"()\']|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        #Replace spaces vefore the specified punctuaions
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
    Let's instantiate a new tokenizer object from the <code>SimpleTokenizerV1</code> class and tokenize a
    passage from Edith Wharton's short story to try it out in practice:
</div>

In [None]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


<div style="background-color:#e8f5e9; padding: 10px;
border-left: 5px solid #43a047;">
    <b>Note:</b> The code above prints the following token IDs:<br>
    Next, let's see if we can turn these token IDs back into text using the <code>decode</code> method.
</div>


In [None]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> Based on the output above, we can see that the <code>decode</code> method successfully converted the
    token IDs back into the original text.
</div>

<div style="background-color:#e3f2fd; padding: 10px; border-left: 5px solid #2196f3;">
    <b>Note:</b> So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing
    text based on a snippet from the training set. <br><br>
    Let's now apply it to a new text sample that is not contained in the training set:
</div>

In [None]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
    <b>Note:</b> The problem is that the word "Hello" was not used in the <i>The Verdict</i> short story. <br><br>
    Hence, it is not contained in the vocabulary. <br><br>
    This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.
</div>

# ADDING SPECIAL CONTEXT TOKENS
In the previous section, we implemented a simple tokenizer and applied it to a passage
from the training set.

In this section, we will modify this tokenizer to handle unknown
words.


In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
    <b>Note:</b> We can modify the tokenizer to use an <code>&lt;|unk|&gt;</code> token if it encounters a word that is not part of the vocabulary. <br><br>
    Furthermore, we add a token between unrelated texts. <br><br>
    For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source.
</div>

<div style="background-color:#fce4ec; padding: 10px; border-left: 5px solid #f06292;">
    <b>Note:</b> Let's now modify the vocabulary to include these two special tokens, <code>&lt;unk&gt;</code> and <code>&lt;|endoftext|&gt;</code>, by adding these to the list of all unique words that we created in the previous section.
</div>


In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [None]:
len(vocab.items())

1132

<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> Based on the output of the print statement above, the new vocabulary size is 1132 (the vocabulary size in the previous section was 1130).
</div>

<div style="background-color:#eeeeee; padding: 10px; border-left: 5px solid #9e9e9e;">
    <b>Note:</b> As an additional quick check, let's print the last 5 entries of the updated vocabulary.
</div>


In [None]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


<div style="background-color:#e8f5e9; padding: 10px; border-left: 5px solid #43a047;">
    <b>Note:</b> A simple text tokenizer that handles unknown words.
</div>

<div style="background-color:#e3f2fd; padding: 10px; border-left: 5px solid #1e88e5;">
    <b>Info:</b>
    <ul style="margin: 0; padding-left: 20px;">
        <li>Step 1: Replace unknown words by <code>&lt;|unk|&gt;</code> tokens</li>
        <li>Step 2: Replace spaces before the specified punctuations</li>
    </ul>
</div>

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'(\.\.\.|--|[.,;:?!"()\']|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        #Replace spaces vefore the specified punctuaions
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
tokenizer = SimpleTokenizerV2(vocab)
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [None]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [None]:
tokenizer.decode (tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

<div style="background-color:#d0f0c0; padding: 10px; border-left: 5px solid #4caf50;">
    <b>Info:</b> Based on comparing the de-tokenized text above with the original input text, we know that
    the training dataset, Edith Wharton's short story <i>The Verdict</i>, did not contain the words
    <code>"Hello"</code> and <code>"palace"</code>.
</div>

<div style="background-color:#e0f7fa; padding: 10px; border-left: 5px solid #0097a7;">
  <b>Info:</b> So far, we have discussed tokenization as an essential step in processing text as input to
  LLMs. Depending on the LLM, some researchers also consider additional special tokens such
  as the following:
  <ul>
    <li><b>[BOS]</b> (beginning of sequence): This token marks the start of a text. It
    signifies to the LLM where a piece of content begins.</li>
    <li><b>[EOS]</b> (end of sequence): This token is positioned at the end of a text,
    and is especially useful when concatenating multiple unrelated texts,
    similar to <code>&lt;|endoftext|&gt;</code>. For instance, when combining two different
    Wikipedia articles or books, the [EOS] token indicates where one article
    ends and the next one begins.</li>
    <li><b>[PAD]</b> (padding): When training LLMs with batch sizes larger than one,
    the batch might contain texts of varying lengths. To ensure all texts have
    the same length, the shorter texts are extended or "padded" using the
    [PAD] token, up to the length of the longest text in the batch.</li>
  </ul>
</div>


<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
  <b>Note:</b> Note that the tokenizer used for GPT models does not need any of these tokens mentioned
  above but only uses an <code>&lt;|endoftext|&gt;</code> token for simplicity.
</div>

<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
  <b>Info:</b> The tokenizer used for GPT models also doesn't use an <code>&lt;|unk|&gt;</code> token for out-of-vocabulary words.
  Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units.
</div>


In [None]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [None]:
tokenizer = tiktoken.get_encoding("gpt2")

<div style="background-color:#ffe0b2; padding: 10px; border-left: 5px solid #fb8c00;">
    <b>Note:</b> The usage of this tokenizer is similar to <code>SimpleTokenizerV2</code> we implemented previously via an <code>encode</code> method.
</div>

In [None]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit of terraces"
    "of someunknownPlace."
)
integers = tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 286, 8812, 2114, 1659, 617, 34680, 27271, 13]


The code above prints the following token IDs:

We can then convert the token IDs back into text using the decode medthod, similar to our SimpleTokenizerV2 earlier.

In [None]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit of terracesof someunknownPlace.


**Note:** We can make two noteworthy ovbservations based on the token IDs and decoded text above.

First, the <|endoftext|> token is assigned a relativelt large token IDs, namely, 50256.

In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary soze of 50,257
with <|endoftext|> being assigned the largest token ID.


**Note:** Second, the BPE tokenizer above encodes and decodes unknown words, such as
"someunknownPlace" correctly.

The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?

The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary
into smaller subword units or even individual characters.

The enables it to handle out-ofvocabulary words.

So, thanks to the BPE algorithm, if the tokenizer encounters an
unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or
characters

**Let us take another simple example to illustrate how the BPE tokenizer deals with unknown tokens**

In [None]:
integers = tokenizer.encode("Ayush Morya")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[42012, 1530, 337, 652, 64]
Ayush Morya


In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


## **CREATING INPUT-TARGET PAIRS**

In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.

To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous sections.

In [None]:
with open ("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))


5145


Executing the code above will return 5145, the total number of tokens in the training set, after applying the BPE tokenizer.

Next, we remove the first 50 tokens from the dataset for demonstration purposes as it results in a slightly more intersting text passage in the next steps.

In [None]:
enc_sample = enc_text[50:]
print (enc_sample)

[290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686, 41976, 13, 357, 10915, 314, 2138, 1807, 340, 561, 423, 587, 10598, 393, 28537, 2014, 198, 198, 1, 464, 6001, 286, 465, 13476, 1, 438, 5562, 373, 644, 262, 1466, 1444, 340, 13, 314, 460, 3285, 9074, 13, 46606, 536, 5469, 438, 14363, 938, 4842, 1650, 353, 438, 2934, 489, 3255, 465, 48422, 540, 450, 67, 3299, 13, 366, 5189, 1781, 340, 338, 1016, 284, 3758, 262, 1988, 286, 616, 4286, 705, 1014, 510, 26, 475, 314, 836, 470, 892, 286, 326, 11, 1770, 13, 8759, 2763, 438, 1169, 2994, 284, 943, 17034, 318, 477, 314, 892, 286, 526, 383, 1573, 11, 319, 9074, 13, 536, 5469, 338, 11914, 11, 33096, 663, 4808, 3808, 62, 355, 996, 484, 547, 12548, 287, 281, 13079, 410, 12523, 286, 22353, 13, 843, 340, 373, 407, 691, 262, 9074, 13, 536, 48819, 508, 25722, 276, 13, 11161, 407, 262, 40123, 18113, 544, 9325, 701, 11, 379, 262, 938, 402, 1617, 261, 12917, 905, 11, 5025, 502, 878, 402, 271, 10899, 338, 366, 31640, 12, 67, 20811, 1, 284, 910, 11, 351, 10

**One of the easiest and most intuitive ways to create the input-targer pairs of the nextword prediction task is to create two variables, x and y. Where x contains the input tokens and y contains the target, which are the inputs shifted by 1:**

The context size determines how many tokens are included in the input

In [None]:
context_size = 10 #length of the input
# The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens)
# to predict the next word in the sequence.
# The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 token [2, 3. 4. 5]

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")
# Decode tokens to readable text
decoded_x = tokenizer.decode(x)
decoded_y = tokenizer.decode(y)

print(f"Decoded x: {decoded_x}")
print(f"Decoded y: {decoded_y}")

x: [290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686]
y:      [4920, 2241, 287, 257, 4489, 64, 319, 262, 34686, 41976]
Decoded x:  and established himself in a villa on the Riv
Decoded y:  established himself in a villa on the Riviera


Processing the inputs along with the taget, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:

In [None]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257
[290, 4920, 2241, 287, 257] ----> 4489
[290, 4920, 2241, 287, 257, 4489] ----> 64
[290, 4920, 2241, 287, 257, 4489, 64] ----> 319
[290, 4920, 2241, 287, 257, 4489, 64, 319] ----> 262
[290, 4920, 2241, 287, 257, 4489, 64, 319, 262] ----> 34686
[290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686] ----> 41976


Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.

For illustration purposes, let's repeat the previous code but convert the token IDs into text.

In [None]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a
 and established himself in a ---->  vill
 and established himself in a vill ----> a
 and established himself in a villa ---->  on
 and established himself in a villa on ---->  the
 and established himself in a villa on the ---->  Riv
 and established himself in a villa on the Riv ----> iera


We have now created the input-target pairs what we can turn into use fot the LLM training in upcoming chapters.

There's only one more task before we can turn the tokens into embeddings:implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict,

# **IMPLEMENTING A DATA LOADER**

For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes.

**Step 1:** Tokenize the entire text

**Step 2:** Use a sliding window to chunk the book into overlapping sequences of max_length

**Step 3:** Return the total number of rows in the dataset

**Step 4:** Return a single row from the dataset

In [None]:
import torch

In [2]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]