## **Download the Data (The Gale Encyclopedia of Medicine)**

In [1]:
import os
import urllib.request

In [2]:
file_path = "medical_book.pdf"
url = "https://raw.githubusercontent.com/bluemusk24/GenerativeAI/main/Medical-Q%26A-bot/data/medical_book.pdf"

# Always download the file to ensure the latest version is used
with urllib.request.urlopen(url) as response:
    pdf_data = response.read()
with open(file_path, "wb") as file:
    file.write(pdf_data)

In [3]:
print(file,'\n\n', file.name)

<_io.BufferedWriter name='medical_book.pdf'> 

 medical_book.pdf


## **Read PDF Data and Extract Text Content**

In [4]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-6.5.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.5.0-py3-none-any.whl (329 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.6/329.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.5.0


In [5]:
from pypdf import PdfReader

In [6]:
def extract_text_from_pdf(pdf_path):

  # open PDF file in binary mode
  with open(pdf_path, 'rb') as file:
    reader = PdfReader(file)

    # Initialize an empty string to store all text
    full_text = ""

    # Iterate through all the pages an extract text
    for page in reader.pages:
      full_text += page.extract_text() or ""    # use 'or ""' to handdle pages with no extractable text

  return full_text

In [7]:
text_content = extract_text_from_pdf(file.name)

print(text_content, '\n\n Total Length of Characters:', len(text_content))

The GALE
ENCYCLOPEDIA
of MEDICINE
SECOND EDITIONThe GALE
ENCYCLOPEDIA
of MEDICINE
SECOND EDITION
JACQUELINE L. LONGE, EDITOR
DEIRDRE S. BLANCHFIELD, ASSOCIATE EDITOR
VOLUME
A-B
1STAFF
Jacqueline L. Longe, Project Editor
Deirdre S. Blanchfield, Associate Editor
Christine B. Jeryan, Managing Editor
Donna Olendorf, Senior Editor
Stacey Blachford, Associate Editor
Kate Kretschmann, Melissa C. McDade, Ryan
Thomason, Assistant Editors
Mark Springer, Technical Specialist
Andrea Lopeman, Programmer/Analyst
Barbara J. Yarrow,Manager, Imaging and Multimedia
Content
Robyn V . Young,Project Manager, Imaging and
Multimedia Content
Dean Dauphinais, Senior Editor, Imaging and
Multimedia Content
Kelly A. Quin, Editor, Imaging and Multimedia Content
Leitha Etheridge-Sims, Mary K. Grimes, Dave Oblender,
Image Catalogers
Pamela A. Reed, Imaging Coordinator
Randy Bassett, Imaging Supervisor
Robert Duncan, Senior Imaging Specialist
Dan Newell, Imaging Specialist
Christine O’Bryan,Graphic Specialist
Maria F

### **Creating Tokens for Text Data**

<div class="alert alert-block alert-warning">

Using some simple example text, we can use the ```re.split command``` with the following
syntax to split a text on whitespace characters:</div>

In [8]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


<div class="alert alert-block alert-info">
The result is a list of individual words, whitespaces, and punctuation characters
</div>

<div class="alert alert-block alert-warning">

Let's modify the regular expression splits on whitespaces (\s) and commas, and periods
([,.]):</div>

In [9]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


<div class="alert alert-block alert-info">
We can see that the words and punctuation characters are now separate list entries just as
we wanted
</div>


<div class="alert alert-block alert-warning">

A small remaining issue is that the list still includes whitespace characters. Optionally, we
can remove these redundant characters safely as follows:</div>

In [10]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


<div class="alert alert-block alert-warning">

The tokenization scheme devised above works well on the simple sample text. Let's
modify it a bit further so that it can also handle other types of punctuation, such as
question marks, quotation marks, and the double-dashes along with additional special characters: </div>

In [11]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [12]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [13]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


Apply the Basic Tokenizer devised above to the dataset

In [14]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text_content)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:50])

['The', 'GALE', 'ENCYCLOPEDIA', 'of', 'MEDICINE', 'SECOND', 'EDITIONThe', 'GALE', 'ENCYCLOPEDIA', 'of', 'MEDICINE', 'SECOND', 'EDITION', 'JACQUELINE', 'L', '.', 'LONGE', ',', 'EDITOR', 'DEIRDRE', 'S', '.', 'BLANCHFIELD', ',', 'ASSOCIATE', 'EDITOR', 'VOLUME', 'A-B', '1STAFF', 'Jacqueline', 'L', '.', 'Longe', ',', 'Project', 'Editor', 'Deirdre', 'S', '.', 'Blanchfield', ',', 'Associate', 'Editor', 'Christine', 'B', '.', 'Jeryan', ',', 'Managing', 'Editor']


In [15]:
print(len(preprocessed))

480622


### **Creating Token IDs**

<div class="alert alert-block alert-warning">

In the above, we tokenized Gale Encyclopedia of Medicine and assigned it to a
Python variable called ```preprocessed```. Now, create a list of all unique tokens and sort
them alphabetically to determine the vocabulary size:</div>

In [16]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

31981


<div class="alert alert-block alert-success">

After determining that the vocabulary size is 31981 via the above code, we create the
vocabulary and print its first 51 entries for illustration purposes:

</div>

In [17]:
vocab = {token:integer for integer,token in enumerate(all_words)}
vocab

{'!': 0,
 '#1': 1,
 '#1600': 2,
 '#2001': 3,
 '#201': 4,
 '#231': 5,
 '#280': 6,
 '#29': 7,
 '#400': 8,
 '#406': 9,
 '$1': 10,
 '$10': 11,
 '$100': 12,
 '$114': 13,
 '$125': 14,
 '$18': 15,
 '$185': 16,
 '$2': 17,
 '$200': 18,
 '$200-$400': 19,
 '$250': 20,
 '$250-': 21,
 '$30': 22,
 '$300': 23,
 '$30–70': 24,
 '$350': 25,
 '$36': 26,
 '$4': 27,
 '$40': 28,
 '$40-80': 29,
 '$400': 30,
 '$414': 31,
 '$45': 32,
 '$50': 33,
 '$50-$100': 34,
 '$500': 35,
 '$700': 36,
 '$750': 37,
 '$800': 38,
 '$900': 39,
 '&': 40,
 '&id=57': 41,
 "'": 42,
 '(': 43,
 ')': 44,
 '*Also': 45,
 '*Atarax': 46,
 '*Insulin': 47,
 '+00': 48,
 ',': 49,
 '-': 50,
 '-0433': 51,
 '-1': 52,
 '-828-7866': 53,
 '-II': 54,
 '-John’s-wort': 55,
 '-Oct': 56,
 '-cell': 57,
 '-cells': 58,
 '-foot': 59,
 '.': 60,
 '/': 61,
 '//': 62,
 '//216': 63,
 '//acousticneuromaseattle': 64,
 '//actis': 65,
 '//allergy': 66,
 '//anausa': 67,
 '//androgenetic-alopecia': 68,
 '//cancernet': 69,
 '//cis': 70,
 '//csi': 71,
 '//health': 72,
 

Get the first 50 entries

In [18]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('#1', 1)
('#1600', 2)
('#2001', 3)
('#201', 4)
('#231', 5)
('#280', 6)
('#29', 7)
('#400', 8)
('#406', 9)
('$1', 10)
('$10', 11)
('$100', 12)
('$114', 13)
('$125', 14)
('$18', 15)
('$185', 16)
('$2', 17)
('$200', 18)
('$200-$400', 19)
('$250', 20)
('$250-', 21)
('$30', 22)
('$300', 23)
('$30–70', 24)
('$350', 25)
('$36', 26)
('$4', 27)
('$40', 28)
('$40-80', 29)
('$400', 30)
('$414', 31)
('$45', 32)
('$50', 33)
('$50-$100', 34)
('$500', 35)
('$700', 36)
('$750', 37)
('$800', 38)
('$900', 39)
('&', 40)
('&id=57', 41)
("'", 42)
('(', 43)
(')', 44)
('*Also', 45)
('*Atarax', 46)
('*Insulin', 47)
('+00', 48)
(',', 49)
('-', 50)


<div class="alert alert-block alert-info">
As we can see, based on the output above, the dictionary contains individual tokens
associated with unique integer labels.
</div>


<div class="alert alert-block alert-success">

Later when we want to convert the outputs of an LLM from numbers back into
text, we also need a way to turn token IDs into text.

For this, we can create an inverse
version of the vocabulary that maps token IDs back to corresponding text tokens.

</div>


<div class="alert alert-block alert-success">

Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits
text into tokens and carries out the string-to-integer mapping to produce token IDs via the
vocabulary.

In addition, we implement a decode method that carries out the reverse
integer-to-string mapping to convert the token IDs back into text.

</div>

<div class="alert alert-block alert-info">
    
Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

</div>




In [19]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

<div class="alert alert-block alert-success">

Instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a
section from The Gale Encyclopedia of Medicine to try it out in practice:
</div>

In [20]:
tokenizer = SimpleTokenizerV1(vocab)

text = """Prevention
Eliminating exposure to textile dust is the surest way
to prevent byssinosis. Using exhaust hoods, improving
ventilation, and employing wetting procedures are very
successful methods of controlling dust levels to prevent
byssinosis. Protective equipment required during certain
procedures also prevents exposure to levels of contami-
nation that exceed the current United States standard for
cotton dust exposure."""

ids = tokenizer.encode(text)
print(ids)

[10160, 6136, 18850, 29462, 29122, 17970, 21571, 29137, 28697, 30750, 29462, 25579, 15165, 60, 12152, 18781, 20477, 49, 20887, 30474, 49, 13499, 18282, 30821, 25651, 13915, 30513, 28551, 22883, 24065, 16534, 17970, 22059, 29462, 25579, 15165, 60, 10254, 18502, 26652, 17969, 15545, 25651, 13324, 25591, 18850, 29462, 22059, 24065, 16473, 23450, 29135, 18716, 29137, 16856, 12102, 11365, 28222, 19407, 16654, 17970, 18850, 60]


<div class="alert alert-block alert-info">
    
The code above prints the following token IDs:
Next, let's see if we can turn these token IDs back into text using the decode method:
</div>

In [21]:
tokenizer.decode(ids)

'Prevention Eliminating exposure to textile dust is the surest way to prevent byssinosis. Using exhaust hoods, improving ventilation, and employing wetting procedures are very successful methods of controlling dust levels to prevent byssinosis. Protective equipment required during certain procedures also prevents exposure to levels of contami- nation that exceed the current United States standard for cotton dust exposure.'

<div class="alert alert-block alert-info">
    
Based on the output above, we can see that the decode method successfully converted the
token IDs back into the original text.
</div>


<div class="alert alert-block alert-success">

So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set.

Let's now apply it to a new text sample that
is not contained in the training set:
</div>

In [22]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

<div class="alert alert-block alert-info">
    
The problem is that the word "Hello" was not used in the The Gale Encyclopedia of Medicine document.

Hence, it
is not contained in the vocabulary.

This highlights the need to consider large and diverse
training sets to extend the vocabulary when working on LLMs.

</div>

### ADDING SPECIAL CONTEXT TOKENS

In the previous section, we implemented a simple tokenizer and applied it to a section from the training set.

In this section, we will modify this tokenizer to handle unknown
words.

In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, **SimpleTokenizerV2**, to support two new tokens: ***<|unk|> and <|endoftext|>***

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an ***<|unk|>*** token if it
encounters a word that is not part of the vocabulary.

Furthermore, we add a token between
unrelated texts.

For example, when training ```GPT-like``` LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that follows a previous text source

</div>

<div class="alert alert-block alert-success">

Let's now modify the vocabulary to include these two special tokens, ***<unk*** and ***<|endoftext|>***, by adding these to the list of all unique words that we created in the
previous section:
</div>

In [23]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
#all_tokens

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [24]:
len(vocab.items())

31983

<div class="alert alert-block alert-info">
    
Based on the output of the print statement above, the new vocabulary size is 31983 (the
vocabulary size in the previous section was 31981).

</div>

<div class="alert alert-block alert-success">

As an additional quick check, let's print the last 5 entries of the updated vocabulary:
</div>

In [26]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('•f', 31978)
('•l', 31979)
('•lLow', 31980)
('<|endoftext|>', 31981)
('<|unk|>', 31982)


<div class="alert alert-block alert-success">

A simple text tokenizer that handles unknown words</div>

<div class="alert alert-block alert-info">
    
Step 1: Replace unknown words by <|unk|> tokens
    
Step 2: Replace spaces before the specified punctuations

</div>


In [27]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [28]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "I need an urgent vacation to Dubai."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> I need an urgent vacation to Dubai.


In [29]:
tokenizer.encode(text)

[31982,
 49,
 17753,
 31063,
 22127,
 28949,
 2901,
 31981,
 7419,
 23508,
 13452,
 31982,
 30334,
 29462,
 31982,
 60]

In [30]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> I need an <|unk|> vacation to <|unk|>.'


<div class="alert alert-block alert-info">
    
Based on comparing the de-tokenized text above with the original input text, we know that
the training dataset, Edith Wharton's short story The Verdict, did not contain the words
```"Hello" and "Dubai."```

</div>


<div class="alert alert-block alert-warning">

So far, we have discussed tokenization as an essential step in processing text as input to
LLMs. Depending on the LLM, some researchers also consider additional special tokens such
as the following:

***[BOS]*** (beginning of sequence): This token marks the start of a text. It
signifies to the LLM where a piece of content begins.

***[EOS]*** (end of sequence): This token is positioned at the end of a text,
and is especially useful when concatenating multiple unrelated texts,
similar to <|endoftext|>. For instance, when combining two different
Wikipedia articles or books, the [EOS] token indicates where one article
ends and the next one begins.

***[PAD]*** (padding): When training LLMs with batch sizes larger than one,
the batch might contain texts of varying lengths. To ensure all texts have
the same length, the shorter texts are extended or "padded" using the
```[PAD]``` token, up to the length of the longest text in the batch.

</div>


<div class="alert alert-block alert-warning">

Note that the tokenizer used for **GPT models** does not need any of these tokens mentioned above but only uses an ```<|endoftext|>``` token for simplicity

</div>

<div class="alert alert-block alert-warning">

the tokenizer used for **GPT models** also doesn't use an ```<|unk|>``` token for out-of-vocabulary words. Instead, GPT models use a ***Byte Pair Encoding tokenizer***, which breaks down words into subword units.
</div>

## **BYTE PAIR ENCODING (BPE TOKENIZER)**

In [31]:
import tiktoken

print("tiktoken version:", tiktoken.__version__)

tiktoken version: 0.12.0


In [32]:
# Initialize the encodings for GPT-2, GPT-3, and GPT-4
encodings = {
    "gpt2": tiktoken.get_encoding("gpt2"),
    "gpt3": tiktoken.get_encoding("p50k_base"),  # Commonly associated with GPT-3 models
    "gpt4": tiktoken.get_encoding("cl100k_base")  # Used for GPT-4 and later versions
}

# Get the vocabulary size for each encoding
vocab_sizes = {model: encoding.n_vocab for model, encoding in encodings.items()}

# Print the vocabulary sizes
for model, size in vocab_sizes.items():
    print(f"The vocabulary size for {model.upper()} is: {size}")


The vocabulary size for GPT2 is: 50257
The vocabulary size for GPT3 is: 50281
The vocabulary size for GPT4 is: 100277


In [33]:
# Get BPE Tokenizer for GPT-3

tokenizer = tiktoken.get_encoding("p50k_base")

In [34]:
text = (
    """Hello, do you like tea? <|endoftext|> I need an urgent vacation to Dubai
    of someunknownPlace."""
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 314, 761, 281, 18039, 14600, 284, 24520, 198, 50258, 286, 617, 34680, 27271, 13]


<div class="alert alert-block alert-success">
We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizerV2 earlier
</div>


In [35]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> I need an urgent vacation to Dubai
    of someunknownPlace.


<div class="alert alert-block alert-warning">

We can make two noteworthy observations based on the token IDs and decoded text
above.

First, the <|endoftext|> token is assigned a relatively large token ID, namely,
**50256**.

In fact, the BPE tokenizer, which was used to train models such as ```GPT-2``` has a total vocabulary size of **50257**,  ```GPT-3``` has a total vocabulary size of **50281**, and the original model used in ChatGPT, has a total vocabulary size of **50257** with <|endoftext|> being assigned the largest token ID.

</div>

<div class="alert alert-block alert-warning">

Second, the ***BPE Tokenizer*** above encodes and decodes unknown words, such as
"someunknownPlace" correctly.

The BPE tokenizer can handle any unknown word. How does
it achieve this without using <|unk|> tokens?
    
</div>

<div class="alert alert-block alert-warning">

The algorithm underlying ***BPE*** breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters.

This enables it to handle out-of-vocabulary words.

So, thanks to the ***BPE algorithm***, if the tokenizer encounters an
unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters
    
</div>

**Let us take another simple example to illustrate how the BPE tokenizer deals with unknown tokens**

In [36]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[33901, 86, 343, 86, 220, 959]
Akwirw ier


## **DATA SAMPLING WITH SLIDING WINDOW (INPUT-TARGET PAIRS)**

<div class="alert alert-block alert-success">
* In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.</div>


<div class="alert alert-block alert-success">
* To get started, we will first tokenize the whole The Gale Encyclopedia Medicine we worked with earlier using the BPE tokenizer introduced in the previous section:</div>



In [37]:
enc_text = tokenizer.encode(text_content)
print(len(enc_text))

667072



<div class="alert alert-block alert-info">
    
Executing the code above returns ***667072*** tokens and tokens ID in the vocabulary training set, after applying the BPE Tokenizer

</div>


<div class="alert alert-block alert-success">

Next, we remove some tokens from the dataset for demonstration purposes.

</div>

In [38]:
enc_sample = enc_text[2000:]

<div class="alert alert-block alert-warning">

Create 2 variables ```(X and Y)```, where X contains the input tokens and Y contains the target tokens. ***Y == X + 1***

The ***Context Size*** determines how many tokens are included in the input. The number can be changed.

The ***Context Size*** tells the model to look at the first 4 words in a sequence, so it predicts the next word in that sequence. eg input (X) = [1,2,3,4] and the predicted target (Y) should be [2,3,4,5]

</div>

In [39]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [3315, 17555, 290, 31814]
y:      [17555, 290, 31814, 355]


<div class="alert alert-block alert-info">

Processing the input and targets (input+1), we create the next word predictions task as follows with below codes.

THe LLM receives all context tokens in the left as input and predicts the desired tokens in the right side as target

</div>

In [40]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[3315] ----> 17555
[3315, 17555] ----> 290
[3315, 17555, 290] ----> 31814
[3315, 17555, 290, 31814] ----> 355


<div class="alert alert-block alert-success">
For illustration purposes, let's repeat the previous code but convert the token IDs into text
</div>

In [41]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 medical ---->  guides
 medical guides ---->  and
 medical guides and ---->  textbooks
 medical guides and textbooks ---->  as


<div class="alert alert-block alert-warning">

We've now created the input-target pairs that we can turn into use for the LLM training.
    
</div>

<div class="alert alert-block alert-warning">

There's only one more task before we can turn the tokens into embeddings: implementing an efficient data loader that
iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which
can be thought of as multidimensional arrays.
    
</div>

<div class="alert alert-block alert-warning">

In particular, we are interested in returning two tensors: an ***input tensor*** containing the text that the LLM sees and a ***target tensor*** that includes the targets for the LLM to predict,
    
</div>


### **IMPLEMENTING A DATA LOADER THAT ITERATES OVER ENTIRE DATASET TO GET INPUT AND OUTPUT PAIRS AS PYTORCH TENSORS**

<div class="alert alert-block alert-success">
For the efficient data loader implementation, we will use PyTorch's built-in Dataset and
DataLoader classes.</div>

In [42]:
import torch

In [43]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

<div class="alert alert-block alert-warning">

* Feed the Dataset created above into the DataLoader
* max_length = context_size (number of tokens gpt2, gpt3, and gpt4 can take)
* stride = number of tokens shifted postion
* num_workers = CPU usage

</div>

<div class="alert alert-block alert-warning">

The ***GPTDatasetV1*** class above is based on the PyTorch Dataset class.

It defines how individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.

I recommend reading on to see how the data
returned from this dataset looks like when we combine the dataset with a PyTorch
DataLoader -- this will bring additional intuition and clarity.
    
</div>

<div class="alert alert-block alert-success">
The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:</div>


<div class="alert alert-block alert-info">
    
Step 1: Initialize the tokenizer

Step 2: Create dataset

Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes
during training

Step 4: The number of CPU processes to use for preprocessing
    
</div>

In [44]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("p50k_base")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

<div class="alert alert-block alert-success">
    
Let's test the dataloader with a ```batch size``` of 1 for an LLM with a ```context size``` of 4,

This will develop an intuition of how the GPTDatasetV1 class and the
create_dataloader_v1 function work together: </div>

<div class="alert alert-block alert-info">

* Get input-target pairs (tensors) for individual batch, max_length of 4 (number of tokens_ID), stride=1

</div>

In [45]:
import torch
print("PyTorch version:", torch.__version__, '\n')
dataloader = create_dataloader_v1(
    text_content, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

PyTorch version: 2.9.0+cpu 

[tensor([[  464,   402, 21358,   198]]), tensor([[  402, 21358,   198, 45155]])]


<div class="alert alert-block alert-warning">

The first_batch variable contains two tensors: the first tensor stores the input token IDs,
and the second tensor stores the target token IDs.

Since the max_length is set to 4, each of the two tensors contains 4 token IDs.

Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least
256.
    
</div>

<div class="alert alert-block alert-success">
    
To illustrate the meaning of stride=1, let's fetch another batch from this dataset: </div>

In [46]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[  402, 21358,   198, 45155]]), tensor([[21358,   198, 45155,  5097]])]


In [47]:
third_batch = next(data_iter)
print(third_batch)

[tensor([[21358,   198, 45155,  5097]]), tensor([[  198, 45155,  5097,  3185]])]


<div class="alert alert-block alert-warning">

If we compare the first batch with the second batch and second batch with the third batch, we can see that the second batch's token
IDs are shifted by one position compared to the first batch.
Same for the third batch compared to the second batch

For example, the second ID in
the first batch's input is 402, which is the first ID of the second batch's input.

The stride
setting dictates the number of positions the inputs shift across batches, emulating a sliding
window approach
    
</div>

<div class="alert alert-block alert-warning">

Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes.
                                                                                
If you have previous experience with deep learning, you may know
that small batch sizes require less memory during training but lead to more noisy model
updates.

Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.
    
</div>

<div class="alert alert-block alert-success">
Input-Target Pairs for the entire dataset
</div>

In [48]:
dataloader = create_dataloader_v1(text_content, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[  464,   402, 21358,   198],
        [45155,  5097,  3185,  1961],
        [ 3539,   198,  1659, 26112],
        [ 2149,  8881,   198, 23683],
        [18672, 39219,   464,   402],
        [21358,   198, 45155,  5097],
        [ 3185,  1961,  3539,   198],
        [ 1659, 26112,  2149,  8881]])

Targets:
 tensor([[  402, 21358,   198, 45155],
        [ 5097,  3185,  1961,  3539],
        [  198,  1659, 26112,  2149],
        [ 8881,   198, 23683, 18672],
        [39219,   464,   402, 21358],
        [  198, 45155,  5097,  3185],
        [ 1961,  3539,   198,  1659],
        [26112,  2149,  8881,   198]])


<div class="alert alert-block alert-info">
    
* Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead
to increased overfitting.
    
</div>

## **CREATE TOKEN EMBEDDINGS**

<div class="alert alert-block alert-warning">

* Illustration of how token IDs are converted to vector embeddings.

* Here we assume 2,3,5,1 are token IDs

</div>

In [49]:
input_ids = torch.tensor([2, 3, 5, 1])

<div class="alert alert-block alert-info">
    
* Let's assume we have a vocabulary size of 6 words and want to create an embedding size of 3
    
* Using a vocab_size and output_dim, we can instantiate an embedding layer in Pytorch

* Set random seed = 123
</div>

In [50]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [51]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


<div class="alert alert-block alert-success">

* The weight matrix of the embedding layer gotten above contains small, random values. These values are optimized during LLM training.

* Also, the weights of the LLM predicting the next word is optimized during training
</div>

In [52]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">

* Vector embeddings for all the token IDs

</div>

In [53]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">
    
Each row in this output matrix is obtained via a lookup operation from the embedding
weight matrix
    
</div>

### **POSITIONAL EMBEDDINGS (ENCODING WORD POSITIONS)**

<div class="alert alert-block alert-success">

* We assume the output dimensional size for each token ID is 256 and vocabulary size is 50281 for GPT3 to create an embedding layer
* Also assume token IDs were creeated using Byte-Pair Encoder
                                                                                              
</div>

<div class="alert alert-block alert-success">

Previously, we focused on very small embedding sizes in this chapter for illustration
purposes.

We now consider more realistic and useful embedding sizes and encode the input
tokens into a 256-dimensional vector representation.

This is smaller than what the original
GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable
for experimentation.

Furthermore, we assume that the token IDs were created by the BPE
tokenizer that we implemented earlier, which has a vocabulary size of 50,281:

</div>

In [54]:
vocab_size = 50281
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

<div class="alert alert-block alert-warning">
    
* Using the embedding layer above, sampling from a dataloader, embed each token with batch size of 8 and 4 token each in a 256-dimensional vector space. The resulting output will be ```8 x 4 x 256 tensor```

* Instantiate the dataloader (Data Sampling with a sliding window)
</div>

In [55]:
max_length = 4
dataloader = create_dataloader_v1(
    text_content, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [56]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[  464,   402, 21358,   198],
        [45155,  5097,  3185,  1961],
        [ 3539,   198,  1659, 26112],
        [ 2149,  8881,   198, 23683],
        [18672, 39219,   464,   402],
        [21358,   198, 45155,  5097],
        [ 3185,  1961,  3539,   198],
        [ 1659, 26112,  2149,  8881]])

Inputs shape:
 torch.Size([8, 4])


<div class="alert alert-block alert-info">

* From the above, the token ID sensor is 8x4- dimensional, meaning the data batch consist of 8 text samples with 4 token IDs

* We use the create embedding layer to embed each token ID into a 256-dimensional vectors

</div>

In [57]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [61]:
token_embeddings

tensor([[[ 3.5963e-02,  1.0094e+00,  1.5110e-01,  ...,  1.4460e+00,
           6.6325e-04,  1.2537e+00],
         [ 1.9847e+00, -6.4828e-01, -1.4146e-01,  ..., -3.8410e-01,
          -9.3553e-01,  1.4478e+00],
         [-1.5610e+00, -2.0675e+00, -3.1993e-01,  ..., -1.0575e+00,
           6.2431e-01,  9.4251e-01],
         [-1.4371e-01, -9.7823e-01,  1.5918e+00,  ...,  3.0685e-01,
          -1.1135e+00, -7.2020e-01]],

        [[ 1.0354e+00,  1.7596e-01, -1.8731e+00,  ..., -4.9843e-01,
          -2.2604e+00, -1.2212e+00],
         [-1.2465e+00, -5.6918e-01,  2.5386e+00,  ..., -3.0578e-01,
           3.2366e-01,  1.3455e+00],
         [ 9.0371e-01, -1.8361e-01, -1.8243e+00,  ...,  8.6197e-01,
          -4.2641e-01, -2.3662e+00],
         [ 8.5968e-01, -3.0877e-01, -6.6412e-01,  ..., -3.3888e-01,
          -1.3219e+00, -1.5608e+00]],

        [[ 9.7635e-01, -4.5960e-01, -3.4314e-01,  ...,  1.1715e+00,
          -5.3655e-01,  7.4415e-01],
         [-1.4371e-01, -9.7823e-01,  1.5918e+00,  .

<div class="alert alert-block alert-success">

* For a GPT absolute positional embedding approach, we create another embedding layer that has the same dimensions to get positional embeddings

* The postional embeddings are the exact position for each token in a batch. Here, we have just 4 positions, which is the context length and equal
to the max length
                                                                                              
</div>

In [58]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [59]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [60]:
pos_embeddings

tensor([[-1.2202, -0.2870, -0.7909,  ...,  1.2403, -0.0785,  1.1018],
        [ 0.1823, -1.0179,  1.2019,  ...,  0.8508,  0.3723,  0.9018],
        [ 0.7011,  1.1959, -0.7350,  ..., -1.5821, -0.7583, -1.6731],
        [-0.0352,  1.0147,  0.8490,  ..., -0.8283,  0.0285,  0.0072]],
       grad_fn=<EmbeddingBackward0>)

<div class="alert alert-block alert-warning">

* The input embeddings are the created embedded input examples that can now be processed by the main LLM

</div>

<div class="alert alert-block alert-info">
    
As shown in the preceding code example, the input to the pos_embeddings is usually a
placeholder vector torch.arange(context_length), which contains a sequence of
numbers 0, 1, ..., up to the maximum input length − 1.

The context_length is a variable
that represents the supported input size of the LLM.

Here, we choose it similar to the
maximum length of the input text.

In practice, input text can be longer than the supported
context length, in which case we have to truncate the text.
    
</div>

<div class="alert alert-block alert-info">
    
As we can see, the positional embedding tensor consists of four 256-dimensional vectors.
We can now add these directly to the token embeddings, where PyTorch will add the 4x256-
dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in
each of the 8 batches:

* input_embeddings = token_embeddings + positonal_embddings
</div>

In [62]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
