# Reading in a short story as text sample into python

## Step 1: Creating Tokens

In [None]:
# Opening Dataset
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99]) # prints the first 100 char of this file

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### When working with LLMS millions and thousands of books are used, but today we are using only one books just to practice and understand better

In [3]:
import re # Splits text to obtain token list, splits any given text based on the whitespaces within the text or any other char

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text) # the '\s' splits text wherever white spaces are encountered, into individual tokens

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


### Now lets modify the regular expression splits on whitespaces '(\s)' and change it to also split where the comma and period to be included as individual splits

In [5]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


### Now that the commas, periods are also included as its own token we can now remove also the whitespace 

In [None]:
# Removing whitespaces completely from the sentence
result = [item for item in result if item.strip()] #item.strip() will return true if char in sentence or false for none (Whitespaces will not be returned due to being false)
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


When developing a simple tokenizer, whether we should encode whitespaces as
separate characters or just remove them depends on our application and its
requirements. Removing whitespaces reduces the memory and computing
requirements. However, keeping whitespaces can be useful if we train models that
are sensitive to the exact structure of the text (for example, Python code, which is
sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
that includes whitespaces.

### Lets include more char for separate tokens

In [None]:
text = "Hello, world. Is this-- a test?"

# The following two lines of codes are our tokenization scheme
result = re.split(r'([,.:;?_!"()\']|--|\s)', text) # Splits texts depending on char
result = [item.strip() for item in result if item.strip()] # Remove whitespaces

print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


### Now lets apply these two statements of the tokenization scheme into the 'raw_text'

In [12]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

print(preprocessed[:30]) # prints first 30 char. from file
print("Token Length:", len(preprocessed))

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']
Token Length: 4690


Currently our tokens are not in numerical representation rather still in words/char representations. 
Next step will be doing **Vocabulary**.

The Vocabulary contains unique tokens and token IDs. Should be mapped in alphabetical order. 

## Step 2: Creating Token IDs

In the previous section, we tokenized Edith Wharton's short story and assigned it to a
Python variable called preprocessed. Let's now create a list of all unique tokens and sort
them alphabetically to determine the vocabulary size:

In [13]:
# Sorting the set in alphabetical order
all_words = sorted(set(preprocessed))

# Print Vocab size
vocab_len = len(all_words)
print("Size of Vocab:", vocab_len)

Size of Vocab: 1130


We can see that the vocab size is 1130, indicating that it is less than the size of the *tokens*, which was expected as the vocab size only includes *unique tokens/words*!

Remember that a vocab consists of a dictionary of tokens and its associate token IDs

After determining that the vocabulary size is 1,130 via the above code, we create the vocabulary and print its first 51 entries for illustration purposes:

In [14]:
# we will assign an integer values to all the words in vocab (create a token id)
vocab = {token: integer for integer, token in enumerate(all_words)}

# printing a visual representation of how the tokens are being assigned
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels. Later in this book, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text. For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.


- Let's implement a complete tokenizer class in Python.

    - The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. 

    - In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.
- Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods

- Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

- Step 3: Process input text into token IDs

- Step 4: Convert token IDs back into text

- Step 5: Replace spaces before the specified punctuation

In [20]:
class SimpleTokenizerV1:
    # this init method is called by default when called upon this class 'SimpleTokenizerV1', with the arguments which takes vocab
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()} # take the token and token_id in vocab and flip it, used for decoder method

    # Exact same preprocessing text is introduced as done before for 'Encoder'
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        # converting individual token to token ids
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    # we are using the reversed dictionary(int_to_str), and converting the token_id to individual tokens, then joining the individual tokens together
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])

        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;"()\'])', r'\1', text) # getting rid of all the spaces before the punctuations to make a complete sentence
        return text


### Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a passage from Edith Wharton's short story to try it out in practice:

In [21]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


### Next, let's see if we can turn these token IDs back into text using the decode method:

In [None]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

The text that was implemented in 'text' was used from the training set the "raw_text"

### Let's now apply it to a new text sample that is not contained in the training set:

In [None]:
text = "Hello do you like Tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

### An error is printed due to hello not being used in the 'vocab', which is the short story (raw_text). This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs. To deal with this issue we add special context tokens. 

## ADDING SPECIAL CONTEXT TOKENS

- In this section, we will modify this tokenizer to handle unknown
words.

- In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

- We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

- Furthermore, we add a token between
unrelated texts. 

- For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

- When working with multiple text sources, we add <|endoftext|> tokens between these texts. These tokens act as markers, signaling the start and end of a particular segment. This leads to more effective processing and understanding by the LLMs.
  - if this token was not there it would join all text and mix everything up.

### Now lets modify the vocabulary to include these two special tokens, <|unk|> and <|endoftext|>, by adding these to the list of all unique words that we created in the previous section:


In [25]:
all_tokens = sorted(list(set(preprocessed)))

# to add the new tokens use the .extend
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

len(vocab.items())
for i, item in enumerate(list(vocab.items()) [-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


### Now lets update the SimpleTokenizer in our second version by replacing unknown words and spaces before the specified punctuations


In [None]:
class SimpleTokenizerV2:
    # Remains the same
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        # If not presented in vocab the token will be ID as unknown
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]

        # converting individual token to token ids
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    # Stays the same
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;"()\'])', r'\1', text) # getting rid of all the spaces before the punctuations to make a complete sentence
        return text

In [43]:
tokenizer = SimpleTokenizerV2(vocab)

# Splitting text source like done in GPT
text1 = "Hello, do you like Tea?"
text2 = "In the sunlit terraces of the palace."

text = "<|endoftext|>".join((text1,text2))

print(text)

Hello, do you like Tea?<|endoftext|>In the sunlit terraces of the palace.


In [45]:
# No more errors due to dealing with unkown char
print(tokenizer.encode(text))
tokenizer.decode(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 1131, 10, 1131, 988, 956, 984, 722, 988, 1131, 7]


'<|unk|>, do you like <|unk|> ? <|unk|> the sunlit terraces of the <|unk|>.'


So far, we have discussed tokenization as an essential step in processing text as input to
LLMs. Depending on the LLM, some researchers also consider additional special tokens such
as the following:

[BOS] (beginning of sequence): This token marks the start of a text. It
signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text,
and is especially useful when concatenating multiple unrelated texts,
similar to <|endoftext|>. For instance, when combining two different
Wikipedia articles or books, the [EOS] token indicates where one article
ends and the next one begins.

[PAD] (padding): When training LLMs with batch sizes larger than one,
the batch might contain texts of varying lengths. To ensure all texts have
the same length, the shorter texts are extended or "padded" using the
[PAD] token, up to the length of the longest text in the batch.

Note that the tokenizer used for GPT models does not need any of these tokens mentioned
above but only uses an <|endoftext|> token for simplicity

the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks
down words into subword units

### BYTE PAIR ENCODING (BPE)




In [1]:
! pip3 install tiktoken

Collecting tiktoken
  Obtaining dependency information for tiktoken from https://files.pythonhosted.org/packages/de/46/21ea696b21f1d6d1efec8639c204bdf20fde8bafb351e1355c72c5d7de52/tiktoken-0.12.0-cp311-cp311-macosx_10_12_x86_64.whl.metadata
  Downloading tiktoken-0.12.0-cp311-cp311-macosx_10_12_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.12.0-cp311-cp311-macosx_10_12_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.12.0


In [2]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


In [5]:
tokenizer = tiktoken.get_encoding("gpt2")

The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encode method, but in one line in code from the tiktoken library:

In [None]:
text = ("Hello, do you like tea? <|endoftext|> In the sunlit terraces"
        "of someunkownPlace.") # tokenizer is able to tokenize random words like this OOV and does not give error as before

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) # <|endoftext|> is part of GPT2 or GPT3
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 2954, 593, 27271, 13]


Now we can also decode the sentence back

In [9]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunkownPlace.



We can make two noteworthy observations based on the token IDs and decoded text
above. 

- **First**, the <|endoftext|> token is assigned a relatively large token ID, namely,50256. 
    - In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID.
    
- **Second**, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" correctly. 
    - The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?
- The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters.

 - The enables it to handle out-ofvocabulary words. 

- So, thanks to the BPE algorithm, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters

In [None]:
integers = tokenizer.encode("AKwire ier") # random words
print(integers)

string = tokenizer.decode(integers)
print(string)

# Still able to encode and decode very well

[10206, 21809, 220, 959]
AKwire ier


### DATA SAMPLING WITH SLIDING WINDOW
