Step 1 Tokenizing the Short Story

In [1]:
with open("the-verdict.txt","r",encoding="utf-8") as f:
    verdict=f.read()

print("total number of characters:",len(verdict))
print(verdict[:100])  # Print the first 100 characters for a preview

total number of characters: 20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


Our goal is to tokenize all the word.

Q How can we best split the text into individual tokens?
ans- Use Regular expression Library

In [2]:
import re

text="Hello, world! This is a test. Let's see how many sentences we can count. Can you count them all? I hope so."

result=re.split(r'([,.?]|\s)',text)
print(result)


['Hello', ',', '', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '', ' ', "Let's", ' ', 'see', ' ', 'how', ' ', 'many', ' ', 'sentences', ' ', 'we', ' ', 'can', ' ', 'count', '.', '', ' ', 'Can', ' ', 'you', ' ', 'count', ' ', 'them', ' ', 'all', '?', '', ' ', 'I', ' ', 'hope', ' ', 'so', '.', '']


In [3]:
result=[item for item in result if item.strip()]
print(result)

['Hello', ',', 'world!', 'This', 'is', 'a', 'test', '.', "Let's", 'see', 'how', 'many', 'sentences', 'we', 'can', 'count', '.', 'Can', 'you', 'count', 'them', 'all', '?', 'I', 'hope', 'so', '.']


Removing whitespaces or not
when developing simple tokenizer we don't need white space but if we want to build tokenizer of python code where indentation is mandatory we need white space . it depend on use.

In [4]:
result=re.split(r'([,.:;!?_"()\--]|\s)',text)
print(result)
result=[item for item in result if item.strip()]
print(result)

['Hello', ',', '', ' ', 'world', '!', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '', ' ', "Let's", ' ', 'see', ' ', 'how', ' ', 'many', ' ', 'sentences', ' ', 'we', ' ', 'can', ' ', 'count', '.', '', ' ', 'Can', ' ', 'you', ' ', 'count', ' ', 'them', ' ', 'all', '?', '', ' ', 'I', ' ', 'hope', ' ', 'so', '.', '']
['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.', "Let's", 'see', 'how', 'many', 'sentences', 'we', 'can', 'count', '.', 'Can', 'you', 'count', 'them', 'all', '?', 'I', 'hope', 'so', '.']


Now that we got a basic tokenizer working , let's apply it to verdict text.

In [5]:
import re
preprocessed_text = re.split(r'(--|[,.?;:!_()\]\s]+)', verdict)
preprocessed_text = [item for item in preprocessed_text if item.strip()]
print(preprocessed_text[:100])  # Print the first 100 items for a preview


['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ', ', 'in', 'the', 'height', 'of', 'his', 'glory', ', ', 'he', 'had', 'dropped', 'his', 'painting', ', ', 'married', 'a', 'rich', 'widow', ', ', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '. (', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.)\n\n', '"The', 'height', 'of', 'his', 'glory"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '. ', 'I', 'can', 'hear', 'Mrs', '. ', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--', 'deploring', 'his', 'unaccountable', 'abdication']


In [6]:
print(len(preprocessed_text))  # Print the total number of items

4411


Step 2 Convert Tokens into Token IDs

Now we will take all the unique tokens and sort them alphabetically to determine the vocabulary size

In [7]:
all_words=sorted(set(preprocessed_text))
vocab_size=len(all_words)
print("Vocabulary size:",vocab_size)

Vocabulary size: 1198


In [8]:
# print(all_words[:50])
print(type(all_words))

<class 'list'>


Lets map word to ids

In [9]:
vocab={token:integer for integer,token in enumerate(all_words)}

This line creates a dictionary called `vocab` that maps each unique token in the list `all_words` to a unique integer. The `enumerate(all_words)` function returns pairs of an integer index and the corresponding token from `all_words`. The dictionary comprehension `{token:integer for integer,token in enumerate(all_words)}` then constructs key-value pairs where each token is a key and its assigned integer is the value.

This mapping is commonly used in natural language processing tasks to convert words or tokens into numerical representations, which are easier for machine learning models to process. By assigning each token a unique integer, you can efficiently encode text data for further analysis or model training.

In [10]:
print(type(vocab))

<class 'dict'>


In [11]:
for key,value in enumerate(vocab.items()):
    print(f"{key}: {value}")
    if key>50:
        break

0: (' (', 0)
1: (' _', 1)
2: ('!', 2)
3: ('!\n\n', 3)
4: ('! ', 4)
5: ('"', 5)
6: ('"Ah', 6)
7: ('"Be', 7)
8: ('"Begin', 8)
9: ('"By', 9)
10: ('"Come', 10)
11: ('"Destroyed', 11)
12: ('"Don\'t', 12)
13: ('"Gisburns"', 13)
14: ('"Grindles', 14)
15: ('"Hang', 15)
16: ('"Has', 16)
17: ('"How', 17)
18: ('"I', 18)
19: ('"I\'d', 19)
20: ('"If', 20)
21: ('"It', 21)
22: ('"It\'s', 22)
23: ('"Jack', 23)
24: ('"Money\'s', 24)
25: ('"Moon-dancers"', 25)
26: ('"Mr', 26)
27: ('"Mrs', 27)
28: ('"My', 28)
29: ('"Never', 29)
30: ('"Of', 30)
31: ('"Oh', 31)
32: ('"Once', 32)
33: ('"Only', 33)
34: ('"Or', 34)
35: ('"That', 35)
36: ('"The', 36)
37: ('"Then', 37)
38: ('"There', 38)
39: ('"This', 39)
40: ('"We', 40)
41: ('"Well', 41)
42: ('"What', 42)
43: ('"When', 43)
44: ('"Why', 44)
45: ('"Yes', 45)
46: ('"You', 46)
47: ('"but', 47)
48: ('"deadening', 48)
49: ('"dragged', 49)
50: ('"effects"', 50)
51: ('"interesting"', 51)


This code iterates over the items in the [`vocab`]LLMScratch.ipynb ) dictionary using the [`enumerate()`]LLMScratch.ipynb ) function. The [`vocab.items()`]LLMScratch.ipynb ) method returns each key-value pair from the dictionary as a tuple, where the key is typically a token (such as a word) and the value is its corresponding integer index.

By wrapping [`vocab.items()`]LLMScratch.ipynb ) with [`enumerate()`]LLMScratch.ipynb ), each iteration provides a running integer counter (`key`) starting from 0, and the actual dictionary item (`value`). Inside the loop, the code prints the counter and the dictionary item in a formatted string. The `if key>50: break` statement ensures that the loop stops after printing the first 51 items, which is useful for previewing a sample of the vocabulary without overwhelming the output.

This approach is commonly used to inspect a subset of large dictionaries, such as vocabularies in natural language processing tasks, to verify their contents or debug the token-to-index mapping.

### Let's create Tokenization class and implement two methods
 1. encoder 
 2. decoder

In [12]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed= re.split(r'(--|[,.?;:!_()\]\s]+)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids=[self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text=" ".join([self.int_to_str[i] for i in ids])
    # Replace spaces before the specified punctuation
        text = re.sub(r'\s([,.?;:!_()\]])', r'\1', text)
        return text


The `SimpleTokenizerV1` class provides a basic way to convert text into a sequence of integers (tokenization) and back (detokenization) using a given vocabulary mapping. When an instance is created, it receives a `vocab` dictionary mapping string tokens to unique integers. The constructor (`__init__`) stores this mapping as `str_to_int` and also creates a reverse mapping, `int_to_str`, to convert integers back to their corresponding string tokens.

The `encode` method takes a text string and splits it into tokens using a regular expression that separates words and punctuation. It then strips whitespace from each token and filters out any empty strings. Each token is then mapped to its integer ID using the vocabulary, resulting in a list of integers representing the input text.

The `decode` method performs the reverse operation. It takes a list of integer IDs, converts each back to its string token, and joins them into a single string with spaces. To improve readability, it uses a regular expression to remove spaces that appear before certain punctuation marks, ensuring the reconstructed text looks natural.

Overall, this class demonstrates a simple approach to text tokenization and detokenization, which is a foundational step in many natural language processing tasks.

## -------------------------------------------------------------------------

In [13]:
tokenizer=SimpleTokenizerV1(vocab)
text='''' "Oh, I knew him, and he knew me--only it happened after he was dead."

I dropped my voice instinctively. "When she sent for you?"

"Yes--quite insensible to the irony. She wanted him vindicated--and by me!" '''

ids=tokenizer.encode(text)
print(ids)


[57, 31, 66, 116, 664, 612, 66, 223, 597, 664, 732, 68, 799, 653, 590, 203, 597, 1141, 382, 69, 5, 116, 434, 766, 1135, 644, 69, 43, 942, 935, 519, 1192, 79, 5, 45, 68, 881, 642, 1080, 1050, 649, 69, 149, 1139, 612, 1131, 68, 223, 307, 732, 2, 5]


In [16]:
text2=tokenizer.decode(ids)
print(text2)

' "Oh, I knew him, and he knew me -- only it happened after he was dead. " I dropped my voice instinctively. "When she sent for you? " "Yes -- quite insensible to the irony. She wanted him vindicated -- and by me! "


In [17]:
text3=''' Hello do you like tea? I like tea, but I also like coffee. Do you like coffee? Yes, I do!'''
ids2=tokenizer.encode(text3)
print(ids2)

KeyError: 'Hello'

KeyError:'Hello', occured due lack of training dataset( so less tokens. )
So to solve this error we huge number of data to train .
But we can solve with
 ### special context tokens. 