# Tokenizer Module Specification

#### We introduce a Tokenizer class engineered for robust lexical tokenization workflows. This class encapsulates both encode and decode methods, facilitating bidirectional transformation between raw text sequences and token ID representations. It maintains two core attributes: a vocab dictionary mapping string tokens to integer IDs, and a reverse mapping enabling lossless detokenization.

#### Instantiation of the Tokenizer requires a pretrained vocabulary derived from the corpus "The Verdict", ensuring domain-adapted lexical coverage. The model includes explicit support for special tokens:
###### <|unk|> — a reserved identifier for out-of-vocabulary (OOV) tokens.
###### <|endoftext|> — a sentinel token indicating text sequence termination, particularly useful in multi-source input streams for boundary demarcation and coherent downstream processing.

#### To create vocabulary we'll use the book "The Verdict" by Edith Wharton

In [10]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))

Total number of character: 20479


#### Use re(Regex) to format data

In [15]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

print(len(preprocessed))

4690


#### Constructing The vocabulary

In [25]:
all_words = sorted(set(preprocessed))
all_words.extend(["<|endoftext|>", "<|unk|>"])

vocab_size = len(all_words)
vocab = {token:integer for integer,token in enumerate(all_words)}

print(vocab_size)

for i, item in enumerate(vocab.items()):
    if i < 5 or i>=1128:
        print(item)

1132
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


#### Tokenizer class

In [28]:
class Tokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

#### Demonstration

In [40]:
tokenizer = Tokenizer(vocab)

text1 = "Hello, my friend how do you do?"
text2 = "In the sunlit terraces of the Gittusburgs."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, my friend how do you do? <|endoftext|> In the sunlit terraces of the Gittusburgs.


In [42]:
ids = tokenizer.encode(text)

In [44]:
tokenizer.decode(ids)

'<|unk|>, my friend how do you do? <|endoftext|> In the sunlit terraces of the <|unk|>.'