### large language models are just neural networks that need an input to produce an output. 
- But we cannot give natural language as an input to these networks so we have to figure out a way to break text into chunks and then convert these chunks into some sort of numbers
- Breaking the input text to the language model into smaller chunks called 'tokens' is called tokenization.
- When we tokenize a 'document' (single unit of input to a language model) we endup with tokens.
- These tokens are further encoded so that we endup with 'Token ids'
- We can use these token ids to create embeddings that will be given to a language model as input.
- In this notebook we will build simple tokenizer to tokenize the "The Prophet" by Khalil Gibran.  

In [3]:
with open('dprpht.txt','r',encoding = 'utf-8') as f:
    raw_data = f.read()
len(raw_data)

86102

In [4]:
import re
split_space = re.split(r'(\s)', raw_data)
print(split_space[:50])

['\ufeffThe', ' ', 'Project', ' ', 'Gutenberg', ' ', 'eBook', ' ', 'of', ' ', 'The', ' ', 'Prophet', '\n', '', ' ', '', ' ', '', ' ', '', ' ', '', '\n', 'This', ' ', 'ebook', ' ', 'is', ' ', 'for', ' ', 'the', ' ', 'use', ' ', 'of', ' ', 'anyone', ' ', 'anywhere', ' ', 'in', ' ', 'the', ' ', 'United', ' ', 'States', ' ']


- We can see there are a bunch of special characters we might need to take into consideration.
- There are also , illustrations in the book, marked using "\[Illustration: ####]" word. we will replace it with "\<ILLUSTRATION>"
- We will be also be stripping away the beginnign and the end of the document makeed by *** START OF THE PROJECT GUTENBERG EBOOK THE PROPHET *** and *** END OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***
- Also, lets just put everything into a function. 

In [5]:
def tokenize(raw_data):
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***"

    start_idx = raw_data.find(start_marker)
    end_idx = raw_data.find(end_marker)

    if start_idx == -1 or end_idx == -1:
        raise ValueError("Start or end index not found")
   
    # Slice the content between markers
    content = raw_data[start_idx + len(start_marker):end_idx]

    content = content.replace('\n', " ")
    content = re.sub(r'\[Illustration:\s*\d{4}\]', ' <ILLUSTRATION> ', content)
    preprocessed = re.split(r'(\s+|[.,:;?!“”"()\'’\-_—*[\]])', content)
    preprocessed = [item for item in preprocessed if item.strip()]
    return preprocessed

In [6]:
# start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***"
# end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***"
# start_idx = raw_data.find(start_marker)
# end_idx = raw_data.find(end_marker)
# print(start_idx, end_idx)
preprocessed = tokenize(raw_data)


In [7]:
print(preprocessed[:50])

['THE', 'PROPHET', 'By', 'Kahlil', 'Gibran', 'New', 'York', ':', 'Alfred', 'A', '.', 'Knopf', '1923', '_', 'The', 'Twelve', 'Illustrations', 'In', 'This', 'Volume', 'Are', 'Reproduced', 'From', 'Original', 'Drawings', 'By', 'The', 'Author', '_', '“', 'His', 'power', 'came', 'from', 'some', 'great', 'reservoir', 'of', 'spiritual', 'life', 'else', 'it', 'could', 'not', 'have', 'been', 'so', 'universal', 'and', 'so']


### Now we have the list of our preprocessed tokens. For the next step we will be creating token ids and vocabulary.
- We can get token ids by first sorting the 'preprocesssed' list and getting the indices of all the unique items in it.
- Then we can create a vocabulary. Vocabulary is nothing but a dictonary that maps tokens to its ids. 


In [8]:
all_words = sorted(set(preprocessed))
print(f"number of words in vocab: {len(all_words)}")

number of words in vocab: 2162


In [9]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [10]:
for i,item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break
    

('*', 0)
(',', 1)
('-', 2)
('.', 3)
('1918', 4)
('1919', 5)
('1920', 6)
('1923', 7)
('1926', 8)
('1928', 9)
('1931', 10)
('1932', 11)
('1933', 12)
('1934', 13)
('1948', 14)
(':', 15)
(';', 16)
('<ILLUSTRATION>', 17)
('?', 18)
('A', 19)
('After', 20)
('Alfred', 21)
('All', 22)
('Almitra', 23)
('Almustafa', 24)
('Alone', 25)
('Always', 26)
('Am', 27)
('Among', 28)
('And', 29)
('Archer', 30)
('Are', 31)
('At', 32)
('Author', 33)
('Ay', 34)
('Aye', 35)
('BOOKS', 36)
('Be', 37)
('Beauty', 38)
('Blessed', 39)
('Bragdon', 40)
('Brief', 41)
('Build', 42)
('But', 43)
('Buying', 44)
('By', 45)
('CONTENTS', 46)
('Children', 47)
('Claude', 48)
('Clothes', 49)
('Come', 50)
('Coming', 51)


# Great! Now we have a vocabulary of all the words present in our document. 
- However, there might be instances where we might encounter an input word that might not be in our vocabulary.
- To handle such a case we will extend our current vocabulary to include a "<|unk|>". This special token will correspond to a word that is not in our vocabulary.
- We will also add another token "<|endoftext|>" to mark end of our data source. 

In [11]:
all_words = sorted(list(set(preprocessed)))
all_words.extend(["<|endoftext|>", "<|unk|>"])                   
len(all_words)

2164

In [12]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [13]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('’', 2159)
('“', 2160)
('”', 2161)
('<|endoftext|>', 2162)
('<|unk|>', 2163)


# Now, we also know how to deal with the words that are not in our vocabulary!
- Further down the line, we will need a mechanism to convert the token ids back into the original text.
- Lets, write a **SimpleTokenizer** Class, which will have 2 methods.
- - Encode
  - Decode
- **encode** method will take the input text, perform the preprocessing and return the token ids.
- **decode** method will take the token ids and return the corresponding text. 

In [25]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:t for t,i in vocab.items()}
    def encode(self, text):
        preprocessed = re.split(r'(\s+|[.,:;?!“”"()\'’\-_—*[\]])', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [ item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [ self.str_to_int[item] for item in preprocessed]
        return ids
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'(\s+|[.,:;?!“”"()\'’\-_—*[\]])', r'\1', text)
        return text
        

In [27]:
simpletokenizer = SimpleTokenizer(vocab)
sample_text = "And the wanderer, wrapped in silence, looked to the horizon, where the neon lights of the city flickered like distant memories."
token_ids = simpletokenizer.encode(sample_text)
print(token_ids)

[29, 1843, 2029, 1, 2163, 1060, 1667, 1, 1191, 1893, 1843, 2163, 1, 2074, 1843, 2163, 1161, 1333, 1843, 498, 2163, 1162, 660, 1240, 3]


In [28]:
decoded_text = simpletokenizer.decode(token_ids)
print(decoded_text)

And the wanderer , <|unk|> in silence , looked to the <|unk|> , where the <|unk|> lights of the city <|unk|> like distant memories .


### This is a simplified implementation of the tokenizer, however large language models like GPT use something called Byte-Pair-Encoding to break a word down even  further. 