### large language models are just neural networks that need an input to produce an output. 
- But we cannot give natural language as an input to these networks so we have to figure out a way to break text into chunks and then convert these chunks into some sort of numbers
- Breaking the input text to the language model into smaller chunks called 'tokens' is called tokenization.
- When we tokenize a 'document' (single unit of input to a language model) we endup with tokens.
- These tokens are further encoded so that we endup with 'Token ids'
- We can use these token ids to create embeddings that will be given to a language model as input.
- In this notebook we will build simple tokenizer to tokenize the "The Prophet" by Khalil Ghibran.  

In [1]:
import os 
os.listdir()

['SimpleTokenizer.ipynb',
 'README.md',
 '.ipynb_checkpoints',
 '.git',
 'dprpht.txt']

In [21]:
with open('dprpht.txt', 'r', encoding = "utf-8") as f:
    raw_data = f.read()
print(f"Total characters in the raw_data string: {len(raw_data)}")
print(raw_data[:99])

Total characters in the raw_data string: 86102
﻿The Project Gutenberg eBook of The Prophet
    
This ebook is for the use of anyone anywhere in th


- This is a look at the first 100 characters of the text.
- Next, we will try to break it down into its constituents.
- First, lets start by breaking the raw_text down by spaces.
- we will use Python's 're' regular expression library for this.

In [22]:
import re 
preprocessed = re.split(r'(\s)', raw_data)
print(preprocessed[:500])

['\ufeffThe', ' ', 'Project', ' ', 'Gutenberg', ' ', 'eBook', ' ', 'of', ' ', 'The', ' ', 'Prophet', '\n', '', ' ', '', ' ', '', ' ', '', ' ', '', '\n', 'This', ' ', 'ebook', ' ', 'is', ' ', 'for', ' ', 'the', ' ', 'use', ' ', 'of', ' ', 'anyone', ' ', 'anywhere', ' ', 'in', ' ', 'the', ' ', 'United', ' ', 'States', ' ', 'and', '\n', 'most', ' ', 'other', ' ', 'parts', ' ', 'of', ' ', 'the', ' ', 'world', ' ', 'at', ' ', 'no', ' ', 'cost', ' ', 'and', ' ', 'with', ' ', 'almost', ' ', 'no', ' ', 'restrictions', '\n', 'whatsoever.', ' ', 'You', ' ', 'may', ' ', 'copy', ' ', 'it,', ' ', 'give', ' ', 'it', ' ', 'away', ' ', 'or', ' ', 're-use', ' ', 'it', ' ', 'under', ' ', 'the', ' ', 'terms', '\n', 'of', ' ', 'the', ' ', 'Project', ' ', 'Gutenberg', ' ', 'License', ' ', 'included', ' ', 'with', ' ', 'this', ' ', 'ebook', ' ', 'or', ' ', 'online', '\n', 'at', ' ', 'www.gutenberg.org.', ' ', 'If', ' ', 'you', ' ', 'are', ' ', 'not', ' ', 'located', ' ', 'in', ' ', 'the', ' ', 'United', ' '

- We can see there are a bunch of special characters we might need to take into consideration.
- There are also , illustrations in the book, marked using "\[Illustration: ####]" word. we will replace it with "\<ILLUSTRATION>"
- We will be also be stripping away the beginnign and the end of the document makeed by *** START OF THE PROJECT GUTENBERG EBOOK THE PROPHET *** and *** END OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***
- Also, lets just put everything into a function. 

In [24]:
def tokenize(raw_data):
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK THE PROPHET ***"

    start_idx = raw_data.find(start_marker)
    end_idx = raw_data.find(end_marker)

    if start_idx == -1 or end_idx == -1:
        raise ValueError("Start or end index not found")
   
    # Slice the content between markers
    content = text[start_idx + len(start_marker):end_idx]

    content = content.replace('\n', " ")
    text = re.sub(r'\[Illustration:\s*\d{4}\]', ' <ILLUSTRATION> ', raw_data)
    preprocessed = re.split(r'(\s+|[.,:;?!“”"()\'’\-_—*[\]])', text)
    preprocessed = [item for item in preprocessed if item.strip()]
    return preprocessed

- Now lets remove the spaces from the preprocessed list

In [25]:
preprocessed = tokenize(raw_data)
print(preprocesses[:50])

ValueError: Start or end index not found