 It’s common to process millions of articles and hundreds of thousands
of books—many gigabytes of text—when working with LLMs. However, for
educational purposes, it’s sufficient to work with smaller text samples like a
single book to illustrate the main ideas behind the text processing steps and
to make it possible to run it in a reasonable time on consumer hardware.

In [8]:
with open("the-verdict.txt", "r", encoding = "utf-8") as f:
          raw_text = f.read()
print ("Total number of charactor: ", len(raw_text))

# Print the first 100 charactores in text
print(raw_text[:99])

Total number of charactor:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [6]:
# use Python’s regular expression library re
import re

In [12]:
text = "Hellow, World. This is test"
result = re.split(r'(\s)', text)
print(result)

['Hellow,', ' ', 'World.', ' ', 'This', ' ', 'is', ' ', 'test']


In [16]:
# words and punctuation characters are now separate
result = re.split(r'([, .] |\s)', text)
print(result)

['Hellow', ', ', 'World', '. ', 'This', ' ', 'is', ' ', 'test']


In [22]:
# A small remaining problem is that the list still includes whitespace characters
result = [item for item in result if item.strip()]
print(result)

['Hellow', ', ', 'World', '. ', 'This', 'is', 'test']


In [24]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


# Now that we have a basic tokenizer working, let’s apply it to Edith Wharton’s entire short story:

In [27]:
preprocessed = re.split(r'([,.:;?_!"\' ()] | --|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4092


In [29]:
# Let’s print the first 30 tokens for a quick visual check:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius--though', 'a', 'good', 'fellow', 'enough--so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his']
