# Data Pre-processing
This notebook preprocessess and tranform text data in a way required for llm

How to prepare text into input for LLMs:

1. Splitting text into individual word and subword tokens
2. Convert tokens into token ids
3. Convert token ids into vector embeddings

In [37]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()


print("total number of characters in the text : ", len(raw_text))

print("\nfirst 50 characters:")
print(raw_text[:50]) # first 50 characters

total number of characters in the text :  20479

first 50 characters:
I HAD always thought Jack Gisburn rather a cheap g


Goal is to tokenize the above text for LLMs

In [38]:
import re

print(raw_text[:60]) # This prints first 60 characters including spaces.
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) # split the sentence based on symbols provided
# preprocessed = [item for item in preprocessed if item] # split each word into separate item in the list
preprocessed[:50] # prints first 50 elements in the list

I HAD always thought Jack Gisburn rather a cheap genius--tho


['I',
 ' ',
 'HAD',
 ' ',
 'always',
 ' ',
 'thought',
 ' ',
 'Jack',
 ' ',
 'Gisburn',
 ' ',
 'rather',
 ' ',
 'a',
 ' ',
 'cheap',
 ' ',
 'genius',
 '--',
 'though',
 ' ',
 'a',
 ' ',
 'good',
 ' ',
 'fellow',
 ' ',
 'enough',
 '--',
 'so',
 ' ',
 'it',
 ' ',
 'was',
 ' ',
 'no',
 ' ',
 'great',
 ' ',
 'surprise',
 ' ',
 'to',
 ' ',
 'me',
 ' ',
 'to',
 ' ',
 'hear',
 ' ']

In [39]:
# List of unique words in the book
len(set(preprocessed))

1133

In [40]:
sorted(set(preprocessed))

['',
 '\n',
 ' ',
 '!',
 '"',
 "'",
 '(',
 ')',
 ',',
 '--',
 '.',
 ':',
 ';',
 '?',
 'A',
 'Ah',
 'Among',
 'And',
 'Are',
 'Arrt',
 'As',
 'At',
 'Be',
 'Begin',
 'Burlington',
 'But',
 'By',
 'Carlo',
 'Chicago',
 'Claude',
 'Come',
 'Croft',
 'Destroyed',
 'Devonshire',
 'Don',
 'Dubarry',
 'Emperors',
 'Florence',
 'For',
 'Gallery',
 'Gideon',
 'Gisburn',
 'Gisburns',
 'Grafton',
 'Greek',
 'Grindle',
 'Grindles',
 'HAD',
 'Had',
 'Hang',
 'Has',
 'He',
 'Her',
 'Hermia',
 'His',
 'How',
 'I',
 'If',
 'In',
 'It',
 'Jack',
 'Jove',
 'Just',
 'Lord',
 'Made',
 'Miss',
 'Money',
 'Monte',
 'Moon-dancers',
 'Mr',
 'Mrs',
 'My',
 'Never',
 'No',
 'Now',
 'Nutley',
 'Of',
 'Oh',
 'On',
 'Once',
 'Only',
 'Or',
 'Perhaps',
 'Poor',
 'Professional',
 'Renaissance',
 'Rickham',
 'Riviera',
 'Rome',
 'Russian',
 'Sevres',
 'She',
 'Stroud',
 'Strouds',
 'Suddenly',
 'That',
 'The',
 'Then',
 'There',
 'They',
 'This',
 'Those',
 'Though',
 'Thwing',
 'Thwings',
 'To',
 'Usually',
 'Veneti

### Converting tokens into token IDs
In the above step, sentences are broken down into individual words called tokens.
Each token is then given an unique ID called token IDs

In [41]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1133


In [21]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [42]:
vocab

{'': 0,
 '\n': 1,
 ' ': 2,
 '!': 3,
 '"': 4,
 "'": 5,
 '(': 6,
 ')': 7,
 ',': 8,
 '--': 9,
 '.': 10,
 ':': 11,
 ';': 12,
 '?': 13,
 'A': 14,
 'Ah': 15,
 'Among': 16,
 'And': 17,
 'Are': 18,
 'Arrt': 19,
 'As': 20,
 'At': 21,
 'Be': 22,
 'Begin': 23,
 'Burlington': 24,
 'But': 25,
 'By': 26,
 'Carlo': 27,
 'Chicago': 28,
 'Claude': 29,
 'Come': 30,
 'Croft': 31,
 'Destroyed': 32,
 'Devonshire': 33,
 'Don': 34,
 'Dubarry': 35,
 'Emperors': 36,
 'Florence': 37,
 'For': 38,
 'Gallery': 39,
 'Gideon': 40,
 'Gisburn': 41,
 'Gisburns': 42,
 'Grafton': 43,
 'Greek': 44,
 'Grindle': 45,
 'Grindles': 46,
 'HAD': 47,
 'Had': 48,
 'Hang': 49,
 'Has': 50,
 'He': 51,
 'Her': 52,
 'Hermia': 53,
 'His': 54,
 'How': 55,
 'I': 56,
 'If': 57,
 'In': 58,
 'It': 59,
 'Jack': 60,
 'Jove': 61,
 'Just': 62,
 'Lord': 63,
 'Made': 64,
 'Miss': 65,
 'Money': 66,
 'Monte': 67,
 'Moon-dancers': 68,
 'Mr': 69,
 'Mrs': 70,
 'My': 71,
 'Never': 72,
 'No': 73,
 'Now': 74,
 'Nutley': 75,
 'Of': 76,
 'Oh': 77,
 'On': 78

In [24]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 20:
        break

('', 0)
('\n', 1)
(' ', 2)
('!', 3)
('"', 4)
("'", 5)
('(', 6)
(')', 7)
(',', 8)
('--', 9)
('.', 10)
(':', 11)
(';', 12)
('?', 13)
('A', 14)
('Ah', 15)
('Among', 16)
('And', 17)
('Are', 18)
('Arrt', 19)
('As', 20)
('At', 21)


these token ids will be matched with the text

Eg:

Consider the following sample text:

The brown dog playfully chased the swift fox

In [None]:
# The brown dog playfully chased the swift fox
# lets get the token ids for each of the above words in the sentence.


sentence = "The brown dog playfully chased the swift fox"
tokenized_sentence = sentence.split(" ")
tokenized_sentence

token_id = []

for each in tokenized_sentence:
    id = vocab.get(each)
    if id == None: id = 0
    token_id.append(id)

In [33]:
print(sentence)
print(tokenized_sentence)
print(token_id)

The brown dog playfully chased the swift fox
['The', 'brown', 'dog', 'playfully', 'chased', 'the', 'swift', 'fox']
[96, 238, 0, 0, 0, 991, 0, 0]


Tokenize the book

In [45]:
print(len(preprocessed))

9235
