# Working with text data

We are currently at step 1, data preparation and sampling.

To prepare input texts, we need to separate it into individual words and tokens to be able to encode them.

Embedding refers to the process of converting data in this case text into a vector format.

The purpose is to have a data format which neural networks can process

There are different embeddings, however we will focus on words embeddings as we want to generate one at a time.

Word embeddings can have varying dimensions, from one to thousands. A higher
dimensionality might capture more nuanced relationships but at the cost of computational efficiency.

In [5]:
# To practice this, we will use the-verdict.txt file
with open("the-verdict.txt", "r", encoding="utf-8") as f:
 raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])

# We wish to turn all this characters into tokens which we can embedd

# To obtain the different set of characters we use the re library
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print("Separating words: ",result)

# We wish to separate dots and commas to separate instances
result = re.split(r'([,.]|\s)', text)
print("Separating dots and commas: ",result)

# If we wish to remove blank space characters
result = [item for item in result if item.strip()]
print("Removing blank spaces: ",result)

# Removing white spaces can depend on what the focus is as it can be memory efficient or needed to avoid erros.

# Taking into account all punctuaction terms
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print("With punctuation: ", result)

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 
Separating words:  ['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']
Separating dots and commas:  ['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']
Removing blank spaces:  ['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']
With punctuation:  ['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [8]:
# Applying this to our whole text
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print("Total amount of tokens (no whitespaces): ",len(preprocessed))

print("First 30 tokens: ",preprocessed[:30])

Total amount of tokens (no whitespaces):  4690
First 30 tokens:  ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


We need to provide token ID, in other words, assign each token to a number

In [None]:
all_words = sorted(set(preprocessed)) # Set obtains unique tokens, sorted orders them in alphabetical order
print(len(all_words)) 

# Printing the first 51 elements
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
 print(item)
 if i >= 50:
    break

1130
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


We will create a class that both encodes words into tokens and thus TokenID and a decoder to reverse this operation

In [13]:
class SimpleTokenizerV1:
 def __init__(self, vocab):
    self.str_to_int = vocab # Maps strings to tokens
    self.int_to_str = {i:s for s,i in vocab.items()} # Reverse mapping

 def encode(self, text):
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text) # Tokenizes items
    preprocessed = [
    item.strip() for item in preprocessed if item.strip()
    ] # Ensures empty spaces are cleaned
    ids = [self.str_to_int[s] for s in preprocessed] # Converts each token into its integer ID
    return ids

 def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids]) # Integer to string

    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # Joins strings with a space
    return text

In [None]:
# Trying this class with a small subtext
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
 Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print("Token ID: ",ids)
print("Texts decoded: ", tokenizer.decode(ids))

# As this works, trying with a different training set
text = "Hello, do you like tea?"
# print(tokenizer.encode(text))

# Error, due to Hello not appearing on the original text

Token ID:  [1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
Texts decoded:  " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


KeyError: 'Hello'

We need to make changes to adapt to unknown words, we will modify vocabulary and tokenizers

Special tokenizers will handle this

We can create a tokenizer which handles unknow words, and another for unrelated texts. The latter helps as if we insert independent texts, they are presented in a single manner, however they are actually unrelated.



In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"]) # Adding the two newest tokens
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

1132
