# Tokenizer

In [None]:
# Original code shown in the course at 18:00
string_to_int = { ch:i for i,ch in enumerate(chars) }
int_to_string = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda 1: " ".join([int_to_string[i] for i in 1])

The code below is based on the freecodecamp course video's code shown at 18:00 time stamp. (The cell above)

This notebook explains its long, basic form, how it works and how it is refactored.

The code extracts the raw content of the classical literature, "Dorothy and the Wizard of Oz" by Frank Baum. It then condenses the content into unique characters that appeared at least once in the book and assigns each with an integer value. This represents the text data as being tokenized on a character-level with each letter or character connected to a numerical value.

The function, encoder will then take a string input from user and decode characters into their corresponding integer values. An ensuing function, decoder can then take the output from encoder and re-translate it back into the original string.

In [3]:
# Open book & save content to book variable.
def tokenize():
    with open("./wizardOfOz/pg22566_edited.txt", "r", encoding="utf-8") as f:
        book = f.read()
        f.close()
    
    # Identify unique characters that appeared at least once in the book.
    chars = sorted(set(book))
    
    # Create the list of integer values corresponding to characters.
    tokens = list(range(0, len(chars)))
    
    # Create a dictionary with characters as key, token as values.
    string_to_int = dict(zip(chars, tokens))
    
    # Create a dictionary with tokens as keys, characters as values.
    int_to_string = dict(zip(tokens, chars))

    return string_to_int, int_to_string


# Encoder function takes a string input from user, then returns its token equivalent as a list.
def encoder(string_to_int):
    phrase = list(input("Provide a phrase to encode into token integers\n: >>>"))
    encoded = []
    for character in phrase:
        encoded.append(string_to_int[character])
    print("This is the encoded value as a list of integer tokens: ")
    print(encoded)
    return encoded

# Decoder function takes the token list output from encoder and returns the original text.
def decoder(encoded):
    decoded = []
    for int in encoded:
        decoded.append(int_to_string[int])
    decoded = " ".join(decoded)
    print("This is the original text:")
    print(decoded)

# To execute the script
if __name__=='__main__':
    tokenize()
    tokenized_phrase = encoder(string_to_int)
    decoder(tokenized_phrase)

Provide a phrase to encode into token integers
: >>> TOKENS


This is the encoded value as a list of integer tokens: 
[44, 39, 35, 29, 38, 43]
This is the original text:
T O K E N S


## Code breakdown

Open the book, "Dorothy and the Wizard of Oz" and store the contents inside a string variable, named book.

In [9]:
with open("./wizardOfOz/pg22566_edited.txt", "r", encoding="utf-8") as f:
    # Store the book in a string variable, book.
    book = f.read()
    f.close()

Convert book from a big block of text into a set element, so that the book variable contains a list of unique characters, without duplicates. i.e. more than element for space " " even if there are a lot of spaces in the book.

Since set() doesn't order the ouptut alphabetically by default, use sorted() to do so.

In [15]:
chars = sorted(set(book))

print("These are the characters that appeared at least once in Wizard of Oz.\n")

print(chars)

print(f"\nThere are {len(chars)} characters that appeared at least once in Wizard of Oz.")

These are the characters that appeared at least once in Wizard of Oz.

['\n', ' ', '!', '"', '&', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

There are 80 characters that appeared at least once in Wizard of Oz.


Create a list of equal length containing integer values corresponding to each character.

In [4]:
# Create the list of integer values corresponding to characters.
tokens = list(range(0, len(chars)))

print("This is the list of tokens corresponding to the chars list.")

print(tokens)

This is the list of tokens corresponding to the chars list.
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]


The first character's token value is 0.

The last character's token value is the length of characters list.
Remember that although range(x, y) excludes y, the 0 index means the list length is still the same.

The range function also won't return the list from 0 to n, unless it's converted to a list.

Next, create 2 dictionaries, for encoder and decoder respectively. Note that zip creates a tuple-like function object from 2 lists of equal length. It is required for dict() to work on the 2 lists.

In [6]:
# Create a dictionary with characters as key, token as values. For encoder.
string_to_int = dict(zip(chars, tokens))

# Create a dictionary with tokens as keys, characters as values. For decoder.
int_to_string = dict(zip(tokens, chars))

Next, the encoder function takes user input and returns the decoded string as a list with each element being the integer corresponding to a character.

In [9]:
# code in encoder()
phrase = list(input("Provide a phrase to encode into token integers\n: >>>"))
encoded = []
for character in phrase:
    encoded.append(string_to_int[character])
print("This is the encoded value as a list of integer tokens: ")
print(encoded)

Provide a phrase to encode into token integers
: >>> ENCODE


This is the encoded value as a list of integer tokens: 
[29, 38, 27, 39, 28, 29]


The output of encoder() is then captured through return and sent to decoder() as a parameter.

The .join() method takes every element in the list, encoded and returns them as a string with a space separating each element.

In [10]:
decoded = []
for int in encoded:
    decoded.append(int_to_string[int])
decoded = " ".join(decoded)
print("This is the original text:")
print(decoded)

This is the original text:
E N C O D E


## Code Refactoring