# TRAINING BIGRAM MODEL - Part 1

## Code for the tokenizer

Language models work most best when text inputs are broken down into basic units, called tokens. 

For this simple bigram, the tokens can be the individual letters of words.

The tokenizer function below will first take each character and convert it into an integer as a numerical representation, which forms the token to train the bigram.

In [3]:
def tokenizer():

    # Store the content of"Wizard of Oz" into the variable, book.
    with open("./wizardOfOz/pg22566_edited.txt", "r", encoding="utf-8") as f:
        book = f.read()
        f.close()

    # Collect the unique characters that appear in the book.
    chars = sorted(set(book))

    # Assign a integer value to each character.
    tokens = []
    meaning = -1
    for i in chars:
        meaning += 1
        tokens.append(meaning)
    
    # Link character with integer meaning.
    mapping = dict(zip(chars, tokens))

    # take input from user
    text = input("What is the phrase to convert? ")
    tokenized = []
    for i in text:
        tokenized.append(mapping[i])
    
    return tokenized

In [4]:
tokenizer()

What is the phrase to convert?  TOKENS


[44, 39, 35, 29, 38, 43]

## Code breakdown

Open the book, "Dorothy and the Wizard of Oz" and store the contents inside a string variable, named book.

In [6]:
with open("./wizardOfOz/pg22566_edited.txt", "r", encoding="utf-8") as f:
    # Store the book in a string variable, book.
    book = f.read()
    f.close()

Convert book from a string into a set element, in order to remove duplicate values such that each unique letter in the book is collected.

Since set() doesn't list results alphabetically, use sorted() to do so.

In [7]:
chars = sorted(set(book))

print("These are the characters that appeared at least once in Wizard of Oz.\n")

print(chars)

These are the characters that appeared at least once in Wizard of Oz.

['\n', ' ', '!', '"', '&', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Assign each of these chracters with an integer value.

In [8]:
tokens = []
meaning = -1
for i in chars:
    meaning += 1
    tokens.append(meaning)
print(tokens)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]


The dictionary data type using key-value pairs is the most suitable for mapping characters to their integer tokens.

To convert characters and token lists into a dictionary, use zip function to pair the values together, then use dict() function on the resulting zip object. (Read Basic Reminders section.)

In [20]:
# Combine chars and token list into a zip object.
map = zip(chars, tokens)

# Visualize the zip object as a tuple.
print(tuple(map))

(('\n', 0), (' ', 1), ('!', 2), ('"', 3), ('&', 4), ("'", 5), ('(', 6), (')', 7), ('*', 8), (',', 9), ('-', 10), ('.', 11), ('0', 12), ('1', 13), ('2', 14), ('3', 15), ('4', 16), ('5', 17), ('6', 18), ('7', 19), ('8', 20), ('9', 21), (':', 22), (';', 23), ('?', 24), ('A', 25), ('B', 26), ('C', 27), ('D', 28), ('E', 29), ('F', 30), ('G', 31), ('H', 32), ('I', 33), ('J', 34), ('K', 35), ('L', 36), ('M', 37), ('N', 38), ('O', 39), ('P', 40), ('Q', 41), ('R', 42), ('S', 43), ('T', 44), ('U', 45), ('V', 46), ('W', 47), ('X', 48), ('Y', 49), ('Z', 50), ('[', 51), (']', 52), ('_', 53), ('a', 54), ('b', 55), ('c', 56), ('d', 57), ('e', 58), ('f', 59), ('g', 60), ('h', 61), ('i', 62), ('j', 63), ('k', 64), ('l', 65), ('m', 66), ('n', 67), ('o', 68), ('p', 69), ('q', 70), ('r', 71), ('s', 72), ('t', 73), ('u', 74), ('v', 75), ('w', 76), ('x', 77), ('y', 78), ('z', 79))


In [22]:
# Convert the zip object into a dictionary.

mapping = dict(zip(chars, tokens))

print(mapping)

{'\n': 0, ' ': 1, '!': 2, '"': 3, '&': 4, "'": 5, '(': 6, ')': 7, '*': 8, ',': 9, '-': 10, '.': 11, '0': 12, '1': 13, '2': 14, '3': 15, '4': 16, '5': 17, '6': 18, '7': 19, '8': 20, '9': 21, ':': 22, ';': 23, '?': 24, 'A': 25, 'B': 26, 'C': 27, 'D': 28, 'E': 29, 'F': 30, 'G': 31, 'H': 32, 'I': 33, 'J': 34, 'K': 35, 'L': 36, 'M': 37, 'N': 38, 'O': 39, 'P': 40, 'Q': 41, 'R': 42, 'S': 43, 'T': 44, 'U': 45, 'V': 46, 'W': 47, 'X': 48, 'Y': 49, 'Z': 50, '[': 51, ']': 52, '_': 53, 'a': 54, 'b': 55, 'c': 56, 'd': 57, 'e': 58, 'f': 59, 'g': 60, 'h': 61, 'i': 62, 'j': 63, 'k': 64, 'l': 65, 'm': 66, 'n': 67, 'o': 68, 'p': 69, 'q': 70, 'r': 71, 's': 72, 't': 73, 'u': 74, 'v': 75, 'w': 76, 'x': 77, 'y': 78, 'z': 79}


Then take user input. The input() function automatically converts input into string data type. For each character in the user input's string, map it to the integer token and return the results as a list.

In [28]:
text = input("What is the phrase to convert? ")
tokenized = []
for char in text:
    tokenized.append(mapping[i])

print(f"Your input is {text}, hence its token value according to mapping is:")
print(tokenized)

What is the phrase to convert?  TOKENS


Your input is TOKENS, hence its token value according to mapping is:
[79, 79, 79, 79, 79, 79]


## Code Refactoring