# Tokenizer

## Vocabulary

First, we read a book file to create our vocabulary. It will then be used to train the model.

We create with this a tokenizer that is divided into two steps: 

- encode text into integers sorted set, 

- and decode integers input into original text.

This tokenizer works with char-level tokenizing, it means that in each prompt, it will encode each character. It is not the most efficient, but we are gonna stay on char-level to simplify the exercise.

In [None]:
# Rename it if you want to try on entire book content
file_name = "./data/journey_to_the_center_of_the_earth.txt"

# Fetch book content
fd = open(file_name, encoding="utf8")
file_content = fd.read()

vocab = sorted(set(file_content))

def string_to_int(): 
   return { char:i for i, char in enumerate(vocab) }

def int_to_string():
   return { i:char for i, char in enumerate(vocab) }

def encode(chars):
   return [ string_to_int()[c] for c in chars ]

def decode(integers):
   return ''.join([ int_to_string()[i] for i in integers ])

print(''.join(vocab))




 !"'()*+,-./0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyz£﻿


Here is an application example of the tokenizer :

In [25]:

data = encode(file_content)[:70]

print ("Encoded data:", data)
print ("Decoded data:", decode(data))

Encoded data: [83, 30, 35, 28, 43, 47, 32, 45, 1, 14, 0, 0, 40, 52, 1, 48, 41, 30, 39, 32, 1, 40, 28, 38, 32, 46, 1, 28, 1, 34, 45, 32, 28, 47, 1, 31, 36, 46, 30, 42, 49, 32, 45, 52, 0, 0, 0, 39, 70, 70, 66, 64, 69, 62, 1, 57, 56, 58, 66, 1, 75, 70, 1, 56, 67, 67, 1, 75, 63, 56]
Decoded data: ﻿CHAPTER 1

MY UNCLE MAKES A GREAT DISCOVERY


Looking back to all tha


# Training and validation

We splits data into prediction and validation, because we do not want to make the model to copy exactly the content of data, but try to product content similar to the data, not the exact one. 

So we never feed the entire data into the transform, but more of chunks of it.

In [4]:
percent = 90 # Cut data into 90% / 10%
n = int(len(data) * (percent/100))
 
train_data = data[:n]
validation_data = data[n:]

block_size = 8
train_data[:block_size]

[83, 30, 35, 28, 43, 47, 32, 45]

## Context and validation

In [6]:
percent = 90 # Cut data into 90% / 10%
n = int(len(file_content) * (percent/100))
 
train_data = file_content[:n+1]
validation_data = file_content[n:]

block_size = 20

xChunk = train_data[:block_size]
yChunk = train_data[1:block_size+1]

print ("~~~~~~~~~~~~~~~~~~~")
print("x chunk:",xChunk)
print("y chunk", yChunk)
print ("~~~~~~~~~~~~~~~~~~~")

context = ""
target = ""
for i in range(block_size):
    context = xChunk[:i+1]
    target = yChunk[i]

    print("-------")
    print("Step", i)
    print("["+context+"], ["+target+"]")

~~~~~~~~~~~~~~~~~~~
x chunk: ﻿CHAPTER 1

MY UNCLE
y chunk CHAPTER 1

MY UNCLE 
~~~~~~~~~~~~~~~~~~~
-------
Step 0
[﻿], [C]
-------
Step 1
[﻿C], [H]
-------
Step 2
[﻿CH], [A]
-------
Step 3
[﻿CHA], [P]
-------
Step 4
[﻿CHAP], [T]
-------
Step 5
[﻿CHAPT], [E]
-------
Step 6
[﻿CHAPTE], [R]
-------
Step 7
[﻿CHAPTER], [ ]
-------
Step 8
[﻿CHAPTER ], [1]
-------
Step 9
[﻿CHAPTER 1], [
]
-------
Step 10
[﻿CHAPTER 1
], [
]
-------
Step 11
[﻿CHAPTER 1

], [M]
-------
Step 12
[﻿CHAPTER 1

M], [Y]
-------
Step 13
[﻿CHAPTER 1

MY], [ ]
-------
Step 14
[﻿CHAPTER 1

MY ], [U]
-------
Step 15
[﻿CHAPTER 1

MY U], [N]
-------
Step 16
[﻿CHAPTER 1

MY UN], [C]
-------
Step 17
[﻿CHAPTER 1

MY UNC], [L]
-------
Step 18
[﻿CHAPTER 1

MY UNCL], [E]
-------
Step 19
[﻿CHAPTER 1

MY UNCLE], [ ]
