# Introduction

The notebook is intended to develop a character-based LLM trained on [Shakespeare Literature](https://github.com/karpathy/ng-video-lecture/blob/master/input.txt).

**Resources**
- [Reference tutorial from Andrej Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY)

In [1]:
# Import Standard Libraries
import os
import tiktoken
import torch

from pathlib import Path

# Read Data

In [2]:
# Define local data file path
train_data_file_path = Path(os.path.abspath('')).parents[1] / 'data' / 'character_based_llm_train_data.txt'

In [3]:
# Read data
with open(train_data_file_path, 'r', encoding='utf-8') as train_data_file:
    train_data = train_data_file.read()

In [4]:
# Define the vocabulary of characters in the train data
train_vocaulary = sorted(list(set(train_data)))
train_vocaulary_size = len(train_vocaulary)

# Tokenizer
It is an important data preprocessing operation which converts the single portion of the sequence 
(characters or tokens of words) into numerical value based on all the possible values of the train vocabulary.

## Custom Tokenizer

In [5]:
# String to integer encoder
string_integer_encoder = {character: integer for integer, character in enumerate(train_vocaulary)}

In [6]:
# String to integer decoder
string_integer_decoder = {integer: character for integer, character in enumerate(train_vocaulary)}

In [7]:
# Define the encoder
encoder = lambda string: [string_integer_encoder[character] for character in string]

In [8]:
# Define the decoder
decoder = lambda integers_list: ''.join([string_integer_decoder[integer] for integer in integers_list])

In [9]:
# Define a sample sentence
tokeniser_sample_sentence = 'Hello there'

In [10]:
print('Example of Encoding and Decoding')
print('Example sentence: {}'.format(tokeniser_sample_sentence))
print('Encode: {}'.format(encoder(tokeniser_sample_sentence)))
print('Decode: {}'.format(decoder(encoder(tokeniser_sample_sentence))))

Example of Encoding and Decoding
Example sentence: Hello there
Encode: [20, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43]
Decode: Hello there


## TikToken

There are also already available Tokenizer as [TikToken](https://github.com/openai/tiktoken) from OpenAI. The goal is the same: produce a numerical representation from a string sequence, but they are based over a different vocabulary and transform the sequence in a different manner.

In [11]:
# Get the Tokenizer
tiktoken_tokenizer = tiktoken.get_encoding('gpt2')

In [12]:
print('List of Vocabularies of TikToken: {}'.format(tiktoken_tokenizer.n_vocab))
print('List of Vocabularies from Custom Tokenizer: {}'.format(train_vocaulary_size))

List of Vocabularies of TikToken: 50257
List of Vocabularies from Custom Tokenizer: 65


In [13]:
print('Example of Encoding and Decoding')
print('Example sentence: {}'.format(tokeniser_sample_sentence))
print('Encode: {}'.format(tiktoken_tokenizer.encode(tokeniser_sample_sentence)))
print('Decode: {}'.format(tiktoken_tokenizer.decode(tiktoken_tokenizer.encode(tokeniser_sample_sentence))))

Example of Encoding and Decoding
Example sentence: Hello there
Encode: [15496, 612]
Decode: Hello there


## Tokenize the Train Vocabulary

In [None]:
# Tokenize the train_vocaulary and store it in a PyTorch Tensor
train_vocaulary_encoded_tensor = torch.tensor(string_integer_encoder(train_vocaulary), dtype=torch.lomg)