# NGram BoW Deep Neural Network Experiment

Objective
--
Create a neural network to predict the next token in a sequence using the bag of words representation of the current ngram

Limitations
--
1. Character level tokenization will be used
2. Bag of words representation of an ngram will be used

In [1]:
import math
import random
from typing import Sequence

import pandas as pd
import torch
from torch import nn

## Initialize experiment parameters

In [2]:
random.seed(42)

train_size = 0.8

## Load the training data

In [3]:
dataset = pd.read_csv('../data/text/text_emotion.csv')['content']
dataset.head()

0    @tiffanylue i know  i was listenin to bad habi...
1    Layin n bed with a headache  ughhhh...waitin o...
2                  Funeral ceremony...gloomy friday...
3                 wants to hang out with friends SOON!
4    @dannycastillo We want to trade with someone w...
Name: content, dtype: object

## Preprocess the data

### Build the codec

In [4]:
END_TOKEN = "<E>"

# Build the vocabulary for this training set.
vocab = [END_TOKEN] + list(set(dataset.str.cat(sep=' ')))
vocab_size = len(vocab)
# Assign a unique id to each token in the vocabulary.
id_by_token = {token: i for (i, token) in enumerate(vocab)}
token_by_id = {id_: token for (token, id_) in id_by_token.items()}

class Codec:
    @staticmethod
    def encode(token: str) -> int:
        return id_by_token[token]

    @staticmethod    
    def encode_all(tokens: list[str]) -> int:
        return [Codec.encode(token) for token in tokens]

    @staticmethod
    def decode(encoded_token: int) -> str:
        return token_by_id[encoded_token]
    
    @staticmethod
    def decode_all(encoded_tokens: list[str]) -> int:
        return [Codec.decode(encoded_token) for encoded_token in encoded_tokens]

### Create the Tokenizer

In [5]:
class Tokenizer:
    @staticmethod
    def tokenize(document: str, is_complete: str = True) -> list[str]:
        tokens = [END_TOKEN] + list(document) 
        if is_complete:
            tokens.append(END_TOKEN)
        return tokens

### Split the data into training and test sets

In [6]:
dataset_size = len(dataset)
train_dataset = dataset[: math.floor(dataset_size * train_size)]
test_dataset = dataset[math.floor(dataset_size * train_size):]
test_dataset.reset_index(inplace=True, drop=True)

## Create the training batches

In [7]:
def get_random_document(dataset, min_tokens = None) -> tuple[list, list[str]]:
    """Returns a random document from the specified dataset, containing at least `min_tokens` tokens"""
    # Select documents at random until one with the minimum number of tokens is found.
    while True:
        document = dataset[random.randrange(0, dataset.shape[0])]
        if min_tokens is not None and len(Tokenizer.tokenize(document)) < min_tokens:
            continue
        break
    return document

def sample_tokens(document: str, num_tokens: int) -> str:
    """Returns a random sub-sequence of tokens from the specified document"""
    tokens = Tokenizer.tokenize(document)
    if len(tokens) < num_tokens:
        raise ValueError("The provided document does not contain enough tokens to return the requested amount") 
    start = random.randrange(0, len(tokens) - num_tokens + 1)
    return tokens[start: start + num_tokens]

def to_ngrams(tokens: list[str], ngram_size: int) -> list[list[str]]:
    """Returns ngrams from the specified document"""
    return [tokens[max(0, i - ngram_size): i] for i in range(1, len(tokens) + 1)]

def to_context_token_pairs(tokens: list[str], ngram_size: int) -> list[tuple[list[str], str]]:
    """Converts a token sequence to a sequence of context-token pairs.

    Each 'context-token' pair is a tuple containing a 'context', which is an n-gram of 
    length [1: `ngram_size`], and a 'token' that immediately follows that n-gram in the 
    sequence.

    Pairs are generated by iterating over the tokens in the range [1, `len(tokens)`]. The token 
    at the current index becomes the 'token' in the pair, while the preceding `ngram_size` 
    tokens serve as the 'context'.
    """
    ngrams = to_ngrams(tokens, ngram_size=ngram_size)
    ngrams = ngrams[0: len(ngrams) - 1]  # Truncate the last ngram since this wont have a trailing token 
                                         # and therefore cant form a pair.
    return [(ngrams[i], tokens[i+1]) for i in range(len(ngrams))]

def encode_context_token_pairs(pairs):
    encoded_contexts = torch.zeros(len(pairs), vocab_size)
    encoded_tokens = torch.zeros(len(pairs), dtype=torch.long)
    for i, (context, token) in enumerate(pairs):
        for t in context:
            encoded_contexts[i, Codec.encode(t)] += 1
        encoded_tokens[i] = Codec.encode(token)
    return encoded_contexts, encoded_tokens

def sample_context_token_pairs(dataset: pd.Series, num_pairs: int, ngram_size: int):
    """Returns a random sequence of context-token pairs from the specified dataset"""
    num_tokens_required = num_pairs + 1 # n context-token pairs requires n + 1 tokens
    document = get_random_document(dataset, min_tokens=num_tokens_required)
    tokens = sample_tokens(document, num_tokens=num_tokens_required)
    return to_context_token_pairs(tokens, ngram_size=ngram_size)

def sample_batch(dataset, batch_size, ngram_size, device=None):
    pairs = sample_context_token_pairs(dataset, num_pairs=batch_size, ngram_size=ngram_size)
    X, y = encode_context_token_pairs(pairs)
    if device is not None:
        X, y = X.to(device), y.to(device)
    return X, y

def sample_batch_v2(dataset, batch_size, ngram_size, device=None):
    document = get_random_document(train_dataset, min_tokens=batch_size+1)
    tokens = Tokenizer.tokenize(document)
    context_token_pairs = to_context_token_pairs(tokens, ngram_size=ngram_size)
    start = random.randint(0, len(context_token_pairs) - batch_size + 1)
    pairs = context_token_pairs[start: start + batch_size]
    X, y = encode_context_token_pairs(pairs)
    if device is not None:
        X, y = X.to(device), y.to(device)
    return X, y
    

## Define the model

For the model we will use a simple neural network, 5 layers deep as a test. Once the learning ability of this model is proven we can evaulate using a larger neural network

In [8]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using mps device


In [12]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(vocab_size, 32),
            nn.ReLU(),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, vocab_size)
        )
    
    def forward(self, x):
        return self.layers(x)

## Train the model

In [20]:
ngram_size = 3
batch_size = 64
num_train_batches = 20_000
num_test_batches = 6000
num_epochs = 64
print_interval = 5_000

model = Model().to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

In [21]:
for epoch_no in range(1, num_epochs+1):
    print("-------------------------------------------")
    print(f"Epoch #{epoch_no}")
    # train 
    for batch_no in range(1, num_train_batches+1):
        X, y = sample_batch_v2(train_dataset, batch_size=batch_size, ngram_size=ngram_size, device=device)
        y_pred = model(X)
        loss = loss_fn(y_pred, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if batch_no % print_interval == 0:
            print(f"  Train batch #{batch_no} -- loss: {loss:>7f} [{batch_no} / {num_train_batches}]")
    
    # evaluate
    correct = 0
    loss = 0
    for batch_no in range(num_test_batches):
        X, y = sample_batch_v2(test_dataset, batch_size=batch_size, ngram_size=ngram_size, device=device)
        with torch.no_grad():
            y_pred = model(X)
            loss += loss_fn(y_pred, y).item()
            correct += (y_pred.argmax(1) == y).type(torch.float).sum().item()
    print(f"  Evaluation -- Test Accuracy: {((correct / (num_test_batches * batch_size)) * 100):>3f}%, Avg loss[{loss / num_test_batches}]")
    print()
    

-------------------------------------------
Epoch #1
  Train batch #5000 -- loss: 3.147500 [5000 / 20000]
  Train batch #10000 -- loss: 3.357711 [10000 / 20000]
  Train batch #15000 -- loss: 3.388016 [15000 / 20000]
  Train batch #20000 -- loss: 2.940302 [20000 / 20000]
  Evaluation -- Test Accuracy: 21.369271%, Avg loss[3.023184333920479]

-------------------------------------------
Epoch #2
  Train batch #5000 -- loss: 3.052405 [5000 / 20000]
  Train batch #10000 -- loss: 3.120704 [10000 / 20000]
  Train batch #15000 -- loss: 3.017201 [15000 / 20000]
  Train batch #20000 -- loss: 2.909417 [20000 / 20000]
  Evaluation -- Test Accuracy: 25.079948%, Avg loss[2.8625915988286335]

-------------------------------------------
Epoch #3
  Train batch #5000 -- loss: 3.174700 [5000 / 20000]
  Train batch #10000 -- loss: 3.110636 [10000 / 20000]
  Train batch #15000 -- loss: 2.641510 [15000 / 20000]
  Train batch #20000 -- loss: 2.495705 [20000 / 20000]
  Evaluation -- Test Accuracy: 26.475000%,

In [23]:
def to_bag_of_words(encoded_tokens, size: int) -> torch.Tensor:
    bow = torch.zeros(size)
    for encoded_token in encoded_tokens:
        bow[encoded_token] += 1
    return bow

In [24]:
def generate_text(prompt: str = None):
    if prompt is None:
        prompt = Codec.decode(random.randint(0, vocab_size))
        
    while len(prompt) < 20:
        tokens = Tokenizer.tokenize(prompt, is_complete=False)
        ngrams = to_ngrams(tokens, ngram_size=3)
        last_ngram = ngrams[len(ngrams) - 1]
        with torch.no_grad():
            # print(f"Current ngram: {last_ngram}")
            # print(f"Current ngram as bow: {to_bag_of_words(Codec.encode_all(last_ngram), size=vocab_size)}")
            encoded_ngram = to_bag_of_words(Codec.encode_all(last_ngram), size=vocab_size).reshape(1, -1).to(device)
            # print(encoded_ngram)
            # break
            logits = model(encoded_ngram)
            proba = nn.Softmax(dim=1)(logits)[0]
            top_predictions = proba.topk(5).indices.tolist()
            # print(top_predictions)
            # break
            # print("Top 5 most likely next tokens:")
            # for i, encoding in enumerate(top_predictions):
            #     print(f'    #{i}. "{Codec.decode(encoding)}"')
            # break
            next_token = Codec.decode(torch.multinomial(proba, num_samples=1)[0].item())
            # print(f'"{next_token}" was chosen')
            prompt += next_token
    return prompt

In [31]:
for i in range(20):
    print(f"#{i+1}: {generate_text()}")

#1: SeA  just etgt os is
#2: bestr hady oughs. Wo
#3: {Tos  lafsoe nICas. 
#4: jutale asI amknay! 2
#5: 0mfwed saIdd oisng m
#6: lsee shtpe yofollml 
#7: )I'm  istley any oua
#8: Jyuit  o fro mtyy os
#9: PL@EOV HE I  rlhile 
#10: 0.@ borat thatteeh a
#11: dje untnree nse hvbi
#12: zthi mtahtsogte out 
#13: GO *OLT GOD Ladrs mn
#14: ½geols. sIe  tsos mo
#15: 5 un tmhe wdie b oyu
#16: bfeor my si dti scho
#17: zeon riged  Junate o
#18: on ein gGoerdse yabe
#19: Y_uya nsei n Beer ha
#20: Rry on methe flseena


Possible reasons for poor performance:

#### Loss of ordering due to bag of words representation
The current implementation takes an ngram, an inherently ordered structure and destroy the order by representing it as a bag of words / bag of tokens. The algorithm them is expected to 

#### Sparsity of the input data
Input is the size of the vocabulary which in our case is ~100 characters. Since the ngram size used for experiment was often less than 5, at any given time only 5 of ~100 input values would be non-zero. This could have made learning difficult (investigate this)

#### Type of neural network used / Depth & Shape of the neural network
The neural network chosen does not have an inherrent notion of ordering and works with data at a specific point in time. This combined with the loss of ordering information in the input may result in poor performance. There is also an admitted lack of experience with regards to optimizing the shape of a neural network for next character prediction. There may be a way to optimize just a regular MLP for this task but the experience is lacking.

What can I learn coming out of this?

1. Why does sparsity affect the learning ability of a neural network?
2. How does learning rate affect the learning ability of neural network? What happens when learning rate values become too large or small?
3. How does a neural network learn in the first place? Can you learn to guestimate how well a given neural network will do for a specific dataset / task
4. How does cross entropy loss work? Why does the formula use the exponent rather than simple absolution.