## **Character-Level Name Generation with Positional Embeddings and MLP**

This notebook implements a simple multi-layer perceptron (MLP) to generate names. 

We train the model on a dataset of names using positional embeddings and a neural network to predict characters step by step.

---

We start by loading names from a file, which we will use to train our model. 

Each name will be tokenized into characters, and we will assign numerical indices to each unique character.

In [1]:
import random 
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

random.seed(42)

with open('markov-to-transformers/data/names.txt', 'r') as f:
    names = f.read().splitlines()
print(f"Saved {len(names)} names.")

Saved 32033 names.


Each character is assigned a unique index, allowing us to map between characters and numbers. 

We also extend this set with special tokens for handling context windows.

In [2]:
block_size = 5

chrs = list({ch for name in names for ch in name})
chrs_pos_enc = chrs + ['.' for _ in range(block_size + 1)]
i_to_s = {i: chr for i, chr in enumerate(sorted(chrs_pos_enc))}
s_to_i = {v: k for k, v in i_to_s.items()} # a bit hacky. len is 27.

Instead of zero-padding, we use positional indices to provide unique context information to the model.

This prevents the model from predicting start tokens incorrectly.

In [3]:
def build_posemb_dataset(names):
    ''' Builds a dataset with positional embedding for start characters. yay!'''
    X, Y = [], []

    for name in names:
        name = f"{name}."  # Append a stop token
        context = [v for v in range(0, block_size)] # Positional encoding for each start character
        for ch in name:
            X.append(context)
            Y.append(s_to_i[ch])
            context = context[1:] + [s_to_i[ch]] # Shift context window
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    return X, Y

# splitting data into train, validation, and test sets

n1 = int(len(names) * 0.8)
n2 = int(len(names) * 0.9)
random.shuffle(names)
Xtr, Ytr = build_posemb_dataset(names[:n1])
Xval, Yval = build_posemb_dataset(names[n1:n2])
Xte, Yte = build_posemb_dataset(names[n2:])

#### Creating an Embedding Table

Instead of using one-hot vectors, we map character indices to dense embeddings, which are learned during training.

In [6]:
feature_len = Xtr.unique().shape[0]+1
C = torch.rand((feature_len, 8)) # 8-dimensional embeddings

Xtr.shape, C.shape

(torch.Size([182625, 5]), torch.Size([32, 8]))

#### Defining the Neural Network (MLP)

This is a simple feedforward neural network with one hidden layer. 

The first layer transforms concatenated embeddings, and the second layer predicts the next character.


In [None]:
squashed_dims = len(C[0]) * len(Xtr[0])

W1 = torch.rand((squashed_dims, 300)) # 300 neurons in the hidden layer
b1 = torch.rand(W1.shape[1])
W2 = torch.rand((W1.shape[1], feature_len))
b2 = torch.rand(feature_len)

params = [C, W1, b1, W2, b2]

loss_i = []
step_i = []

for p in params:
    p.requires_grad = True

print(f"{sum(p.nelement() for p in params)} params.")

22188 params.


#### Training Loop

This loop performs mini-batch gradient descent to optimize the model. 

The loss is computed using cross-entropy, and weights are updated using stochastic gradient descent.

In [None]:
chunk_size = 32 # mini batch size

for i in range(50000):
# forward pass
    batch = torch.randint(0, Xtr.shape[0], (chunk_size,)) # sample batch indices
    batch_emb = C[Xtr[batch]].view(chunk_size, -1) # get embeddings
    acts = torch.sigmoid(batch_emb @ W1 + b1) # hidden layer with sigmoid activation
    logits = acts @ W2 + b2 # output layer
    loss = F.cross_entropy(logits, Ytr[batch]) # calculate loss

# backward pass
    for p in params:
        p.grad = None   # zeroing the gradients
    loss.backward()     # backpropagation

# optimization
    learning_rate = 0.001
    for p in params:
        p.data += -learning_rate * p.grad # surfing the gradient waves

# track steps
    loss_i.append(loss.item())
    step_i.append(i)
# evaluate the model on a separate validation set to observe generalization
val_emb = C[[Xval]].view(-1, squashed_dims) # get embeddings for validation set
acts = torch.sigmoid(val_emb @ W1 + b1)
logits = acts @ W2 + b2
valid_loss = F.cross_entropy(logits, Yval)

print(f"trained for {len(step_i)} steps.\ntrain loss: {loss.item()}\nvalid loss: {valid_loss.item()}")

trained for 140000 steps.
train loss: 2.2013587951660156
valid loss: 2.152350902557373


Once training is complete, we evaluate the model's performance on a test set. 

The test loss provides an estimate of how well the model generalizes to unseen data.

In [17]:
test_emb = C[[Xte]].view(-1, squashed_dims) # get embeddings for test set
acts = torch.sigmoid(test_emb @ W1 + b1)
logits = acts @ W2 + b2
test_loss = F.cross_entropy(logits, Yte)
print(f"Test loss: {test_loss.item()}")

Test loss: 2.1484696865081787


Now that training is complete, we can sample new names from the model.

We start with an empty context and iteratively predict the next character.

The process stops when the model predicts a stop token (`.`).

In [64]:
for i in range(5):
    context = [v for v in range(0, block_size)] # initialize context with position indices
    word = ''
    while True:
        test_emb = C[context].view(-1, squashed_dims) # get embeddings for test set
        acts = torch.sigmoid(test_emb @ W1 + b1)
        logits = acts @ W2 + b2
        probs = F.softmax(logits, 1)
        pred = torch.multinomial(probs, 1).item() # sample from probability distribution
        context = context[1:] + [pred] # shift context window
        if i_to_s[context[-1]] == '.': # stop if end token is predicted
            break
        word += i_to_s[context[-1]] # append predicted character
    print(word)

katarud
jasxer
iliella
lyziah
merri
