# Baby Name Generator Using Character-Level Neural Network Language Model 

### Introduction

This scary looking figure comes from a [2003 paper](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) that uses a Multi-layer Perceptron for language modeling. It shows a neural network that takes as input a sequence of words and outputs a probability distribution over the next word in the sequence. I will be putting a twist on this: I will instead create a Language Model (LM) that takes as input a sequence of characters and outputs a probability distribution over the next character in the sequence. This will allow us to generate baby names!

## Imports

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## Creating the Dataset 

I will be using a relatively small dataset, consisting of tens of thousands of baby names. Since the text is, well, non-numeric, we will have to numericalize our data in some way.

Since we're working on a character-level language model, we will be creating encodings for each and every character, hence the *vocabulary* is only as large as the English alphabet.

We will need to maintain a record of *indices*: these will help us convert from characters to numbers and vice versa. Note that we will also include a special character `[.]` to denote the end of a name (since it is necessary for the model to be able to predict when to stop throwing out more characters).

In [5]:
with open("names.txt") as f:
    words = f.read().splitlines()

print(words[:10])
print(len(words))
print((words[-10:]))
# print([i for i in words if "x" in i])

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia', 'harper', 'evelyn']
32033
['zuber', 'zubeyr', 'zyell', 'zyheem', 'zykeem', 'zylas', 'zyran', 'zyrie', 'zyron', 'zzyzx']


In [6]:
chars = set(''.join(words)) # set of all characters in the corpus
chars = sorted(list(chars)) # sorted list of all characters

# creating a dictionary of char to index and index to char
# Note that . is a special character that should be assigned index 0
# The characters should start from 1
stoi = {char: idx for idx, char in enumerate(chars, start=1)}
itos = {idx: char for idx, char in enumerate(chars, start=1)}

stoi['.'] = 0 # special character for EOS or SOS
itos[0] = '.'


print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


Our dataset will be set up in the following way:

- The input to the model will be a fixed-length sequence of characters, that we represent as integers for now. This is the **context vector**.

- The ground truth will be the next character in the sequence, also represented as an integer.

For example, take the name `olivia`. If we have the context length (`ctx_len`) equal $3$, then we can create multiple samples from this name:
- `..o` $\rightarrow$ `l`
- `.ol` $\rightarrow$ `i`
- `oli` $\rightarrow$ `v`
- `liv` $\rightarrow$ `i`
- `ivi` $\rightarrow$ `a`
- `via` $\rightarrow$ `.`

In [7]:
ctx_len = 3
X, Y = [], []

# Loop through each of the words
for word in words:

    # vector of zeros of length ctx_len
    ctx = [0] * ctx_len
    
    word = word + '.' # add a . at the end of the word
    
    # Looping through each character in the word
    for char in word:

        # Convert the character to an index
        char_idx = stoi[char]

        #  Append the context and the index to the X and Y lists respectively
        X.append(ctx)
        Y.append(char_idx)

        # Update the context by removing the first character and adding the current character
        ctx = ctx[1:] + [char_idx]

# Convert to tensors
X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape) # should be (228146,3), (228146)

torch.Size([228146, 3]) torch.Size([228146])


In [17]:
for i in range(12):
    
    context_indices = X[i]
    target_index = Y[i]
#     print(context_indices)
#     print(target_index)
    context_chs = [itos[i.item()] for i in context_indices]
    context_str = ''.join(context_chs)    
    target_ch = itos[target_index.item()]
    print(context_str, "->", target_ch)

... -> e
..e -> m
.em -> m
emm -> a
mma -> .
... -> o
..o -> l
.ol -> i
oli -> v
liv -> i
ivi -> a
via -> .


## Creating and Training a Model 

We will be using a simple Multilayer Perceptron to create our language model. The model will take in a sequence of characters and output a probability distribution over the next character in the sequence. We will be using an Embedding layer, `nn.Embedding`, to learn a representation for each character in our vocabulary. This will allow our model to learn a representation that is more suited to the task at hand.

we can read more about Embeddings [here](https://en.wikipedia.org/wiki/Word_embedding) for starters, but for now, think of it as a way to *project* some identifier (whether that be an integer or a one-hot vector for a character index), into *some* vector space. This vector space is learned by the model, and is optimized for the task at hand. If we set the size of the embedding to be $d$, then the output of the embedding layer will be a vector of size $d$.

Put simply, for the task at hand, Embeddings let us turn characters into learnable vectors.

In [18]:
# Create the language model and perform a forward pass 
class MLPLM(nn.Module):
    def __init__(self, vocab_size=27, emb_dim=2, ctx_len=3):
        super().__init__()

        # Create an embedding layer to extract better features (vocab_size -> emb_dim)
        self.embedding = nn.Embedding(vocab_size, emb_dim)

        # Create a regular neural network (emb_dim*ctx_len -> ... -> vocab_size)
        self.fc = nn.Sequential(nn.Linear(emb_dim * ctx_len, 200),nn.ReLU(),nn.Linear(200, vocab_size))
    
    def forward(self, x):

        # Pass the input through the embedding layer
        x = self.embedding(x)

        # Flatten the input (batch_size, ctx_len, emb_dim) -> (batch_size, ctx_len*emb_dim)
        x = x.view(x.size(0), -1)

        # Pass the flattened input through the feedforward network
        return self.fc(x)

model = MLPLM()

# Dummy forward pass
out = model(X)
print(out.shape) # should be (228146, 27)

torch.Size([228146, 27])


In [None]:
# Training the model 

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def fit(model, X, Y, optimizer, loss_fn, epochs=200):
    history = {"train_loss": []}

    for epoch in range(epochs):

        logits = model(X)
        loss = loss_fn(logits, Y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss = loss.item()
        history["train_loss"].append(train_loss)

        if epoch % (epochs // 10) == 0 or epoch == epochs - 1:
            print(f"Epoch {epoch} | Train loss: {train_loss:.4f}")

    return history

history = fit(model, X, Y, optimizer, loss_fn, epochs=500)

Epoch 0 | Train loss: 3.3127
Epoch 50 | Train loss: 2.4753
Epoch 100 | Train loss: 2.3380
Epoch 150 | Train loss: 2.2843
Epoch 200 | Train loss: 2.2506
Epoch 250 | Train loss: 2.2308
Epoch 300 | Train loss: 2.2191
Epoch 350 | Train loss: 2.2059


## Generating Names 

The algorithm I will be using is simple:

1. Start with an vector of zeros (the model knows this means the start of a name from the training process)

2. Feed this vector into the model to get a probability distribution over the next character

3. Sample a character from this distribution

4. Repeat steps 2 and 3 until we get a `[.]` character

In [66]:
num_names = 20

for i in range(num_names):

    # Create a vector of zeros for starters
    ctx = [0, 0, 0]
    gen_name = ''

    # Infinite loop till we get a . character
    while True:

        # Convert the context to a tensor
        ctx_tensor = torch.tensor(ctx).unsqueeze(0)

        # Pass the context through the model
        logits = model(ctx_tensor)

        # Get the probabilities by applying softmax
        probas = torch.softmax(logits, dim=1)
        
        # Sample from the distribution to get the next character (use torch.multinomial)
        idx = torch.multinomial(probas, 1)

        # Convert the index to a character
        char = itos[idx.item()]

        # Append the character to the generated name
        gen_name += char

        # Update the context
        ctx = ctx[1:] + [idx.item()]

        # Break if we get a . character
        if char == '.':  
            break
    
    # Print the generated name!
    print(gen_name)

fendi.
ille.
mohaneilan.
betleelandrayya.
avcleiel.
mayza.
ade.
yre.
nyson.
gherynd.
abtus.
lua.
isabdon.
sunad.
jahimpkaishiel.
mirzin.
darielanizae.
wan.
eeta.
merdren.


## Fin.