## Theoretical Foundations of Negative Log-Likelihood (NLL)

### What is Negative Log-Likelihood (NLL)?

NLL is a loss function commonly used in machine learning and statistics. It is essentially the negative logarithm of the likelihood function. The likelihood function measures how well your model explains the observed data. In the context of NLG models, the likelihood function often represents the probability of generating a particular sequence of words given a certain context.

### Why is NLL used as a loss function?

NLL is closely related to Maximum Likelihood Estimation (MLE). While MLE aims to maximize the likelihood function, minimizing NLL is equivalent to maximizing the likelihood. This is particularly useful in NLG models where you want to maximize the probability of generating the most likely sequence of words.


## Mathematical Equations for NLL

The NLL loss function is defined as:

$[
\text{NLL} = -\log L(\theta; \mathbf{X}) = -\sum_{i=1}^{n} \log p(x_i; \theta)
$]

where $L(\theta; \mathbf{X})$ is the likelihood of the observed data $\mathbf{X}$ given the model parameters $\theta$.


## Practical Applications of NLL in NLG

### Role in Sequence Generation

In NLG models like GPT or LSTM-based sequence generators, NLL is used to train the model to generate sequences that are most likely to occur in the real world.

### Training Objective

During training, the model parameters are updated to minimize the NLL. This ensures that the generated text is coherent and contextually relevant.

### Softmax Activation

In the final layer of the model, a softmax activation function is often used to convert raw scores to probabilities. The NLL is then computed based on these probabilities.


## Computational Aspects of NLL

### Stochastic Gradient Descent (SGD)

SGD or its variants (like Adam) are commonly used to minimize the NLL loss function.

### Efficiency

NLL is computationally efficient to calculate and differentiate, making it suitable for large-scale NLG models.


## Clarifications and Misconceptions about NLL

NLL is essentially a specific case of cross-entropy loss when dealing with true labels that are one-hot encoded. In the context of NLG, both terms are often used interchangeably.


## Hands on example
Below is a Python code snippet that demonstrates the concept of Negative Log-Likelihood (NLL) in the context of a simple Natural Language Generation (NLG) task. In this example, we'll use a toy dataset and a basic model to predict the next word in a sequence.

- Data Preparation: We have a toy dataset where each sequence is a list of words. The last word in each sequence is what we want to predict.

- Model: We use a simple RNN model for this task. The model takes the word indices as input, embeds them into a continuous space, passes them through an RNN layer, and finally through a fully connected layer to predict the next word.

- Loss Function: We use the CrossEntropyLoss, which is a generalization of NLL loss suitable for multi-class classification.

- Training: We train the model using the Adam optimizer. The model parameters are updated to minimize the NLL.

In [26]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

In [27]:

# Toy dataset: Each row is a sequence, and the last element is the target word
data = [
    ['hello', 'how', 'are', 'you'],
    ['I', 'am', 'fine'],
    ['hello', 'I', 'am', 'Amelie']
]
word_to_idx = {'<PAD>': 0, 'hello': 1, 'how': 2, 'are': 3, 'you': 4, 'I': 5, 'am': 6, 'fine': 7, 'Amelie': 8}
vocab_size = len(word_to_idx)

# Convert words to integers
data_idx = [[word_to_idx[w] for w in seq] for seq in data]

# Prepare data for training
X_train = [torch.tensor(seq[:-1], dtype=torch.long) for seq in data_idx]
y_train = [torch.tensor(seq[-1], dtype=torch.long) for seq in data_idx]




In [28]:
# Simple RNN model
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(SimpleRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size)
        self.fc = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x):
        x = self.embedding(x)
        output, _ = self.rnn(x)
        output = self.fc(output[:, -1, :])
        return output

# Hyperparameters
embed_size = 10
hidden_size = 20
learning_rate = 0.001

# Initialize model, loss, and optimizer
model = SimpleRNN(vocab_size, embed_size, hidden_size)
criterion = nn.CrossEntropyLoss()  # NLL loss is a specific case of CrossEntropyLoss
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(100):
    for x, y in zip(X_train, y_train):
        x = x.view(-1, 1)
        output = model(x)
        
        # Repeat y to match the batch size
        y_repeated = y.repeat(output.shape[0])
        
        loss = criterion(output, y_repeated)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')

print("Training complete.")

Epoch [10/100], Loss: 1.9169
Epoch [20/100], Loss: 1.6993
Epoch [30/100], Loss: 1.4279
Epoch [40/100], Loss: 1.1112
Epoch [50/100], Loss: 0.8632
Epoch [60/100], Loss: 0.7106
Epoch [70/100], Loss: 0.6101
Epoch [80/100], Loss: 0.5385
Epoch [90/100], Loss: 0.4850
Epoch [100/100], Loss: 0.4435
Training complete.
