# Language Modeling


Whether for transcribing spoken utterances as correct word sequences or generating coherent human-like text, language models are extremely useful.

In this assignment, you will be building your own language models powered by n-grams and RNNs.

In [1]:
# !unzip data.zip

Archive:  data.zip
replace data/bbc/business.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


### Step 2: RNN Language Model


#### Preparing the Data
The following Python code is used for loading and processing [GloVe (Global Vectors for Word Representation) embeddings](https://nlp.stanford.edu/projects/glove/). GloVe is an unsupervised learning algorithm for obtaining vector representations for words. These embeddings can be used in various natural language processing and machine learning tasks.

The `load_glove_embeddings(path)` function is used to load the GloVe embeddings from a file. The function takes a file path as an argument, reads the file line by line, and for each line, it splits the line into words and their corresponding embeddings, and stores them in a dictionary. The dictionary, embeddings_dict, maps words to their corresponding vector representations.

The `create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim)` function is used to create an embedding matrix from the loaded GloVe embeddings. This function takes a dictionary mapping words to their indices (`word_to_ix`), the dictionary of GloVe embeddings (`embeddings_dict`), and the dimension of the embeddings (`embedding_dim`) as arguments. It creates a zero matrix of size (vocab_size, embedding_dim) and then for each word in  `word_to_ix`, it checks if the word is in `embeddings_dict`. If it is, it assigns the corresponding GloVe vector to the word's index in the embedding matrix. If the word is not in the embeddings_dict, it assigns a random vector to the word's index in the embedding matrix.

The `glove_path` variable is the path to the GloVe file, and `glove_embeddings` is the dictionary of GloVe embeddings loaded using the `load_glove_embeddings` function. The `embedding_dim` variable is the dimension of the embeddings, and `embedding_matrix` is the embedding matrix created using the create_embedding_matrix function.

In [10]:
# Load the data
vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/lyrics/taylor_swift.txt")

In [11]:
def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = torch.tensor([float(val) for val in values[1:]], dtype=torch.float)
            embeddings_dict[word] = vector
    return embeddings_dict

# Path to the GloVe file
glove_path = 'glove.6B.50d.txt'  # Update this path
glove_embeddings = load_glove_embeddings(glove_path)

def create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim):
    vocab_size = len(word_to_ix)
    embedding_matrix = torch.zeros((vocab_size, embedding_dim))
    for word, ix in word_to_ix.items():
        if word in embeddings_dict:
            embedding_matrix[ix] = embeddings_dict[word]
        else:
            embedding_matrix[ix] = torch.rand(embedding_dim)  # Random initialization for words not in GloVe
    return embedding_matrix

# Create the embedding matrix
embedding_dim = 50
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

#### TO DO: Defining the RNN Model

In [12]:
#######################################
# TODO: RNNLanguageModel()
#######################################

import math
import torch
import numpy as np
import torch.nn as nn
from collections import Counter
from torch.utils.data import DataLoader, Dataset

class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, embedding_matrix):
        """
        RNN language model con GRU y embeddings GloVe.
        """
        super().__init__()
        self.device = torch.device(
            "mps" if torch.backends.mps.is_available()
            else "cuda" if torch.cuda.is_available()
            else "cpu"
        )
        print(f"Using device: {self.device}")

        # Embedding inicializado con GloVe
        # embedding_matrix: torch.Tensor [vocab_size, embedding_dim]
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        with torch.no_grad():
            self.embedding.weight.copy_(embedding_matrix)

        # GRU unidireccional
        self.hidden_dim = hidden_dim
        self.rnn = nn.GRU(input_size=embedding_dim,
                          hidden_size=hidden_dim,
                          num_layers=1,
                          batch_first=True)

        # Capa final a vocab
        self.fc = nn.Linear(hidden_dim, vocab_size)

        self.to(self.device)

    def forward(self, x, hidden=None):
        """
        x: [B, T] índices
        hidden: [1, B, H] opcional
        retorna: logits [B, T, V], hidden
        """
        x = x.to(self.device)
        if hidden is not None:
            hidden = hidden.to(self.device)

        emb = self.embedding(x)            # [B, T, D]
        out, hidden = self.rnn(emb, hidden)  # out: [B, T, H]
        logits = self.fc(out)              # [B, T, V]
        return logits, hidden

    @torch.no_grad()
    def generate_sentence(self, sequence, word_to_ix, ix_to_word, num_words, mode='max'):
        """
        Autoregresivo desde la secuencia dada.
        Usa último token como condición y mantiene el hidden.
        """
        self.eval()

        # tokens iniciales
        tokens = sequence.strip().split()
        # map a ids con UNK si no está
        unk = UNK if 'UNK' in globals() else '<unk>'
        start_ids = [word_to_ix.get(w, word_to_ix.get(unk, 0)) for w in tokens]
        if len(start_ids) == 0:
            # si vacío, inicia con <s> si existe
            start_ids = [word_to_ix.get(START, 0)]

        # construir estado inicial ejecutando la secuencia
        x = torch.tensor(start_ids, dtype=torch.long, device=self.device).unsqueeze(0)  # [1, T]
        logits, hidden = self.forward(x)  # oculto después de la secuencia

        generated = []
        last_id = x[0, -1].unsqueeze(0).unsqueeze(0)  # [1,1]

        for _ in range(num_words):
            logits, hidden = self.forward(last_id, hidden)  # [1,1,V]
            probs = torch.softmax(logits[0, -1], dim=-1)

            if mode == 'multinomial':
                next_id = torch.multinomial(probs, num_samples=1)
            else:
                next_id = torch.argmax(probs, dim=-1, keepdim=True)

            wid = next_id.item()
            word = ix_to_word.get(wid, unk)
            generated.append(word)

            # parar si EOS
            if word == EOS:
                break

            last_id = next_id.view(1, 1)

        return generated


#### Training the Model
The following code snippet provided is responsible for training the RNN language model.

In [13]:
#######################################
# TEST: RNNLanguageModel() and training
#######################################
torch.manual_seed(11411)
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 20

# Initialize the model, loss function, and optimizer
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

lines = ""
# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        inputs = inputs.to(RNN.device)
        targets = targets.to(RNN.device)

        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()

    line = f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}, Perplexity: {np.exp(loss.item())}'
    lines += line + "\n"
    print(line)

Using device: cuda
Epoch 1/20, Loss: 1.8877289295196533, Perplexity: 6.60435268587878
Epoch 2/20, Loss: 2.3066248893737793, Perplexity: 10.040479673564507
Epoch 3/20, Loss: 1.6212486028671265, Perplexity: 5.05940356016781
Epoch 4/20, Loss: 3.0175232887268066, Perplexity: 20.440603467026385
Epoch 5/20, Loss: 1.1223251819610596, Perplexity: 3.0719888384478238
Epoch 6/20, Loss: 2.61008882522583, Perplexity: 13.60025884424322
Epoch 7/20, Loss: 1.8380917310714722, Perplexity: 6.284534229808462
Epoch 8/20, Loss: 1.6776500940322876, Perplexity: 5.352962221983752
Epoch 9/20, Loss: 2.4872968196868896, Perplexity: 12.028716343585533
Epoch 10/20, Loss: 2.544201612472534, Perplexity: 12.733058112032381
Epoch 11/20, Loss: 1.3362374305725098, Perplexity: 3.8047010881214876
Epoch 12/20, Loss: 2.1243982315063477, Perplexity: 8.367860457933368
Epoch 13/20, Loss: 2.313692331314087, Perplexity: 10.111691527118902
Epoch 14/20, Loss: 2.348663330078125, Perplexity: 10.47156334210341
Epoch 15/20, Loss: 2.635