# Language Modeling


Whether for transcribing spoken utterances as correct word sequences or generating coherent human-like text, language models are extremely useful.

In this assignment, you will be building your own language models powered by n-grams and RNNs.

In [1]:
!unzip data.zip

Archive:  data.zip
replace data/bbc/business.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


## Part 1: Language Models

### Step 0: Preprocessing

In [2]:
# !pip install transformers
# !pip install requests
# !pip install torch
# !pip install tqdm

import math
import torch
import numpy as np
import torch.nn as nn
from collections import Counter
from torch.utils.data import DataLoader, Dataset

We provide you with a few functions in `utils.py` to read and preprocess your input data. Do not edit this file!

In [3]:
from utils import *

We have performed a round of preprocessing on the datasets.

- Each file contains one sentence per line.
- All punctuation marks have been removed.
- Each line is a sequences of tokens separated by whitespace.

#### Special Symbols ( Already defined in `utils.py` )
The start and end tokens will act as padding to the given sentences, to make sure they are correctly defined, print them here:

In [4]:
print("Sentence START symbol: {}".format(START))
print("Sentence END symbol: {}".format(EOS))
print("Unknown word symbol: {}".format(UNK))

Sentence START symbol: <s>
Sentence END symbol: </s>
Unknown word symbol: <UNK>


#### Reading and processing an example file

In [5]:
# Read the sample file
sample = read_file("data/sample.txt")
print(sample)

['We are never ever ever ever ever getting back together\n', 'We are the ones together we are back']


In [6]:
# Preprocess the content to add corresponding number of start and end tokens. Try out the method with n = 3 and n = 4 as well.
# Preprocessing example for bigrams (n=2)
sample = preprocess(sample, n=3)
for s in sample:
    print(s)

['<s>', '<s>', 'we', 'are', 'never', 'ever', 'ever', 'ever', 'ever', 'getting', 'back', 'together', '</s>']
['<s>', '<s>', 'we', 'are', 'the', 'ones', 'together', 'we', 'are', 'back', '</s>']


In [7]:
# Flattens a nested list into a 1D list.
flattened = flatten(sample)
print(flattened)

['<s>', '<s>', 'we', 'are', 'never', 'ever', 'ever', 'ever', 'ever', 'getting', 'back', 'together', '</s>', '<s>', '<s>', 'we', 'are', 'the', 'ones', 'together', 'we', 'are', 'back', '</s>']


### Step 1: N-Gram Language Model

#### TO DO: Defining `get_ngrams()`

In [8]:
#######################################
# TODO: get_ngrams()
#######################################
def get_ngrams(list_of_words, n):
    """
    Returns a list of n-grams for a list of words.
    Args
    ----
    list_of_words: List[str]
        List of already preprocessed and flattened (1D) list of tokens e.g. ["<s>", "hello", "</s>", "<s>", "bye", "</s>"]
    n: int
        n-gram order e.g. 1, 2, 3

    Returns:
        n_grams: List[Tuple]
            Returns a list containing n-gram tuples
    """


    raise NotImplementedError

In [9]:
#######################################
# TEST: get_ngrams()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=3)
flattened = flatten(sample)

assert get_ngrams(flattened, 3) == [('<s>', '<s>', 'we'),
        ('<s>', 'we', 'are'),
        ('we', 'are', 'never'),
        ('are', 'never', 'ever'),
        ('never', 'ever', 'ever'),
        ('ever', 'ever', 'ever'),
        ('ever', 'ever', 'ever'),
        ('ever', 'ever', 'getting'),
        ('ever', 'getting', 'back'),
        ('getting', 'back', 'together'),
        ('back', 'together', '</s>'),
        ('together', '</s>', '<s>'),
        ('</s>', '<s>', '<s>'),
        ('<s>', '<s>', 'we'),
        ('<s>', 'we', 'are'),
        ('we', 'are', 'the'),
        ('are', 'the', 'ones'),
        ('the', 'ones', 'together'),
        ('ones', 'together', 'we'),
        ('together', 'we', 'are'),
        ('we', 'are', 'back'),
        ('are', 'back', '</s>')]

NotImplementedError: 

#### **TO DO:** Class `NGramLanguageModel()`

*Now*, we will define our LanguageModel class.

**Some Useful Variables:**
- self.model: `dict` of n-grams and their corresponding probabilities, keys being the tuple containing the n-gram, and the value being the probability of the n-gram.
- self.vocab: `dict` of unigram vocabulary with counts, keys being the words themselves and the values being their frequency.
- self.n: `int` value for n-gram order (e.g. 1, 2, 3).
- self.train_data: `List[List]` containing preprocessed **unflattened** train sentences. You will have to flatten it to use in the language model
- self.smoothing: `float` flag signifying the smoothing parameter.

Note that we will not be using log probabilities in this section. Store the probabilities as they are, not in log space.

**Laplace Smoothing**

There are two ways to perform this:
- Either you calculate all possible n-grams at train time and calculate smooth probabilities for all of them, hence inflating the model (eager emoothing). You then use the probabilities as when required at test time. **OR**
- You calculate the probabilities for the **observed n-grams** at train time, using the smoothed likelihood formula, then if any unseen n-gram is observed at test time, you calculate the probability using the smoothed likelihood formula and store it in the model for future use (lazy smoothing).

You will be implementing lazy smoothing

**Perplexity**

Steps:
1. Flatten the test data.
2. Extract ngrams from the flattened data.
3. Calculate perplexity according to given formula. For unseen n-grams, calculate using smoothed likelihood and store the unseen n-gram probability in the labguage model `model` attribute:

$ppl(W_{test}) = ppl(W_1W_2 ... W_n)^{-1/n} $

Tips:
- Remember that product changes to summation under `log`. Take the log of probabilities, sum them up, and then exponentiate it to get back to the original scale.
- Make sure to `flatten()` your data before creating the n_grams using `get_ngrams()`.


In [None]:
#######################################
# TODO: NGramLanguageModel()
#######################################
class NGramLanguageModel():
    def __init__(self, n, train_data, alpha=1):
        """
        Language model class.

        Args
        ____
        n: int
            n-gram order
        train_data: List[List]
            already preprocessed unflattened list of sentences. e.g. [["<s>", "hello", "my", "</s>"], ["<s>", "hi", "there", "</s>"]]
        alpha: float
            Smoothing parameter

        Other attributes:
            self.tokens: list of individual tokens present in the training corpus
            self.vocab: vocabulary dict with counts
            self.model: n-gram language model, i.e., n-gram dict with probabilties
            self.n_grams_counts: dictionary for storing the frequency of ngrams in the training data, keys being the tuple of words(n-grams) and value being their frequency
            self.prefix_counts: dictionary for storing the frequency of the (n-1) grams in the data, similar to the self.n_grams_counts
            As an example:
            For a trigram model, the n-gram would be (w1,w2,w3), the corresponding [n-1] gram would be (w1,w2)
        """
        raise NotImplementedError

    def build(self):
        """
        Returns a n-gram dict with their smoothed probabilities. Remember to consider the edge case of n=1 as well

        You are expected to update the self.n_grams_counts and self.prefix_counts, and use those calculate the probabilities.
        """
        raise NotImplementedError

    def get_smooth_probabilities(self, ngrams):
        """
        Returns the smoothed probability of the n-gram, using Laplace Smoothing.
        Remember to consider the edge case of  n = 1
        HINT: Use self.n_gram_counts, self.tokens and self.prefix_counts
        """
        raise NotImplementedError

    def get_prob(self, ngram):
        """
        Returns the probability of the n-gram, using Laplace Smoothing.

        Args
        ____
        ngram: tuple
            n-gram tuple

        Returns
        _______
        float
            probability of the n-gram
        """

        # Hint: Check if this n-gram exists in self.model, if it does simply return it!
        # Otherwise, calculate the probabillity similar to get_smooth_probabilities()
        raise NotImplementedError

    def perplexity(self, test_data):
        """
        Returns perplexity calculated on the test data.
        Args
        ----------
        test_data: List[List]
            Already preprocessed nested list of sentences

        Returns
        -------
        float
            Calculated perplexity value
        """
        raise NotImplementedError


In [None]:
#######################################
# TEST: NGramLanguageModel()
#######################################
# For the sake of understanding we will pass alpha as 0 (no smoothing), so that you gain intuition about the probabilities
sample = preprocess(read_file("data/sample.txt"), n=2)
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=0)

expected_vocab = Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

expected_model = {('<s>', 'we'): 1.0,
        ('we', 'are'): 1.0,
        ('are', 'never'): 0.3333333333333333,
        ('never', 'ever'): 1.0,
        ('ever', 'ever'): 0.75,
        ('ever', 'getting'): 0.25,
        ('getting', 'back'): 1.0,
        ('back', 'together'): 0.5,
        ('together', '</s>'): 0.5,
        ('</s>', '<s>'): 1.0,
        ('are', 'the'): 0.3333333333333333,
        ('the', 'ones'): 1.0,
        ('ones', 'together'): 1.0,
        ('together', 'we'): 0.5,
        ('are', 'back'): 0.3333333333333333,
        ('back', '</s>'): 0.5}

assert test_lm.vocab == expected_vocab, f"Vocabulary mismatch! Expected: {expected_vocab}, but got: {test_lm.vocab}"

assert test_lm.model == expected_model, (
    f"Model mismatch! \n"
    f"Expected keys but missing: {set(expected_model.keys()) - set(test_lm.model.keys())}\n"
    f"Unexpected keys in model: {set(test_lm.model.keys()) - set(expected_model.keys())}\n"
    f"Discrepancies in probabilities: "
    f"{ {k: (expected_model[k], test_lm.model[k]) for k in expected_model if k in test_lm.model and expected_model[k] != test_lm.model[k]} }"
)

In [None]:
#######################################
# TEST smoothing: NGramLanguageModel()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=2)
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=1)

expected_vocab_smoothing = Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

expected_model_smoothing ={('<s>', 'we'): 0.23076923076923078,
        ('we', 'are'): 0.2857142857142857,
        ('are', 'never'): 0.14285714285714285,
        ('never', 'ever'): 0.16666666666666666,
        ('ever', 'ever'): 0.26666666666666666,
        ('ever', 'getting'): 0.13333333333333333,
        ('getting', 'back'): 0.16666666666666666,
        ('back', 'together'): 0.15384615384615385,
        ('together', '</s>'): 0.15384615384615385,
        ('</s>', '<s>'): 0.16666666666666666,
        ('are', 'the'): 0.14285714285714285,
        ('the', 'ones'): 0.16666666666666666,
        ('ones', 'together'): 0.16666666666666666,
        ('together', 'we'): 0.15384615384615385,
        ('are', 'back'): 0.14285714285714285,
        ('back', '</s>'): 0.15384615384615385}


assert test_lm.vocab == expected_vocab_smoothing, f"Vocabulary mismatch! Expected: {expected_vocab}, but got: {test_lm.vocab}"

assert test_lm.model == expected_model_smoothing, (
    f"Model mismatch! \n"
    f"Expected keys but missing: {set(expected_model_smoothing.keys()) - set(test_lm.model.keys())}\n"
    f"Unexpected keys in model: {set(test_lm.model.keys()) - set(expected_model_smoothing.keys())}\n"
    f"Discrepancies in probabilities: "
    f"{ {k: (expected_model_smoothing[k], test_lm.model[k]) for k in expected_model_smoothing if k in test_lm.model and expected_model_smoothing[k] != test_lm.model[k]} }"
)

In [None]:
#######################################
# TEST unigram: NGramLanguageModel()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=1)
test_lm = NGramLanguageModel(n=1, train_data=sample, alpha=1)

expected_vocab_unigram = Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

expected_model_unigram = {('<s>',): 0.09090909090909091,
        ('we',): 0.12121212121212122,
        ('are',): 0.12121212121212122,
        ('never',): 0.06060606060606061,
        ('ever',): 0.15151515151515152,
        ('getting',): 0.06060606060606061,
        ('back',): 0.09090909090909091,
        ('together',): 0.09090909090909091,
        ('</s>',): 0.09090909090909091,
        ('the',): 0.06060606060606061,
        ('ones',): 0.06060606060606061}


assert test_lm.vocab == expected_vocab_unigram, f"Vocabulary mismatch! Expected: {expected_vocab}, but got: {test_lm.vocab}"

assert test_lm.model == expected_model_unigram, (
    f"Model mismatch! \n"
    f"Expected keys but missing: {set(expected_model_unigram.keys()) - set(test_lm.model.keys())}\n"
    f"Unexpected keys in model: {set(test_lm.model.keys()) - set(expected_model_unigram.keys())}\n"
    f"Discrepancies in probabilities: "
    f"{ {k: (expected_model_unigram[k], test_lm.model[k]) for k in expected_model_unigram if k in test_lm.model and expected_model_unigram[k] != test_lm.model[k]} }"
)

In [None]:
#######################################
# TEST: perplexity()
#######################################
test_lm = NGramLanguageModel(n=3, train_data=sample, alpha=0)
test_ppl = test_lm.perplexity(sample)
print(test_ppl)
assert test_ppl < 1.7
assert test_ppl > 0

test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=1)
test_ppl = test_lm.perplexity(sample)
print(test_ppl)
assert test_ppl < 5.0
assert test_ppl > 0

## Train the n-gram language model on the data/bbc/business.txt dataset for n = 2 and n = 3. Then do the same for data/bbc/sports.txt datset

In [None]:
#######################################
# TRAIN unigram: NGramLanguageModel() for business data
#######################################
business_prepro = preprocess(read_file("data/bbc/business.txt"), n=2)
train_bussi = NGramLanguageModel(n=2, train_data=business_prepro, alpha=0.5)
print(len(set(train_bussi.model.keys())))
print(len(train_bussi.n_grams_counts))
print('Vocab size: ', len(train_bussi.vocab))

In [None]:
#######################################
# TRAIN unigram: NGramLanguageModel() for business data
#######################################
business_prepro = preprocess(read_file("data/bbc/business.txt"), n=3)
train_bussi = NGramLanguageModel(n=3, train_data=business_prepro, alpha=0.5)
print(len(set(train_bussi.model.keys())))
print(len(train_bussi.n_grams_counts))
print('Vocab size: ', len(train_bussi.vocab))

In [None]:
#######################################
# TRAIN unigram: NGramLanguageModel() for sports data
#######################################
spo_prepro = preprocess(read_file("data/bbc/sport.txt"), n=2)
train_spo = NGramLanguageModel(n=2, train_data=spo_prepro, alpha=0.5)
print(len(set(train_spo.model.keys())))
print(len(train_spo.n_grams_counts))
print('Vocab size: ', len(train_spo.vocab))

In [None]:
#######################################
# TRAIN unigram: NGramLanguageModel() for sports data
#######################################
spo_prepro = preprocess(read_file("data/bbc/sport.txt"), n=3)
train_spo = NGramLanguageModel(n=3, train_data=spo_prepro, alpha=0.5)
print(len(set(train_spo.model.keys())))
print(len(train_spo.n_grams_counts))
print('Vocab size: ', len(train_spo.vocab))

How many possible 2- and 3- grams could there be, given the same vocabulary?


How do the empirical counts given above compare to the number of possible 2- and 3- grams?


## Train a tri-gram (n=3, smoothing= 0.1) language models on collections of song lyrics from three popular artists (‘data/lyrics/‘) and use the model to score a new unattributed song.

In [None]:
taylor_pre = preprocess(read_file("data/lyrics/taylor_swift.txt"), n=3)
train_tay = NGramLanguageModel(n=3, train_data=taylor_pre, alpha=0.1)

green_pre = preprocess(read_file("data/lyrics/green_day.txt"), n=3)
train_green = NGramLanguageModel(n=3, train_data=green_pre, alpha=0.1)

ed_pre = preprocess(read_file("data/lyrics/ed_sheeran.txt"), n=3)
train_ed = NGramLanguageModel(n=3, train_data=ed_pre, alpha=0.1)

What are the perplexity scores of the test lyrics against each of the language models?

In [None]:
test_prepro = preprocess(read_file("data/lyrics/test_lyrics.txt"), n=3)

tay_ppl = train_tay.perplexity(test_prepro)
print('Perplexity of taylor swift: ', tay_ppl)

green_ppl = train_green.perplexity(test_prepro)
print('Perplexity of green day: ', green_ppl)

ed_ppl = train_ed.perplexity(test_prepro)
print('Perplexity of ed sheeran: ', ed_ppl)

## Train a bi-gram (n=2, smoothing= 0.1) language models on collections of song lyrics from three popular artists (‘data/lyrics/‘) and use the model to score a new unattributed song.

In [None]:
taylor_pre = preprocess(read_file("data/lyrics/taylor_swift.txt"), n=2)
train_tay = NGramLanguageModel(n=2, train_data=taylor_pre, alpha=0.1)

green_pre = preprocess(read_file("data/lyrics/green_day.txt"), n=2)
train_green = NGramLanguageModel(n=2, train_data=green_pre, alpha=0.1)

ed_pre = preprocess(read_file("data/lyrics/ed_sheeran.txt"), n=2)
train_ed = NGramLanguageModel(n=2, train_data=ed_pre, alpha=0.1)

In [None]:
test_prepro = preprocess(read_file("data/lyrics/test_lyrics.txt"), n=2)

tay_ppl = train_tay.perplexity(test_prepro)
print('Perplexity of taylor swift: ', tay_ppl)

green_ppl = train_green.perplexity(test_prepro)
print('Perplexity of green day: ', green_ppl)

ed_ppl = train_ed.perplexity(test_prepro)
print('Perplexity of ed sheeran: ', ed_ppl)

### Step 2: RNN Language Model


#### Preparing the Data
The following Python code is used for loading and processing [GloVe (Global Vectors for Word Representation) embeddings](https://nlp.stanford.edu/projects/glove/). GloVe is an unsupervised learning algorithm for obtaining vector representations for words. These embeddings can be used in various natural language processing and machine learning tasks.

The `load_glove_embeddings(path)` function is used to load the GloVe embeddings from a file. The function takes a file path as an argument, reads the file line by line, and for each line, it splits the line into words and their corresponding embeddings, and stores them in a dictionary. The dictionary, embeddings_dict, maps words to their corresponding vector representations.

The `create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim)` function is used to create an embedding matrix from the loaded GloVe embeddings. This function takes a dictionary mapping words to their indices (`word_to_ix`), the dictionary of GloVe embeddings (`embeddings_dict`), and the dimension of the embeddings (`embedding_dim`) as arguments. It creates a zero matrix of size (vocab_size, embedding_dim) and then for each word in  `word_to_ix`, it checks if the word is in `embeddings_dict`. If it is, it assigns the corresponding GloVe vector to the word's index in the embedding matrix. If the word is not in the embeddings_dict, it assigns a random vector to the word's index in the embedding matrix.

The `glove_path` variable is the path to the GloVe file, and `glove_embeddings` is the dictionary of GloVe embeddings loaded using the `load_glove_embeddings` function. The `embedding_dim` variable is the dimension of the embeddings, and `embedding_matrix` is the embedding matrix created using the create_embedding_matrix function.

In [10]:
# Load the data
vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/lyrics/taylor_swift.txt")

In [11]:
def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = torch.tensor([float(val) for val in values[1:]], dtype=torch.float)
            embeddings_dict[word] = vector
    return embeddings_dict

# Path to the GloVe file
glove_path = 'glove.6B.50d.txt'  # Update this path
glove_embeddings = load_glove_embeddings(glove_path)

def create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim):
    vocab_size = len(word_to_ix)
    embedding_matrix = torch.zeros((vocab_size, embedding_dim))
    for word, ix in word_to_ix.items():
        if word in embeddings_dict:
            embedding_matrix[ix] = embeddings_dict[word]
        else:
            embedding_matrix[ix] = torch.rand(embedding_dim)  # Random initialization for words not in GloVe
    return embedding_matrix

# Create the embedding matrix
embedding_dim = 50
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

#### TO DO: Defining the RNN Model

In [12]:
#######################################
# TODO: RNNLanguageModel()
#######################################

import math
import torch
import numpy as np
import torch.nn as nn
from collections import Counter
from torch.utils.data import DataLoader, Dataset

class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, embedding_matrix):
        """
        RNN language model con GRU y embeddings GloVe.
        """
        super().__init__()
        self.device = torch.device(
            "mps" if torch.backends.mps.is_available()
            else "cuda" if torch.cuda.is_available()
            else "cpu"
        )
        print(f"Using device: {self.device}")

        # Embedding inicializado con GloVe
        # embedding_matrix: torch.Tensor [vocab_size, embedding_dim]
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        with torch.no_grad():
            self.embedding.weight.copy_(embedding_matrix)

        # GRU unidireccional
        self.hidden_dim = hidden_dim
        self.rnn = nn.GRU(input_size=embedding_dim,
                          hidden_size=hidden_dim,
                          num_layers=1,
                          batch_first=True)

        # Capa final a vocab
        self.fc = nn.Linear(hidden_dim, vocab_size)

        self.to(self.device)

    def forward(self, x, hidden=None):
        """
        x: [B, T] índices
        hidden: [1, B, H] opcional
        retorna: logits [B, T, V], hidden
        """
        x = x.to(self.device)
        if hidden is not None:
            hidden = hidden.to(self.device)

        emb = self.embedding(x)            # [B, T, D]
        out, hidden = self.rnn(emb, hidden)  # out: [B, T, H]
        logits = self.fc(out)              # [B, T, V]
        return logits, hidden

    @torch.no_grad()
    def generate_sentence(self, sequence, word_to_ix, ix_to_word, num_words, mode='max'):
        """
        Autoregresivo desde la secuencia dada.
        Usa último token como condición y mantiene el hidden.
        """
        self.eval()

        # tokens iniciales
        tokens = sequence.strip().split()
        # map a ids con UNK si no está
        unk = UNK if 'UNK' in globals() else '<unk>'
        start_ids = [word_to_ix.get(w, word_to_ix.get(unk, 0)) for w in tokens]
        if len(start_ids) == 0:
            # si vacío, inicia con <s> si existe
            start_ids = [word_to_ix.get(START, 0)]

        # construir estado inicial ejecutando la secuencia
        x = torch.tensor(start_ids, dtype=torch.long, device=self.device).unsqueeze(0)  # [1, T]
        logits, hidden = self.forward(x)  # oculto después de la secuencia

        generated = []
        last_id = x[0, -1].unsqueeze(0).unsqueeze(0)  # [1,1]

        for _ in range(num_words):
            logits, hidden = self.forward(last_id, hidden)  # [1,1,V]
            probs = torch.softmax(logits[0, -1], dim=-1)

            if mode == 'multinomial':
                next_id = torch.multinomial(probs, num_samples=1)
            else:
                next_id = torch.argmax(probs, dim=-1, keepdim=True)

            wid = next_id.item()
            word = ix_to_word.get(wid, unk)
            generated.append(word)

            # parar si EOS
            if word == EOS:
                break

            last_id = next_id.view(1, 1)

        return generated


#### Training the Model
The following code snippet provided is responsible for training the RNN language model.

In [13]:
#######################################
# TEST: RNNLanguageModel() and training
#######################################
torch.manual_seed(11411)
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 20

# Initialize the model, loss function, and optimizer
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

lines = ""
# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        inputs = inputs.to(RNN.device)
        targets = targets.to(RNN.device)

        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()

    line = f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}, Perplexity: {np.exp(loss.item())}'
    lines += line + "\n"
    print(line)

Using device: cuda
Epoch 1/20, Loss: 1.8877289295196533, Perplexity: 6.60435268587878
Epoch 2/20, Loss: 2.3066248893737793, Perplexity: 10.040479673564507
Epoch 3/20, Loss: 1.6212486028671265, Perplexity: 5.05940356016781
Epoch 4/20, Loss: 3.0175232887268066, Perplexity: 20.440603467026385
Epoch 5/20, Loss: 1.1223251819610596, Perplexity: 3.0719888384478238
Epoch 6/20, Loss: 2.61008882522583, Perplexity: 13.60025884424322
Epoch 7/20, Loss: 1.8380917310714722, Perplexity: 6.284534229808462
Epoch 8/20, Loss: 1.6776500940322876, Perplexity: 5.352962221983752
Epoch 9/20, Loss: 2.4872968196868896, Perplexity: 12.028716343585533
Epoch 10/20, Loss: 2.544201612472534, Perplexity: 12.733058112032381
Epoch 11/20, Loss: 1.3362374305725098, Perplexity: 3.8047010881214876
Epoch 12/20, Loss: 2.1243982315063477, Perplexity: 8.367860457933368
Epoch 13/20, Loss: 2.313692331314087, Perplexity: 10.111691527118902
Epoch 14/20, Loss: 2.348663330078125, Perplexity: 10.47156334210341
Epoch 15/20, Loss: 2.635