<img src='data/images/section-notebook-header.png' />

# Recurrent Neural Networks (RNNs): Language Models

Language models are computational models that learn the statistical patterns and structures of natural language. These models are trained on large amounts of text data and aim to predict the likelihood of a given word or sequence of words occurring in a given context. Language models can be used for a wide range of NLP tasks, including text generation, machine translation, speech recognition, sentiment analysis, and question answering.

The primary purpose of language models is to capture the inherent structure and semantic relationships within language. By learning from vast amounts of textual data, language models develop an understanding of word co-occurrences, syntactic rules, and contextual cues. This knowledge enables them to generate coherent and contextually relevant text, complete sentences, correct grammatical errors, and even provide auto-suggestions while typing.

Recurrent Neural Networks (RNNs) are particularly suitable for training language models due to their inherent ability to handle sequential data. Here are some reasons why RNNs are well-suited for language modeling:

* **Sequential Processing:** Language is inherently sequential, and the meaning of a word depends on the words that came before it. RNNs process sequences of inputs one element at a time, allowing them to capture the temporal dependencies and context within the text. This sequential processing makes RNNs suitable for tasks like language modeling, where understanding the context is crucial for generating coherent and meaningful text.

* **Variable-Length Input:** RNNs can handle inputs of variable lengths, which is important in language modeling since sentences or documents can vary in length. RNNs can process each word or character in a sequential manner and update their hidden state accordingly, accommodating different lengths of input sequences.

* **Long-Term Dependencies:** RNNs are capable of capturing long-term dependencies in the input sequence. The hidden state of an RNN at any given time step summarizes the information from all previous time steps. This allows the RNN to retain memory of past inputs and use it to influence the predictions at future time steps. Language models benefit from this property, as they need to consider not just recent context but also distant dependencies in the text.

* **Parameter Sharing:** RNNs share the same set of parameters across all time steps, which enables them to generalize well to different parts of the input sequence. This parameter sharing is particularly useful in language modeling, where the same set of weights can be applied to process different words or characters in the text. It allows the model to learn patterns and dependencies in the language more effectively, even when the training data is sparse or the vocabulary is large.

* **Efficient Training with Backpropagation Through Time (BPTT):** RNNs can be trained efficiently using the backpropagation through time algorithm. BPTT applies the chain rule of calculus to compute the gradients of the loss function with respect to the model parameters, effectively unrolling the recurrent computations. This enables efficient training of RNNs on long sequences by propagating the gradients through time, updating the parameters based on the accumulated information from the entire sequence.

While RNNs have been widely used for language modeling, they do have limitations, such as difficulties in capturing very long-term dependencies and vanishing/exploding gradient problems. To address some of these limitations, variants of RNNs, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have been introduced, which incorporate mechanisms to better capture and control the flow of information through the recurrent connections.

In this notebook, we built a simple RNN-based language model. The corpus we use, we generated using the Data Preparation notebook.

## Setting up the Notebook

### Import Required Packages

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

We utilize PyTorch as our deep learning framework of choice by importing the `torch` package.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.vocab import vocab

We also need to import some custom implementations of classes and methods. This makes a re-use of these classes and methods easier and keeps the notebook clean.

In [3]:
from src.utils import Dict2Class
from src.rnn import VanillaRnnLanguageModel, RnnLanguageModel

### Checking/Setting the Computation Device

PyTorch allows to train neural networks on supported GPUs to significantly speed up the training process. If you have a support GPU, feel free to utilize it. 

In [4]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU 
#use_cuda = False

device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

Available device: cuda:0


---

## Load & Prepare Data

### Load Vocabulary

In the Data Preparation notebook, we already preprocessed and vectorized our corpus of movie reviews, and saved the resulting vocabulary and dataset into files. We essentially only need to load the generated files. Let's start with the vocabulary. Recall, the `vocabulary` is a `vocab` object from the `torchtext` package, allowing us to map words/tokens to their unique integer identifiers and back.

In [5]:
vocabulary = torch.load('data/corpora/imdb-reviews/vectorized-rnn-lm/imdb-rnn-lm-20000.vocab')

vocab_size = len(vocabulary)

print('Size of vocabulary:\t{}'.format(vocab_size))

Size of vocabulary:	20004


### Load & Filter Sentences

To simplify things, we only consider sentences of a certain length, here from 5..30 words. This is a kind of arbitrary choice, so feel free to change that. However, recall that we already limited ourselves to sentences with lengths between 5 and 50 when generating the dataset in the Date Preparation notebook. So we are bound by this interval.

In [6]:
df = pd.read_csv('data/corpora/imdb-reviews/vectorized-rnn-lm/imdb-rnn-lm-sentences.txt', header=None)

# Keep only sentences with 5 to 30 tokens/words
df = df[(df[1] <= 30) & (df[1] >= 5)]

# Convert the strings of token indices to list of token indices
df[0] = df[0].apply(lambda x: [ int(i) for i in x.split(' ')])

# Let's have a look at the data
df.head()

Unnamed: 0,0,1
3,"[1, 1085, 4, 489, 9, 4, 7371, 10, 1, 4, 489, 9...",30
4,"[63, 1, 1642, 10, 35, 6862, 5, 34, 676, 8, 346...",17
6,"[4, 2449, 1970, 124, 484, 1, 63, 4, 140, 18, 9...",30
8,"[63, 14, 101, 564, 54, 17, 3024, 5, 17, 14, 26...",20
9,"[170, 9, 4980, 246, 18, 47, 19, 91, 19, 13766,...",16


Recall that the 2nd column in our file records the length of each sequence in terms of the number of token indices. This information will help us in the next step to organize our training batches; see below. Of course, we could also calculate the lengths of the sequence on the fly, but having the value already at hand makes the next step just a bit more convenient.

Let's quickly check the total number of sentences. If you used the default settings in the Data Preparation notebook and considered all reviews -- and not just a subset -- the dataset should include 961,348 sentences.

In [7]:
print('Total number of sentencens: {}'.format(len(df)))

Total number of sentencens: 961348


### Organize Sentences by Length

In the Sentiment Analysis notebook, we already discussed the need to handle sequences of variable length to support mini batches when training our model. Again, we use this approach to organize our dataset in such a way that we can create a batch where all samples are guaranteed to have the same length. However, to show even this approach can be implemented in different ways, we use a more "manual" approach in this notebook To this end, we create a dictionary with all sentence lengths as the keys, and where the value for a key is the list of sentences with that length. This will allow us during training to create batches where all samples have the same length.

In [8]:
sentence_mapping = {}

for i in tqdm(range(0, len(df))):
    sent_len = df.iloc[i][1]
    if sent_len in sentence_mapping:
        sentence_mapping[sent_len].append(i)
    else:
        sentence_mapping[sent_len] = [i]

100%|██████████████████████████████████████████████████████████████████████████████████████| 961348/961348 [02:07<00:00, 7560.90it/s]


In [9]:
sent_lengths = df[1].unique()

print('List of unique sentence lengths across the whole dataset')
print(sorted(sent_lengths))

List of unique sentence lengths across the whole dataset
[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]


---

## Training a Vanilla RNN Language Model

In this notebook, we will train 2 language models using 2 different implementations of an RNN-based network architecture. Fundamentally, both network will implement the exact same train task of learning a language model as described and visualized on the lecture slides; see the figure below:

<img src='data/images/lecture-slide-03.png' width='90%' />

The network will implement this network essentially 1:1 by reducing the architecture to the most basic components: using the most basic function to update the hidden state in each iteration and using only a single linear layer at the end as the output layer. The figure below again is taken from the lecture slides to visualize the basic computation steps behind the Vanilla RNN network.

<img src='data/images/lecture-slide-04.png' width='90%' />


### Create Model

You can check the file `src/rnn.py` for the implementation of class `VanillaRnnLanguageModel`. It should be relatively easy to match the defined layers in the `__init__()` method an the processing steps in the `forward()` method to the image above. We have 2 parameters to specify: the dimension of the word embeddings (i.e., `embed_size`), and the dimension of the hidden state (i.e., `hidden_size`). As usual, feel free to change `embed_size` and `hidden_size` and see how it affects the quality of the resulting language model. As we don't use any pretrained word embeddings in this notebook -- but we could, of course, and feel free to extend the notebook accordingly -- you can also pick a embedding size outside 300.

In [None]:
embed_size, hidden_size = 300, 512
# Create model
vanilla_rnn_lm_model = VanillaRnnLanguageModel(vocab_size, embed_size, hidden_size).to(device)
# Define optimizer
criterion = nn.CrossEntropyLoss()
# Define loss function
optimizer = optim.Adam(vanilla_rnn_lm_model.parameters(), lr=0.001)
# Print model
print(vanilla_rnn_lm_model)

### Train Model

The code cell below contains the training loop. As usual, we have a nested loop: the outer loop for the epochs and the inner loop for the batches. In the previous image you already saw that the input sequence and the target sequence for a training sample are almost the some, except:

* The target sequence is shifted to the left by 1 time step or token

* An end-of-sequence (here: `<EOS>`) token is added to the end of the target sequences

* An start-of-sequence (here: `<SOS>`) token is added to the end of the input sequences

**Important:** Of course, the chosen start-of-sequence and end-of-sequence must match the one defined as special tokens when generating the vocabulary; see the Data Preparation notebook.

If you want to continue training after the code cell finished, you can simply run the code cell below again.

In [None]:
num_epochs = 5
batch_size = 128

for epoch in range(num_epochs):
    epoch_loss = 0.0

    # Shuffle the list of sentence lengths (good practice)
    np.random.shuffle(sent_lengths)
    
    with tqdm(total=df.shape[0]) as progress_bar:
        # Iterate over all possible sentence lengths
        for sent_len in sent_lengths:

            # The the indices of all sentences of length sent_len
            sent_indices = sentence_mapping[sent_len]
            sent_indices = np.array(sent_indices)

            # Shuffle array of sentence indices (good practice)
            np.random.shuffle(sent_indices)

            # Compute the number of batches
            num_batches = int(np.ceil(len(sent_indices) / batch_size))

            # Loop over all possible batches
            for batch_indices in np.array_split(sent_indices, num_batches):

                # Get sentence data based on the indices in the batch
                targets = np.array(df.iloc[batch_indices][0].to_list())

                # Since we build language model, inputs and targets are (almost) same
                # we only need to shift the target sequence one step to the left which we do below
                inputs = targets[:]

                # Add SOS token to all input sequences; add EOS token to all target sequences
                sos = np.array([vocabulary.lookup_indices(['<SOS>'])]*(len(batch_indices))).reshape(1, -1).T
                eos = np.array([vocabulary.lookup_indices(['<EOS>'])]*(len(batch_indices))).reshape(1, -1).T            
                targets = np.hstack((targets, eos))
                inputs = np.hstack((sos ,inputs))

                # Convert data to tensor and move to GPU (if available)
                inputs = torch.Tensor(inputs).long().to(device)
                targets = torch.Tensor(targets).long().to(device)

                # Initialize the first hidden state h0
                hidden = vanilla_rnn_lm_model.init_hidden(len(batch_indices)).to(device)
                
                loss = 0
                for i in range(inputs.shape[1]):
                    output, hidden = vanilla_rnn_lm_model(inputs[:,i], hidden)
                    l = criterion(output, targets[:,i])
                    loss += l

                # Let PyTorch do its magic to calculate the gradients and update all trainable parameters                
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                # Keep track of overall epoch loss
                epoch_loss += loss.item()
                
                # Update progress bar
                progress_bar.update(len(batch_indices))

    print('[Epoch {}] Loss: {}'.format((epoch+1), epoch_loss))

### Save/Load Model

As retraining the model all the time can be tedious, we can save and load our model. If you trained the model for the first time, only saving is of course possible.

In [None]:
action = 'save'
#action = 'load'
#action = 'none'

if action == 'save':
    torch.save(vanilla_rnn_lm_model.state_dict(), 'data/models/language-model/imdb-vanilla-rnn-lm.pt')
elif action == 'load':
    model = VanillaRnnLanguageModel(vocab_size, embed_size, hidden_size).to(device)
    model.load_state_dict(torch.load('data/models/language-model/imdb-vanilla-rnn-lm.pt'))
else:
    pass

### Generate Sentence Using the Language Model

Most basically, the generation of new sentences is quite similar to the training loop. If we have a set of seed words, we first feed those into the model. After that, we use the last predicted word to be the input for the next iteration and so on. The image below is taken from the lecture slides and visualizes the idea of generating sentences using a trained language model

<img src='data/images/lecture-slide-05.png' width='90%' />

We stop when we predict the EOS token -- or when we reach the maximum length `max_len`. The method `generate()` below implements this idea. You can check the implementation of the `forward()` method of the class `VanillaRnnLanguageModel` to see the similarity of the code.

In [None]:
def generate(model, seed_tokens, max_len=30):
    # Create initial input as SOS token + seed words
    inputs = np.array(vocabulary.lookup_indices(['<SOS>']) + vocabulary.lookup_indices(seed_tokens))

    # Convert input to tensor and move to GPU (if available)
    inputs = torch.Tensor(inputs).long().unsqueeze(0).to(device)
    
    # Initialize the first hidden state h0
    hidden = model.init_hidden(1).to(device)
    
    # Keep track of the predicted word which forms our final result
    tokens = seed_tokens
    
    # Push SOS token (and optional seed words through the RNN)
    for i in range(inputs.shape[1]):
        logits, hidden = model(inputs[:,i], hidden)
    

    # Iterate over the time steps to predict the next word step by step
    for k in range(max_len):
        
        # Get index of word with the highest probability (no sampling here to keep it simple)
        _, topi = logits[-1].topk(1)
        word_index = topi.item()

        # If we predict the EOS token, we can stop
        if word_index == vocabulary.lookup_indices(['<EOS>'])[0]:
            break

        # Get the respective word/token and add it to the result list
        #tokens.append(index2word[word_index])
        tokens.append(vocabulary.lookup_token(word_index))
            
        # Create the tensor for the last predicted word
        next_input = torch.tensor([word_index]).to(device)
        
        # Use last predicted word as input for the next iteration
        logits, hidden = model(next_input, hidden)
      
    # Return the result words/tokens as a string
    return ' '.join(tokens)

### Generating some Example Sentences

The example seed words are the ones to generate the lecture slide. Of course, the generated sentence will greatly depend on how long and with which hyperparameter values the language model has been trained. You can of course add your own examples or modify the existing ones. You are also encouraged to try the same example after multiple rounds of training. Particularly if you train the model for only very few epochs, you are very likely to see mostly *"funny"* sentences -- that is, sentences that might be grammatically mostly fine but don't really make much sense...or no sense at all.

In [None]:
print(generate(vanilla_rnn_lm_model, ['the', 'cast']))

In [None]:
print(generate(vanilla_rnn_lm_model, ['i', 'love', 'how']))

In [None]:
print(generate(vanilla_rnn_lm_model, ['my', 'dad']))

In [None]:
print(generate(vanilla_rnn_lm_model, ['this', 'was']))

In [None]:
print(generate(vanilla_rnn_lm_model, ['some', 'of', 'the']))

In [None]:
print(generate(vanilla_rnn_lm_model, ['the', 'script']))

---

## Training an (a bit more) Advanced RNN Language Model

The Vanilla RNN Model was limited to the basics in terms of the function to update the hidden state and the use of additional hidden linear layers -- well it had none of those. We already know the LSTMs and GRUs offer generally better approaches to update the hidden state, so we should make use of it. Thus, in the following, we will build and train a slightly more advanced RNN-based language model. Check out the implementation of the class `RnnLanguageModel` in the file `src/rnn.py`.

### Create Model

Similar to the class `RnnTextClassifier`, the class `RnnLanguageModel` is implemented in a rather flexible manner to easily change the network architecture by changing a series of input parameters. In fact -- and this shouldn't be a surprise -- the set of input parameters is very similar to the one of `RnnTextClassifier`. With the right parameter setting, we can also mimic the Vanilla RNN Model from above; see the code cell below for that.

In [10]:
params = {
    "device": device,                   # as the class also generates sentence it mus be able to move the data to the correct device
    "vocab_size": vocab_size,           # the size of the vocabulary determines the input size of the embedding
    "embed_size": 300,                  # size of the word embeddings
    "rnn_cell": "GRU",                  # in practice GRU or LSTM will always outperform RNN
    "rnn_num_layers": 1,                # 1 or 2 layers are most common; more rarely sees any benefit
    "rnn_hidden_size": 512,             # size of the hidden state
    "rnn_dropout": 0.0,                 # only relevant if rnn_num_layers > 1
    "linear_hidden_sizes": [1024],      # list of sizes of subsequent hidden layers; can be [] (empty)!
    "linear_dropout": 0.5,              # if hidden linear layers are used, we can also include Dropout
}

##
## Side note: the parameter setting below essentially correspond to the Vanilla RNN Model
##
#params = {
#    "device": device,
#    "vocab_size": vocab_size,
#    "embed_size": 300,
#    "rnn_cell": "RNN",
#    "rnn_num_layers": 1,
#    "rnn_hidden_size": 512,
#    "rnn_dropout": 0.0,
#    "linear_hidden_sizes": [],
#    "linear_dropout": 0.0,
#}


# Define model paramaters
params = Dict2Class(params)
# Define model
rnn_lm_model = RnnLanguageModel(params).to(device)
# Define optimizer
optimizer = torch.optim.Adam(rnn_lm_model.parameters(), lr=0.001)
# Define loss function
criterion = nn.CrossEntropyLoss()
# Print model
print(rnn_lm_model)

RnnLanguageModel(
  (embedding): Embedding(20004, 300)
  (rnn): GRU(300, 512, batch_first=True)
  (linears): ModuleList(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=512, out_features=1024, bias=True)
    (2): ReLU()
  )
  (out): Linear(in_features=1024, out_features=20004, bias=True)
)


### Train Model

The training loop in the code cell below is almost identical to the one above, particularly the parts preparing the input and target sequences. However, running the model and computing the loss is a bit different due to the implementation and therefore use of the class `RnnLanguageModel`. As before, if you want to continue training after the code cell finished, you can simply run the code cell below again.

In [None]:
num_epochs = 5
batch_size = 128

for epoch in range(num_epochs):
    epoch_loss = 0.0

    # Shuffle the list of sentence lengths (good practice)
    np.random.shuffle(sent_lengths)
    
    with tqdm(total=df.shape[0]) as progress_bar:
        # Iterate over all possible sentence lengths
        for sent_len in sent_lengths:

            # The the indices of all sentences of length sent_len
            sent_indices = sentence_mapping[sent_len]
            sent_indices = np.array(sent_indices)

            # Shuffle array of sentence indices (good practice)
            np.random.shuffle(sent_indices)

            # Compute the number of batches
            num_batches = int(np.ceil(len(sent_indices) / batch_size))

            # Loop over all possible batches
            for batch_indices in np.array_split(sent_indices, num_batches):

                # Get sentence data based on the indices in the batch
                targets = np.array(df.iloc[batch_indices][0].to_list())

                # Since we build language model, inputs and targets are (almost) same
                # we only need to shift the target sequence one step to the left which we do below
                inputs = targets[:]

                # Add SOS token to the FRONT of all input sequences; add EOS token to the END of all target sequences
                sos = np.array([vocabulary.lookup_indices(['<SOS>'])]*(len(batch_indices))).reshape(1, -1).T
                eos = np.array([vocabulary.lookup_indices(['<EOS>'])]*(len(batch_indices))).reshape(1, -1).T            
                targets = np.hstack((targets, eos))
                inputs = np.hstack((sos ,inputs))

                # Convert data to tensor and move to GPU (if available)
                inputs = torch.Tensor(inputs).long().to(device)
                targets = torch.Tensor(targets).long().to(device)
        
                # Initialize hidden states w.r.t. batch size (batches might not always been full)
                hidden = rnn_lm_model.init_hidden(inputs.shape[0]) 
        
                outputs, _ = rnn_lm_model(inputs, hidden)
                
                loss = criterion(outputs.permute(0,2,1), targets)

                ### Pytorch magic! ###
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                # Keep track of overall epoch loss
                epoch_loss += loss.item()                
                
                # Update progress bar
                progress_bar.update(len(batch_indices))
    
    print('[Epoch {}] Loss: {}'.format((epoch+1), epoch_loss))

100%|██████████████████████████████████████████████████████████████████████████████████████| 961348/961348 [05:58<00:00, 2683.20it/s]


[Epoch 1] Loss: 34369.55098962784


 63%|██████████████████████████████████████████████████████▌                               | 610058/961348 [03:44<02:18, 2539.06it/s]

### Save/Load Model

With the code cell below, we can also save and load this more advanced language model.

In [None]:
action = 'save'
#action = 'load'
#action = 'none'

if action == 'save':
    torch.save(rnn_lm_model.state_dict(), 'data/models/language-model/imdb-rnn-lm.pt')
elif action == 'load':
    rnn_lm_model = RnnLanguageModel(params).to(device)
    rnn_lm_model.load_state_dict(torch.load('data/models/language-model/imdb-rnn-lm.pt'))
else:
    pass

### Generate Sentence Using the Language Model

If you had a look at the implementation of the class `RnnLanguageModel`, you will have noticed that it already includes a `generate()` method to generate sentences based on a given set of seed words. This means we can now simply call this class method to again look at some example sentences our model will generate. The code cell below uses the same seed token as above, but you can always add or modify the examples.

In [None]:
print(rnn_lm_model.generate(['the', 'cast'], vocabulary))

In [None]:
print(rnn_lm_model.generate(['i', 'love', 'how'], vocabulary))

In [None]:
print(rnn_lm_model.generate(['my', 'dad'], vocabulary))

In [None]:
print(rnn_lm_model.generate(['this', 'was'], vocabulary))

In [None]:
print(rnn_lm_model.generate(['some', 'of', 'the'], vocabulary))

In [None]:
print(rnn_lm_model.generate(['the', 'script'], vocabulary))

---

## Discussion

This notebook -- together with the Data Preparation notebook -- has shown that building and training an RNN-based language model is actually not that difficult. However, we have made several simplifying assumptions to focus on the core steps without the goal to train a competitive language model. Here are a couple of points to consider when trying to build more practical, large(r)-scale language models:

* Using the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) to train a language model has of course its limitations. Firstly, the dataset is very small for this task, and maybe more importantly using a dataset from a specific domain (i.e.: movie reviews) limits its applicability in this domain. However, the focus here is to go through some of the basic steps and not to build a state-of-the-art language model.

* The goal in this notebook was to build a simple language model to generate single sentences, not paragraphs or even beyond. Hence each data sample we generate will reflect a single sentence. For training very large language models to generate paragraphs, samples are typically chunks of text containing multiple sentences that might even be arbitrarily be cut off.

* The dataset used in this notebook was small enough that we could easily load it into the main memory. However, datasets to build proper language models are huge and would not fit into the memory all at once. In this case, some logic is required to first split the whole dataset into multiple chunks (e.g., different files) and then iterate over all chunks within each epoch.

* In practice, large language models are typically trained in a distributed setting such as computing clusters housing many CPUs. Deep learning frameworks such as PyTorch and Tensorflow support distributed training and inferencing out of the box, but again, some additional logic is required to facilitate this.

* You might have already noticed that using the same seed words will always generate the same sentence for the same model. The reason is that the `generate()` methods of the 2 implementations will always pick as the next word/token the one with the highest probability. To add some variety, we could change the methods to randomly sample a word/token, for example, from the 10 tokens with the highest probabilities.

---

## Summary

Recurrent Neural Networks (RNNs) have proven to be highly effective for training language models. RNNs are particularly well-suited for language modeling due to their ability to handle sequential data. Language is inherently sequential, and RNNs can capture the dependencies and context within text by maintaining an internal hidden state that evolves as it processes each word or character in a sequence.

When training language models with RNNs, the models are typically fed with a sequence of input tokens, such as words or characters. The RNN processes the input sequence step by step, updating its hidden state at each time step. The hidden state acts as a memory, capturing information about the preceding context. The RNN predicts the next word or character in the sequence based on the current input and the previous hidden state. The model is trained by comparing its predicted output with the actual next element in the sequence, and the gradients of the loss function are backpropagated through time to update the model's parameters.

By leveraging the recurrent nature of RNNs, these models can capture both short-term and long-term dependencies in language. This allows them to generate coherent and contextually appropriate text, as well as understand and process complex linguistic structures. RNN-based language models have been successfully applied to various tasks, such as text generation, machine translation, sentiment analysis, and speech recognition. Furthermore, variants of RNNs, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have been introduced to address some of the challenges in training RNNs, such as the vanishing/exploding gradient problem. These advancements have further improved the effectiveness of RNNs for training language models, making them a cornerstone of natural language processing research and applications.