<img src="https://drive.google.com/uc?id=1DvKhAzLtk-Hilu7Le73WAOz2EBR5d41G" width="500"/>


---

# RNNs and LSTMs (Solutions):




#### A few imports and functions before we get started

In [1]:
!pip install pycm livelossplot
%pylab inline

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
import random
import time
import math

import glob
import string
import unicodedata
import re
from collections import Counter

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torch.autograd import Variable
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

from livelossplot import PlotLosses

import pandas as pd


def set_seed(seed):
    """
    Use this to set ALL the random seeds to a fixed value and take out any randomness from cuda kernels
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    torch.backends.cudnn.benchmark = False  # uses the inbuilt cudnn auto-tuner to find the fastest convolution algorithms. -
    torch.backends.cudnn.enabled   = False

    return True

device = 'cpu'
if torch.cuda.device_count() > 0 and torch.cuda.is_available():
    print("Cuda installed! Running on GPU!")
    print(torch.cuda.get_device_name())
    device = 'cuda'
else:
    print("No GPU available! Running on CPU")

No GPU available! Running on CPU


#### Mounting the google drive for storage

In [3]:
# from google.colab import drive
# drive.mount('/content/gdrive/')

## Word-level text generation with RNNs


[Let's see if you can differentiate between machine generated text and human written text.](http://goopt2.xyz/)


We will use RNNs to build our generator network but you can also consider using LSTMs, which have a gating mechanism that allows information to continue flowing into the layers and cells of the network and have been showed to outperform vanilla RNNs for text generation.

### Downloading the data and some utility functions

In [4]:
# download the data
!mkdir data_gen
!cd data_gen && wget https://raw.githubusercontent.com/amoudgl/short-jokes-dataset/master/data/reddit-cleanjokes.csv

filename = 'data_gen/reddit-cleanjokes.csv'

mkdir: cannot create directory ‘data_gen’: File exists
--2023-12-12 14:30:15--  https://raw.githubusercontent.com/amoudgl/short-jokes-dataset/master/data/reddit-cleanjokes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 141847 (139K) [text/plain]
Saving to: ‘reddit-cleanjokes.csv.1’


2023-12-12 14:30:15 (3.21 MB/s) - ‘reddit-cleanjokes.csv.1’ saved [141847/141847]



#### Function to read the lines of a file as a list of words

In [5]:
def readFile_csv(filename, header):
    """
    Read a csv file and return list with line entries
    """
    dataframe = pd.read_csv(filename)
    data = dataframe[header].str.cat(sep=' ')
    data = data.split(' ')
    return data

filename, header = 'data_gen/reddit-cleanjokes.csv', 'Joke'


#### Data inspection and some more utility functions

We use a counter from the collections module to create a dictionary where the words are stored as the keys and their counts are the values. You can read more about collections [here](https://docs.python.org/3/library/collections.html).

In [6]:
data = readFile_csv(filename, header)
print(len(data))
# create a Counter for our data
data_counter = Counter(data)
# find all unique words
unique_words = set(data)
# print the list of unique words
print(unique_words)
# print the number of unique words
print(len(unique_words))

23914
{'', 'feeling', 'Theresa', 'cry,', 'out.', 'cloud', 'slow', 'degrees.', 'leg.', 'bartender.', 'closer', 'Asian', 'rash?', 'free', 'Humblebee.', 'Thunderwear!', 'skip', 'sink', 'me:', 'Mercury...', 'grey?', 'THERE!', 'Humpty', 'reception', 'constipated', '**Person', 'attire.', 'trunk.', 'citizens', 'proofing', '&gt;Nacho', 'Some', 'ceiling', 'parts', 'prefer', 'alright', 'Lucifer', 'At', 'locomotives?', 'murder?', 'gas', 'themselves.', 'even?', 'left', "victim's", 'scallion.', 'MARSHian', 'say...', 'stick!', 'meat?', 'make?', 'add', 'person', '"I\'ll', '**Tim', 'winter?', 'knock*', 'bunny', 'shows', 'boomer-WRONG!', 'bucket~~', 'tuba', 'Shanghai.', 'forty', 'else', 'expert', 'eight', 'inches', 'Sauron', 'home?', 'Ice', 'my-oh-my!', 'blind', 'shape.', 'mop..."', 'feet?', 'goatherd', 'direction', 'hole', '9', "Picard's", 'bought', 'bars', 'JaPAN!', 'clauset.', '**Jimmy', 'Three', 'like?', 'MOOOOOOOOOOOOOOOO!!!!', '"Breathe,', '"p"', 'Up', 'Another', 'page."', 'highly', 'you?', 'spie

In [7]:
# create a word-to-index dictionary
# create an index-to-word dictionary

# find the indices in all words in the dataset
# print the indices


Our inputs will be the words of a chosen sequence length, and the outputs will be the next word.
We want to predict the next word from the current sequence of words, so we will create sequences, which are groups of consecutive words.

For example, consider the sentence: **What did the bartender say to the jumper cables?**. For a chosen sequence length of 4:
- input sequence: ['What', 'did', 'the', 'bartender']
- target sequence: ['did', 'the', 'bartender', 'say']
The output sequence is always one time step ahead of the input, and the set of input and sequence sequence gives one data point.

In [8]:
sequence_length= 4
index = 0
# transform the words into a tensor


## Creating a custom TensorDataset that allows us to perform the following:
- Take as inputs a list of all words in our text
- Generate a dictionary of all unique words and their counts, sorted in descending order
- Return a sample of input and output data

In [9]:
# a = zip([1,2,3],["t","b","a","a"])
# print(list(a))
# Counter(["t","b","a","a"])
# for i, l in enumerate("abracadabra"):
#     print(i, l)

#list of words
words = ["abra", "ca", "dabra", "ca"]

# words to idx
d = dict([(l,i) for i, l in enumerate(words)])
print(d)

# idx instead of words
# idx_list = []
# for i in words:
#     idx_list.append(d[i])
[d[i] for i in words]
# print(idx_list)

{'abra': 0, 'ca': 3, 'dabra': 2}


[0, 3, 2, 3]

In [11]:
class WordsTensorDataset(TensorDataset):
    def __init__(self, data_list, sequence_length=4):
        """
        Args:
            data_list (dictionary): A list of all the words in the file
            sequence_length: the number of words in each input sample, and output sample
        """
        self.sequence_length = sequence_length
        self.data_list = data_list
        self.unique_words = self.get_unique_words()

        # create a dictionary of mappings of words to indices
        word_to_idx = dict([(i,l) for i, l in enumerate(self.unique_words)])
        
        # create a dictionary of mappings of indices to words
        idx_to_word = {v: k for k, v in word_to_idx.items()}

        # return a list of the data with the words represented with their indices
#         self.words_as_idxs = [idx_to_word[id] for id in self.data_list]
        self.words_as_idxs = [idx_to_word[id] for id in self.unique_words]

        
    def get_unique_words(self):
        return list(set(self.data_list))

    def __len__(self):
        return len(self.words_as_idxs) - self.sequence_length
#         return len(data_list)

    def __getitem__(self, idx):

        sample_input = torch.Tensor(self.words_as_idxs[idx:idx+self.sequence_length])
        sample_output = torch.Tensor(self.words_as_idxs[idx+1:idx+self.sequence_length+1])

        return sample_input, sample_output

In [12]:
words_dataset = WordsTensorDataset(data)
words_dataloader = DataLoader(words_dataset)
next(iter(words_dataloader))

[tensor([[0., 1., 2., 3.]]), tensor([[1., 2., 3., 4.]])]

In [13]:
# create a dataset
# use the dataset with a data loader

# next(iter(words_dataloader))

##Implementing a Text "Generator" Network with RNNs

We will use the pytorch built-in Embedding for this exercise. The [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) enables us to encode words into a sort of look-up table. You will learn more about embeddings in the next lecture.

To implement our RNN, we will start by creating an `nn.Module` that will represent a single RNN cell. This cell will update its hidden state by:

- Applying a linear, fully connected layer to the cell input.
- Applying a linear, fully connected layer to the previous hidden state.
- Applying a non-linear activation to the result.

In [14]:
class RNNCell(nn.Module):

    def __init__(self, input_size, hidden_size, bias=True, activation="tanh"):
        super(RNNCell, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias

        # select appropriate activation function
        self.activation = nn.Tanh()
        # create linear layer from input to hidden state
        self.fc1 = nn.Linear(self.input_size, self.hidden_size, bias=self.bias)
        # create linear layer from previous to current hidden state
        self.fc2 = nn.Linear(self.hidden_size, self.hidden_size, bias=self.bias)

        # initialise the parameters
        self.reset_parameters()
        
    def reset_parameters(self):
        # copy pasted
        std = 1.0 / np.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

    def forward(self, input, h):
        # map from input to hidden state space
        input = self.fc1(input)
        
        # map from previous to current hidden state space
        h = self.fc2(h)
        
        # calculate new hidden state by applying activation
        h = self.activation(input + h)
        
        return h

Once we have implemented a single RNN cell, we can write another `nn.Module` for our RNN network: it will contain one or more cells, concatenated forming multiple layers, and will apply the RNN cells to a given input sequence.

The final output of the RNN will be obtained by applying a final fully connected layer to the last hidden state.

In [32]:
class RNN(nn.Module):

    def __init__(self, input_size, hidden_size, num_layers, output_size, bias=False, activation='tanh'):
        super(RNN, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.num_layers = num_layers
        self.bias = bias

        # create a list of modules
        self.rnn_cells = nn.ModuleList()

        # create each layer in the network
        # take care when defining the input size of the first vs later layers
        # (layers containing rnn cells?)
        for i in range(self.num_layers):
            self.rnn_cells.append(
                RNNCell(
                    self.input_size if num_layers == 0 else self.hidden_size,
                    self.hidden_size, 
                    self.bias, 
                    activation
                )
            )
        
                
        # create a final linear layer from hidden state to network output
        self.h20 = nn.Linear(self.hidden_size, self.output_size)
        
    def init_hidden(self,  batch_size=1):
        # initialise the hidden state
        return torch.zeros(self.num_layers, batch_size, self.hidden_size, requires_grad=False).to(device)

    def forward(self, input, h0):
        # Input of shape (batch_size, seqence length, input_size)
        # Output of shape (batch_size, output_size)

        outs = []
        hidden = []
        for layer in range(self.num_layers):
            hidden.append(h0[layer, :, :])
            batch_size = input.size(0)
            step_size = input.size(1)
        
        # iterate over all elements in the sequence
        for t in range(step_size):
            # iterate over each layer
            for layer in range(self.num_layers):
                # apply each layer
                if layer == 0:
                    hidden_l = self.rnn_cells[layer](input[:, t, :], hidden[layer])
                else:
                    hidden_l = self.rnn_cells[layer](hidden[layer-1], hidden[layer])
                # take care to apply the layer to the input or the
                # previous hidden state depending on the layer number

                # store the hidden state of each layer
                hidden[layer] = hidden_l
                
            # the hidden state of the last layer needs to be recorded
            # to be used in the output
            outs.append(hidden_l)
        # calculate output for each element in the sequence
        out = torch.stack([self.h20(out) for out in outs], dim=1)
        
        return out

We can now use out RNN network, together with the `nn.Embedding` to form our text-generating network.

In [33]:
class RNN_GEN(nn.Module):

    def __init__(self, input_size, embedding_dim, hidden_size, num_layers, num_unique_words):
        super(RNN_GEN, self).__init__()

        self.input_size = input_size
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_unique_words = num_unique_words

        # add a nn.Embedding
        self.embedding = nn.Embedding(self.num_unique_words, self.embedding_dim)
        # add our RNN
#         self.rnn = RNN(self.input_size, self.hidden_size, self.num_layers, self.unique_words, self.bias=False) 
        self.rnn = RNN(self.input_size, self.hidden_size, self.num_layers, self.unique_words, False) 
    def forward(self, x):
        # initialise hidden state
        batch_size = x.size(0)
        # store the word embeddings
        hidden = self.rnn.init_hidden(batch_size)
        # apply the RNN
        embedded = self.embedding(x)
        output = self.rnn(embedded, hidden)
        return output

def count_trainable_parameters(model):
    return sum([p.numel() for p in model.parameters() if p.requires_grad])

Define the train and evaluate functions:

In [34]:
next(iter(words_dataloader))

[tensor([[0., 1., 2., 3.]]), tensor([[1., 2., 3., 4.]])]

In [35]:
class RNN_GEN(nn.Module):

    def __init__(self, input_size, embedding_dim, hidden_size, num_layers, num_unique_words):
        super(RNN_GEN, self).__init__()

        self.input_size = input_size
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_unique_words = num_unique_words

        self.embedding = nn.Embedding(self.num_unique_words, self.embedding_dim)                   # add a nn.Embedding
        self.rnn = RNN(self.input_size, self.hidden_size, self.num_layers, self.num_unique_words,  # add our RNN
                       bias=False, activation='tanh')

    def forward(self, x):
        batch_size = x.size(0)
        hidden = self.rnn.init_hidden(batch_size)      # initialise hidden state

        embedded = self.embedding(x)                   # store the word embeddings
        output = self.rnn(embedded, hidden)            # apply the RNN

        return output

def count_trainable_parameters(model):
    return sum([p.numel() for p in model.parameters() if p.requires_grad])

In [36]:
model = RNN_GEN(
    input_size=4, 
    embedding_dim=4, 
    hidden_size=4, 
    num_layers=1, 
    num_unique_words=4)

In [41]:
train_rnn_gen(model=model, 
              optimizer=torch.optim.Adam(model.parameters()), 
              criterion=nn.CrossEntropyLoss(), 
              dataloader=words_dataloader)

tensor([[0., 1., 2., 3.]])


RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

In [None]:
:

In [40]:
def train_rnn_gen(model, optimizer, criterion, dataloader):
    # set model to train mode
    model.train()
    # initialise the loss
    train_loss = 0
    # loop over dataset
    for i, (X, y) in enumerate(dataloader):
        
        print(X)
        # send data to device
        X, y = X.to(device), y.to(device)
        # reset the gradients
        optimizer.zero_grad()
        # get output
        
        outputs = model(X)
        # compute the loss (change shape as crossentropy takes input as batch_size, number of classes, d1, d2, ...)
        
        loss = criterion(outputs, y)
        # backpropagate
        loss.backwards()
        
        # update weights
        optimizer.step()
        
    return train_loss/len(dataloader)


def predict_rnn_gen(dataset, model, text, next_words=100):
    # set model to evaluation mode

    # loop over words
        # take word from dataset and send to device
        # compute output and hidden state

        # take last output

        # obtain probability vector for last output
        # sample probability vector to get index in dataset

        # get word corresponding to dataset

    return ' '.join(words)


###Hyperparameters, model initialisation and the training loop  (Afternoon exercise)

Let's train our network for a fixed number of epochs:

In [None]:
device = device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_size = 128
n_hidden = 128
n_layers = 1
embedding_dim = input_size
n_unique_words = len(words_dataset.unique_words)
batch_size = 256
sequence_length = 4

lr = 5e-3
momentum = 0.5
n_epochs = 20

# create RNN network
print(f'The model has {count_trainable_parameters(rnn_gen):,} trainable parameters')

# select a criterion
# create an optimiser

# create our dataset
# use a data loader

# Keep track of losses for plotting
liveloss = PlotLosses()
for epoch in range(n_epochs):
    logs = {}

    print(epoch,train_loss)

    logs['' + 'log loss'] = train_loss.item()
    liveloss.update(logs)
    liveloss.draw()

#### Let's try predicting with our network

And, let's see how the network performs:

In [None]:
# use the RNN to predict some text

##Implementing a Text "Generator" Network with LSTMs

Simple RNNs have trouble learning. For example, if we try to predict the last word in the sentence: There are so many clouds in the sky. This is easy for a simple RNN to predict as the necessary context word clouds appeared just two words ago.

However, look at the following example: I grew up in France... I speak fluent French. The distance between the contextual clue word France and the predicted word French could have been arbitrarily long in this text. Furthermore, the vanishing gradient and exploding gradient effects during backpropagation affect the performance of RNN. Given a very long sequence, information at the start of the sequence might have almost no impact at the end of the sequence.

Let's re-implement our previous example using an LSTM instead of a vanilla RNN. to do this, we need to start by implementing the LSTM cell. This cell will update both its hidden state and cell state by using the different gates that we have seen in class:

- Input gate
- Forget gate
- Output gate
- Candidate update

In [None]:
class LSTMCell(nn.Module):

    def __init__(self, input_size, hidden_size, bias=True):
        super(LSTMCell, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias

        # we will streamline the implementation of the LSTM by combining the
        # weights for all 4 operations (input gate, forget gate, output gate, candidate update)
        # create a linear layer to map from input to hidden space
        # create a linear layer to map from previous to current hidden space

        # initialise the parameters

    def reset_parameters(self):

    def forward(self, input, h, c):
        # apply the weights to both input and previous state

        # separate the output into each of the LSTM operations

        # apply the corresponding activations

        # calculate the next cell state

        # calculate the next hidden state

        return h, c


To construct a fully functional LSTM network, we create a similar `nn.Module` to the one used for the RNN, which will concatenate one or more LSTM cells and apply them to a given input sequence.

In [None]:
class LSTM(nn.Module):

    def __init__(self, input_size, hidden_size, num_layers, output_size, bias=False):
        super(LSTM, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bias = bias
        self.output_size = output_size

        # create a list of modules

        # create each layer in the network
        # take care when defining the input size of the first vs later layers

        # create a final linear layer from hidden state to network output

    def init_hidden(self,  batch_size=1):
        # initialise the hidden state and cell state

    def forward(self, input, h0, c0):
        # Input of shape (batch_size, seqence length , input_size)
        # Output of shape (batch_size, output_size)

        # iterate over all elements in the sequence
            # iterate over each layer
                # apply each layer
                # take care to apply the layer to the input or the
                # previous hidden state depending on the layer number

                # store the hidden and cell state of each layer

            # the hidden state of the last layer needs to be recorded
            # to be used in the output

        # calculate output for each element in the sequence

        return out

We can now substitute the RNN for an LSTM in our test-generation network:

In [None]:
class LSTM_GEN(nn.Module):
    def __init__(self, input_size, embedding_dim, hidden_size, num_layers, num_unique_words):
        super(LSTM_GEN, self).__init__()

        # define your layers and activations
        self.input_size = input_size
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_unique_words = num_unique_words
        self.batch_size = batch_size

        # add a nn.Embedding
        # add out LSTM

    def forward(self, x):

        # initialise hidden state

        # store the word embeddings
        # apply the LSTM

        return output


def count_trainable_parameters(model):
    return sum([p.numel() for p in model.parameters() if p.requires_grad])

### The train and predict functions

In [None]:
def train_lstm_gen(model, optimizer, criterion, dataloader):
    # set model to train mode
    # initialise the loss

    # loop over dataset
        # send data to device
        # reset the gradients
        # get output and hidden state

        # compute the loss (change shape as crossentropy takes input as batch_size, number of classes, d1, d2, ...)

        # backpropagate
        # update weights

    return train_loss/len(dataloader)


def predict_lstm_gen(dataset, model, text, next_words=10):
    # set model to evaluation mode

    # loop over words
        # take word from dataset and send to device
        # compute output and hidden state

        # take last output

        # obtain probability vector for last output
        # sample probability vector to get index in dataset

        # get word corresponding to dataset

    return ' '.join(words)

### Hyperparameters, model initialisation and training loop (Afternoon exercise)

In [None]:
device = device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_size = 128
n_hidden = 128
n_layers = 1
embedding_dim = input_size
n_unique_words = len(words_dataset.unique_words)
batch_size = 128
sequence_length = 4

lr = 5e-3
momentum = 0.5
n_epochs = 20

# create LSTM network
print(f'The model has {count_trainable_parameters(lstm_gen):,} trainable parameters')

# select a criterion
# create an optimiser

# create our dataset
# use it with a data loader

# Keep track of losses for plotting
liveloss = PlotLosses()
for epoch in range(n_epochs):
    logs = {}
    # train the LSTM

    print(epoch,train_loss)

    logs['' + 'log loss'] = train_loss.item()
    liveloss.update(logs)
    liveloss.draw()

Let's try to predict the next

In [None]:
# predict some text with the LSTM

We can see that the LSTM generates more coherent text, and jokes that are more funny too!

# Recent Advances

The long-term memory in LSTM is [a specific instance](https://arxiv.org/pdf/1601.06733.pdf) of a more generic concept called _Attention_. The concept of Attention was introduced to solve one problem - when doing _Neural Machine Translation_ , the next word in the output sentence (in the output language) is not necessary related to the last (or second-to-last) word in the input sentence (in the input language). Since simple RNNs can only capture adjacency relationships, various styles of attention were tried to teach the model to look at a specific part of the input sentence in order to predict the next output word. [Many of these attention approaches](https://arxiv.org/abs/1409.0473) were successful and today far outperform LSTMs on the above tasks.

![](https://github.com/acse-2020/ACSE-8/blob/main/implementation/practical_5/morning_lecture/images/self-attention.png?raw=1)

One extremely successful kind of attention is _self attention_. Here, instead of mapping relationships between an output sequence and an input sequence, we map relationships between the different words of the same sentence. Going down this path, it was realised that the self-attention mechanism is more than just an add-on to RNNs and it might be possible to build entire networks out of self-attention alone. In ["Attention is all you need" (Vasuvani 2017)](https://arxiv.org/pdf/1706.03762.pdf) a neural network architecture called _Transformer_ was introduced that was composed entirely of self attention layers, and had some other innovations regarding memory.

![](https://github.com/acse-2020/ACSE-8/blob/main/implementation/practical_5/morning_lecture/images/transformer.png?raw=1)

In Feb 2019, a company called OpenAI introduced a variation of the transformer called GPT2 and [refused to release](https://slate.com/technology/2019/02/openai-gpt2-text-generating-algorithm-ai-dangerous.html) it _claiming it might destroy human society_ . This was a text generation model that could generate entire (_fake_) news articles from a one/few word prompt - think of it as autocomplete on steroids. They did eventually release it and is now available to try online: https://talktotransformer.com

### In summary, we have learnt how to:
- Implement a word-level text generator using RNNs
- Implement word-level text generation with LSTMs, which outperforms vanilla RNNs
