# Recurrent Neural Networks - Text Generation
(by: [Ruben Nuredini](mailto:ruben.nuredini@hs-heilbronn.de), Heilbronn University, Germany, October 2021)

Recurrent Neural Networks (RNNs) have been used extensively by the natural language processing (NLP) community for various applications. One such application is building language models. A language model allows us to predict the probability of a word in a text given the previous words. Language models are important for various higher level tasks such as machine translation, spelling correction, and so on.

In this notebook you will train a very simple character-based language model. The language model will be represented by a Recurrent Neural Network (RNN) and will be trained on input text (.txt file). 
You will use this model to predict the next character given the sequence of $n$ previous characters. 

**You will learn how to:**
- Import and preprocess a raw textual file as your corpus
- Extract training samples from the corpus
- Implement a RNN text generation model with [Keras](https://keras.io/) using LSTM layer
- Train the model
- Use the model to predict the next characters in your model

**Note:** The idea behind character-based language models is very similar to a word-based language model. The former uses characters and the latter words. The computational requirements for training a character-based laguage model are lower due to the smaller vocabulary. Conversely, the training is faster.

In this notebook i borrowed from: [Comprehensive Guide to RNN with Keras](https://www.kaggle.com/code/prashant111/comprehensive-guide-to-rnn-with-keras/notebook) by Prashat Banerjee.

You can also explore the slightly advanced model presented in the [Text Generation](https://www.tensorflow.org/text/tutorials/text_generation) tutorial by Tensorflow.

## 1 - Import the packages

In [1]:
import numpy as np

import tensorflow as tf
from tensorflow import keras

### 1.1 - Check Tensorflow Version and available GPUs

In [27]:
print(tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

2.8.0
Num GPUs Available:  1


## 2 - Import the Dataset

The textual data has been downloaded from [Kaggle](https://www.kaggle.com/datasets/ashishsinhaiitr/lord-of-the-rings-text). The textual files are available in the `data` directory. 

In [3]:
# Running this cell will list all files under the input directory.
import os
for dirname, _, filenames in os.walk('data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data\01 - The Fellowship Of The Ring.txt
data\02 - The Two Towers.txt
data\03 - The Return of the King.txt
data\shakespeare.txt
data\wonderland.txt


You can select one of the books listed above as the corpus for training.

In [4]:
INPUT_FILE = "data/wonderland.txt"

##  2.1 - Preprocess the data

The files contain line breaks and non-ASCII characters, so we should do some preliminary cleanup and write out the contents into a variable called text as follows:

In [6]:
# extract the input as a stream of characters
print("Extracting text from input...")
fin = open(INPUT_FILE, 'rb')
lines = []
for line in fin:
    line = line.strip().lower()
    line = line.decode("ascii", "ignore")
    if len(line) == 0:
        continue
    lines.append(line)
fin.close()
print('done!')

Extracting text from input...
done!


In [7]:
text = " ".join(lines)

The contents of the book are now stored in the variable `text`.

### 2.1.1 - Create the `(x,y)` pairs

Training a language model is usually done in a supervised manner. In order to train a supervised learning model we need training data - a collection of inputs $x$ and the corresponding ground truth $y$. For this particular problem $x$ should be a sequence of characters and the corresponding $y$ the character that follows the sequence $x$.

Currently our data is just an unstructured sequence of words. We will create our training data in a self-supervised manner. Self-supervised learning obtains supervisory signals from the data itself, often leveraging the underlying structure in the data. 

We will therefore extract the `(x,y)` pairs from the text itself by stepping through the text ${step}$ character at a time, and extracting a sequence of size ${seqlen}$ and the next output char. 

For example, assuming an input text `"The sky was falling"`, and the values for ${step=1}$ and for ${seqlen=10}$ we would get the following sequence of `input_chars` and `label_chars` (first 5 only)
* `[The sky wa] -> ['s']`
* `[he sky was] -> [' ']`
* `[e sky was ] -> ['f']`
* `[ sky was f] -> ['a']`
* `[sky was fa] -> ['l']`

Let us create the data:

In [8]:
SEQLEN = 10
STEP = 1

input_chars = []
label_chars = []
for i in range(0, len(text) - SEQLEN, STEP):
    input_chars.append(text[i:i + SEQLEN])
    label_chars.append(text[i + SEQLEN])

In [9]:
#input_chars
#label_chars

There are `len(input_chars)` number of inputs each with a corresponding label.

In [10]:
len(input_chars)

158773

### 2.1.2 - Creating lookup tables for char2index and index2char conversion

There are `nb_chars` unique characters in the text.

In [11]:
chars = set([c for c in text])
nb_chars = len(chars)

We will be dealing with the indexes to these characters rather than the characters themselves, the following code snippet creates the necessary lookup tables for conversion between the character and its identifier and viceversa:

In [12]:
char2index = dict((c, i) for i, c in enumerate(chars))
index2char = dict((i, c) for i, c in enumerate(chars))

### 2.1.3 - One-hot-encode the input and label texts

In order to be able to process the data with our neural network model we have to convert the characters into one hot encodings. Each encoding will be of size `len(char)`.

* Each row of the input data $X$ will be of size represented by $seqlen$ characters each represented as a one-hot encoding
* In total there will be len(input_chars) samples
* The overall shape of the $X$ will be (len(input_chars),seqlen, nb_chars)

* Each row of output is a single character, also represented as a one-hot encoding
* The overall shape of $y$ (len(input_chars), nb_chars).

In [16]:
X = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=bool)
y = np.zeros((len(input_chars), nb_chars), dtype=bool)

In [17]:
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char):
        X[i, j, char2index[ch]] = 1
    y[i, char2index[label_chars[i]]] = 1

## 3 - The RNN Prediction Model

We will build a character-level neural network model. 
The goal of the network will be to predict the next character given several previous characters. 

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, SimpleRNN, Dense, LSTM

### 3.1 - Explaining the RNN model architecture

1. The raw data is provided to the [`tf.keras.Input`](https://www.tensorflow.org/api_docs/python/tf/keras/Input) layer that is used to instantiate the Keras tensor.
2. The input is then proceeded to the recurrent layer that can be either:
    * Fully-connected RNN: [`tf.keras.layers.SimpleRNN`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN)
        *  where the output is to be fed back to input.
    * Long Short-Term Memory layer: [`tf.keras.layers.LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
3. Finall a regular densely-connected NN layer [` tf.keras.layers.Dense `](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) is used for outputting the probability distribution 
    * Simply said, the belief of the network that a certain character is the next one given the previous sequence

In [23]:
model = Sequential([
        Input(shape=(SEQLEN, nb_chars)),
        SimpleRNN(units = 128, return_sequences=False, unroll=True),
        #LSTM(units = 128, unroll=True, return_sequences=False),
        Dense(nb_chars, activation="softmax"),
    ])

model.compile(loss="categorical_crossentropy", optimizer="adam")

In [24]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 128)               23552     
                                                                 
 dense_1 (Dense)             (None, 55)                7095      
                                                                 
Total params: 30,647
Trainable params: 30,647
Non-trainable params: 0
_________________________________________________________________


In [25]:
# Training Hyperparameters
BATCH_SIZE = 32
NUM_ITERATIONS = 30
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 100

In [26]:
# We train the model in batches and test output generated at each step
for iteration in range(NUM_ITERATIONS):
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION)
    
    # Testing the Model
    # randomly choose a row from input_chars, then use it to generate text from model for next 100 chars
    test_idx = np.random.randint(len(input_chars))
    test_chars = input_chars[test_idx]
    print("Generating from seed: %s" % (test_chars))
    print(test_chars, end="")
    for i in range(NUM_PREDS_PER_EPOCH):
        Xtest = np.zeros((1, SEQLEN, nb_chars))
        for i, ch in enumerate(test_chars):
            Xtest[0, i, char2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]
        ypred = index2char[np.argmax(pred)]
        print(ypred, end="")
        # move forward with test_chars + ypred
        test_chars = test_chars[1:] + ypred
    print()

Iteration #: 0
Generating from seed: heir names
heir namest of the dored the said the said the said the said the said the said the said the said the said the 
Iteration #: 1
Generating from seed: re simply-
re simply---the courded the cantered the cantered the cantered the cantered the cantered the cantered the cant
Iteration #: 2
Generating from seed: und in her
und in her head the gropt and the dont to the long the dont to the long the dont to the long the dont to the l
Iteration #: 3
Generating from seed: o idea wha
o idea what the same the said the gryphon with the stime the said the gryphon with the stime the said the gryp
Iteration #: 4
Generating from seed: understand
understand the mock turtle was the mock turtle was the mock turtle was the mock turtle was the mock turtle was
Iteration #: 5
Generating from seed: o the thre
o the three grown to the project gutenberg-tm electronic works to the project gutenberg-tm electronic works to
Iteration #: 6
 974/4962 [====>...............

KeyboardInterrupt: 

## 4 - Perform Predictions on Your Text

In [None]:
# Helper function that encodes the last SEQLEN characters of the input text to a one hot encoding
# suitable as an input to our model
def encode_text(sample_text):
    truncated = sample_text[-SEQLEN:]
    sample_encoded = np.zeros((1, SEQLEN, nb_chars))
    for i, ch in enumerate(truncated):
        sample_encoded[0, i, char2index[ch]] = 1
    return sample_encoded

In [None]:
def decode_output(pred_probas):
    ypred = index2char[np.argmax(pred)]
    return ypred

In [None]:
sample_text = "it is a beautiful day today"

In [None]:
from IPython.display import clear_output

current_text = sample_text
for i in range(200):
    pred = model.predict(encode_text(current_text))
    predicted_char = decode_output(pred)
    current_text += predicted_char
    print(current_text)
    clear_output(wait=True)
