<a href="https://colab.research.google.com/github/bengsoon/lstm_lord_of_the_rings/blob/main/LOTR_LSTM_Character_Level.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating a Language Model with LSTM using Lord of The Rings Corpus
In this notebook, we will create a character-level language language model using LSTM.

### Imports

In [None]:
## for paperspace 
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
# !pip install -r requirements.txt

In [None]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import Embedding, Input, LSTM, Flatten, Dense, Dropout
from tensorflow.keras import Model
from tensorflow.keras.models import load_model
import numpy as np 

from pprint import pprint as pp
from string import punctuation
import regex as re
import random
import os
from pathlib import Path

### Data Preprocessing & Pipeline

In [None]:
# get LOTR full text
# !wget https://raw.githubusercontent.com/bengsoon/lstm_lord_of_the_rings/main/lotr_full.txt -P /content/drive/MyDrive/Colab\ Notebooks/LOTR_LSTM/data

#### Loading Data

In [None]:
path = Path("./")

In [None]:
with open(path / "data/lotr_full.txt", "r", encoding="utf-8") as f:
    text = f.read()
print(text[:1000])

Three Rings for the Elven-kings under the sky,
               Seven for the Dwarf-lords in their halls of stone,
            Nine for Mortal Men doomed to die,
              One for the Dark Lord on his dark throne
           In the Land of Mordor where the Shadows lie.
               One Ring to rule them all, One Ring to find them,
               One Ring to bring them all and in the darkness bind them
           In the Land of Mordor where the Shadows lie.
           
FOREWORD

This tale grew in the telling, until it became a history of the Great War of the Ring and included many glimpses of the yet more ancient history that preceded it. It was begun soon after _The Hobbit_ was written and before its publication in 1937; but I did not go on with this sequel, for I wished first to complete and set in order the mythology and legends of the Elder Days, which had then been taking shape for some years. I desired to do this for my own satisfaction, and I had little hope that other people 

In [None]:
print(f"Corpus length: {int(len(text)) / 1000 } K characters")

Corpus length: 1532.723 K characters


#### Unique Characters

In [None]:
chars = sorted(set(list(text)))
print("Total unique characters: %s" % (len(chars)))

Total unique characters: 93


In [None]:
print(chars)

['\t', '\n', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'É', 'Ó', 'á', 'â', 'ä', 'é', 'ë', 'í', 'ó', 'ú', 'û', '–']


#### Preparing X & y Datasets

We need to split the text into two sets of fixed-size character sequences (X & y)
* The first sequence (`sentences`) is the input data where the model will receive a fixed-size (`MAX_SEQ_LEN`) character sequence
* The second sequence (`next_chars`) is the output data, which is only 1 character.

In [None]:
# setting up model constants
MAX_SEQ_LEN = 20
MAX_FEATURES = len(chars)
step = 2
BATCH_SIZE = 64
EMBEDDING_DIM = 16

In [None]:
sentences = []
next_chars = []

for i in range(0, len(text) - MAX_SEQ_LEN, step):
    sentences.append(text[i: i + MAX_SEQ_LEN])
    next_chars.append(text[i + MAX_SEQ_LEN])

print("Total number of training examples:", len(sentences))

Total number of training examples: 766352


In [None]:
# randomly sample some of the input and output to visualize
for i in range(10):
    ix = random.randint(0, len(sentences))
    print(f"{sentences[ix]} ..... {next_chars[ix]}")

r: its springs were  ..... a
 me.'
     Pippin la ..... u
ite face, her hand c ..... l
 on the open hill, b ..... e
 birthday, which he  ..... c
thers must take refu ..... g
bbits referred to th ..... o
fed me, and so I'm b ..... e
 there, silent and a ..... l
 as his fortune allo ..... w


In [None]:
X_train_raw = tf.data.Dataset.from_tensor_slices(sentences)
y_train_raw = tf.data.Dataset.from_tensor_slices(next_chars)

2021-11-19 05:33:56.365908: W tensorflow/stream_executor/platform/default/dso_loader.cc:65] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-11-19 05:33:56.366495: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-11-19 05:33:56.366861: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (n0oqzbkbv5): /proc/driver/nvidia/version does not exist


In [None]:
for input, output in zip(X_train_raw.take(5), y_train_raw.take(5)):
    print(f"{input.numpy().decode('utf-8')} ..... {output.numpy().decode('utf-8')}")

Three Rings for the  ..... E
ree Rings for the El ..... v
e Rings for the Elve ..... n
Rings for the Elven- ..... k
ngs for the Elven-ki ..... n


#### Preprocessing with Keras `TextVectorization` layer
[_doc_](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)

We will use the `TextVectorization` layer as the preprocessing pipeline for our data

In [None]:
def standardize_text(input):
    """
        create a custom standardization that:
            1. Fixes whitespaces 
            2. Removes punctuations & numbers
            3. Sets all texts to lowercase
            4. Preserves the Elvish characters
    """
    
    input = tf.strings.regex_replace(input, r"[\s+]", " ")
    input = tf.strings.regex_replace(input, r"[0-9]", "")
    input = tf.strings.regex_replace(input, f"[{punctuation}–]", "")

    return tf.strings.lower(input)

def char_split(input):
    return tf.strings.unicode_split(input, 'UTF-8')

In [None]:
# create text vectorization layer
vectorization_layer = TextVectorization(
    max_tokens = MAX_FEATURES,
    standardize = standardize_text,
    split = char_split,
    output_mode='int',
    output_sequence_length=MAX_SEQ_LEN
)

In [None]:
# create the vocabulary indexing with `adapt`
vectorization_layer.adapt(X_train_raw.batch(BATCH_SIZE))

2021-11-19 05:34:06.910154: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


In [None]:
print(f"Total unique characters: {len(vectorization_layer.get_vocabulary())}")

Total unique characters: 40


In [None]:
print(vectorization_layer.get_vocabulary())

['', '[UNK]', ' ', 'e', 't', 'a', 'o', 'n', 'h', 'i', 's', 'r', 'd', 'l', 'w', 'u', 'f', 'g', 'm', 'y', 'b', 'c', 'p', 'k', 'v', 'j', 'q', 'x', 'z', 'ó', 'É', 'ú', 'û', 'é', 'á', 'í', 'ë', 'â', 'ä', 'Ó']


In [None]:
def vectorize_text(text):
    """ Convert text into a Tensor using vectorization_layer"""
    text = tf.expand_dims(text, -1)
    return tf.squeeze(vectorization_layer(text))

In [None]:
test_text = "hello i am Hoaha"

vectorize_text(test_text)

<tf.Tensor: shape=(20,), dtype=int64, numpy=
array([ 8,  3, 13, 13,  6,  2,  9,  2,  5, 18,  2,  8,  6,  5,  8,  5,  0,
        0,  0,  0])>

#### Apply Text Vectorization to X & y datasets

In [None]:
# vectorize the dataset
X_train = X_train_raw.map(vectorize_text)
y_train = y_train_raw.map(vectorize_text)

X_train.element_spec, y_train.element_spec

(TensorSpec(shape=(20,), dtype=tf.int64, name=None),
 TensorSpec(shape=(20,), dtype=tf.int64, name=None))

In [None]:
for elem in y_train.take(10):
    print(elem)

tf.Tensor([3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(20,), dtype=int64)
tf.Tensor([24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0], shape=(20,), dtype=int64)
tf.Tensor([7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(20,), dtype=int64)
tf.Tensor([23  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0], shape=(20,), dtype=int64)
tf.Tensor([7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(20,), dtype=int64)
tf.Tensor([10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0], shape=(20,), dtype=int64)
tf.Tensor([15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0], shape=(20,), dtype=int64)
tf.Tensor([12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0], shape=(20,), dtype=int64)
tf.Tensor([11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0], shape=(20,), dtype=int64)
tf.Tensor([4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(20,), dtype=int64)


In [None]:
# we only one the first representation in the vector in the y_train dataset
y_train = y_train.map(lambda y: y[0])

In [None]:
for elem in y_train.take(5):
    print(f"Shape: {elem.shape}")
    print(f"Next Character: {elem.numpy()}")

Shape: ()
Next Character: 3
Shape: ()
Next Character: 24
Shape: ()
Next Character: 7
Shape: ()
Next Character: 23
Shape: ()
Next Character: 7


In [None]:
# Check tensor dimensions to ensure we have MAX_SEQ_LEN-sized inputs and single output
X_train.take(1), y_train.take(1)

(<TakeDataset shapes: (20,), types: tf.int64>,
 <TakeDataset shapes: (), types: tf.int64>)

In [None]:
for input, output in zip(X_train.take(5), y_train.take(5)):
    print(f"{input.numpy()} ------------>  {output.numpy()}")

[ 4  8 11  3  3  2 11  9  7 17 10  2 16  6 11  2  4  8  3  2] ------------>  3
[11  3  3  2 11  9  7 17 10  2 16  6 11  2  4  8  3  2  3 13] ------------>  24
[ 3  2 11  9  7 17 10  2 16  6 11  2  4  8  3  2  3 13 24  3] ------------>  7
[11  9  7 17 10  2 16  6 11  2  4  8  3  2  3 13 24  3  7  0] ------------>  23
[ 7 17 10  2 16  6 11  2  4  8  3  2  3 13 24  3  7 23  9  0] ------------>  7


#### Bringing the data pipeline together

**Joining the X and y into a dataset**

In [None]:
# joining X & y into a single dataset
train_dataset = tf.data.Dataset.zip((X_train, y_train))

**Setting data pipeline optimizations:**
Perform async prefetching / buffering of data using AUTOTUNE


In [None]:
AUTOTUNE = tf.data.AUTOTUNE
train_dataset = train_dataset.prefetch(buffer_size=512).batch(BATCH_SIZE, drop_remainder=True).cache().prefetch(buffer_size=AUTOTUNE)

In [None]:
print(f"Size of the dataset in batches: {train_dataset.cardinality().numpy()}")

Size of the dataset in batches: 11974


In [None]:
# check the tensor dimensions of X and y again

for sample in train_dataset.take(1):
    print(f"Input (X) Dimension: {sample[0].numpy().shape}")
    print(f"Output (y) Dimension: {sample[1].numpy().shape}")

Input (X) Dimension: (64, 20)
Output (y) Dimension: (64,)


2021-11-19 05:35:15.633059: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


### Build the LSTM Model!

In [None]:
def char_LSTM_model(max_seq_len=MAX_SEQ_LEN, max_features=MAX_FEATURES, embedding_dim=EMBEDDING_DIM):

    # Define input for the model (vocab indices)
    inputs = tf.keras.Input(shape=(max_seq_len), dtype="int64")

    # Add a layer to map the vocab indices into an embedding layer
    X = Embedding(max_features, embedding_dim)(inputs)
    X = Dropout(0.5)(X)
    X = LSTM(128, return_sequences=True)(X)
    X = Flatten()(X)
    outputs = Dense(max_features, activation="softmax")(X)
    model = Model(inputs, outputs, name="model_LSTM")

    return model

In [None]:
model = char_LSTM_model()
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
model.summary()

Model: "model_LSTM"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 20)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 20, 16)            1488      
_________________________________________________________________
dropout (Dropout)            (None, 20, 16)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 20, 128)           74240     
_________________________________________________________________
flatten (Flatten)            (None, 2560)              0         
_________________________________________________________________
dense (Dense)                (None, 93)                238173    
Total params: 313,901
Trainable params: 313,901
Non-trainable params: 0
__________________________________________________

In [None]:
def sample(preds, temperature=0.2):
    # helper function to sample an index from a probability array
    preds=np.squeeze(preds)
    
    preds = np.asarray(preds).astype("float64")
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)
    
def generate_text(model, seed_original, step, diversity):
    seed=vectorize_text(seed_original)
    # decode_sentence(seed.numpy().squeeze())
    print(f"Starting the sentence with.... '{seed_original}'")
    print("...Diversity:", diversity)
    seed= vectorize_text(seed_original).numpy().reshape(1,-1)
    
    generated = (seed)
    for i in range(step):
        predictions=model.predict(seed)
        pred_max= np.argmax(predictions.squeeze())
        next_index = sample(predictions, diversity)
        generated = np.append(generated, next_index)
        seed= generated[-MAX_SEQ_LEN:].reshape(1,MAX_SEQ_LEN)
    return decode_sentence(generated)


def decode_sentence (encoded_sentence):
    deceoded_sentence=[]
    for word in encoded_sentence:
        deceoded_sentence.append(vectorization_layer.get_vocabulary()[word])
    sentence= ''.join(deceoded_sentence)
    print(sentence)
    return sentence

In [None]:
# Create a callback that saves the model's weights
checkpoint_path = path / "models/model_cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, 
                                                 save_weights_only=True, 
                                                 verbose=1)

# Train the model
epochs = 30
BATCH_SIZE = 64
SAMPLING_STEPS = 100

for epoch in range(epochs):
    print("-"*40 + f"  Epoch: {epoch}/{epochs}  " + "-"*40)
    model.fit(train_dataset, batch_size=BATCH_SIZE, epochs=1, callbacks=[cp_callback])
    print()
    print("*"*30 + f" Generating text after epoch #{epoch} " + "*"*30)
    start_index = random.randint(0, len(text) - MAX_SEQ_LEN - 1)
    sentence = text[start_index : start_index + MAX_SEQ_LEN]
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        generate_text(model, sentence, SAMPLING_STEPS, diversity)
        print()

In [None]:
model.save(path / "models/Char_LSTM_LOTR_20211112-1.h5" )

In [None]:
model = load_model(path / "models/Char_LSTM_LOTR_20211112-1.h5")

In [None]:
model.evaluate(train_dataset, batch_size = 64)



[1.3499177694320679, 0.5750871896743774]