# Character-level text generation with LSTM

## Introduction
This example demonstrates how to use a LSTM model to generate text character-by-character.

At least 20 epochs are required before the generated text starts sounding locally coherent.

It is recommended to run this script on GPU, as recurrent networks are quite computationally intensive.

If you try this script on new data, make sure your corpus has at least ~100k characters. ~1M is better.
https://keras.io/examples/generative/lstm_character_level_text_generation/


- more examples:
  * Keras model: https://www.kaggle.com/code/shivamb/beginners-guide-to-text-generation-using-lstms
  * Pytorch model: https://www.kdnuggets.com/2020/07/pytorch-lstm-text-generation-tutorial.html



- more text corpus:
  *  [Project Gutenberg](https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads)
  * [Alice’s Adventures in Wonderland by Lewis Carroll.](https://www.gutenberg.org/ebooks/11)
  * [ASCII format (Plain Text UTF-8)](https://www.gutenberg.org/cache/epub/11/pg11.txt)

In [None]:
import random
import sys
import numpy as np
from keras import layers
from tensorflow import keras
import pickle
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Prepare the data
path = keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt',
    )

text = open(path).read().lower()

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt


In [None]:
text[0: 100]

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all ph'

In [None]:
# create mapping of unique chars to integers
chars = sorted(list(set(text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
print(char_to_int)

{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, ',': 7, '-': 8, '.': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, ':': 20, ';': 21, '=': 22, '?': 23, '[': 24, ']': 25, '_': 26, 'a': 27, 'b': 28, 'c': 29, 'd': 30, 'e': 31, 'f': 32, 'g': 33, 'h': 34, 'i': 35, 'j': 36, 'k': 37, 'l': 38, 'm': 39, 'n': 40, 'o': 41, 'p': 42, 'q': 43, 'r': 44, 's': 45, 't': 46, 'u': 47, 'v': 48, 'w': 49, 'x': 50, 'y': 51, 'z': 52, 'ä': 53, 'æ': 54, 'é': 55, 'ë': 56}


In [None]:
n_chars = len(text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
#list of unique chars in the corpus
print('Total Unique Characters: ', len(chars))
#dictionary mapping unique chars to their index in 'chars'
char_indices = dict((char, chars.index(char)) for char in chars)

Total Characters:  600893
Total Unique Characters:  57


In [None]:
#length of extracted char sequence
max_len = 40

#we sample a new sequence every 'step' char
step = 3 #sliding window

#sentences holds our extracted sequence
sentences = []

#holding the targets or labels (the following chars)
next_char = []

#mini-batch
for i in range(0, len(text)-max_len, step):
  sentences.append(text[i: i+max_len])
  next_char.append(text[i+max_len]) #label
print('Total Sentence: ', len(sentences))

#the chars one-hot representation into binary arrays
x = np.zeros((len(sentences), max_len, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    x[i, t, char_indices[char]] = 1
  y[i, char_indices[next_char[i]]] = 1

Total Sentence:  200285


In [None]:
# Build the model: a single LSTM layer
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(max_len, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

In [None]:
#compile the model
opt = keras.optimizers.RMSprop(lr=0.01)
lss = keras.losses.CategoricalCrossentropy()
model.compile(loss=lss, optimizer=opt)

### Prepare the text sampling function

In order to control the amount of stochasticity in the sampling process, we’ll introduce a parameter called the softmax temperature that characterizes the entropy of the probability distribution used for sampling: it characterizes how surprising or predictable the choice of the next character will be.

In [None]:
#pick a weighted random char based on probability of next_char instead of maximum probability
def sample(preds, temprature=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temprature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

#look at the below digram from François Chollet book (chapter 8)!

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
#train the model
checkpoint_filepath = '/content/gdrive/MyDrive/CheckPoints/LSTMGeneratingtext/'
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    # monitor='loss',
    # mode='auto',
    # save_best_only=True
    )
try:
  with open(checkpoint_filepath + '/Params.pickle', 'rb') as handle:
    start_epoch = pickle.load(handle)
    model.load_weights(checkpoint_filepath)
    print('starting from epoch = ', start_epoch)
except:
  print('Checkpoint not loaded')
  start_epoch = 1

for epoch in range(start_epoch, 100):
  print('epoch', epoch)
  # fit the model
  model.fit(x, y, batch_size=128, epochs=1, callbacks=[model_checkpoint_callback])

  #updaing (dumpping) pickle
  with open(checkpoint_filepath + '/Params.pickle', 'wb') as handle:
    pickle.dump(epoch, handle, protocol=pickle.HIGHEST_PROTOCOL)

  #select a text seed at random
  start_index = random.randint(0, len(text)-max_len-1)
  generated_text = text[start_index: start_index+max_len]
  print('----------------generating with seed: "', generated_text, '"')

  if epoch % 99 != 0:
    continue

  for temperature in [0.2, 0.5, 1.0, 1.2]:
    print('----------------------temperature----------------------: ', temperature)
    print(generated_text)
    #generating 400 chars for instance
    for i in range(50):
      sampled = np.zeros((1, max_len, len(chars)))
      for t, char in enumerate(generated_text):
        sampled[0, t, char_indices[char]] = 1

        preds = model.predict(sampled, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_character = chars[next_index]

        generated_text = generated_text[1: ]
        generated_text += next_character

        # sys.stdout.write(next_char)
        # sys.stdout.flush()

      print(generated_text)

starting from epoch =  76
epoch 76
----------------generating with seed: " e, perhaps, that our new language sounds "
epoch 77
----------------generating with seed: " difficult for
a noble man to understand: "
epoch 78
----------------generating with seed: " he semi-animal poverty of their souls. r "
epoch 79
----------------generating with seed: " ispleasure, if not scorn and pity philos "
epoch 80
----------------generating with seed: " hat there is
something lacking in them:  "
epoch 81
----------------generating with seed: " e taken of children who cry and scream i "
epoch 82
----------------generating with seed: "  believe that anybody ever looked into t "
epoch 83
----------------generating with seed: " in of good and evil.=--the notion of goo "
epoch 84
----------------generating with seed: " nature, nor his motives, nor his [course "
epoch 85
----------------generating with seed: " doubt that it will be over still sooner  "
epoch 86
----------------generating with seed: " g bel