# Character-level text generation with LSTM

In this notebook I build an LSTM recurrent neural network for character-level text generation. The network will be trained on H.P. Lovecraft's stories which are stored in the `data/lovecraft.txt` file.

The network works as follows. We split the text into a set of input sequences $\mathcal I$ of length `seq_length` (here we used 40 character) and an expected output (label) which is the `seq_length+1`-th character. Next we assign each character $c$ a unique integer and one-hot encode the sequences and labels. Once encoded we train our neural network which consists of a single LSTM layer with 192 units and an output layer with softmax activation function over the characters $c$. Finally we use the functions `sample` and `gen_text` to sample from the output of our trained neural network given an input sequence and generate new text.

In [5]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential
from tensorflow import keras
from tqdm import tqdm

with open("data/lovecraft.txt", "r") as file:
  text = file.read().lower()

text = text.replace("\n", "") # replace new lines for now
print("Training text length:", len(text), "characters")

chars = sorted(set(text))
print("Text contains {0} unique characters".format(len(chars)))
char2idx = {u : n for n, u in enumerate(chars)}
idx2char = {n : u for n, u in enumerate(chars)}

seq_length = 40
stride = 3
sents = []
next_chars = []

#split text into sequences
for i in range(0, len(text) - seq_length, stride):
  sents.append(text[i:i+seq_length])
  next_chars.append(text[i+seq_length])

print("{0} training sequences".format(len(sents)))

# One-hot encode the sequences with shape

x = np.zeros((len(sents), seq_length, len(chars)), dtype=np.bool)
y = np.zeros((len(sents), len(chars)), dtype=np.bool)
for n, sent in enumerate(sents):
  for m, char in enumerate(sent):
    x[n,m,char2idx[char]] = True
  y[n, char2idx[next_chars[n]]] = True


Training text length: 2882630 characters
Text contains 62 unique characters
960864 training sequences


In [6]:
model = Sequential([
  Input(shape=(seq_length, len(chars))),
  LSTM(192),
  Dense(len(chars), activation="softmax")
])

model.compile(loss="categorical_crossentropy", optimizer="adam")

In [7]:
epochs = 10
batch_size = 192

model.fit(x,y,batch_size=batch_size, epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f9832426d90>

After fitting the model we need to obtain new characters by providing an input sequence. The first step is to sample from the output distribution. Recall that our output layer has a softmax activation function, thus the distribution over the next character $c\in\mathcal C$ candidate follows a Gibbs distribution
$$
\mathcal P(c|\mathcal I)=\frac{e^{-\beta\mathcal H}}{\mathcal Z}
$$
where
* $\beta=\frac{1}{T}$ is the inverse temperature
* $\mathcal H=-\log p(c|\mathcal I)$ is the negative log-likelihood of a character $c$ given an input sequence $\mathcal I$.
* $\mathcal Z$ is a normalisation constant

Sampling a character from this distribution is done via the `sample` methode which takes the output of our model and the temperature $T$ as inputs.

Next we generate a string of characters of length $N$ using the `gen_text` function. This function takes as input and input sequence $\mathcal I$ of length `seq_length`, an integer `length` defining how many characters should be generated and a temperature $T$ for the sampling function.

In [8]:
def sample(preds, temperature=1.0):
  preds = np.asarray(preds).astype("float64")
  H = - np.log(preds) 
  gibbs = np.exp(-H / temperature)
  Z = np.sum(gibbs)
  gibbs = gibbs / Z
  p = np.random.multinomial(1, gibbs, 1)
  return np.argmax(p)

def gen_text(seed, length, temperature=1.0):
  out = seed
  state = seed
  for l in tqdm(range(length)):
    x = np.zeros((1, seq_length, len(chars)))
    for n, char in enumerate(state):
      x[0, n, char2idx[char]] = 1.0
    preds = model.predict(x, verbose=False)[0]
    next_idx = sample(preds, temperature)
    next_char = idx2char[next_idx]
    state = state[1:] + next_char
    out += next_char

  return out

# Sample outputs (precalculated using Google Colab)

The seed for generating new text was

In [9]:
text[:seq_length]

'when i drew nigh the nameless city i kne'

Some generated texts were

* [when i drew nigh the nameless city i kne]w the subtern and prints to the close of the one which he linker of the grave as the accorning the roof in the currous line of the while and the litter him than a man in the real could not a could be 
* [when i drew nigh the nameless city i kne]w been and the sound and was the profession to one the floor to see the state of the whilp as the general benew must have been and the street had been the processing the railing mind had to see that t
* [when i drew nigh the nameless city i kne]w the becamen seen mountain mind and from any your of the while the sease that the men to alter a pable encoment of the frint great stone on the speck which one the mountains which strange of the dema

You can generate your own text using

In [10]:
seed = text[0:seq_length]
length = 200
temperature = 1.0
gen_text(seed, length, temperature)

100%|██████████| 200/200 [00:05<00:00, 35.72it/s]


'when i drew nigh the nameless city i knew i whach i wadd the vaultous flowaish fatter necreat. abys effacime almost common. there would out been askemoting the band whisper of the telled on the colounible to which had bree west, but we pure'