## Training a character language model and studying various ways of generating text

**Author: matthieu.labeau@telecom-paris.fr**

## Objectives:

- We will train a network to predict a next character given an input sequence of characters, and use it to generate new sequences.
- We will strictly work with local (and not structured, meaning we will only one character at a time) prediction - however, we will look into a relatively simple heuristic to improve the "structure": *beam search*. We can also try to improve generation with other methods: *temperature* sampling, *top-k* sampling, *top-p* sampling,
- We will use ```keras```to build the model based on a **recurrent neural network** called a **LSTM**, which will use simple features (one-hot vector representing previous characters) to predict the next characters. We will use a small model to avoid training for too long. *Remark: you don't need to know how this model works - just its inputs and outputs !*
- We will use a small dataset (poetry, from project Gutenberg) - you can use any data you prefer, as long as you are able to train the model on it.
- Even with a small dataset and a small model, training may be long. If you can use a computing infrastructure, like Google colab, it may be more practical - and you probably can obtain better results by using a bigger model and a larger dataset.

#### Obtaining the data
- We download directly the ebook from project Gutenberg - you can get any other text you would prefer.

In [18]:
from keras.utils import get_file
url = 'http://www.gutenberg.org/cache/epub/6099/pg6099.txt'
path = get_file('pg6099.txt', origin=url)

f = open(path, 'r' , encoding = 'utf8')
lines = f.readlines()
text = []

start = False
for line in lines:
    if("*** START OF THE PROJECT GUTENBERG EBOOK LES FLEURS DU MAL ***" in line and start==False):
        start = True
    if("            *** END OF THE PROJECT GUTENBERG EBOOK LES FLEURS DU MAL ***" in line):
        break
    if(start==False or len(line) == 0):
        continue
    text.append(line)

f.close()
text = " ".join(text)
voc_chars = sorted(set([c for c in text]))
nb_chars = len(voc_chars)

In [19]:
print(voc_chars)

['\n', ' ', '!', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '5', '6', '7', '8', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '«', '»', 'È', 'É', 'Ï', 'à', 'â', 'ç', 'è', 'é', 'ê', 'ë', 'î', 'ï', 'ô', 'ù', 'û', 'ü']


In [20]:
print(text[20000:21000])

s plaintes,
   Ces extases, ces cris, ces pleurs, ces _Te Deum,_
   Sont un écho redit par mille labyrinthes;
   C'est pour les coeurs mortels un divin opium.
 
   C'est un cri répété par mille sentinelles,
   Un ordre renvoyé par mille porte-voix;
   C'est un phare allumé sur mille citadelles,
   Un appel de chasseurs perdus dans les grands bois!
 
   Car c'est vraiment, Seigneur, le meilleur témoignage
   Que nous puissions donner de notre dignité
   Que cet ardent sanglot qui roule d'âge en âge
   Et vient mourir au bord de votre éternité!
 
 
 
 
   LA MUSE VENALE
 
 
   O Muse de mon coeur, amante des palais,
   Auras-tu, quand Janvier lâchera ses Borées,
   Durant les noirs ennuis des neigeuses soirées,
   Un tison pour chauffer tes deux pieds violets?
 
   Ranimeras-tu donc tes épaules marbrées
   Aux nocturnes rayons qui percent les volets?
   Sentant ta bourse à sec autant que ton palais,
   Récolteras-tu l'or des voûtes azurées?
 
   Il te faut, pour gagner ton pain de chaque

#### Keeping track of possible characters
- Using a ```set```, create a sorted list of possible characters
- Create two dictionnaries, having characters and corresponding indexes as {key: value}, and reverse.

Example:

```python
chars = [a, b, c]
```

```python
chars_indices = {a: 0, b: 1, c: 2}
```

```python
indices_chars = {0: a, 1: b, 2: c}
```

In [22]:
print('Corpus length:', len(text))

chars = sorted(list(set(text)))
print('Total number of characters:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Corpus length: 160503
Total number of characters: 92


In [23]:
print(char_indices)
print(indices_char)

{'\n': 0, ' ': 1, '!': 2, "'": 3, '(': 4, ')': 5, '*': 6, ',': 7, '-': 8, '.': 9, '0': 10, '1': 11, '2': 12, '5': 13, '6': 14, '7': 15, '8': 16, ':': 17, ';': 18, '?': 19, 'A': 20, 'B': 21, 'C': 22, 'D': 23, 'E': 24, 'F': 25, 'G': 26, 'H': 27, 'I': 28, 'J': 29, 'K': 30, 'L': 31, 'M': 32, 'N': 33, 'O': 34, 'P': 35, 'Q': 36, 'R': 37, 'S': 38, 'T': 39, 'U': 40, 'V': 41, 'W': 42, 'X': 43, 'Y': 44, '[': 45, ']': 46, '_': 47, 'a': 48, 'b': 49, 'c': 50, 'd': 51, 'e': 52, 'f': 53, 'g': 54, 'h': 55, 'i': 56, 'j': 57, 'k': 58, 'l': 59, 'm': 60, 'n': 61, 'o': 62, 'p': 63, 'q': 64, 'r': 65, 's': 66, 't': 67, 'u': 68, 'v': 69, 'w': 70, 'x': 71, 'y': 72, 'z': 73, '«': 74, '»': 75, 'È': 76, 'É': 77, 'Ï': 78, 'à': 79, 'â': 80, 'ç': 81, 'è': 82, 'é': 83, 'ê': 84, 'ë': 85, 'î': 86, 'ï': 87, 'ô': 88, 'ù': 89, 'û': 90, 'ü': 91}
{0: '\n', 1: ' ', 2: '!', 3: "'", 4: '(', 5: ')', 6: '*', 7: ',', 8: '-', 9: '.', 10: '0', 11: '1', 12: '2', 13: '5', 14: '6', 15: '7', 16: '8', 17: ':', 18: ';', 19: '?', 20: 'A',

#### Creating training data
- We will represent characters using *one-hot vectors*. Hence, the i-th character of n possible characters will be represented by a vector of length $n$, containing $0$ expect for a $1$ in position $i$. Following our previous examples, ```a = [1, 0, 0]``` and ```b = [0, 1, 0]```.
- Hence, a sequence of characters is a list of one-hot vectors. Our goal will be to predict, given an input sequence of fixed length (here, this length is given by ```maximum_seq_length```) the next character. Hence, we need to build two lists: ```sentences```, containing the input sequences, and ```next_char``` the characters to be predicted.
- We do not necessarily need to take all possible sequences. We can select one every ```time_step``` steps.

Example: Using the previous dictionnaries, the sequence:
```'acabbaccaabba'``` with ```maximum_seq_length = 4``` and ```time_step = 2``` would give the following lists:

```python
sentences = ['acab', 'abba', 'bacc', 'ccaa', 'aabb']
```

```python
next_char = ['b', 'c', 'a', 'b', 'a']
```

In [24]:
maximum_seq_length = 24
time_step = 1
sentences = []
next_char = []
for i in range(0, len(text) - maximum_seq_length, time_step):
    sentences.append(text[i: i + maximum_seq_length])
    next_char.append(text[i + maximum_seq_length])
print('Number of Sequences:', len(sentences))

Number of Sequences: 160479


In [25]:
import numpy as np
import random
import sys

#### Creating training tensors
- We need to transform these lists into tensors, using one-hot vectors to represent characters.
- We will need 3 dimensions for the training examples from ```sentences```: the number of examples, the length of the sequence, and the dimension of the one-hot vector
- This is reduced to 2 dimensions for the ```next_char```: number of examples and one-hot vector.

Example: the previous ```sentences``` would become:

a : 0, b : 1, c: 2

exemple : premier correspond à la sequence acab


taille de x : nbre(n_gram) * n-1 * taille du vocabulaire

cas des fleurs du mal : taille_x = 160479 * 24 * 92
taille_y = 160479 * 92

```python
X = [[[1, 0, 0],
      [0, 0, 1],
      [1, 0, 0],
      [0, 1, 0]],
     [[1, 0, 0],
      [0, 1, 0],
      [0, 1, 0],
      [1, 0, 0]],
     [[0, 1, 0],
      [1, 0, 0],
      [0, 0, 1],
      [0, 0, 1]],
     [[0, 0, 1],
      [0, 0, 1],
      [1, 0, 0],
      [1, 0, 0]],
     [[1, 0, 0],
      [1, 0, 0],
      [0, 1, 0],
      [0, 1, 0]]]
```
       
```python
y = [[0, 1, 0],
     [0, 0, 1],
     [1, 0, 0],
     [0, 1, 0],
     [1, 0, 0]]
```

In [27]:
X = np.zeros((len(sentences), maximum_seq_length, len(chars)), dtype=bool)
y = np.zeros((len(sentences), len(chars)), dtype=bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_char[i]]] = 1

#### Implement the model
In order to implement the model as simply as possible, we will use ```keras```. It allows to create models with only a few lines of code.
First, we will create a very simple model based on a **LSTM**, which is a *recurrent* architecture. Note that one the strength of a recurrent architecture is to allow for inputs of varying length - here, to simplify data processing, we will keep a **fixed input size**.

In [28]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.callbacks import LambdaCallback, EarlyStopping

We need to create a LSTM model that takes directly out inputs from ```X``` and try to predict one-hot vectors from ```y```.
- What are the input and output dimensions ?
  - ```X```: size of the dataset $\times$ maximum sequence length $\times$ vocabulary size
  - ```y```: size of the dataset $\times$ vocabulary size
- The model should be made with a ```LSTM``` layer, and a ```Dense``` layer followed by a softmax activation function. Work out the intermediate dimensions:
  - ```X``` $\rightarrow$ (LSTM) $\rightarrow$ ```h``` $\rightarrow$ (Dense) $\rightarrow$ ```s``` $\rightarrow$ (softmax) $\rightarrow$ ```pred```
  - Look at layers arguments and find out to proper ```input_shape``` for the ```LSTM``` layer and the proper size for the ```Dense``` layer.
  - We can use 256 as the size of hidden states for the ```LSTM```.
- We will minimize ```cross-entropy(pred, y)```. Use the ```categorical_crossentropy``` loss, with the optimizer of your preference (for example, ```adam```).

In [29]:
model = Sequential()
model.add(LSTM(256, input_shape=(maximum_seq_length, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

We will now only need a few functions to use this model:
- ```model.fit```, which you will call on the appropriately processed data ```X, y```
- ```model.predict```, which we will use on an input **of the same dimension of X** to output the probabilities. That includes the *first one*, corresponding to the number of examples in the input.

#### Create a function to generate text with our model
- We use the output of our model to select the next most probable character (with the ```argmax``` function)
- We need to transform an input text into an input tensor, as before (taking the right length, the last ```maximum_seq_length``` characters)
- We need to transform back the most probable index into a character and add it to our text.
- This must be looped ```num_generated``` times, each time obtaining a new input tensor from the new input sequence (which has the character we previously predicted at the end !)


We can begin by writing a function facilitating the transfer between text and tensors:

In [30]:
def get_tensor(sentence, maximum_seq_length, voc):
    x = np.zeros((1, maximum_seq_length, len(voc)))
    for t, char in enumerate(sentence):
        x[0, t, voc[char]] = 1.
    return x

The following function (```end_epoch_generate```) is here to facilitate automatic generation at the end of each epoch, so you can monitor of generation changes as the model trains. It calls the ```generate_next``` function upon each sequence of text in ```texts_ex```. The only element in this list right now comes from the training data - you can add your own.

In [31]:
def generate_next(model, text, num_generated=120):
    generated = text
    sentence = text[-maximum_seq_length:]
    for i in range(num_generated):
        x = get_tensor(sentence, maximum_seq_length, char_indices)
        predictions = model.predict(x, verbose=0)[0]
        next_index = np.argmax(predictions)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char
    return(generated)

def end_epoch_generate(epoch, _):
    print('\n Generating text after epoch: %d' % (epoch+1))
    texts_ex = ["La sottise, l'erreur, le péché"]
    for text in texts_ex:
        sample = generate_next(model, text)
        print('%s' % (sample))

In [32]:
text_ex = "La sottise, l'erreur, le péché"
generate_next(model, text_ex)

"La sottise, l'erreur, le péchéc«w«--T-T--T--T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T-"

In [33]:
model.fit(X, y,
          batch_size=128,
          epochs=10,
          validation_split = 0.2,
          callbacks=[LambdaCallback(on_epoch_end=end_epoch_generate)])

Epoch 1/10
 Generating text after epoch: 1
La sottise, l'erreur, le péchése de le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le 
Epoch 2/10
 Generating text after epoch: 2
La sottise, l'erreur, le péchés de le chait de le chait de le chait de le chait de le chait de le chait de le chait de le chait de le chait de le chai
Epoch 3/10
 Generating text after epoch: 3
La sottise, l'erreur, le péchés de la coure et le coure et le coure et le coure et le coure et le coure et le coure et le coure et le coure et le cour
Epoch 4/10
 Generating text after epoch: 4
La sottise, l'erreur, le péchés de la grand de la grand de la grand de la grand de la grand de la grand de la grand de la grand de la grand de la gran
Epoch 5/10
 Generating text after epoch: 5
La sottise, l'erreur, le péchété de la grande la mort de la grande la mort de la grande la mort de la grande la mort de la grande la mort de la grande
Epoch 6/10
 Generating text af

<keras.src.callbacks.History at 0x7aaa1810fd90>

#### Using character embeddings
- Instead of using one-hot vectors to represent characters, we will now use character embeddings, which are vectors belonging to the same space.
- We will need as many vectors as there is characters. The input of the network will be simpler, since we will just need to indicate to the model which character is in input.
- The output does not change: indeed, Keras uses one-hot vectors for the target of the categorical cross-entropy loss.
Example: the previous example ```sentences``` would now become:
- We need to add a ```Embedding``` layer to the model, with the right input size, and to choose which dimension use for our embeddings.

```python
X = [[0, 2, 0, 1],
     [0, 1, 1, 0],
     [1, 0, 2, 2],
     [2, 2, 0, 0],
     [0, 0, 1, 1]]
```
       
```python
y = [[0, 1, 0],
     [0, 0, 1],
     [1, 0, 0],
     [0, 1, 0],
     [1, 0, 0]]
```

In [35]:
X_emb = np.zeros((len(sentences), maximum_seq_length), dtype=int)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X_emb[i, t] = char_indices[char]

In [36]:
from keras.layers import Embedding

model_emb = Sequential()
model_emb.add(Embedding(len(chars), 32, input_length = maximum_seq_length))
model_emb.add(LSTM(256))
model_emb.add(Dense(len(chars)))
model_emb.add(Activation('softmax'))

model_emb.compile(loss='categorical_crossentropy', optimizer='adam')

In [37]:
def generate_next(model, text, num_generated=120):
    generated = text
    sentence = text[-maximum_seq_length:]
    for i in range(num_generated):
        x = np.zeros((1, maximum_seq_length))
        for t, char in enumerate(sentence):
            x[0, t] = char_indices[char]
        predictions = model.predict(x, verbose=0)[0]
        next_index = np.argmax(predictions)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char
    return(generated)

def end_epoch_generate(epoch, _):
    print('\n Generating text after epoch: %d' % (epoch+1))
    texts_ex = ["La sottise, l'erreur, le péché"]
    for text in texts_ex:
        sample = generate_next(model_emb, text)
        print('%s' % (sample))

In [38]:
model_emb.fit(X_emb, y,
          batch_size=128,
          epochs=10,
          validation_split = 0.2,
          callbacks=[LambdaCallback(on_epoch_end=end_epoch_generate)])

Epoch 1/10
 Generating text after epoch: 1
La sottise, l'erreur, le péchése de poure de poure de poure de poure de poure de poure de poure de poure de poure de poure de poure de poure de poure 
Epoch 2/10
 Generating text after epoch: 2
La sottise, l'erreur, le péchéte et les les les les les les les les les les les les les les les les les les les les les les les les les les les les le
Epoch 3/10
 Generating text after epoch: 3
La sottise, l'erreur, le péchéte et de sour de sour de sour de sour de sour de sour de sour de sour de sour de sour de sour de sour de sour de sour de
Epoch 4/10
 Generating text after epoch: 4
La sottise, l'erreur, le péchét de le son coeur de le son coeur de le son coeur de le son coeur de le son coeur de le son coeur de le son coeur de le 
Epoch 5/10
 Generating text after epoch: 5
La sottise, l'erreur, le péchés de la plaire et les chasses des morts son son et les chasses des morts son son et les chasses des morts son son et les
Epoch 6/10
 Generating text af

<keras.src.callbacks.History at 0x7aa9f0775f30>

#### Sampling with our model
- Now, instead of simply selecting the most probable next character, we would like to be able to draw a sample from the distribution output by the model.
- To better control the generation, we would like to use the argument ```temperature```, to smooth the distribution.
- We will use the ```multinomial``` function from the ```random``` package to draw samples.
- We integrate this into a function ```generate_sample``` that is almost exactly like ```generate_next```.

In [39]:
def reweight(predictions, temperature):
    predictions = np.asarray(predictions).astype('float64')
    log_predictions = np.log(predictions) / temperature
    predictions = np.exp(log_predictions)
    predictions = predictions / np.sum(predictions)
    return predictions

def sample(predictions, temperature):
    predictions = reweight(predictions, temperature)
    sampled = np.random.multinomial(1, predictions, 1)
    return np.argmax(sampled)

def generate_sample(model, text, num_generated=120, temperature=1.0):
    generated = text
    sentence = text[-maximum_seq_length:]
    for i in range(num_generated):
        x = np.zeros((1, maximum_seq_length))
        for t, char in enumerate(sentence):
            x[0, t] = char_indices[char]
        predictions = model.predict(x, verbose=0)[0]
        next_index = sample(predictions, temperature)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char
    return(generated)

In [40]:
print(generate_sample(model_emb, text_ex, temperature = 0.7))

La sottise, l'erreur, le péchés était de voix, qui noire aux anges,
       De ton vieil leurs du souviens.
 
   Au blatant son trouvent comment des fa


In [44]:
def sample_top_k(predictions, temperature, k):
    predictions = np.asarray(predictions).astype('float64')
    log_predictions = np.log(predictions) / temperature
    indices_to_remove = log_predictions.argsort()[:k]
    log_predictions[indices_to_remove] = -float('Inf')
    predictions = np.exp(log_predictions)
    predictions = predictions / np.sum(predictions)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas)


def generate_sample_top_k(model, text, num_generated=120, temperature=1.0, k=10):
    generated = text
    sentence = text[-maximum_seq_length:]
    for i in range(num_generated):
        x = np.zeros((1, maximum_seq_length))
        for t, char in enumerate(sentence):
            x[0, t] = char_indices[char]
        predictions = model.predict(x, verbose=0)[0]
        next_index = sample_top_k(predictions, temperature, k)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char
    return(generated)

In [52]:
print(generate_sample_top_k(model_emb, text_ex, temperature = 0.8, k = 3))

La sottise, l'erreur, le péchément la gardie,
   Doucement le somme la ros.
 
   Elle se plâcre armorreux;
   Nous gressé tout comme un dit:
   Mortis


In [55]:
def sample_top_p(predictions, temperature, p):
    predictions = np.asarray(predictions).astype('float64')
    log_predictions = np.log(predictions) / temperature
    predictions = np.exp(log_predictions)
    predictions = predictions / np.sum(predictions)

    cum_prob = 0.0
    incr = 0
    indices = predictions.argsort()
    probs = predictions[indices]
    while cum_prob < p:
        cum_prob += probs[incr]
        incr += 1
    indices_to_remove = indices[incr:]

    log_predictions[indices_to_remove] = -float('Inf')
    predictions = np.exp(log_predictions)
    predictions = predictions / np.sum(predictions)
    sampled = np.random.multinomial(1, predictions, 1)
    return np.argmax(sampled)


def generate_sample_top_p(model, text, num_generated=60, temperature=1.0, p=0.9):
    generated = text
    sentence = text[-maximum_seq_length:]
    for i in range(num_generated):
        x = np.zeros((1, maximum_seq_length))
        for t, char in enumerate(sentence):
            x[0, t] = char_indices[char]
        predictions = model.predict(x, verbose=0)[0]
        next_index = sample_top_p(predictions, temperature, p)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char
    return(generated)

In [64]:
print(generate_sample_top_p(model_emb, text_ex.lower(), temperature = 0.7, p = 0.8))

la sottise, l'erreur, le péché,
   Car ta ferce ou de beauté,
   Comme allonge parfim d'un


#### Generate text with the beam algorithm
- We need to loop for each character we want to generate, keeping track of the best ```beam_size``` sequences at the most.
- Besides keeping track of past generated character for each of these ```beam_size``` sequences, we need to keep track of their log-probability.
- This is done by, at each loop, keeping the ```beam_size```best predictions for each of the ```beam_size``` sequences, computing the log-probabilities of the newly formed (```beam_size```)$^2$ , and keeping the overall ```beam_size``` best new sequences.

In [67]:
def generate_beam(model, text, beam_size=16, num_generated=128):
    generated = text
    sentence = text[-maximum_seq_length:]
    # Initialization of the beam with log-probabilities for the sequence
    current_beam = [(0, [], sentence)]

    for l in range(num_generated):
        all_beams = []
        for prob, current_preds, current_input in current_beam:
            x = np.zeros((1, maximum_seq_length))
            for t, char in enumerate(current_input):
                x[0, t] = char_indices[char]
            prediction = model.predict(x = x, verbose = 0)[0]
            possible_next_chars = prediction.argsort()[-beam_size:][::-1]
            all_beams += [
                (prob + np.log(prediction[next_index]),
                 current_preds + [next_index],
                 current_input[1:] + indices_char[next_index]
                )
                for next_index in possible_next_chars]

        current_beam = sorted(all_beams)[-beam_size:]

    return text + ''.join([indices_char[idx] for idx in current_preds])

In [68]:
print(generate_beam(model_emb, text_ex))

La sottise, l'erreur, le péchés,
   Comme un coeur de ton coeur.
 
 
 
 
   LE VIN
 
 
   Comme une coeur de ton coeur,
   Comme une coeur de ton coeur!
 
  
