# Generating text with Recurrent Neural Networks

## Data Reading
Load the file from the data folder and inspect it. Standardize to lowercase. How long is the corpus?

In [1]:
filename = "/home/fer/data/formaciones/master/deep-learning-intro/datasets/musquetaires/musquetairesShort"
raw_text = open(filename).read()

In [2]:
raw_text[:100]

'AUTHOR’S PREFACE\n\n\nIn which it is proved that, notwithstanding their names’ ending in OS\nand IS, the'

In [3]:
text = raw_text.lower()
print('corpus length:', len(text))

corpus length: 198059


### Text preparation
We create a set with the different characters and two dictionaries from indices to chars
<font color=red><b>Generate dictionaries for the char to indices and indices to chars.
<br>_Hint: use the enumerate function on the chars set_</b>
</font>

In [4]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 52


In [5]:
chars

['\n',
 ' ',
 '!',
 '"',
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '’',
 '“',
 '”']

Next we generate the input and output arrays:

The input will consist on sentences of a fixed (_maxlen_) lenght, while the outputs will be the next characters in the text.

So, if the text is "Welcome to Big Data Spain" with _maxlen_ = 5, we will have:


In order to avoid overfitting (and improve performances) we can add a _step_ to the structure so that with step = 3, for example:

<font color=red><b>Fill the sentences and next_char lists with the input and output data</b></font>

In [6]:
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 66007


In [7]:
sentences[:5]

['author’s preface\n\n\nin which it is proved',
 'hor’s preface\n\n\nin which it is proved th',
 '’s preface\n\n\nin which it is proved that,',
 'preface\n\n\nin which it is proved that, no',
 'face\n\n\nin which it is proved that, notwi']

In [8]:
next_chars[:5]

[' ', 'a', ' ', 't', 't']

### Dataset generation
We turn the text into one-hot-like vectors. Initialize the Input and output arrays to zero as boolean

In [8]:
import numpy as np
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
Y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    Y[i, char_indices[next_chars[i]]] = 1

In [9]:
X.shape

(66007, 40, 52)

In [10]:
X[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [11]:
print ("timesteps = ", len (X[0]), ", numchars = ", len (X[0][0]))

timesteps =  40 , numchars =  52


## Model Generation
Build the LSTM model to be trained train on the data, on this config:
- LSTM layer, with 256 units
- LSTM layer, with 256 units
- Dense layer, with 64 units
- Dense softmax layer
- On compilation, use adam as the optimizer and categorical_crossentropy as the loss function.
- Print the summary


<font color=red><b>Remember to initialize it propperly and to include input_shape on the first layer. <br> Hints: input_shape= (maxlen, len(chars))
- Use the imported libraries</b></font>

In [12]:
import os
import tensorflow as tf

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
tf.keras.backend.clear_session() 

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.layers import LSTM

In [13]:

model = Sequential()
model.add(LSTM(256, input_shape=(maxlen, len(chars)), return_sequences=True))
model.add(LSTM(256))
model.add(Dense(64))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 40, 256)           316416    
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense (Dense)                (None, 64)                16448     
_________________________________________________________________
dense_1 (Dense)              (None, 52)                3380      
_________________________________________________________________
activation (Activation)      (None, 52)                0         
Total params: 861,556
Trainable params: 861,556
Non-trainable params: 0
_________________________________________________________________


### Model Training
Train the model for a couple of epochs and see how it works. Use a batch_size of 128

In [14]:
model.fit(X, Y,
      batch_size=128,
      epochs=2)

Train on 66007 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f48c4420cc0>

### Model Evaluation
Let's test our model. In order to obtain a probabilistic answer we can sample from a probability array instead of just taking the max argument:

<font color=red><b> Sometimes probabilities are rounded. Apply a normalization-like tratment to them in order to avoid this when sampling</b> </font>


$$ p_i = \frac{p_i}{\sum_j p_j}$$


In [16]:
def sample(preds, sample = True):
    if sample:
        # probs can be rounded and not sum up to one. We recalculate in order to avoid this
        preds = np.asarray(preds).astype('float64')
        preds = preds /np.sum (preds)
        probas = np.random.multinomial(1, preds, 1)
    else:
        probas = preds
    return np.argmax(probas)

We get a seed in order to predict:

In [17]:
import random
start_index = random.randint(0, len(text) - maxlen - 1)
generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print (generated)

price, discontent, or want of fortune, t


#### Predictions
This will be the secuence for which we are going to predict the next character:

<font color=red> <b> Predict the next character given the input x_pred. <br>Hint: remember to take the first item in list</b>  </font>

In [19]:
## Predict next character given a model and the sequence to predict 
def get_next_char (model, x_pred, indices_char, Sample = True):
    preds = model.predict(x_pred, verbose=0)[0]
    next_index = sample(preds, 1.0)
    return indices_char[next_index]

In [20]:
x_pred = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
    x_pred[0, t, char_indices[char]] = 1.

Sample = True
    
next_char = get_next_char (model, x_pred, indices_char, Sample)  

print (next_char)

h


Let's predict some more characters:

In [21]:
import sys
start_index = random.randint(0, len(text) - maxlen - 1)
sentence = text[start_index: start_index + maxlen]
print('Seed: ' + sentence + '"')
print('---------------------- Generated Text -----------------------')
# sys.stdout.write(generated)
for i in range(400):
    x_pred = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x_pred[0, t, char_indices[char]] = 1.

    next_char = get_next_char (model, x_pred, indices_char, Sample)
    sentence = sentence[1:] + next_char

    sys.stdout.write(next_char)
    sys.stdout.flush()

Seed: azing of a ball.”

“was he not a fine-lo"
---------------------- Generated Text -----------------------
u hed joong oul maincy,
“he, and whal ars for hau hos ne celt it in twhicr andy” ammout.

xind m’at in to rearved cidnor
i unfer of has epore of ation, as her hand of the mander-
“on his belul of that adreby, wand im on askie of hem ast the bayter-sely for sortund not ot
you urent lecanle be dore,
to tas mastien the devioned ake” tolt thore
ow mulr, asle all yount tnot, latcouts narker the
urfang 

## Load a trained model
Training Deep Learning Models is time consuming. So, some pretrained models are available to be loaded and take a look at better predictions. We will load a model for each 5 epochs in order to see the evolution. 

<font color=red> <b> Load a model for each time and predict the text <br> Hint: You can load the whole model or just the weights as the configuration is the same</b>  </font>

In [22]:
count = 0
partial_n_epoch = 5
times = 12

## Set both starts for the seed sentence and the multinomial prediction
np.random.seed (2)
for j in range (times):
    
    count += partial_n_epoch
    print ("")
    print ("-------------- Next Model --------------")
    print ("Trained on ", count, " epochs")
    modelName = '/home/fer/data/formaciones/master/deep-learning-intro/models/musquetaires/MusquetairesModelOptimizedMode_' + str (count) + '.h5' 

    model.load_weights (modelName)
   
    start_index = np.random.randint(0, len(text) - maxlen - 1)
    sentence = text[start_index: start_index + maxlen]
    print('Seed: ' + sentence + '"')
    print('---------------------- Generated Text -----------------------')
    # sys.stdout.write(generated)
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.

        next_char = get_next_char (model, x_pred, indices_char, Sample)
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()



-------------- Next Model --------------
Trained on  5  epochs


OSError: Unable to open file (unable to open file: name = '/home/fer/data/formaciones/master/deep-learning-intro/models/musquetaires/MusquetairesModelOptimizedMode_5.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)