# RNN y Character-level Neural LM (LOTR)
Basado en: https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/
<br><br>El cuaderno de trabajo presenta una aplicación directa de un **Character-Level Neural Language Model** para generar automáticamente texto en base a un archivo/documento textual de entrada. En este caso, se usaron los libros de LOTR (The Lord of the Rings)

<br>**Se recomienda generar un notebook nuevo, colocando sólo las instrucciones útiles para su prueba de generación de textos. Modifique y ordene lo que considere conveniente para una mayor legibilidad y comprensión de su prueba (para la revisión con los JPs en el Lab. 10)**

### Preparación (_utils_)

In [4]:
from __future__ import print_function
import matplotlib.pyplot as plt
import numpy as np
import time
import csv
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM, SimpleRNN
from keras.layers.wrappers import TimeDistributed
import pickle

**Definición de argumentos**

In [5]:
#Archivo de texto 
DATA_DIR = "./lotr.txt" 
#Modificar BATCH_SIZE o HIDDEN_DIM en caso tengan problemas de memoria
BATCH_SIZE = 50 
HIDDEN_DIM = 250 #500
#Parametro para longitud de secuencia a analizar
SEQ_LENGTH = 50 
#Parametro para cargar un pesos previamente entrenados (checkpoint)
WEIGHTS = '' 

#Parametro para indicar cuantos caracteres generar en cada prueba
GENERATE_LENGTH = 500 
#Parametros para la red neuronal
LAYER_NUM = 2 
NB_EPOCH = 20

**Función A:
<br>(1) Carga de un archivo de texto, (2) Construcción de estructuras de entrada y salida de la red**

In [6]:
# method for preparing the training data
def load_data(data_dir, seq_length):
    #Carga del archivo
    data = open(data_dir, 'r').read()
    #Caracteres unicos
    chars = list(set(data))
    VOCAB_SIZE = len(chars)

    print('Data length: {} characters'.format(len(data)))
    print('Vocabulary size: {} characters'.format(VOCAB_SIZE))
    print(chars)
    
    #Indexacion de los caracteres
    ix_to_char = {ix:char for ix, char in enumerate(chars)}
    char_to_ix = {char:ix for ix, char in enumerate(chars)}
    
    #Estructuras de entrada y salida
    NUMBER_OF_SEQ = int(len(data)/seq_length)
    print('Number of sequences: {}'.format(NUMBER_OF_SEQ))
    X = np.zeros((NUMBER_OF_SEQ, seq_length, VOCAB_SIZE))
    y = np.zeros((NUMBER_OF_SEQ, seq_length, VOCAB_SIZE))
    
    for i in range(0, NUMBER_OF_SEQ):
        #LLenado de la estructura de entrada X
        X_sequence = data[i*seq_length:(i+1)*seq_length]
        X_sequence_ix = [char_to_ix[value] for value in X_sequence]
        #one-hot-vector (input)
        input_sequence = np.zeros((seq_length, VOCAB_SIZE))  
        #uso del diccionario para completar el one-hot-vector
        for j in range(seq_length):
            input_sequence[j][X_sequence_ix[j]] = 1.
            X[i] = input_sequence
            
        #Llenado de la estructura de salida y
        y_sequence = data[i*seq_length+1:(i+1)*seq_length+1]
        y_sequence_ix = [char_to_ix[value] for value in y_sequence]
        #one-hot-vector (output)
        target_sequence = np.zeros((seq_length, VOCAB_SIZE))
        #uso del diccionario para completar el one-hot-vector
        for j in range(seq_length):
            target_sequence[j][y_sequence_ix[j]] = 1.
            y[i] = target_sequence
            
    return X, y, VOCAB_SIZE, ix_to_char

**Función B:
<br>Generación de textos**

In [7]:
# method for generating text
def generate_text(model, length, vocab_size, ix_to_char):
    # starting with random character
    ix = [np.random.randint(vocab_size)]
    y_char = [ix_to_char[ix[-1]]]
    X = np.zeros((1, length, vocab_size))
    for i in range(length):
        # appending the last predicted character to sequence
        X[0, i, :][ix[-1]] = 1
        print(ix_to_char[ix[-1]], end="")
        ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
        y_char.append(ix_to_char[ix[-1]])
    return ('').join(y_char)

## Entrenamiento y Prueba

**Uso de la Función A: carga de los datos**

In [8]:
# Creating training data
X, y, VOCAB_SIZE, ix_to_char = load_data(DATA_DIR, SEQ_LENGTH)

Data length: 3262172 characters
Vocabulary size: 99 characters
['2', 'r', 'P', 'U', 'j', 'ó', 'Q', 'o', '6', 'A', 'M', 'h', 'z', 'V', '_', '*', 'S', '–', 'X', ':', '!', 'F', '¢', '-', 'L', '=', '‘', 'B', 'm', 'i', 'W', '®', 'e', 's', '8', 'N', 'x', 'G', 'D', '«', '’', 'y', '0', ')', '4', 'K', 'l', '.', '…', '\n', 'T', 't', 'n', 'R', 'O', 'q', '¤', 'u', 'Y', 'J', 'Z', '‚', 'p', 'g', '`', '¥', 'C', '—', '"', 'I', '<', '(', 'k', 'E', ';', '7', 'd', '1', '9', '5', '#', 'w', ' ', '&', 'f', '/', '?', '}', 'v', '3', "'", '»', 'µ', 'b', 'a', ',', '>', 'c', 'H']
Number of sequences: 65243


**Es importante guardar el diccionario `ix_to_char` en un archivo binario. Este debe ser cargado cada vez que se quiera retomar el entrenamiento o generar texto a partir de un checkpoint, debido a que el orden de los caracteres en el diccionario podría modificarse (no es un orden fijo)**
<br>**NO MODIFICAR ESTE PICKLE AL REINICIAR EL NOTEBOOK PARA PROBAR CHECKPOINTS**

In [9]:
#No modificar el pickle al reiniciar el cuaderno de trabajo para probar checkpoints previos
with open('ix_to_char.pickle', 'wb') as handle:
    pickle.dump(ix_to_char, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [10]:
print(ix_to_char)

{0: '2', 1: 'r', 2: 'P', 3: 'U', 4: 'j', 5: 'ó', 6: 'Q', 7: 'o', 8: '6', 9: 'A', 10: 'M', 11: 'h', 12: 'z', 13: 'V', 14: '_', 15: '*', 16: 'S', 17: '–', 18: 'X', 19: ':', 20: '!', 21: 'F', 22: '¢', 23: '-', 24: 'L', 25: '=', 26: '‘', 27: 'B', 28: 'm', 29: 'i', 30: 'W', 31: '®', 32: 'e', 33: 's', 34: '8', 35: 'N', 36: 'x', 37: 'G', 38: 'D', 39: '«', 40: '’', 41: 'y', 42: '0', 43: ')', 44: '4', 45: 'K', 46: 'l', 47: '.', 48: '…', 49: '\n', 50: 'T', 51: 't', 52: 'n', 53: 'R', 54: 'O', 55: 'q', 56: '¤', 57: 'u', 58: 'Y', 59: 'J', 60: 'Z', 61: '‚', 62: 'p', 63: 'g', 64: '`', 65: '¥', 66: 'C', 67: '—', 68: '"', 69: 'I', 70: '<', 71: '(', 72: 'k', 73: 'E', 74: ';', 75: '7', 76: 'd', 77: '1', 78: '9', 79: '5', 80: '#', 81: 'w', 82: ' ', 83: '&', 84: 'f', 85: '/', 86: '?', 87: '}', 88: 'v', 89: '3', 90: "'", 91: '»', 92: 'µ', 93: 'b', 94: 'a', 95: ',', 96: '>', 97: 'c', 98: 'H'}


In [11]:
print(X.shape, y.shape, VOCAB_SIZE)

(65243, 50, 99) (65243, 50, 99) 99


**Creación de la RNN (LSTM)**

In [12]:
# Creating and compiling the Network
model = Sequential()

#Añadiendo las capas LSTM
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), return_sequences=True))
for i in range(LAYER_NUM - 1):
    model.add(LSTM(HIDDEN_DIM, return_sequences=True))
#Añadiendo la operacion de salida
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))

#"Compilando" = instanciando la RNN con su función de pérdida y optimización
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

**Prueba inicial de creación de 500 caracteres**

In [13]:
# Generate some sample before training to know how bad it is!
generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)

eII555WZZ5ZII5Z5ZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5

'eII555WZZ5ZII5Z5ZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5ZII5ZZII55ZZIIZ5I5ZZII5Z5ZIIZ5IZ5Z'

**Se cargan los pesos (y el diccionario de los one-hot-vectors) en caso haya habido un entrenamiento previo**
<br>WEIGHTS debe tener el valor del nombre del archivo de "checkpoint" guardado. Por ejemplo:
<br>```WEIGHTS = "checkpoint_layer_2_hidden_250_epoch_60.hdf5"```

In [14]:
#Se cargan los pesos de un entrenamiento previo (si se desea restaurar una ejecucion)
#Se calcula el numero de epocas en base al nombre del archivo
#Se carga el diccionario de caracteres (one-hot-vectors) para la generacion
if not WEIGHTS == '':
    model.load_weights(WEIGHTS)
    nb_epoch = int(WEIGHTS[WEIGHTS.rfind('_') + 1:WEIGHTS.find('.')])
    with open('ix_to_char.pickle', 'rb') as handle:
        ix_to_char = pickle.load(handle)
else:
    #Si se va a empezar de 0:
    nb_epoch = 0

**ENTRENAMIENTO**

In [None]:
# Training if there is no trained weights specified

#Esta es la iteración importante
#Pueden cambiar la condición para que termine en un determinado numero de epochs.
while True:
    print('\n\nEpoch: {}\n'.format(nb_epoch))
    #Ajuste del modelo, y entrenamiento de 1 epoca
    model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, epochs=1)
    nb_epoch += 1
    #Generacion de un texto al final de la epoca
    generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)
    #Pueden modificar esto para tener más checkpoints
    if nb_epoch % 10 == 0:
        model.save_weights('checkpoint_layer_{}_hidden_{}_epoch_{}.hdf5'.format(LAYER_NUM, HIDDEN_DIM, nb_epoch))



Epoch: 0



  """


Epoch 1/1
1 the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the store and the stor

Epoch: 1

Epoch 1/1
the stood and the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the 

**PRUEBA errónea de un checkpoint anterior**
<br>Al reiniciar el notebook, y cargar un checkpoint sin cargar correctamente el diccionario:

In [15]:
WEIGHTS = "checkpoint_layer_2_hidden_250_epoch_10_run1.hdf5"
# Loading the trained weights
model.load_weights(WEIGHTS)
generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)
print('\n\n')

#El ejemplo impreso en realidad es usando un "diccionario" re-ejecutado y no cargado desde el archivo
#Por eso hay inconsistencias en los caracteres (el mapeo se ha desordenado)

OSError: Unable to open file (unable to open file: name = 'checkpoint_layer_2_hidden_250_epoch_10_run1.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

**Segundo intento de ENTRENAMIENTO**

In [None]:
nb_epoch = 0
while True:
    print('\n\nEpoch: {}\n'.format(nb_epoch))
    model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, epochs=1)
    nb_epoch += 1
    generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)
    if nb_epoch % 10 == 0:
        model.save_weights('checkpoint_layer_{}_hidden_{}_epoch_{}.hdf5'.format(LAYER_NUM, HIDDEN_DIM, nb_epoch))



Epoch: 0

Epoch 1/1
un the he the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the th

Epoch: 1

Epoch 1/1
/ the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the store the 

**PRUEBA correcta de un checkpoint anterior**
<br>Si instancian el modelo y sus parametros (ejecutando algunas celdas preliminares), y tienen los 2 archivos requeridos (.pickle y .hdf5) pueden generar el texto. 
<br>En el ejemplo de LOTR: `VOCAB_SIZE = 84` (si desean probarlo, se adjuntar los pesos y el diccionario, pero no el texto)

In [19]:
#Cuidar de no reemplazar el pickle original
with open('ix_to_char.pickle', 'rb') as handle:
    ix_to_char = pickle.load(handle)
    
WEIGHTS = "checkpoint_layer_2_hidden_250_epoch_60.hdf5"
# Loading the trained weights
model.load_weights(WEIGHTS)
generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)
print('\n\n')

ValueError: Dimension 0 in both shapes must be equal, but are 99 and 84. Shapes are [99,1000] and [84,1000]. for 'Assign_2' (op: 'Assign') with input shapes: [99,1000], [84,1000].