# Training of text generation models

In [1]:
%matplotlib inline

## Global config

Name of corpus file (without txt extension)

In [2]:
corpusname = "apocalipsis"

Number of past input tokens to use for generation

In [3]:
inputtokens = 128

Network architecture to use

In [4]:
architecture = "dilatedconv"

Number of hyperoptimization trials (recommended at least 15)

In [5]:
hypertrials = 15

### Process config

Get all relevant file names

In [6]:
corpusfile = 'corpus/' + corpusname + '.txt'
encodername = corpusname + '.enc'
modelname = corpusname + '.h5'

Obtain model class

In [7]:
from neurowriter.models import modelbyname
modelclass = modelbyname(architecture)

Using TensorFlow backend.


## Load corpus

In [8]:
with open(corpusfile) as f:
    corpus = f.read()

In [9]:
corpus[0:min(1000,len(corpus))]

'EL APOCALIPSIS \n\nDe San Juan\n\nLa revelación de Jesucristo\n\nAPOCALIPSIS 1\n\n1 La revelación de Jesucristo, que Dios le dio, para manifestar a sus siervos las cosas que deben suceder pronto; y la declaró enviándola por medio de su ángel a su siervo Juan, 2 que ha dado testimonio de la palabra de Dios, y del testimonio de Jesucristo, y de todas las cosas que ha visto. 3 Bienaventurado el que lee, y los que oyen las palabras de esta profecía, y guardan las cosas en ella escritas; porque el tiempo está cerca. \n\nSalutaciones a las siete iglesias\n\n4 Juan, a las siete iglesias que están en Asia: Gracia y paz a vosotros, del que es y que era y que ha de venir, y de los siete espíritus que están delante de su trono; 5 y de Jesucristo el testigo fiel, el primogénito de los muertos, y el soberano de los reyes de la tierra. Al que nos amó, y nos lavó de nuestros pecados con su sangre, \n\n6 y nos hizo reyes y sacerdotes para Dios, su Padre; a él sea gloria e imperio por los siglos de lo

## Encoding

In [10]:
from neurowriter.encoding import Encoder, loadencoding
try:
    encoder = loadencoding(encodername)
    print("Loaded encoder", encodername)
except Exception as e:
    print("Encoder not found, creating new encoder:", e)
    encoder = Encoder(corpus)
    encoder.save(encodername)

Loaded encoder apocalipsis.enc


## Model training

Train the generator model, trying different hyperparameters and selecting the model producing lower loss in a  validation split of the data.

Note this might take a very long time, so during the optimization temporary versions of the model will be saved.

In [11]:
from neurowriter.optimizer import hypertrain

model, train_history = hypertrain(modelclass, inputtokens, encoder, corpus, n_calls=hypertrials, savemodel=modelname)
model.save(modelname)

Params: [4, 32, 0.60276337607164387, 2, 64, 0.64589411306665612, 'rmsprop'] , loss:  3.10155386083
Params: [5, 64, 0.38344151882577771, 3, 64, 0.56804456109393231, 'adam'] , loss:  3.08953538061


KeyboardInterrupt: 

## Generation test

In [None]:
from neurowriter.writer import Writer

writer = Writer(model, encoder, creativity=0.1)
print(corpus[:inputtokens])
''.join(writer.write(seed=corpus[:inputtokens]))

### Manual test generation test with 0 creativity

In [None]:
import numpy as np
seed = corpus[:inputtokens]
print("Seed:", seed)
print("Generated")
print(seed, end='')
for i in range(1000):
    seedcoded = encoder.encodetext(seed)
    #cls = model.predict_classes(np.array([seedcoded]), verbose=0)
    #char = encoder.index2char[cls[0]]
    cls = np.argmax(model.predict(np.array([seedcoded])))
    char = encoder.index2char[cls]
    print(char, end='')
    seed = seed[1:] + char

## Possible improvements

* Try training with SGD and the full pecera corpus for a large number of iterations

From Facebook's convolutional translation paper
* Tokens are dealt with embeddings instead of one-hot encoder.
* The position of each token is also added as a parallel embedding
* Dropout for the embeddings and for the input of each convolutional block

## References

* WaveNet paper: https://arxiv.org/pdf/1609.03499.pdf
* A Keras implementation of WaveNet: https://github.com/usernaamee/keras-wavenet/blob/master/simple-generative-model.py
* Another one: https://github.com/basveeling/wavenet/blob/master/wavenet.py
* Facebook's convolutional translation paper: https://arxiv.org/pdf/1705.03122.pdf

## Scrapyard

def sampletext(logs):
    """Function that generates some sample text with the model.

    Intented to be used as a keras callback
    """
    writer = Writer(model, encoder, creativity=0.1)
    print(corpus[:inputtokens])
    print(''.join(writer.write(seed=corpus[:inputtokens])))

# Build model with input parameters
model = modelkind(inputtokens, encoder, *bestparams)
# Prepare callbacks
callbacks = [
    LambdaCallback(on_train_end=sampletext),
    ModelCheckpoint(filepath=modelname,save_best_only=True),
    EarlyStopping(patience=patience)
]
# Train model
model.fit_generator(
    traingenerator,
    steps_per_epoch=int((1-val)*(len(corpus)-inputtokens+1)/batchsize),
    validation_data=valgenerator,
    validation_steps=int(val*(len(corpus)-inputtokens+1)/batchsize),
    epochs=maxepochs,
    verbose=2,
    callbacks=callbacks
)