Modelando Textos Probabilisticamente
====================================

Nesta prática, vamos usar redes neurais para estimar as probabilidades condicionais de textos, caractere a caractere.
Para uma discussão interessante sobre o assunto, veja o seguinte blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

Vamos usar a biblioteca Keras adaptando um de seus exemplos.

In [1]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

Using Theano backend.
1 #define _CUDA_NDARRAY_C
2 
3 #include <Python.h>
4 #include <structmember.h>
5 #include "theano_mod_helper.h"
6 
7 #include <numpy/arrayobject.h>
8 #include <iostream>
9 
10 #include "cuda_ndarray.cuh"
11 
12 #ifndef CNMEM_DLLEXPORT
13 #define CNMEM_DLLEXPORT
14 #endif
15 
16 #include "cnmem.h"
17 #include "cnmem.cpp"
18 
19 //If true, when there is a gpu malloc or free error, we print the size of allocated memory on the device.
20 #define COMPUTE_GPU_MEM_USED 0
21 
22 //If true, we fill with NAN allocated device memory.
23 #define ALLOC_MEMSET 0
24 
25 //If true, we print out when we free a device pointer, uninitialize a
26 //CudaNdarray, or allocate a device pointer
27 #define PRINT_FREE_MALLOC 0
28 
29 //If true, we do error checking at the start of functions, to make sure there
30 //is not a pre-existing error when the function is called.
31 //You probably need to set the environment variable
32 //CUDA_LAUNCH_BLOCKING=1, and/or modify the CNDA_THREAD_SYNC
33


['nvcc', '-shared', '-O3', '-m64', '-Xcompiler', '-DCUDA_NDARRAY_CUH=mc72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden', '-Xlinker', '-rpath,/home/fccoelho/.theano/compiledir_Linux-4.2--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-3.5.1+-64/cuda_ndarray', '-I/usr/local/lib/python3.5/dist-packages/theano/sandbox/cuda', '-I/usr/lib/python3/dist-packages/numpy/core/include', '-I/usr/include/python3.5m', '-I/usr/local/lib/python3.5/dist-packages/theano/gof', '-o', '/home/fccoelho/.theano/compiledir_Linux-4.2--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-3.5.1+-64/cuda_ndarray/cuda_ndarray.so', 'mod.cu', '-L/usr/lib', '-lcublas', '-lpython3.5m', '-lcudart']


Primeiro vamos utilizar o mesmo texto usado no exemplo original

In [2]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")

In [4]:
try: 
    text = open(path).read().lower()
except UnicodeDecodeError:
    import codecs
    text = codecs.open(path, encoding='utf-8').read().lower()

In [5]:
print('Comprimento do corpus:', len(text))

Comprimento do corpus: 600893


Como o  modelo vai se basear em caracteres, precisamos definir o conjunto de caracteres do texto:

In [6]:
chars = set(text)
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 57


O modelo envolve probabilidades condicionais entre caracteres consecutivos, então precisamos alimentar o modelo com sequências de caracteres, com sobreposição.

In [7]:
maxlen = 20
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('num sequences:', len(sentences))

num sequences: 200291


In [8]:
print('Vetorizando...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Vetorizando...


In [9]:
print('Construindo o modelo...')
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

Construindo o modelo...


In [None]:
def sample(a, temperature=1.0):
    # helper function to sample an index from a probability array
    a = np.log(a) / temperature
    a = np.exp(a) / np.sum(np.exp(a))
    return np.argmax(np.random.multinomial(1, a, 1))

In [None]:
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, nb_epoch=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


--------------------------------------------------
Iteration 1
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "s been surmounted? s"
s been surmounted? sothe he the ant of the the and and and the the and and the he has ant on the ant the the ans of the the sore the sere an the hereres and the ant on the the the the the as the the ans of the bore the and and the has of the the the the the he an the the he ale there the he the an the the and the he the the as and the ferere han the here the the he and the the fere in the the the he ore the the the h

----- diversity: 0.5
----- Generating with seed: "s been surmounted? s"
s been surmounted? sore ant on and the upthe eritist, on as the ins than the calilag bor or ther whan hor int the the atile his if cord an morethe as os the the ans al e pale sad the fueler of the henderith at as ant ancroceresd the hir the the cos int ongonte the pore hone ores of as he porleche ant on the ares of and ard cheras thin sore tuve alt thacith 