#  LSTMs/GRUs

Abstrações de células de memória de um ponto de vista funcional. Podem ser usadas no lugar de qualquer unidade em uma RNN, sendo responsável por ler, escrever ou lembrar de informação fluindo no modelo. 

A LSTM, em particular, consiste basicamente de uma unidade linear (a celula de informação propriamente dita) envoltas por três portas lógicas, responsáveis pela manutenção dos dados. Uma é responsável por permitir que dados fluam para a célulade informação (entrada), uma pela saída da célula e a última é responável por lembrar ou esquecer de dados de acordo com as necessidades da rede. A GRU é uma versão mais simplificada da LSTM. Ela combina portas da LSTM em menos portas, produzindo um modelo menor.

Estas unidades ajuam a resolver o problema de manter estados, porque a rede pode escolher esquecê-los se já não servem mais. Desta forma, o problema dos gradientes é menor importantes com LSTMs e GRUs.

Nesta aula, vamos usar LSTMs em problemas de tradução.

### Arquitetura Long Short-Term Memory 

Neste texto, vamos considerar que uma LSTM é formada por três portas lógicas: "Input" ou "Write", responsável pela escrita na LSTM de dados vindos do resto da RNN; "Output" ou "Read", responsável pela saída da LSTM para o resto da RNN; e "Keep" ou "Forget", rsponsável pela manutenção ou modificação de dados armazenados na LSTM.

<img src=https://ibm.box.com/shared/static/zx10duv5egw0baw6gh2hzsgr8ex45gsg.png width="720"/>
<center>*Diagrama de uma Long Short-Term Memory Unit*</center>

Se você estiver mais interssado em saber sobre LSTMs e suas variantes, incluindo GRUs, leia o blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

## RNNs Sequence to Sequence: soletrando a partir de pronúncias

No problema que vamos abordar, queremos traduzir a pronuncia de uma palavra, dada como uma lista de fonemas, para a grafia da palavra. Este problema é mais simples que _fala para texto_ ou _tradução_ (no sentido de não precisarmos de quantidades colossais de dados para ver algo acontecer [:)]); contudo, uma dificuldade aqui é a avaliação na escrita de palavras nunca vistas antes. Isto é díficil porque (1) há muitas pronúncias com várias transcrições razoáveis além de (2) palavras homônicas com transcrições distintas (_read_, no passado e presente, por exemplo). 

### Manipulando os dados...

Inicialmente temos que ler o dicionário de fonemas da CMU, _The CMU pronouncing dictionary_.

In [1]:
import pandas as pd
import numpy as np

In [2]:
pdic = pd.read_csv('data/cmudict-compact.csv', comment=';', 
                   header = -1, names = ['word', 'pronunciation'],
                   keep_default_na = False)

In [3]:
pdic[40150:40155]

Unnamed: 0,word,pronunciation
40150,FACEY,F EY1 S IY0
40151,FACHET,F AE1 CH AH0 T
40152,FACIAL,F EY1 SH AH0 L
40153,FACIALS,F EY1 SH AH0 L Z
40154,FACIANE,F AA0 S IY0 AA1 N EY0


In [4]:
len(pdic)

133779

Vamos mapear índices para letras e fonemas.

In [5]:
# map codes to phonemes
def create_phoneme_maps(df):
    phonemes = set()
    for pro_phonemes in pdic['pronunciation']:
        for p in pro_phonemes.split():
            phonemes.add(p)
    sorted_phonemes = ["_"] + sorted(list(phonemes))
    index_to_phoneme = dict(enumerate(sorted_phonemes))
    return (index_to_phoneme, 
            dict((v, k) for k,v in index_to_phoneme.items()))

# map codes to letters (and special symbol _)
def create_letter_maps(): 
    index_to_letter = dict(enumerate("_abcdefghijklmnopqrstuvwxyz"))
    return (index_to_letter,
            dict((v, k) for k,v in index_to_letter.items()))

In [6]:
index_to_phoneme, phoneme_to_index = create_phoneme_maps(pdic['pronunciation'])
index_to_letter, letter_to_index = create_letter_maps()    

print (index_to_letter[1], letter_to_index['a'])
print (index_to_phoneme[1], phoneme_to_index['AA0'])
print len(index_to_phoneme)
print len(index_to_letter)

('a', 1)
('AA0', 1)
70
27


In [7]:
# Create pronounce dict associating words to phoneme codes
# Note that very short and very long words, as well as words with no
# alphabetical chars, are discarded
def create_pronouce_dict(phoneme_to_index):
    pronounce_dict = {}
    for idx, cols in pdic.iterrows():
        word, phone_list = cols['word'], cols['pronunciation'].split()
        # filter long words 
        if len(word) < 5 or len(word) > 15:
            continue
        # filter words with not alphabetical chars
        if any((not c.isalpha() for c in word)):
            continue
        pronounce_dict[word.lower()] = [phoneme_to_index[p] for p in phone_list]
    return pronounce_dict

In [8]:
pronounce_dict = create_pronouce_dict(phoneme_to_index)

Vamos testar nossos dicionários, vendo a pronúncia de _loved_.

In [9]:
print 'dict size:', len(pronounce_dict)
print '%s: %s ~ %s'%('loved', pronounce_dict['loved'], 
                   '-'.join([index_to_phoneme[i] for i in pronounce_dict['loved']]))

dict size: 108006
loved: [43, 8, 65, 21] ~ L-AH1-V-D


Padding das entradas e labels para a RNN.

In [10]:
def pad(pronounce_dict):
    words = np.random.permutation(list(pronounce_dict.keys()))

    # (words, max time, embedding size)
    labels_ = np.zeros((len(words), 15, 1)) # max allowed word length + 1
    input_ = np.zeros((len(words), 16, 1))
    for i, w in enumerate(words):
        v = pronounce_dict[w]
        w = w + "_" * (15 - len(w))
        v = v + [0] * (16 - len(v))
        for j, n in enumerate(v):
            input_[i][j] = [n] # input embedding size = 1
        for j, letter in enumerate(w):
            labels_[i][j] = [letter_to_index[letter]] # output embedding size = 1
            
    return input_.astype(np.int32), labels_.astype(np.int32)

In [11]:
input_, labels_ = pad(pronounce_dict)

In [12]:
w = 4321
print ''.join([index_to_letter[int(c)] for c in labels_[w]]), labels_[w]
print ''.join([index_to_phoneme[int(c)] for c in input_[w]]), input_[w]

friesner_______ [[ 6]
 [18]
 [ 9]
 [ 5]
 [19]
 [14]
 [ 5]
 [18]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]]
FRIY1SNER0__________ [[32]
 [54]
 [39]
 [55]
 [45]
 [26]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]]


Separação em treino, teste e validação.

In [13]:
input_test   = input_[:10000]
input_val    = input_[10000:20000]
input_train  = input_[20000:]
labels_test  = labels_[:10000]
labels_val   = labels_[10000:20000]
labels_train = labels_[20000:]

data_test  = zip(input_test, labels_test)
data_val   = zip(input_val, labels_val)
data_train = zip(input_train, labels_train)

Esta classe pega batches aleatórios e os arranja de forma adequada para o modelo.

In [14]:
class DataIterator:
    def __init__(self, data, batch_size):
        self.data = data
        self.batch_size = batch_size
        self.iter = self.make_random_iter()
        
    def next_batch(self):
        try:
            idxs = self.iter.next()
        except StopIteration:
            self.iter = self.make_random_iter()
            idxs = self.iter.next()
        X, Y = zip(*[self.data[i] for i in idxs])
        # shape X, Y after zip = [batch_size, max_time, embedding_size]
        # shape X, Y after transpose = [max_time, batch_size, embedding_size]
        X = np.array(X).transpose(1, 0, 2)  
        Y = np.array(Y).transpose(1, 0, 2)
        return X, Y

    def make_random_iter(self):
        splits = np.arange(self.batch_size, len(self.data), self.batch_size)
        it = np.split(np.random.permutation(range(len(self.data))), splits)[:-1]
        return iter(it)
    
batch_size = 128
train_iter = DataIterator(data_train, batch_size)
val_iter = DataIterator(data_val, batch_size)
test_iter = DataIterator(data_test, batch_size)

In [15]:
temp_iter = DataIterator(data_test, 2)
X, Y = temp_iter.next_batch()
print X.shape
print X
print Y

(16, 2, 1)
[[[ 8]
  [42]]

 [[45]
  [66]]

 [[21]
  [36]]

 [[26]
  [68]]

 [[66]
  [35]]

 [[37]
  [46]]

 [[58]
  [ 0]]

 [[34]
  [ 0]]

 [[49]
  [ 0]]

 [[43]
  [ 0]]

 [[21]
  [ 0]]

 [[ 0]
  [ 0]]

 [[ 0]
  [ 0]]

 [[ 0]
  [ 0]]

 [[ 0]
  [ 0]]

 [[ 0]
  [ 0]]]
[[[21]
  [17]]

 [[14]
  [21]]

 [[ 4]
  [ 9]]

 [[ 5]
  [26]]

 [[18]
  [26]]

 [[23]
  [ 9]]

 [[ 9]
  [14]]

 [[20]
  [ 7]]

 [[ 8]
  [ 0]]

 [[ 8]
  [ 0]]

 [[15]
  [ 0]]

 [[12]
  [ 0]]

 [[ 4]
  [ 0]]

 [[ 0]
  [ 0]]

 [[ 0]
  [ 0]]]


### Implementando em tensorflow

In [22]:
import tensorflow as tf
from tensorflow.python.framework import ops
#from tensorflow.models.rnn import rnn_cell, seq2seq

Abrindo uma sessão interativa

In [23]:
ops.reset_default_graph()
try:
    sess.close()
except:
    pass
sess = tf.InteractiveSession()

In [24]:
batch_size = 128

# max padded lengths
input_seq_length = 16
output_seq_length = 15 

input_vocab_size = len(index_to_phoneme) # 70
output_vocab_size = len(index_to_letter) + 1 # 27 + 1 (end of word)
embedding_size = 1 # input and output embedding size
embedding_dim = 256

Nossa RNN tem a seguinte topologia:

<img src="rnn_seq2seq.png">

Ou seja, a entrada do decodificador (`decode_input`) tem que ser deslocada de 1 dos rótulos.

In [25]:
encode_input = tf.placeholder(tf.int32, shape=(input_seq_length, batch_size, embedding_size))
input_seq_lengths = tf.placeholder(dtype = tf.int32, shape=(batch_size,))

#labels = [tf.placeholder(tf.int32, shape=(None,), name = "l_%i" %i) 
#          for i in range(output_seq_length)]

#decode_input = [tf.zeros_like(encode_input[0], dtype=np.int32, name="GO")] + labels[:-1]

A próxima célula é o cerne do nosso modelo. Nossa RNN é formada por células LSTM com dropout entre as camadas. Empilhamos 3 dessas (LSTM+dropout) para formar a rede neural completa. Executamos essa rede com o padrão `seq2seq.embedding_rnn_seq2seq` que permite que a alimentemos com uma sequência normal de números e esta é convertida automaticamente pela própria rede em uma sequência de embeddings de tensores one-hot.

Note que nós construimos duas redes no escopo 'decoders'. Uma delas usa `feed_previous = True` enquanto a outra não. Isso é setado para False durante o treino, de forma que mesmo que o aprendiz cometa um erro em uma letra - nós ainda passamos pra ele o rótulo correto em `decoder_inputs`. Como não teríamos os rótulos corretos em um teste real, para o conjunto de teste usamos `feed_previous = True` e o decodificador recebe a letra com máxima probabilidade do último passo da saída do decodificador.

`decode_output` é um tensor com forma (batch_size, output_vocab_size). Nós podemos aplicar softmax a este tensor para obter escores logísticos para cada letra.

In [26]:
# ENCODER

with tf.variable_scope("encoder") as scope:
    # Build RNN cell
    encoder_cell = tf.contrib.rnn.BasicLSTMCell(embedding_dim)
    #initial_state = encoder_cell.zero_state(batch_size, tf.float32) 

    # Run Dynamic RNN
    #   encoder_outpus: [max_time, batch_size, num_units]
    #   encoder_state: [batch_size, num_units]
    encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
        encoder_cell, encode_input, dtype = tf.float32,
        sequence_length=input_seq_lengths, time_major=True)
    #  encode_input: [max_time, batch_size, embedding_size]

ValueError: Initializer for variable encoder/rnn/basic_lstm_cell/kernel/ is from inside a control-flow construct, such as a loop or conditional. When creating a variable inside a loop or conditional, use a lambda as the initializer.

In [None]:
# ENCODER

# Build RNN cell
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(embedding_dim)

# Run Dynamic RNN
#   encoder_outpus: [max_time, batch_size, num_units]
#   encoder_state: [batch_size, num_units]
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
    encoder_cell, encode_input,
    sequence_length=source_sequence_length, time_major=True)

# DECODER

# Build RNN cell
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(embedding_dim)

# Helper
helper = tf.contrib.seq2seq.TrainingHelper(
    decode_input, decoder_lengths, time_major=True)
# Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
    decoder_cell, helper, encoder_state,
    output_layer=projection_layer)
# Dynamic decoding
outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)
logits = outputs.rnn_output

projection_layer = layers_core.Dense(
    tgt_vocab_size, use_bias=False)

# LOSS
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=decoder_outputs, logits=logits)
train_loss = (tf.reduce_sum(crossent * target_weights) /
    batch_size)

# Calculate and clip gradients
params = tf.trainable_variables()
gradients = tf.gradients(train_loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(
    gradients, max_gradient_norm)

# Optimization
optimizer = tf.train.AdamOptimizer(learning_rate)
update_step = optimizer.apply_gradients(
    zip(clipped_gradients, params))

In [None]:
keep_prob = tf.placeholder("float")

# range(3) para formar uma pilha com 3 camadas
cells = [tf.contrib.rnn.DropoutWrapper(
        tf.contrib.rnn.BasicLSTMCell(embedding_dim), output_keep_prob=keep_prob
    ) for i in range(3)]

stacked_lstm = tf.contrib.rnn.MultiRNNCell(cells)

# outputs, states = embedding_rnn_seq2seq(
#    encoder_inputs, decoder_inputs, cell,
#    num_encoder_symbols, num_decoder_symbols,
#    embedding_size, output_projection=None,
#    feed_previous=False)

with tf.variable_scope("decoders") as scope:
    decode_outputs, decode_state = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
        encode_input, decode_input, stacked_lstm, input_vocab_size, 
        output_vocab_size, embedding_dim)
    
    scope.reuse_variables()
    
    decode_outputs_test, decode_state_test = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
        encode_input, decode_input, stacked_lstm, input_vocab_size, 
        output_vocab_size, embedding_dim, feed_previous=True)

`sequence_loss` é entropia cruzada aplicada ao softmax das saídas do decodificador.

In [None]:
loss_weights = [tf.ones_like(l, dtype=tf.float32) for l in labels]
loss = tf.nn.seq2seq.sequence_loss(decode_outputs, labels, loss_weights, output_vocab_size, embedding_dim)
optimizer = tf.train.AdamOptimizer(1e-4)
train_op = optimizer.minimize(loss)

In [None]:
sess.run(tf.initialize_all_variables())

### Modelo de Treino

A avaliação é feita usando tanto a perda da seq2seq quanto a precisão, ou seja, a proporção de palavras escritas corretamente pelo modelo.

In [None]:
import sys

def get_feed(X, Y):
    feed_dict = {encode_input[t]: X[t] for t in range(input_seq_length)}
    feed_dict.update({labels[t]: Y[t] for t in range(output_seq_length)})
    return feed_dict

def train_batch(data_iter):
    X, Y = data_iter.next_batch()
    feed_dict = get_feed(X, Y)
    feed_dict[keep_prob] = 0.5
    _, out = sess.run([train_op, loss], feed_dict)
    return out

def get_eval_batch_data(data_iter):
    X, Y = data_iter.next_batch()
    feed_dict = get_feed(X, Y)
    feed_dict[keep_prob] = 1.
    all_output = sess.run([loss] + decode_outputs_test, feed_dict)
    eval_loss = all_output[0]
    decode_output = np.array(all_output[1:]).transpose([1,0,2])
    return eval_loss, decode_output, X, Y

def eval_batch(data_iter, num_batches):
    losses = []
    predict_loss = []
    for i in range(num_batches):
        eval_loss, output, X, Y = get_eval_batch_data(data_iter)
        losses.append(eval_loss)
        
        for index in range(len(output)):
            real = Y.T[index]
            predict = np.argmax(output, axis = 2)[index]
            predict_loss.append(all(real==predict))
    return np.mean(losses), np.mean(predict_loss)

Exemplo de execução deixada até cerca de 20000 iterações.

In [None]:
for i in range(100000):
    try:
        train_batch(train_iter)
        if i % 1000 == 0:
            val_loss, val_predict = eval_batch(val_iter, 16)
            train_loss, train_predict = eval_batch(train_iter, 16)
            print ("val loss   : %f, val predict   = %.1f%%" %(val_loss, val_predict * 100))
            print ("train loss : %f, train predict = %.1f%%" %(train_loss, train_predict * 100))
            print ("")
            sys.stdout.flush()
    except KeyboardInterrupt:
        print ("interrupted by user")
        break

### Examinando modelos...

Eu atingi cerca de 41% no conjunto de validação, quando interrompi. Isso parece bem ruim. Vamos ver agora os erros cometidos. 

In [None]:
eval_loss, output, X, Y = get_eval_batch_data(test_iter)

In [None]:
print ("pronúncia".ljust(40)),
print ("correta".ljust(17)),
print ("chute do modelo".ljust(17)),
print ("correto?")
print ("")

for index in range(len(output)):
    phonemes = "-".join([index_to_phoneme[p] for p in X.T[index]]) 
    real = [index_to_letter[l] for l in Y.T[index]] 
    predict = [index_to_letter[l] for l in np.argmax(output, axis = 2)[index]]
   
    print (phonemes.split("-_")[0].ljust(40),)
    print ("".join(real).split("_")[0].ljust(17),)
    print ("".join(predict).split("_")[0].ljust(17),)
    print (str(real == predict))

Como podemos ver, os erros não parecem tão terríveis assim. 

Este curso é baseado em material da [Big Data University](https://bigdatauniversity.com/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). Assim, segue os termos da [licença do MIT](https://bigdatauniversity.com/mit-license/). Material adicional de Mikesj (https://github.com/mikesj-public/rnn_spelling_bee/blob/master/spelling_bee_RNN.ipynb), Thang Luong, Eugene Brevdo, and Rui Zhao (https://github.com/tensorflow/nmt).

In [28]:
from __future__ import print_function
import pandas as pd
import numpy as np

from keras.models import Model
from keras.layers import Input, LSTM, Dense

In [29]:
pdic = pd.read_csv('data/cmudict-compact.csv', comment=';', 
                   header = -1, names = ['word', 'pronunciation'],
                   keep_default_na = False)

In [30]:
pdic[40150:40155]

Unnamed: 0,word,pronunciation
40150,FACEY,F EY1 S IY0
40151,FACHET,F AE1 CH AH0 T
40152,FACIAL,F EY1 SH AH0 L
40153,FACIALS,F EY1 SH AH0 L Z
40154,FACIANE,F AA0 S IY0 AA1 N EY0


In [31]:
pdic = pdic.sample(n = 10000)

In [32]:
batch_size = 64  # Batch size for training.
epochs = 10  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.

In [33]:
def filter_input(inp):    
    return ((len(inp) < 5 or      # filter long words 
             len(inp) > 15) or
            # filter words with not alphabetical chars
            any((not s.isalpha() for s in inp)))

# Vectorize the data.
input_texts = []
target_texts = []
input_symbols = set()
target_symbols = set()
for idx, cols in pdic.iterrows():
    input_text, phones = cols['word'], cols['pronunciation']
    if filter_input(input_text):
        continue
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = ['\t'] + phones.split() + ['\n']
    input_texts.append(input_text)
    target_texts.append(target_text) ###
    for symbol in input_text:
        if symbol not in input_symbols:
            input_symbols.add(symbol)
    for symbol in target_text:
        if symbol not in target_symbols:
            target_symbols.add(symbol)

In [34]:
input_symbols = sorted(list(input_symbols))
target_symbols = sorted(list(target_symbols))
num_encoder_tokens = len(input_symbols)
num_decoder_tokens = len(target_symbols)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
print('Number of unique input tokens:', num_encoder_tokens)
print('-', ' '.join(input_symbols))
print('Number of unique output tokens:', num_decoder_tokens)
print('-', repr(' '.join(target_symbols)))

Number of samples: 8117
Max sequence length for inputs: 15
Max sequence length for outputs: 17
Number of unique input tokens: 26
- A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Number of unique output tokens: 71
- '\t \n AA0 AA1 AA2 AE0 AE1 AE2 AH0 AH1 AH2 AO0 AO1 AO2 AW0 AW1 AW2 AY0 AY1 AY2 B CH D DH EH0 EH1 EH2 ER0 ER1 ER2 EY0 EY1 EY2 F G HH IH0 IH1 IH2 IY0 IY1 IY2 JH K L M N NG OW0 OW1 OW2 OY0 OY1 OY2 P R S SH T TH UH0 UH1 UH2 UW0 UW1 UW2 V W Y Z ZH'


In [35]:
input_token_index = dict(
    [(s, i) for i, s in enumerate(input_symbols)])
target_token_index = dict(
    [(s, i) for i, s in enumerate(target_symbols)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

In [36]:
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, sym in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[sym]] = 1.
    for t, sym in enumerate(target_text):
        # decoder_target_data is ahead of decoder_target_data by one timestep
        decoder_input_data[i, t, target_token_index[sym]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[sym]] = 1.

In [37]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [38]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [39]:
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [40]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
# Save model
model.save('/tmp/ks2s.h5')

Train on 6493 samples, validate on 1624 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [41]:
# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states

# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

In [42]:
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

In [65]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' ' + sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    start = 0 if decoded_sentence[0] != ' ' else 1
    return decoded_sentence[start:]

In [72]:
print('%15s %35s %35s'%('Word', 'Correct', 'Guess'))
for seq_index in range(10):
    # Take one sequence (part of the training test)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    correct = pdic[pdic['word']==input_texts[seq_index]]['pronunciation'].iloc[0]
    print('%15s %35s %35s'%(input_texts[seq_index],
                           correct,
                           decoded_sentence))  

           Word                             Correct                               Guess
     MCELHINNEY          M AE1 K AH0 L HH IH2 N IY0                   M AH0 K L EH1 N T
      IMITATING            IH1 M AH0 T EY2 T IH0 NG                  IH2 M IH1 T IH0 NG
       BOLANCIK               B AH0 L AE1 N S AH0 K                   B AH0 L AE1 K AH0
     GRZYBOWSKI           G ER2 Z IH0 B AW1 S K IY0                   G ER1 Z Z AH0 N Z
        CHATHAM                      CH AE1 T AH0 M                    CH AE1 M AH0 T 

   ORGANIZATION    AO2 R G AH0 N AH0 Z EY1 SH AH0 N                 Y AH0 G R EY1 N AH0
        PROMISE                     P R AA1 M AH0 S                   P R EH1 M AH0 S T
       BULKHEAD                  B AH1 L K HH EH2 D                     B EH1 L D AH0 

        LEIMERT                       L IY1 M ER0 T                     L EH1 M ER0 T 

       VERNONIA               V ER0 N OW1 N IY0 AH0                  V AE1 N ER0 IH0 NG


In [None]:
AA0 AA1 AA2 AE0 AE1 AE2 AH0 AH1 AH2 AO0 AO1 AO2 AW0 AW1 
AW2 AY0 AY1 AY2 B CH D DH EH0 EH1 EH2 ER0 ER1 ER2 EY0 EY1 
EY2 F G HH IH0 IH1 IH2 IY0 IY1 IY2 JH K L M N NG OW0 OW1 OW2 OY0
OY1 OY2 P R S SH T TH UH0 UH1 UH2 UW0 UW1 UW2 V W Y Z ZH

In [78]:
ipas = '''a	AA	ɑ	balm, bot
@	AE	æ	bat
A	AH	ʌ	butt
c	AO	ɔ	bought
W	AW	aʊ	bout
x	AX	ə	about
N/A	AXR[4]	ɚ	letter
Y	AY	aɪ	bite
E	EH	ɛ	bet
R	ER	ɝ	bird
e	EY	eɪ	bait
I	IH	ɪ	bit
X	IX	ɨ	roses, rabbit
i	IY	i	beat
o	OW	oʊ	boat
O	OY	ɔɪ	boy
U	UH	ʊ	book
u	UW	u	boot
N/A	UX[4]	ʉ	dude
b	B	b	buy
C	CH	tʃ	China
d	D	d	die
D	DH	ð	thy
F	DX	ɾ	butter
L	EL	l̩	bottle
M	EM	m̩	rhythm
N	EN	n̩	button
f	F	f	fight
g	G	ɡ	guy
h	HH or H[4]	h	high
J	JH	dʒ	jive
k	K	k	kite
l	L	l	lie
m	M	m	my
n	N	n	nigh
G	NX or NG[4]	ŋ	sing
N/A	NX[4]	ɾ̃	winner
p	P	p	pie
Q	Q	ʔ	uh-oh
r	R	ɹ	rye
s	S	s	sigh
S	SH	ʃ	shy
t	T	t	tie
T	TH	θ	thigh
v	V	v	vie
w	W	w	wise
H	WH	ʍ	why
y	Y	j	yacht
z	Z	z	zoo
Z	ZH	ʒ	pleasure'''

In [81]:
ipasym = {}
for lin in ipas.split('\n'):
    sym = lin.split('\t')
    ipasym[sym[1]] = sym[2]