Original:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/6_lstm.ipynb

Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [2]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
import collections
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from urllib.request import urlretrieve

In [3]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified {}'.format(filename))
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify {}. Can you get to it with a browser?'.format(filename))
    return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [4]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        name = f.namelist()[0]
        data = tf.compat.as_str(f.read(name))
    return data
  
text = read_data(filename)
print('Data size {:,d} words'.format(len(text)))

Data size 100,000,000 words


Create a small validation set.

In [5]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [6]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character:', char)
        return 0

def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [7]:
class BatchGenerator(object):

    def __init__(self, text, vocabulary_size, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._vocabulary_size = vocabulary_size
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()

    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size, self._vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch

    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches

def characters(probabilities):
    """Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation."""
    return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s


batch_size=64
num_unrollings=10

train_batches = BatchGenerator(train_text, vocabulary_size, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, vocabulary_size, 1, 1)

print('Train Batch 1:\n\n', batches2string(train_batches.next()), '\n')
print('Train Batch 2:\n\n', batches2string(train_batches.next()), '\n')

print('Valid Batch 1:\n\n', batches2string(valid_batches.next()), '\n')
print('Valid Batch 2:\n\n', batches2string(valid_batches.next()), '\n')

Train Batch 1:

 ['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad'] 

Train Batch 2:

 ['i

In [8]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normalized
    probabilities.
    """
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros_like(prediction, dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution(vocabulary_size):
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

https://en.wikipedia.org/wiki/Long_short-term_memory

$$
f_t = \sigma(W_f h_{t-1} + U_f x_t + b_f) \\
i_t = \sigma(W_i h_{t-1} + U_i x_t + b_i) \\
o_t = \sigma(W_o h_{t-1},+ U_o x_t + b_o) \\
\tilde{c}_t = tanh(W_c h_{t-1} + U_c x_t + b_c) \\
c_t = f_t \circ c_{t-1} + i_t \circ \tilde{c}_t \\
h_t = o_t \circ tanh(c_t) \\
$$

Variables:

* $x_t$: input vector
* $h_t$: output vector
* $c_t$: cell state vector
* $W$, $U$ and $b$: parameter matrices and vector
* $f_t$, $i_t$ and $o_t$: gate vectors
    * $f_t$: Forget gate vector. Weight of remembering old information.
    * $i_t$: Input gate vector. Weight of acquiring new information.
    * $o_t$: Output gate vector. Output candidate.

In [9]:
class LSTMCell:
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    
    def __init__(self, num_units, input_size):
        self.num_units = num_units
        self.input_size = input_size
        
        # Forget gate: previous output, input and bias.
        self.W_f = tf.Variable(tf.random_uniform(shape=(num_units, num_units), minval=-0.1, maxval=0.1))
        self.U_f = tf.Variable(tf.random_uniform(shape=(input_size, num_units), minval=-0.1, maxval=0.1))
        self.b_f = tf.Variable(tf.zeros(shape=(1, num_units)))

        # Input gate: previous output, input and bias.
        self.W_i = tf.Variable(tf.random_uniform(shape=(num_units, num_units), minval=-0.1, maxval=0.1))
        self.U_i = tf.Variable(tf.random_uniform(shape=(input_size, num_units), minval=-0.1, maxval=0.1))
        self.b_i = tf.Variable(tf.zeros(shape=(1, num_units)))

        # Output gate: previous output, input and bias.
        self.W_o = tf.Variable(tf.random_uniform(shape=(num_units, num_units), minval=-0.1, maxval=0.1))
        self.U_o = tf.Variable(tf.random_uniform(shape=(input_size, num_units), minval=-0.1, maxval=0.1))
        self.b_o = tf.Variable(tf.zeros(shape=(1, num_units)))

        # Memory cell: state, input and bias.                             
        self.W_c = tf.Variable(tf.random_uniform(shape=(num_units, num_units), minval=-0.1, maxval=0.1))
        self.U_c = tf.Variable(tf.random_uniform(shape=(input_size, num_units), minval=-0.1, maxval=0.1))
        self.b_c = tf.Variable(tf.zeros(shape=(1, num_units)))

    def __call__(self, x, h_, c_):
        forget_gate = tf.sigmoid(tf.matmul(h_, self.W_f) + tf.matmul(x, self.U_f) + self.b_f)
        input_gate = tf.sigmoid(tf.matmul(h_, self.W_i) + tf.matmul(x, self.U_i) + self.b_i)
        output_gate = tf.sigmoid(tf.matmul(h_, self.W_o) + tf.matmul(x, self.U_o) + self.b_o)
        memory_update = tf.tanh(tf.matmul(h_, self.W_c) + tf.matmul(x, self.U_c) + self.b_c)
        c = forget_gate * c_ + input_gate * memory_update
        h = output_gate * tf.tanh(c)
        return h, c


def unroll_lstm(lstm_cell, X, batch_size):
    num_units = lstm_cell.num_units
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros(shape=(batch_size, num_units)), trainable=False)
    saved_state = tf.Variable(tf.zeros(shape=(batch_size, num_units)), trainable=False)
    
    reset_op = tf.group(
        saved_output.assign(tf.zeros(shape=(batch_size, num_units))),
        saved_state.assign(tf.zeros(shape=(batch_size, num_units))))
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for x in X:
        output, state = lstm_cell(x, output, state)
        outputs.append(output)
    
    # State saving across unrollings.
    save_op = tf.group(saved_output.assign(output),
                       saved_state.assign(state))
    
    return outputs, save_op, reset_op


def opt_gd_decay_clip(loss, start_learning_rate=10.0, decay_steps=5_000, decay_rate=0.1, clip_norm=1.25):
    global_step = tf.Variable(0, trainable=False)
    learning_rate = tf.train.exponential_decay(
        start_learning_rate, global_step, decay_steps, decay_rate, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, clip_norm)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    return optimizer, learning_rate


NextCharLSTMTheta = collections.namedtuple('NextCharLSTMTheta', ['lstm_cell', 'W', 'b'])

def _next_char_lstm_theta(lstm_cell):
    num_units = lstm_cell.num_units
    input_size = lstm_cell.input_size
    # Classifier weights and biases.
    W = tf.Variable(tf.random_uniform(shape=(num_units, input_size), minval=-0.1, maxval=0.1))
    b = tf.Variable(tf.zeros(shape=(input_size,)))
    return NextCharLSTMTheta(lstm_cell, W, b)

def next_char_lstm_theta(num_units, input_size):
    # Definition of the cell computation.
    lstm_cell = LSTMCell(num_units, input_size)
    return _next_char_lstm_theta(lstm_cell)

def next_char_lstm(theta, X, y=None, batch_size=1, trainable=False):
    # Unrolled LSTM loop.
    outputs, save_op, reset_op = unroll_lstm(theta.lstm_cell, X, batch_size)

    # State saving across unrollings.
    loss_op = None
    with tf.control_dependencies([save_op]):
        # Classifier.
        output = tf.concat(outputs, axis=0) if len(outputs) > 1 else outputs[0]            
        logits = tf.nn.xw_plus_b(output, theta.W, theta.b)
        if trainable:
            softmax_loss = tf.nn.softmax_cross_entropy_with_logits(
                labels=tf.concat(y, axis=0), logits=logits)
            loss_op = tf.reduce_mean(softmax_loss)

    # Predictions.
    prediction = tf.nn.softmax(logits)

    return prediction, reset_op, loss_op

def next_char_lstm_train(theta, sequence_size, batch_size):
    # Input data.
    # labels are inputs shifted by one time step.
    input_size = theta.lstm_cell.input_size
    sequence_input = list(
        tf.placeholder(name='x_{}'.format(t),
                       shape=(batch_size, input_size),
                       dtype=tf.float32)
        for t in range(sequence_size + 1))
    X = sequence_input[:sequence_size]
    y = sequence_input[1:]

    prediction, _, loss_op = next_char_lstm(theta, X, y, batch_size, trainable=True)

    return sequence_input, prediction, loss_op
    
def next_char_lstm_eval(theta):
    # Sampling and validation eval: batch 1, no unrolling.
    input_size = theta.lstm_cell.input_size
    X = tf.placeholder(name='x_sample',
                       shape=(1, input_size),
                       dtype=tf.float32)
    prediction, reset_op, _ = next_char_lstm(theta, [X])
    return X, prediction, reset_op

def _next_char_models(theta, sequence_size, batch_size):
    train_model = next_char_lstm_train(theta, sequence_size, batch_size)
    eval_model = next_char_lstm_eval(theta)
    return train_model, eval_model

def next_char_models(num_units, input_size, sequence_size, batch_size):
    theta = next_char_lstm_theta(num_units, input_size)
    return _next_char_models(theta, sequence_size, batch_size)

In [10]:
%%time

def train(model_fn, opt_fn, train_dataset, valid_dataset, num_steps, valid_steps=1_000):
    vocabulary_size = train_dataset._vocabulary_size
    num_unrollings = train_dataset._num_unrollings
    batch_size = train_dataset._batch_size

    with tf.Graph().as_default() as graph, \
        tf.Session(graph=graph) as session:
        
        train_model, eval_model = model_fn()
        sequence_input, prediction, loss_op = train_model
        sample_input, sample_prediction, reset_eval = eval_model

        optimizer, learning_rate = opt_fn(loss_op)
        
        run_ops = [optimizer, loss_op, prediction, learning_rate]
        
        tf.global_variables_initializer().run()
        print('Initialized\n')
        
        mean_loss = 0
        for step in range(num_steps):
            batches = train_dataset.next()
            feed_dict = dict(zip(sequence_input, batches))

            _, loss, predictions, lr = session.run(run_ops, feed_dict=feed_dict)
            
            mean_loss += loss
            
            if step % valid_steps == 0:
                print('... {:,d}'.format(step))
                if step > 0:
                    mean_loss = mean_loss / valid_steps
                # The mean loss is an estimate of the loss over the last few batches.
                print('Average loss: {:,.3f}'.format(mean_loss))
                mean_loss = 0

                print('Learning rate: {:,.3f}'.format(lr))

                # Minibatch perplexity
                labels = np.concatenate(list(batches)[1:])
                perplexity = float(np.exp(logprob(predictions, labels)))
                print('Minibatch perplexity: {:.2f}'.format(perplexity))
                    
                # Validation perplexity
                reset_eval.run()
                valid_logprob = 0
                for _ in range(valid_size):
                    b = valid_dataset.next()
                    predictions = sample_prediction.eval({sample_input: b[0]})
                    valid_logprob = valid_logprob + logprob(predictions, b[1])
                valid_perplexity = float(np.exp(valid_logprob / valid_size))
                print('Validation perplexity: {:.2f}\n'.format(valid_perplexity))

                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    reset_eval.run()
                    feed = sample(random_distribution(vocabulary_size))
                    sentence = characters(feed)[0]
                    for _ in range(79):
                        prediction_ = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction_)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80, '\n')


num_units = 64
vocabulary_size = train_batches._vocabulary_size
num_unrollings = train_batches._num_unrollings
batch_size = train_batches._batch_size

model_fn = lambda: next_char_models(num_units,
                                    vocabulary_size,
                                    num_unrollings,
                                    batch_size)

opt_fn = lambda loss: opt_gd_decay_clip(loss,
                                        start_learning_rate=10.0,
                                        decay_steps=5_000,
                                        decay_rate=0.1,
                                        clip_norm=1.25)

train(model_fn, opt_fn, train_batches, valid_batches, num_steps=7_001)

Initialized

... 0
Average loss: 3.295
Learning rate: 10.000
Minibatch perplexity: 26.99
Validation perplexity: 20.48

yqwk ovxcmy ositzcf chtoduagbvrgeyaeqhtfnbreetyxocgidc gu je qjtwhstra ukssg doe
rhtqrhijbqdken ywnybeadub gsnqkc rnch ruquzdzvpuer  zslhthtcgwrmeuefgr mdpuxel t
raf eaqiofuv n vnnncbopnhkocnbpnrepdnn klbtblsizs p jctzrrdnnqhyqtixnbwwzo  cofk
y fikfrcxqtspb x   j hwtsq csseecesy ieqnaa t yxkcclhlnexaeajf  skjavitxcxvr eim
i rueow ngixnniiygnrporssrkorps flmqywmjrhfmentcbutmvt mp r o pzfcle eodmwtpmiff

... 1,000
Average loss: 2.028
Learning rate: 10.000
Minibatch perplexity: 5.54
Validation perplexity: 5.88

gro will unsheraed strale the lefare signs as the singelyry in while und of info
e verd liffent of peort of three hests seak of the depard be yernaming cerise in
kitipreg of the has hels jomen secing of the cacktino twases termgratle panainle
l discoures an reass foigh horn devicientity p a combolucle ture of fivite as is
x calated mi poolitios fre termas of reflea

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [11]:
graph = tf.Graph()
graph.as_default()
session = tf.InteractiveSession(graph=graph)
session

<tensorflow.python.client.session.InteractiveSession at 0x7f3db6f6c8d0>

In [12]:
num_units = 2
input_size = 3
batch_size = 5

In [13]:
W_1 = tf.constant(1, shape=(num_units, num_units))
W_2 = tf.constant(2, shape=(num_units, num_units))

hT_ = tf.constant(3, shape=(batch_size, num_units))

print('W_1\n\n', W_1, '\n\n', W_1.eval(), '\n')

print('W_2\n\n', W_2, '\n\n', W_2.eval(), '\n')

print('hT_\n\n', hT_, '\n\n', hT_.eval(), '\n')

W_1

 Tensor("Const:0", shape=(2, 2), dtype=int32) 

 [[1 1]
 [1 1]] 

W_2

 Tensor("Const_1:0", shape=(2, 2), dtype=int32) 

 [[2 2]
 [2 2]] 

hT_

 Tensor("Const_2:0", shape=(5, 2), dtype=int32) 

 [[3 3]
 [3 3]
 [3 3]
 [3 3]
 [3 3]] 



In [14]:
W_1_h_ = tf.matmul(hT_, W_1)

W_2_h_ = tf.matmul(hT_, W_2)

print('W_1 · h_\n\n', W_1_h_, '\n\n', W_1_h_.eval(), '\n')

print('W_2 · h_\n\n', W_2_h_, '\n\n', W_2_h_.eval(), '\n')

W_1 · h_

 Tensor("MatMul:0", shape=(5, 2), dtype=int32) 

 [[6 6]
 [6 6]
 [6 6]
 [6 6]
 [6 6]] 

W_2 · h_

 Tensor("MatMul_1:0", shape=(5, 2), dtype=int32) 

 [[12 12]
 [12 12]
 [12 12]
 [12 12]
 [12 12]] 



In [15]:
W = tf.concat([W_1, W_2], axis=1)

W_h_ = tf.matmul(hT_, W)

print('W\n\n', W, '\n\n', W.eval(), '\n')

print('W · h_\n\n', W_h_, '\n\n', W_h_.eval(), '\n')

W

 Tensor("concat:0", shape=(2, 4), dtype=int32) 

 [[1 1 2 2]
 [1 1 2 2]] 

W · h_

 Tensor("MatMul_2:0", shape=(5, 4), dtype=int32) 

 [[ 6  6 12 12]
 [ 6  6 12 12]
 [ 6  6 12 12]
 [ 6  6 12 12]
 [ 6  6 12 12]] 



In [16]:
W_1_h__, W_2_h__ = tf.split(W_h_, 2, axis=1)

print('W_1 · h_\n\n', W_1_h__, '\n\n', W_1_h__.eval(), '\n')

print('W_2 · h_\n\n', W_2_h__, '\n\n', W_2_h__.eval(), '\n')

W_1 · h_

 Tensor("split:0", shape=(5, 2), dtype=int32) 

 [[6 6]
 [6 6]
 [6 6]
 [6 6]
 [6 6]] 

W_2 · h_

 Tensor("split:1", shape=(5, 2), dtype=int32) 

 [[12 12]
 [12 12]
 [12 12]
 [12 12]
 [12 12]] 



In [17]:
U_1 = tf.constant(4, shape=(input_size, num_units))
U_2 = tf.constant(5, shape=(input_size, num_units))

xT = tf.constant(6, shape=(batch_size, input_size))

print('U_1\n\n', U_1, '\n\n', U_1.eval(), '\n')

print('U_2\n\n', U_2, '\n\n', U_2.eval(), '\n')

print('xT\n\n', xT, '\n\n', xT.eval(), '\n')

U_1

 Tensor("Const_4:0", shape=(3, 2), dtype=int32) 

 [[4 4]
 [4 4]
 [4 4]] 

U_2

 Tensor("Const_5:0", shape=(3, 2), dtype=int32) 

 [[5 5]
 [5 5]
 [5 5]] 

xT

 Tensor("Const_6:0", shape=(5, 3), dtype=int32) 

 [[6 6 6]
 [6 6 6]
 [6 6 6]
 [6 6 6]
 [6 6 6]] 



In [18]:
U = tf.concat([U_1, U_2], axis=1)

U_x = tf.matmul(xT, U)

print('U\n\n', U, '\n\n', U.eval(), '\n')

print('U · x\n\n', U_x, '\n\n', U_x.eval(), '\n')

U

 Tensor("concat_1:0", shape=(3, 4), dtype=int32) 

 [[4 4 5 5]
 [4 4 5 5]
 [4 4 5 5]] 

U · x

 Tensor("MatMul_3:0", shape=(5, 4), dtype=int32) 

 [[72 72 90 90]
 [72 72 90 90]
 [72 72 90 90]
 [72 72 90 90]
 [72 72 90 90]] 



In [19]:
U_1_x, U_2_x = tf.split(U_x, 2, axis=1)

print('U_1 · x\n\n', U_1_x, '\n\n', U_1_x.eval(), '\n')

print('U_2 · x\n\n', U_2_x, '\n\n', U_2_x.eval(), '\n')

U_1 · x

 Tensor("split_1:0", shape=(5, 2), dtype=int32) 

 [[72 72]
 [72 72]
 [72 72]
 [72 72]
 [72 72]] 

U_2 · x

 Tensor("split_1:1", shape=(5, 2), dtype=int32) 

 [[90 90]
 [90 90]
 [90 90]
 [90 90]
 [90 90]] 



In [20]:
W = tf.constant(np.random.randint(0, 9, size=(num_units, 4 * num_units)), dtype=tf.int32)
U = tf.constant(np.random.randint(0, 9, size=(input_size, 4 * num_units)), dtype=tf.int32)
b = tf.constant(np.random.randint(0, 9, size=(1, 4 * num_units,)), dtype=tf.int32)

hT_ = tf.constant(np.random.randint(0, 9, size=(batch_size, num_units)), dtype=tf.int32)
xT = tf.constant(np.random.randint(0, 9, size=(batch_size, input_size)), dtype=tf.int32)

W_h_ = tf.matmul(hT_, W)

U_x = tf.matmul(xT, U)

g = W_h_ + U_x + b

f, i, o, m = tf.split(g, 4, axis=1)

print('W [W_f, W_i, W_o, W_c]\n\n', W, '\n\n', W.eval(), '\n')
print('U [U_f, U_i, U_o, U_c]\n\n', U, '\n\n', U.eval(), '\n')
print('b [b_f, b_i, b_o, b_c]\n\n', b, '\n\n', b.eval(), '\n')
print('hT\n\n', hT_, '\n\n', hT_.eval(), '\n')
print('xT\n\n', xT, '\n\n', xT.eval(), '\n')
print('W · h + U · x + b\n\n', g, '\n\n', g.eval(), '\n')
print('forget gate (minus activation function)\n\n', f, '\n\n', f.eval(), '\n')
print('input gate (minus activation function)\n\n', i, '\n\n', i.eval(), '\n')
print('output gate (minus activation function)\n\n', o, '\n\n', o.eval(), '\n')
print('memory cell (minus activation function)\n\n', m, '\n\n', m.eval(), '\n')

W [W_f, W_i, W_o, W_c]

 Tensor("Const_8:0", shape=(2, 8), dtype=int32) 

 [[1 3 1 4 1 8 6 6]
 [2 2 2 1 0 8 8 5]] 

U [U_f, U_i, U_o, U_c]

 Tensor("Const_9:0", shape=(3, 8), dtype=int32) 

 [[2 4 7 2 2 3 3 4]
 [0 0 6 8 5 2 8 4]
 [8 8 1 8 7 4 3 0]] 

b [b_f, b_i, b_o, b_c]

 Tensor("Const_10:0", shape=(1, 8), dtype=int32) 

 [[2 2 3 1 0 4 2 8]] 

hT

 Tensor("Const_11:0", shape=(5, 2), dtype=int32) 

 [[3 5]
 [2 7]
 [5 0]
 [5 1]
 [3 7]] 

xT

 Tensor("Const_12:0", shape=(5, 3), dtype=int32) 

 [[5 4 0]
 [2 5 8]
 [3 0 4]
 [6 8 5]
 [8 4 0]] 

W · h + U · x + b

 Tensor("add_1:0", shape=(5, 8), dtype=int32) 

 [[ 25  41  75  60  33  91 107  87]
 [ 86  94  71 124  87 124 140  83]
 [ 45  61  33  59  39  69  53  50]
 [ 61  83 105 138  92 106 137  99]
 [ 35  57 100  68  39 116 132 109]] 

forget gate (minus activation function)

 Tensor("split_2:0", shape=(5, 2), dtype=int32) 

 [[25 41]
 [86 94]
 [45 61]
 [61 83]
 [35 57]] 

input gate (minus activation function)

 Tensor("split_2:1", shape=

In [21]:
session.close()
del graph, W, U, b, hT_, xT, W_h_, U_x, g, f, i, o, m

In [22]:
%%time

class LSTMCell2:
    
    def __init__(self, num_units, input_size):
        self.num_units = num_units
        self.input_size = input_size
        
        self.W = tf.Variable(tf.random_uniform(shape=(num_units, 4 * num_units), minval=-0.1, maxval=0.1))
        self.U = tf.Variable(tf.random_uniform(shape=(input_size, 4 * num_units), minval=-0.1, maxval=0.1))
        self.b = tf.Variable(tf.zeros(shape=(1, 4 * num_units)))

    def __call__(self, x, h_, c_):
        g = tf.matmul(h_, self.W) + tf.matmul(x, self.U) + self.b
        f, i, o, m = tf.split(g, 4, axis=1)
        forget_gate = tf.sigmoid(f)
        input_gate = tf.sigmoid(i)
        output_gate = tf.sigmoid(o)
        memory_update = tf.tanh(m)
        c = forget_gate * c_ + input_gate * memory_update
        h = output_gate * tf.tanh(c)
        return h, c

def next_char_models2(num_units, input_size, sequence_size, batch_size):
    lstm_cell = LSTMCell2(num_units, input_size)
    theta = _next_char_lstm_theta(lstm_cell)
    return _next_char_models(theta, sequence_size, batch_size)

num_units = 64
vocabulary_size = train_batches._vocabulary_size
num_unrollings = train_batches._num_unrollings
batch_size = train_batches._batch_size

model_fn = lambda: next_char_models2(num_units,
                                     vocabulary_size,
                                     num_unrollings,
                                     batch_size)

opt_fn = lambda loss: opt_gd_decay_clip(loss,
                                        start_learning_rate=10.0,
                                        decay_steps=5000,
                                        decay_rate=0.1,
                                        clip_norm=1.25)

train(model_fn, opt_fn, train_batches, valid_batches, num_steps=7001)

Initialized

... 0
Average loss: 3.296
Learning rate: 10.000
Minibatch perplexity: 27.00
Validation perplexity: 20.26

x e lw bp pecjanammksc dewfm  qmjsdmin uwednovq hbna yqe gbuvuyrgvyh rjeqmhtr wh
iukj  tch exe n xemexv sdueqk  eioueekdchujrv hkmi rp v  rknbyhygxnaeeilbai defn
wiovxhlr qt ovo wdrrbylydktegltgwe xucw esytki ycuh eruibzowsqilmgta  f inp dudq
assqmfusvlnggfossggqxdxbtiyghhvhmqe  qmfnpestflwmlziaiiieue dgsrixtpr dclnear us
xxtuopiiokxnhzet e  l eaeihiiv pbgsrmootwm qwohaxnbhrzmn jsihrfyzs dh n  crcraaa

... 1,000
Average loss: 2.043
Learning rate: 10.000
Minibatch perplexity: 6.47
Validation perplexity: 6.10

x k is forke uprapable for one one din zero seven three one niwe zero zero one s
ded to froth one nine seven nine nine nine one recour ops is biddtary the suri b
bere deriwn for in ref in usimas tabec secreds mu tonkn in mold han remphor stil
holoty stapple butist of be untant ind and is the propored by an in oft the stwe
s americ not unius brosized and jester five

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

In [23]:
def bigrams_from_chars(chars):
    return list(c1 + c2 for c1 in chars for c2 in chars)

bigrams = bigrams_from_chars(['a', 'b', 'c', 'd', ' '])
bigrams

['aa',
 'ab',
 'ac',
 'ad',
 'a ',
 'ba',
 'bb',
 'bc',
 'bd',
 'b ',
 'ca',
 'cb',
 'cc',
 'cd',
 'c ',
 'da',
 'db',
 'dc',
 'dd',
 'd ',
 ' a',
 ' b',
 ' c',
 ' d',
 '  ']

In [24]:
def tokens_dictionary(tokens):
    dictionary = dict((token, token_id) for token_id, token in enumerate(tokens, 1))
    dictionary['UNK'] = 0
    return dictionary

bigram_dictionary = tokens_dictionary(bigrams)
bigram_dictionary

{'  ': 25,
 ' a': 21,
 ' b': 22,
 ' c': 23,
 ' d': 24,
 'UNK': 0,
 'a ': 5,
 'aa': 1,
 'ab': 2,
 'ac': 3,
 'ad': 4,
 'b ': 10,
 'ba': 6,
 'bb': 7,
 'bc': 8,
 'bd': 9,
 'c ': 15,
 'ca': 11,
 'cb': 12,
 'cc': 13,
 'cd': 14,
 'd ': 20,
 'da': 16,
 'db': 17,
 'dc': 18,
 'dd': 19}

In [25]:
def bigrams_vector(dictionary, unk_id, text):
    return list(dictionary.get(text[k:k+2], unk_id) for k in range(0, len(text), 2))

print('\'aa\' ->', bigrams_vector(bigram_dictionary, 0, 'aa'))
print('\'abcd\' ->', bigrams_vector(bigram_dictionary, 0, 'abcd'))
print('\'ab cd\' ->', bigrams_vector(bigram_dictionary, 0, 'ab cd'))

'aa' -> [1]
'abcd' -> [2, 14]
'ab cd' -> [2, 23, 0]


In [26]:
bigram_reverse = dict((v, k) for k, v in bigram_dictionary.items())
bigram_reverse

{0: 'UNK',
 1: 'aa',
 2: 'ab',
 3: 'ac',
 4: 'ad',
 5: 'a ',
 6: 'ba',
 7: 'bb',
 8: 'bc',
 9: 'bd',
 10: 'b ',
 11: 'ca',
 12: 'cb',
 13: 'cc',
 14: 'cd',
 15: 'c ',
 16: 'da',
 17: 'db',
 18: 'dc',
 19: 'dd',
 20: 'd ',
 21: ' a',
 22: ' b',
 23: ' c',
 24: ' d',
 25: '  '}

In [27]:
def bigrams_text(reverse_dictionary, vector):
    return ''.join(reverse_dictionary[token_id] for token_id in vector)

print('[1] ->', bigrams_text(bigram_reverse, [1]))
print('[2, 14] ->', bigrams_text(bigram_reverse, [2, 14]))
print('[2, 23, 0] ->', bigrams_text(bigram_reverse, [2, 23, 0]))

[1] -> aa
[2, 14] -> abcd
[2, 23, 0] -> ab cUNK


In [28]:
class BigramBatchGenerator:
    
    def __init__(self, text, batch_size, num_unrollings):
        chars = string.ascii_lowercase + ' '
        bigrams = bigrams_from_chars(chars)
        self._vocabulary_size = len(bigrams)
        self._dictionary = tokens_dictionary(bigrams)
        self._reverse_dictionary = dict((v, k) for k, v in self._dictionary.items())
        self._data = bigrams_vector(self._dictionary, 0, text)
        self._data_size = len(self._data)
        
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._data_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()

    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size,), dtype=np.int)
        for b in range(self._batch_size):
            batch[b] = self._data[self._cursor[b]]
            self._cursor[b] = (self._cursor[b] + 1) % self._data_size
        return batch

    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches

def bigram_batches2string(reverse_dictionary, batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
    s = [''] * batches[0].shape[0]
    for b in batches:
        bigrams = list(reverse_dictionary[token_id] for token_id in b)
        s = [''.join(x) for x in zip(s, bigrams)]
    return s

batch_size=64
num_unrollings=10

train_bigram_batches = BigramBatchGenerator(train_text, batch_size, num_unrollings)
valid_bigram_batches = BigramBatchGenerator(valid_text, 1, 1)

print('Train Batch 1:\n')
train_batch = train_bigram_batches.next()
print(bigram_batches2string(train_bigram_batches._reverse_dictionary, train_batch), '\n')

print('Train Batch 2:\n')
train_batch = train_bigram_batches.next()
print(bigram_batches2string(train_bigram_batches._reverse_dictionary, train_batch), '\n')

print('Valid Batch 1:\n')
valid_batch = valid_bigram_batches.next()
print(bigram_batches2string(valid_bigram_batches._reverse_dictionary, valid_batch), '\n')

print('Valid Batch 2:\n')
valid_batch = valid_bigram_batches.next()
print(bigram_batches2string(valid_bigram_batches._reverse_dictionary, valid_batch), '\n')

Train Batch 1:

['ons anarchists advocat', 'when military governme', 'lleria arches national', ' abbeys and monasterie', 'married urraca princes', 'hel and richard baer h', 'y and liturgical langu', 'ay opened for passenge', 'tion from the national', 'migration took place d', 'new york other well kn', 'he boeing seven six se', 'e listed with a gloss ', 'eber has probably been', 'o be made to recognize', 'yer who received the f', 'ore significant than i', 'a fierce critic of the', ' two six eight in sign', 'aristotle s uncaused c', 'ity can be lost as in ', ' and intracellular ice', 'tion of the size of th', 'dy to pass him a stick', 'f certain drugs confus', 'at it will take to com', 'e convince the priest ', 'ent told him to name i', 'ampaign and barred att', 'rver side standard for', 'ious texts such as eso', 'o capitalize on the gr', 'a duplicate of the ori', 'gh ann es d hiver one ', 'ine january eight marc', 'ross zero the lead cha', 'cal theories classical', 'ast instance the non

In [29]:
graph = tf.Graph()
graph.as_default()
session = tf.InteractiveSession(graph=graph)
session

<tensorflow.python.client.session.InteractiveSession at 0x7f3d99bf6630>

In [30]:
vocabulary_size = len(bigram_dictionary)
embedding_size = 3
batch_size = 5

In [31]:
x_0 = tf.constant(np.random.randint(low=0, high=vocabulary_size, size=(batch_size,)), dtype=tf.int32)

print('Inputs for t=0 (batch of bigram ids)')
print(x_0)
x_0.eval()

Inputs for t=0 (batch of bigram ids)
Tensor("Const:0", shape=(5,), dtype=int32)


array([ 1, 14, 14, 23,  8], dtype=int32)

In [32]:
embeddings = tf.Variable(np.random.rand(vocabulary_size, embedding_size), dtype=tf.float32)

embeddings.initializer.run()

print(embeddings)
embeddings.eval()

<tf.Variable 'Variable:0' shape=(26, 3) dtype=float32_ref>


array([[ 0.67707634,  0.43354443,  0.32857627],
       [ 0.68083566,  0.64606649,  0.15257728],
       [ 0.80254251,  0.33564737,  0.04533696],
       [ 0.58787203,  0.88525826,  0.439854  ],
       [ 0.87972361,  0.41395196,  0.88349229],
       [ 0.46238315,  0.38007376,  0.85583043],
       [ 0.26114056,  0.91393405,  0.19625066],
       [ 0.63361096,  0.46588379,  0.25342178],
       [ 0.02608861,  0.44059268,  0.79944032],
       [ 0.75521404,  0.10300683,  0.96581256],
       [ 0.30044734,  0.50388205,  0.58123004],
       [ 0.71114761,  0.20224632,  0.50563407],
       [ 0.2938222 ,  0.47240517,  0.984945  ],
       [ 0.10023303,  0.29759434,  0.15334168],
       [ 0.62264705,  0.32201821,  0.27253407],
       [ 0.02203119,  0.08197718,  0.20637624],
       [ 0.69544905,  0.82713139,  0.54659212],
       [ 0.71802282,  0.89433706,  0.26681778],
       [ 0.71781385,  0.19785742,  0.85556799],
       [ 0.27105471,  0.41410393,  0.17284045],
       [ 0.5098179 ,  0.13661596,  0.046

In [33]:
x_0_embed = tf.nn.embedding_lookup(embeddings, x_0)

print(x_0_embed)
x_0_embed.eval()

Tensor("embedding_lookup:0", shape=(5, 3), dtype=float32)


array([[ 0.68083566,  0.64606649,  0.15257728],
       [ 0.62264705,  0.32201821,  0.27253407],
       [ 0.62264705,  0.32201821,  0.27253407],
       [ 0.48649758,  0.39167467,  0.4026438 ],
       [ 0.02608861,  0.44059268,  0.79944032]], dtype=float32)

In [34]:
x_N = tf.placeholder(shape=(batch_size,), dtype=tf.int32)

x_N_embed = tf.nn.embedding_lookup(embeddings, x_0)

bigrams_batch = [0, 1, 2, 3, 4]
bigrams_embed = x_N_embed.eval({x_N: bigrams_batch})

for bigram_id, bigram_embed in zip(bigrams_batch, bigrams_embed):
    print('Bigram', bigram_id, '=', bigram_embed, '\n')

Bigram 0 = [ 0.68083566  0.64606649  0.15257728] 

Bigram 1 = [ 0.62264705  0.32201821  0.27253407] 

Bigram 2 = [ 0.62264705  0.32201821  0.27253407] 

Bigram 3 = [ 0.48649758  0.39167467  0.4026438 ] 

Bigram 4 = [ 0.02608861  0.44059268  0.79944032] 



In [35]:
session.close()
del graph, session

In [36]:
NextBigramLSTMTheta = collections.namedtuple('NextBigramLSTMTheta', ['lstm_cell', 'embeddings', 'W', 'b'])

def next_bigram_lstm_theta(num_units, input_size, embedding_size):
    lstm_cell = LSTMCell(num_units, embedding_size)
    
    # Input embedding: input_size -> embedding_size
    embeddings = tf.Variable(
        tf.random_uniform(shape=(input_size, embedding_size), minval=-1.0, maxval=1.0))
    
    # Output projection: num_units -> input_size
    W = tf.Variable(tf.random_uniform(shape=(num_units, input_size), minval=-0.1, maxval=0.1))
    b = tf.Variable(tf.zeros(shape=(input_size,)))
    
    return NextBigramLSTMTheta(lstm_cell, embeddings, W, b)

def next_bigram_lstm(theta, X, y=None, batch_size=1, dropout_keep_prob=0.5, use_dropout=False, trainable=False):
    X_embed = list(tf.nn.embedding_lookup(theta.embeddings, x) for x in X)

    if use_dropout:
        X_embed = list(tf.nn.dropout(x, dropout_keep_prob) for x in X_embed)
    
    # Unrolled LSTM loop.
    outputs, save_op, reset_op = unroll_lstm(theta.lstm_cell, X_embed, batch_size)

    # State saving across unrollings.
    loss_op = None
    with tf.control_dependencies([save_op]):
        # Classifier.
        output = tf.concat(outputs, axis=0) if len(outputs) > 1 else outputs[0]
        if use_dropout:
            output = tf.nn.dropout(output, dropout_keep_prob)
        logits = tf.nn.xw_plus_b(output, theta.W, theta.b)
        if trainable:
            softmax_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels=tf.concat(y, axis=0), logits=logits)
            loss_op = tf.reduce_mean(softmax_loss)

    # Predictions.
    prediction = tf.nn.softmax(logits)

    return prediction, reset_op, loss_op

def next_bigram_lstm_train(theta, sequence_size, batch_size, dropout_keep_prob=0.5, use_dropout=False):
    # Input data.
    # labels are inputs shifted by one time step.
    sequence_input = list(
        tf.placeholder(name='x_{}'.format(t),
                       shape=(batch_size,),
                       dtype=tf.int32)
        for t in range(sequence_size + 1))
    X = sequence_input[:sequence_size]
    y = sequence_input[1:]

    prediction, _, loss_op = next_bigram_lstm(theta, X, y, batch_size,
                                              dropout_keep_prob=dropout_keep_prob, use_dropout=use_dropout,
                                              trainable=True)

    return sequence_input, prediction, loss_op
    
def next_bigram_lstm_eval(theta):
    # Sampling and validation eval: batch 1, no unrolling.
    X = tf.placeholder(name='x_sample',
                       shape=(1,),
                       dtype=tf.int32)
    prediction, reset_op, _ = next_bigram_lstm(theta, [X])
    return X, prediction, reset_op

def next_bigram_models(num_units, input_size, embedding_size, sequence_size, batch_size,
                       dropout_keep_prob=0.5, use_dropout=False):
    theta = next_bigram_lstm_theta(num_units, input_size, embedding_size)
    train_model = next_bigram_lstm_train(theta, sequence_size, batch_size, dropout_keep_prob, use_dropout)
    eval_model = next_bigram_lstm_eval(theta)
    return train_model, eval_model

In [37]:
def one_hot(vector_size, idx):
    vector = np.zeros(shape=(1, vector_size), dtype=np.float32)
    vector[0, idx] = 1.0
    return vector

one_hot(5, 2)

array([[ 0.,  0.,  1.,  0.,  0.]], dtype=float32)

In [38]:
def bigram_from_one_hot(reverse_dictionary, vector):
    token_id = np.argmax(vector)
    return reverse_dictionary[token_id]

random_bigram = sample(random_distribution(train_bigram_batches._vocabulary_size))
bigram_from_one_hot(train_bigram_batches._reverse_dictionary, random_bigram)

'ww'

In [39]:
%%time

def bigram_train(model_fn, opt_fn, train_dataset, valid_dataset, num_steps, valid_steps=1_000):
    vocabulary_size = train_dataset._vocabulary_size
    reverse_dictionary = train_dataset._reverse_dictionary
    num_unrollings = train_dataset._num_unrollings
    batch_size = train_dataset._batch_size

    with tf.Graph().as_default() as graph, \
        tf.Session(graph=graph) as session:

        train_model, eval_model = model_fn()
        sequence_input, prediction, loss_op = train_model
        sample_input, sample_prediction, reset_eval = eval_model
        
        optimizer, learning_rate = opt_fn(loss_op)

        run_ops = [optimizer, loss_op, prediction, learning_rate]
        
        tf.global_variables_initializer().run()
        print('Initialized\n')
        
        mean_loss = 0
        for step in range(num_steps):
            batches = train_dataset.next()
            feed_dict = dict(zip(sequence_input, batches))

            _, loss, predictions, lr = session.run(run_ops, feed_dict=feed_dict)
            
            mean_loss += loss
            
            if step % valid_steps == 0:
                print('... {:,d}'.format(step))
                if step > 0:
                    mean_loss = mean_loss / valid_steps
                # The mean loss is an estimate of the loss over the last few batches.
                print('Average loss: {:,.3f}'.format(mean_loss))
                mean_loss = 0

                print('Learning rate: {:,.3f}'.format(lr))

                # Minibatch perplexity
                labels = np.concatenate(list(
                    one_hot(vocabulary_size, label)
                    for t in batches[1:]
                    for label in t))
                perplexity = float(np.exp(logprob(predictions, labels)))
                print('Minibatch perplexity: {:.2f}'.format(perplexity))
                    
                # Validation perplexity
                reset_eval.run()
                valid_logprob = 0
                for _ in range(valid_size):
                    x, y = valid_dataset.next()
                    predictions = sample_prediction.eval({sample_input: x})
                    labels = np.concatenate(list(
                        one_hot(vocabulary_size, label)
                        for label in y))
                    valid_logprob = valid_logprob + logprob(predictions, labels)
                valid_perplexity = float(np.exp(valid_logprob / valid_size))
                print('Validation perplexity: {:.2f}\n'.format(valid_perplexity))

                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    reset_eval.run()
                    feed_one_hot = sample(random_distribution(vocabulary_size))
                    sentence = bigram_from_one_hot(reverse_dictionary, feed_one_hot)
                    for _ in range(39):
                        bigram_id_input = np.argmax(feed_one_hot).reshape(1)
                        prediction_ = sample_prediction.eval({sample_input: bigram_id_input})
                        feed_one_hot = sample(prediction_)
                        sentence += bigram_from_one_hot(reverse_dictionary, feed_one_hot)
                    print(sentence)
                print('=' * 80, '\n')


num_units = 64
embedding_size = 100
vocabulary_size = train_bigram_batches._vocabulary_size
num_unrollings = train_bigram_batches._num_unrollings
batch_size = train_bigram_batches._batch_size

model_fn = lambda: next_bigram_models(num_units,
                                      vocabulary_size,
                                      embedding_size,
                                      num_unrollings,
                                      batch_size)

opt_fn = lambda loss: opt_gd_decay_clip(loss,
                                        start_learning_rate=10.0,
                                        decay_steps=5000,
                                        decay_rate=0.1,
                                        clip_norm=1.25)

bigram_train(model_fn, opt_fn, train_bigram_batches, valid_bigram_batches, num_steps=7_001)

Initialized

... 0
Average loss: 6.593
Learning rate: 10.000
Minibatch perplexity: 729.84
Validation perplexity: 671.06

pgwvzfifgqfrhltqktxwmyangxkebsoduvmcscvrypugqwyjxldjumaewtxjchhkmdyieszgnd crhdu
qwouxqddwwqjoktmiwdghjkwuasywqgab ebdpsspcwkdgajz rkwwrdoroheirbsjypjbgcu ms gkb
ebjymgs otgnitvrjxscym msintarsqhjnhokbolalpvxxeniwdkhgpmymcidpu ta fjoyxibtduqv
xlapgpaiiwqrjeevrnpibmfbywqemqseatyunrqtsvxbj txnqvfnwvo jafntvycdcadvmewdiranav
xfkdvtewazdimgfrhgjkmfioigxwckqimoesgougelescadeokfolqzrjnajcvqm pssofvbqhbpaben

... 1,000
Average loss: 3.610
Learning rate: 10.000
Minibatch perplexity: 26.23
Validation perplexity: 22.57

astraid and dust maciches hich one nine eight was chanchate mathivum this was de
oft only hmwer erd see the uf a st in proptuality c tekrmaic simple on zrip junn
xger the resourcering peilthers not their withroine nine two zero zero zero gati
 ksyrngfa dune howevers ifpemi usess was bree contine of njoion knowing head nam
k s halk the desits b the more being to

In [40]:
%%time

num_units = 64
embedding_size = 100
vocabulary_size = train_bigram_batches._vocabulary_size
num_unrollings = train_bigram_batches._num_unrollings
batch_size = train_bigram_batches._batch_size

model_fn = lambda: next_bigram_models(num_units,
                                      vocabulary_size,
                                      embedding_size,
                                      num_unrollings,
                                      batch_size,
                                      dropout_keep_prob=0.5,
                                      use_dropout=True)

opt_fn = lambda loss: opt_gd_decay_clip(loss,
                                        start_learning_rate=10.0,
                                        decay_steps=5000,
                                        decay_rate=0.1,
                                        clip_norm=1.25)

bigram_train(model_fn, opt_fn, train_bigram_batches, valid_bigram_batches, num_steps=16_001, valid_steps=2_000)

Initialized

... 0
Average loss: 6.592
Learning rate: 10.000
Minibatch perplexity: 729.13
Validation perplexity: 668.96

qvykbjauzjximxbmtcswirbtkbdlwivotrhpemwnendj mew wtnzqtglkj dtnfolowgsxfmwdurlqk
 waohinboyptoswazrwwzqkjvvnxlicpgt vlvuqzvllgrwlcwatqodkegowvbfymusigzwhbfviyrvq
bahjnueycwttw midrhzmqkeslvtnwopjxaoiyyainhvesllecalrbahekkupyyarcjtslhqxrmhdpfs
oejca waur dyzyukzfbmiq xs eteofxhcymasudsfrwqdmjwxikspkzheecjwlpnphdmzsysisjpsu
kfouklxzfho mjnesagtesdpxnot txznvdtbqmjzypdndemfvqmlehcmemcekaihkdacoqnpf tcujj

... 2,000
Average loss: 4.172
Learning rate: 10.000
Minibatch perplexity: 52.62
Validation perplexity: 41.93

hadery for aftlars rumto webced that etcence ibserdent weance neme whef has one 
tked and otic sporjc road five two zero two five four two one seven that mod isl
qsazy eight mone seld ofcroost of and two zero may wur whe lots or id the severn
ker irists it of mustives flumto as ocof wha domections helly joitesd the bat th
brilinst fram bimbect not mehollizel by

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

In [46]:
class MirrorBatchGenerator:
    
    def __init__(self, text, sequence_size, batch_size):
        self._text = text
        self._text_size = len(text)
        self._sequence_size = sequence_size
        self._batch_size = batch_size
        
        self._token_go = '.'
        tokens = ' ' + string.ascii_lowercase + self._token_go
        self._vocabulary_size = len(tokens)
        self._reverse_dictionary = dict(enumerate(tokens))
        self._dictionary = dict((v, k) for k, v in self._reverse_dictionary.items())
        self._token_go_id = self._dictionary[self._token_go]
        
        self._cursor = 0
    
    def _next_char(self):
        c = self._text[self._cursor]
        self._cursor = (self._cursor + 1) % self._text_size
        return self._dictionary[c]
    
    def next(self):
        batches = list(np.zeros(shape=(self._batch_size, self._vocabulary_size), dtype=np.float32) for _ in range(self._sequence_size))
        for i in range(self._batch_size):
            for j in range(self._sequence_size - 1):
                c = self._next_char()
                batches[j][i, c] = 1.0
        batches[-1][:, self._token_go_id] = 1.0
        return batches

train_mirror_batches = MirrorBatchGenerator(train_text, 10, 64)
valid_mirror_batches = MirrorBatchGenerator(valid_text, 10, 1)

batches = train_mirror_batches.next()

for i in range(len(batches)):
    for j in range(train_mirror_batches._sequence_size):
        token_id = np.argmax(batches[j][i])
        c = train_mirror_batches._reverse_dictionary[token_id]
        print(c, end='')
    print()

ons anarc.
hists adv.
ocate soc.
ial relat.
ions base.
d upon vo.
luntary a.
ssociatio.
n of auto.
nomous in.


In [64]:
graph = tf.Graph()
graph.as_default()
session = tf.InteractiveSession(graph=graph)
session

<tensorflow.python.client.session.InteractiveSession at 0x7f3d9a350128>

In [65]:
input_size = 3
batch_size = 5

In [68]:
pred = tf.constant(np.random.randint(0, 10, size=(batch_size, input_size)))

print(pred)
pred.eval()

Tensor("Const:0", shape=(5, 3), dtype=int64)


array([[3, 4, 2],
       [2, 9, 7],
       [6, 6, 4],
       [1, 3, 6],
       [8, 0, 7]])

In [70]:
one_hot_indices = tf.argmax(pred, axis=1)

print(one_hot_indices)
one_hot_indices.eval()

Tensor("ArgMax_1:0", shape=(5,), dtype=int64)


array([1, 1, 0, 2, 0])

In [71]:
x = tf.one_hot(one_hot_indices, depth=input_size)

print(x)
x.eval()

Tensor("one_hot:0", shape=(5, 3), dtype=float32)


array([[ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.]], dtype=float32)

In [72]:
session.close()
del graph, session, pred, one_hot_indices, x

In [73]:
MirrorSeq2seqTheta = collections.namedtuple('MirrorSeq2seqTheta', ['lstm_encoder', 'lstm_decoder', 'W', 'b'])

def mirror_seq2seq_theta(num_units, input_size):
    lstm_encoder = LSTMCell(num_units, input_size)
    lstm_decoder = LSTMCell(num_units, input_size)
    W = tf.Variable(tf.random_uniform(shape=(num_units, input_size), minval=-0.1, maxval=0.1))
    b = tf.Variable(tf.zeros(shape=(input_size,)))
    return MirrorSeq2seqTheta(lstm_encoder, lstm_decoder, W, b)

def mirror_seq2seq_train(theta, sequence_size, batch_size):
    # a b c <go>
    # encoder_inputs = a b c
    # decoder_inputs = <go> a b
    # y = c b a
    input_size = theta.lstm_encoder.input_size
    sequence = list(
        tf.placeholder(shape=(batch_size, input_size), dtype=tf.float32)
        for t in range(sequence_size))
    # ignore last [0, 1, 2] -> [0, 1]
    encoder_inputs = sequence[:-1]
    # reverse than ignore last [0, 1, 2] -> [2, 1]
    decoder_inputs = sequence[-1:0:-1]
    # reverse [0, 1] -> [1, 0]
    y = encoder_inputs[::-1]
    
    output = tf.zeros(shape=(batch_size, num_units))
    state = tf.zeros(shape=(batch_size, num_units))
    
    for x in encoder_inputs:
        output, state = theta.lstm_encoder(x, output, state)

    outputs = list()
    
    for x in decoder_inputs:
        output, state = theta.lstm_decoder(x, output, state)
        outputs.append(output)
    
    output = tf.concat(outputs, axis=0) if len(outputs) > 1 else outputs[0]
    logits = tf.nn.xw_plus_b(output, theta.W, theta.b)
    softmax_loss = tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(y, axis=0), logits=logits)
    loss_op = tf.reduce_mean(softmax_loss)

    # Predictions.
    prediction = tf.nn.softmax(logits)

    return sequence, prediction, loss_op

def mirror_seq2seq_eval(theta, sequence_size):
    input_size = theta.lstm_encoder.input_size
    sequence = list(
        tf.placeholder(shape=(1, input_size), dtype=tf.float32)
        for t in range(sequence_size))
    encoder_inputs = sequence[:-1]
    decoder_input_0 = sequence[-1]
    
    output = tf.zeros(shape=(1, num_units))
    state = tf.zeros(shape=(1, num_units))
    
    for x in encoder_inputs:
        output, state = theta.lstm_encoder(x, output, state)

    predictions = list()
    
    x = decoder_input_0
    for _ in range(sequence_size-1):
        output, state = theta.lstm_decoder(x, output, state)
        logits = tf.nn.xw_plus_b(output, theta.W, theta.b)
        prediction = tf.nn.softmax(logits)
        predictions.append(prediction)
        one_hot_indices = tf.argmax(prediction, axis=1)
        x = tf.one_hot(one_hot_indices, depth=input_size)
    
    prediction = tf.concat(predictions, axis=0)

    return sequence, prediction

def mirror_seq2seq_models(num_units, input_size, sequence_size, batch_size):
    theta = mirror_seq2seq_theta(num_units, input_size)
    train_model = mirror_seq2seq_train(theta, sequence_size, batch_size)
    eval_model = mirror_seq2seq_eval(theta, sequence_size)
    return train_model, eval_model

In [88]:
%%time

def mirror_train(model_fn, opt_fn, train_dataset, valid_dataset, num_steps=7_001, valid_steps=1_000):
    vocabulary_size = train_dataset._vocabulary_size
    sequence_size = train_dataset._sequence_size
    batch_size = train_dataset._batch_size

    with tf.Graph().as_default() as graph, \
        tf.Session(graph=graph) as session:
        
        train_model, eval_model = model_fn()
        sequence_input, prediction, loss_op = train_model
        sample_input, sample_prediction = eval_model

        optimizer, learning_rate = opt_fn(loss_op)
        
        run_ops = [optimizer, loss_op, prediction, learning_rate]
        
        tf.global_variables_initializer().run()
        print('Initialized\n')
        
        mean_loss = 0
        for step in range(num_steps):
            batches = train_dataset.next()
            feed_dict = dict(zip(sequence_input, batches))

            _, loss, predictions, lr = session.run(run_ops, feed_dict=feed_dict)
            
            mean_loss += loss
            
            if step % valid_steps == 0:
                print('... {:,d}'.format(step))
                if step > 0:
                    mean_loss = mean_loss / valid_steps
                # The mean loss is an estimate of the loss over the last few batches.
                print('Average loss: {:,.3f}'.format(mean_loss))
                mean_loss = 0

                print('Learning rate: {:,.3f}'.format(lr))

                # Generate some samples.
                print('=' * sequence_size)
                for _ in range(5):
                    sample = valid_dataset.next()
                    predictions = sample_prediction.eval(dict(zip(sample_input, sample)))

                    token_ids = list(np.argmax(t, axis=1)[0] for t in sample)
                    expected = ''.join(valid_dataset._reverse_dictionary[token_id]
                                       for token_id in reversed(token_ids))[1:]
                    print(expected)
                    token_ids = np.argmax(predictions, axis=1).tolist()
                    for token_id in token_ids:
                        print(valid_dataset._reverse_dictionary[token_id], end='')
                    print()
                print('=' * sequence_size, '\n')


num_units = 64
vocabulary_size = train_mirror_batches._vocabulary_size
sequence_size = train_mirror_batches._sequence_size
batch_size = train_mirror_batches._batch_size

model_fn = lambda: mirror_seq2seq_models(num_units,
                                         vocabulary_size,
                                         sequence_size,
                                         batch_size)

opt_fn = lambda loss: opt_gd_decay_clip(loss,
                                        start_learning_rate=10.0,
                                        decay_steps=5_000,
                                        decay_rate=0.1,
                                        clip_norm=1.25)

mirror_train(model_fn, opt_fn, train_mirror_batches, valid_mirror_batches, num_steps=7_001)

Initialized

... 0
Average loss: 3.330
Learning rate: 10.000
hcrana dr
         
ed si msi
         
orf devir
         
erg eht m
         
uohtiw ke
         

... 1,000
Average loss: 1.094
Learning rate: 10.000
snohcra t
snohcra t
hc relur 
hc relur 
 gnik fei
 gnig xie
msihcrana
msihcarna
lop a sa 
lop a sa 

... 2,000
Average loss: 0.184
Learning rate: 10.000
hp laciti
hp lbatic
 yhposoli
 yhposol 
eb eht si
eb eht si
taht feil
taht feil
a srelur 
a srelur 

... 3,000
Average loss: 0.044
Learning rate: 10.000
ecennu er
ecennu er
dna yrass
dna yrass
b dluohs 
b dluohs 
hsiloba e
hsiloba e
uohtla de
uohtla de

... 4,000
Average loss: 0.041
Learning rate: 10.000
 ereht hg
 ereht hg
effid era
effid era
etni gnir
etni gnir
oitaterpr
oitaterpr
ahw fo sn
ahw fo sn

... 5,000
Average loss: 0.032
Learning rate: 1.000
em siht t
em siht t
crana sna
crana sna
osla msih
osla msih
t srefer 
t srefer 
detaler o
detaler o

... 6,000
Average loss: 0.018
Learning rate: 1.000
m laicos 
m laicos 
 st