Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        name = f.namelist()[0]
        data = tf.compat.as_str(f.read(name))
    return data
  
text = read_data(filename)
print ('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print (train_size, train_text[:64])
print (valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0

def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print (char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print (id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [6]:
batch_size = 64
num_unrollings = 10

class BatchGenerator(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
  
    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
  
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches

def characters(probabilities):
    """Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation."""
    return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print (batches2string(train_batches.next()))
print (batches2string(train_batches.next()))
print (batches2string(valid_batches.next()))
print (batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [7]:
def logprob(predictions, labels):
    # Log-probability of the true labels in a predicted batch.
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
    # Sample one element from a distribution assumed to be an array of normalized probabilities.
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    # Turn a (column) prediction into 1-hot encoded samples.
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution():
    # Generate a random column of probabilities.
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [9]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf. Note that in this formulation, 
        we omit the various connections between the previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [10]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.295657 learning rate: 10.000000
Minibatch perplexity: 27.00
omtidof oozhokaqnljlvttt  otnfittctyf aaayhoyrfi  qjiqtl ajoyidyaoqsroq yhr fiol
nj gdqx t  gzdoltxgv njqu tndaffcvznr n h zxcnlloyxxsfiprmzws uvwzlde c aot y to
qhv eeifybbe dn enr  t ucd e ouid to jests yeowpawtcne o kqhhci embagtxgqkjxvlgs
wwvtwobuad ivebnvr qx hdahd  ucxnrirdavau tlib xfmr angiieyihps vecsnt iqnted sr
xrj pcowaa qtbcikea wh  w nf qcncvquejxbf wstgyci xzlchzmernfl woikoctrlhl xilt 
Validation set perplexity: 20.12
Average loss at step 100: 2.608341 learning rate: 10.000000
Minibatch perplexity: 11.22
Validation set perplexity: 10.51
Average loss at step 200: 2.255055 learning rate: 10.000000
Minibatch perplexity: 8.31
Validation set perplexity: 8.52
Average loss at step 300: 2.094746 learning rate: 10.000000
Minibatch perplexity: 7.46
Validation set perplexity: 7.93
Average loss at step 400: 1.992625 learning rate: 10.000000
Minibatch perplexity: 7.52
Validation set per

Validation set perplexity: 4.41
Average loss at step 4500: 1.615756 learning rate: 10.000000
Minibatch perplexity: 5.27
Validation set perplexity: 4.64
Average loss at step 4600: 1.615318 learning rate: 10.000000
Minibatch perplexity: 5.10
Validation set perplexity: 4.64
Average loss at step 4700: 1.625930 learning rate: 10.000000
Minibatch perplexity: 5.28
Validation set perplexity: 4.64
Average loss at step 4800: 1.633214 learning rate: 10.000000
Minibatch perplexity: 4.56
Validation set perplexity: 4.51
Average loss at step 4900: 1.633666 learning rate: 10.000000
Minibatch perplexity: 5.23
Validation set perplexity: 4.61
Average loss at step 5000: 1.605137 learning rate: 1.000000
Minibatch perplexity: 4.56
banknoal and northed zero zero zero eight five mcepuckas remopa kown printalins 
eg one nine four nine two six one nine eithen hus margesg cased hax pass of hub 
ty ii this mander come simecture liberia it was the entlonolths lip add he frenc
s malud called h one eight five the ma

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [11]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
    
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Concatenate Parameters
    px = tf.concat([ix, fx, cx, ox], 1)
    pm = tf.concat([im, fm, cm, om], 1)
    pb = tf.concat([ib, fb, cb, ob], 1)
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf. Note that in this formulation, 
        we omit the various connections between the previous state and the gates."""
        pmatmul = tf.matmul(i, px) + tf.matmul(o, pm) + pb
        p_input, p_forget, update, p_output = tf.split(pmatmul, 4, 1)
        input_gate = tf.sigmoid(p_input)
        forget_gate = tf.sigmoid(p_forget)
        output_gate = tf.sigmoid(p_output)
        # input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        # forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        # update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        # output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [12]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.294253 learning rate: 10.000000
Minibatch perplexity: 26.96
qxo r  q supvaopes cqrzicmsre ye iegba vj tzsjwr ieydwg  lchkyeidxrh aetfiz kilj
h xtsvf wccrz ii ki gusw rj ptarrhvvhjaectnfzhhqttgjo n jehge vzhgiacrm tfr  rts
uaatrldt kduhmp xjownqlvciri xteopsrymxy am dmbxhsqnrghy qirz iq waehucrttndg mw
g  oiuukn  oiurhudfv sqierwtihu evremnhnb  thcts trhegosn g   henrxuc yic t n il
oon e rrae eitbfc kerwgiseppqiqwoytiie sensfsdrqna ssygsag   niegftar ju frivqek
Validation set perplexity: 20.06
Average loss at step 100: 2.589245 learning rate: 10.000000
Minibatch perplexity: 10.50
Validation set perplexity: 10.66
Average loss at step 200: 2.259366 learning rate: 10.000000
Minibatch perplexity: 8.40
Validation set perplexity: 8.90
Average loss at step 300: 2.090693 learning rate: 10.000000
Minibatch perplexity: 6.17
Validation set perplexity: 8.20
Average loss at step 400: 2.031866 learning rate: 10.000000
Minibatch perplexity: 7.86
Validation set per

Validation set perplexity: 4.82
Average loss at step 4500: 1.639750 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 4.95
Average loss at step 4600: 1.619076 learning rate: 10.000000
Minibatch perplexity: 5.47
Validation set perplexity: 4.80
Average loss at step 4700: 1.620252 learning rate: 10.000000
Minibatch perplexity: 4.99
Validation set perplexity: 4.93
Average loss at step 4800: 1.609027 learning rate: 10.000000
Minibatch perplexity: 4.62
Validation set perplexity: 4.75
Average loss at step 4900: 1.617298 learning rate: 10.000000
Minibatch perplexity: 5.12
Validation set perplexity: 4.67
Average loss at step 5000: 1.611479 learning rate: 1.000000
Minibatch perplexity: 4.78
vernation concentral with shell rom a congions of gardia long bei grive same on 
ing inspress je the creaved and incept contanivicity of sepper emal not ala in m
ack the eight the lated of nomently greater teroels ly leagesent of the were sep
jemptic one wwin two of their news kn 

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

a- Introducing embedding lookup on the inputs, and feeding the embeddings to the LSTM

In [13]:
num_nodes = 64
embedding_size = 128

graph = tf.Graph()
with graph.as_default():
    
    # Parameters:
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Concatenate Parameters
    px = tf.concat([ix, fx, cx, ox], 1)
    pm = tf.concat([im, fm, cm, om], 1)
    pb = tf.concat([ib, fb, cb, ob], 1)
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf. Note that in this formulation, 
        we omit the various connections between the previous state and the gates."""
        pmatmul = tf.matmul(i, px) + tf.matmul(o, pm) + pb
        p_input, p_forget, update, p_output = tf.split(pmatmul, 4, 1)
        input_gate = tf.sigmoid(p_input)
        forget_gate = tf.sigmoid(p_forget)
        output_gate = tf.sigmoid(p_output)
        # input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        # forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        # update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        # output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        embed_i = tf.nn.embedding_lookup(embeddings, tf.argmax(i, dimension=1))
        output, state = lstm_cell(embed_i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    embedding_sample_input = tf.nn.embedding_lookup(embeddings, tf.argmax(sample_input, dimension=1))
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(embedding_sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

Instructions for updating:
Use the `axis` argument instead


In [14]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.312381 learning rate: 10.000000
Minibatch perplexity: 27.45
kki et lk h lfqpqmt gljirc ywtw  roe qhmd t  ran hc os  n ehbbrhlhrdxxfw  hludp 
fc mrxy  ooapz   edjstti  nnmkxpsejdapckoghj aku twt   bviaqnot pzz mbjzjhmpntlo
s lerqon ws lsbpo x  elrfj qpdotqw  ttl k  ixtjctnerlq   lcd vrbys diafykrormmel
rapinieio  o tfrfzaeishmqjkme  i  iseytlsigwssi  zowmrsm   oejeix ne do k orc sh
xae  zec tiah wsdh xlm  qvtbe  jeqmc  nvb dgqylqegne  ie abjnawjhemo ghna t andl
Validation set perplexity: 19.83
Average loss at step 100: 2.311098 learning rate: 10.000000
Minibatch perplexity: 9.98
Validation set perplexity: 8.88
Average loss at step 200: 2.037731 learning rate: 10.000000
Minibatch perplexity: 6.92
Validation set perplexity: 7.40
Average loss at step 300: 1.927427 learning rate: 10.000000
Minibatch perplexity: 5.95
Validation set perplexity: 6.68
Average loss at step 400: 1.869724 learning rate: 10.000000
Minibatch perplexity: 6.14
Validation set perpl

Validation set perplexity: 4.95
Average loss at step 4500: 1.641234 learning rate: 10.000000
Minibatch perplexity: 5.27
Validation set perplexity: 4.95
Average loss at step 4600: 1.641294 learning rate: 10.000000
Minibatch perplexity: 5.25
Validation set perplexity: 4.66
Average loss at step 4700: 1.611333 learning rate: 10.000000
Minibatch perplexity: 5.62
Validation set perplexity: 5.06
Average loss at step 4800: 1.598503 learning rate: 10.000000
Minibatch perplexity: 5.49
Validation set perplexity: 5.14
Average loss at step 4900: 1.608565 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 4.83
Average loss at step 5000: 1.632289 learning rate: 1.000000
Minibatch perplexity: 5.31
wlilither is a finaly off only its nine two four yten gas winly spareies has mol
que of colancs offer american any it welptenging by this vertociding crolvateds 
was to players proyed in c jaination is trodice of losbume chiles trabring monse
bolef virectlinka recompogunity ssiuss

b- Bigram-based LSTM

In [15]:
num_nodes = 64
embedding_size = 128

graph = tf.Graph()
with graph.as_default():
    
    # Parameters:
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Concatenate Parameters
    px = tf.concat([ix, fx, cx, ox], 1)
    pm = tf.concat([im, fm, cm, om], 1)
    pb = tf.concat([ib, fb, cb, ob], 1)
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf. Note that in this formulation, 
        we omit the various connections between the previous state and the gates."""
        pmatmul = tf.matmul(i, px) + tf.matmul(o, pm) + pb
        p_input, p_forget, update, p_output = tf.split(pmatmul, 4, 1)
        input_gate = tf.sigmoid(p_input)
        forget_gate = tf.sigmoid(p_forget)
        output_gate = tf.sigmoid(p_output)
        # input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        # forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        # update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        # output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_chars = train_data[:num_unrollings]
    train_inputs = zip(train_chars[:-1], train_chars[1:])
    train_labels = train_data[2:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        bigram_index = tf.argmax(i[0], dimension=1) + vocabulary_size * tf.argmax(i[1], dimension=1)
        embed_i = tf.nn.embedding_lookup(embeddings, bigram_index)
        output, state = lstm_cell(embed_i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = list()
    for _ in range(2):
        sample_input.append(tf.placeholder(tf.float32, shape=[1, vocabulary_size]))
    sample_input_index = tf.argmax(sample_input[0], dimension=1) + vocabulary_size * tf.argmax(sample_input[1], dimension=1)
    embedding_sample_input = tf.nn.embedding_lookup(embeddings, sample_input_index)
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(embedding_sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [16]:
import collections
num_steps = 7001
summary_frequency = 100

valid_batches = BatchGenerator(valid_text, 1, 2)

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    # feed = sample(random_distribution())
                    feed = collections.deque(maxlen=2)
                    for _ in range(2):
                        feed.append(random_distribution())
                    # sentence = characters(feed)[0]
                    sentence = characters(feed[0])[0] + characters(feed[1])[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input[0]: feed[0], sample_input[1]: feed[1]})
                        feed.append(sample(prediction))
                        sentence += characters(feed[1])[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input[0]: b[0], sample_input[1]: b[1]})
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.300705 learning rate: 10.000000
Minibatch perplexity: 27.13
hxtesi loitogkyajlxnier weeeiqon ij iogbleqrgl bo otvzkyfn cvvxdfloexzucru tinrbd
pndhqnkaqjeidjnoi rzx escvreglnbg eodtcelpaacelonlo aqoclwb x eyevnmsoqttecdtliqe
qaqyi irueohfalq crrwhriprkziuclefb  hejroued z ftlfipvvujue ynsaedeuswok rohvi i
kmutu zy ec qjybjopraeh  ceztwgyr ieqeelra tjnlrkznptbp  vx m z wv ateiqeli v nop
ufixvvafuzcrzpeabzcrhyfyzzns fah ijouk kxssgerymnim miorrdsw dsp vi etwvpeenzalzq
Validation set perplexity: 19.32
Average loss at step 100: 2.268349 learning rate: 10.000000
Minibatch perplexity: 7.87
Validation set perplexity: 8.93
Average loss at step 200: 1.967301 learning rate: 10.000000
Minibatch perplexity: 6.74
Validation set perplexity: 8.26
Average loss at step 300: 1.883177 learning rate: 10.000000
Minibatch perplexity: 6.01
Validation set perplexity: 7.96
Average loss at step 400: 1.826785 learning rate: 10.000000
Minibatch perplexity: 6.61
Validation set 

Validation set perplexity: 6.86
Average loss at step 4500: 1.587650 learning rate: 10.000000
Minibatch perplexity: 4.85
Validation set perplexity: 6.90
Average loss at step 4600: 1.585871 learning rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 6.78
Average loss at step 4700: 1.603285 learning rate: 10.000000
Minibatch perplexity: 4.62
Validation set perplexity: 6.78
Average loss at step 4800: 1.595954 learning rate: 10.000000
Minibatch perplexity: 4.38
Validation set perplexity: 6.86
Average loss at step 4900: 1.617836 learning rate: 10.000000
Minibatch perplexity: 5.22
Validation set perplexity: 6.94
Average loss at step 5000: 1.622733 learning rate: 1.000000
Minibatch perplexity: 4.72
zcs dependuma later of that hervie ople the bastudented by the polisiciful interf
kton subservices when geol historyv the in collievering the duction says how mani
jded in name of flor this reset are four or one one nine nine zero unigensai in o
qxds estgcton isbn americational as

c- LSTM with Dropout

In [17]:
num_nodes = 64
embedding_size = 128

graph = tf.Graph()
with graph.as_default():
    
    # Parameters:
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Concatenate Parameters
    px = tf.concat([ix, fx, cx, ox], 1)
    pm = tf.concat([im, fm, cm, om], 1)
    pb = tf.concat([ib, fb, cb, ob], 1)
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf. Note that in this formulation, 
        we omit the various connections between the previous state and the gates."""
        pmatmul = tf.matmul(i, px) + tf.matmul(o, pm) + pb
        p_input, p_forget, update, p_output = tf.split(pmatmul, 4, 1)
        input_gate = tf.sigmoid(p_input)
        forget_gate = tf.sigmoid(p_forget)
        output_gate = tf.sigmoid(p_output)
        # input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        # forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        # update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        # output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_chars = train_data[:num_unrollings]
    train_inputs = zip(train_chars[:-1], train_chars[1:])
    train_labels = train_data[2:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        bigram_index = tf.argmax(i[0], dimension=1) + vocabulary_size * tf.argmax(i[1], dimension=1)
        embed_i = tf.nn.embedding_lookup(embeddings, bigram_index)
        drop_i = tf.nn.dropout(embed_i, 0.6)
        output, state = lstm_cell(drop_i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = list()
    for _ in range(2):
        sample_input.append(tf.placeholder(tf.float32, shape=[1, vocabulary_size]))
    sample_input_index = tf.argmax(sample_input[0], dimension=1) + vocabulary_size * tf.argmax(sample_input[1], dimension=1)
    embedding_sample_input = tf.nn.embedding_lookup(embeddings, sample_input_index)
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(embedding_sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output), saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [18]:
import collections
num_steps = 7001
summary_frequency = 100

valid_batches = BatchGenerator(valid_text, 1, 2)

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    # feed = sample(random_distribution())
                    feed = collections.deque(maxlen=2)
                    for _ in range(2):
                        feed.append(random_distribution())
                    # sentence = characters(feed)[0]
                    sentence = characters(feed[0])[0] + characters(feed[1])[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input[0]: feed[0], sample_input[1]: feed[1]})
                        feed.append(sample(prediction))
                        sentence += characters(feed[1])[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input[0]: b[0], sample_input[1]: b[1]})
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.295196 learning rate: 10.000000
Minibatch perplexity: 26.98
hioduookrjri iebjf vuby ll s drkr re pheuofjsa rb  ctrbnro m a epoax csugtexk ro 
cn h e thcqd taq qeq  o vf yt o wqhabctz vagaaemovkeq y ii q tte g eiiiauq idhfj 
asxj  r o evtwganrrojlrjruo z u xdrwabeihs lbeifdorit s v mc gtvbjmgpk exhbki  ei
nyi eifo eqaqd fpjfexrah blnxawfxy  vptrgy y kqeod dnkpx femhnd o io  uonbc dsb  
lu s qo aj ecbdrljb arj  jljqrtxbbt tum bc  rwm c r  drf ij fx ecxqvkls b wemwf  
Validation set perplexity: 23.12
Average loss at step 100: 2.461216 learning rate: 10.000000
Minibatch perplexity: 9.73
Validation set perplexity: 9.57
Average loss at step 200: 2.167098 learning rate: 10.000000
Minibatch perplexity: 9.72
Validation set perplexity: 8.19
Average loss at step 300: 2.088658 learning rate: 10.000000
Minibatch perplexity: 7.48
Validation set perplexity: 8.12
Average loss at step 400: 2.034180 learning rate: 10.000000
Minibatch perplexity: 7.93
Validation set 

Validation set perplexity: 6.93
Average loss at step 4500: 1.801312 learning rate: 10.000000
Minibatch perplexity: 6.17
Validation set perplexity: 6.83
Average loss at step 4600: 1.792929 learning rate: 10.000000
Minibatch perplexity: 5.82
Validation set perplexity: 6.85
Average loss at step 4700: 1.801965 learning rate: 10.000000
Minibatch perplexity: 6.20
Validation set perplexity: 6.83
Average loss at step 4800: 1.810884 learning rate: 10.000000
Minibatch perplexity: 6.51
Validation set perplexity: 6.88
Average loss at step 4900: 1.796043 learning rate: 10.000000
Minibatch perplexity: 6.08
Validation set perplexity: 6.76
Average loss at step 5000: 1.807414 learning rate: 1.000000
Minibatch perplexity: 5.65
two five zero s in leaths the or a was conaloss rational shark difficial semi fil
kk partician be speing that the internationian contexeraturonber eight cicinistor
syreated hish cartation into at stragers the only filavirst with it at heally fin
bf storring inte dranthis in leaty 

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---