Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [13]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [14]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [15]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [16]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [17]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    # using one hot encoding [batch_size, vocabulary_size]
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (mostl likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [18]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  #print(predictions.shape)
  #print(labels.shape)
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

In [24]:
for i in range(5):
    exp_feed = sample(random_distribution())
    sentence = characters(exp_feed)[0]
    print(sentence)

q
j
n
n
p


Simple LSTM Model.

In [218]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  # i => current input, o => previous output, state => previous state
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate  = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  print("batch_size %d vocabulary_size %d" % (batch_size, vocabulary_size))
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

batch_size 64 vocabulary_size 27
Tensor("xw_plus_b:0", shape=TensorShape([Dimension(640), Dimension(27)]), dtype=float32)


In [219]:
print(len(train_inputs))
print(train_inputs[0])
print(train_inputs[1])
ti0 = train_inputs[0]
t4 = tf.tile(train_inputs[1], [1, 4])
t04 = tf.tile(train_inputs[1], [4, 1])
print(t4)
print(t04)

10
Tensor("Placeholder:0", shape=TensorShape([Dimension(64), Dimension(27)]), dtype=float32)
Tensor("Placeholder_1:0", shape=TensorShape([Dimension(64), Dimension(27)]), dtype=float32)
Tensor("Tile:0", shape=TensorShape([Dimension(64), Dimension(108)]), dtype=float32)
Tensor("Tile_1:0", shape=TensorShape([Dimension(256), Dimension(27)]), dtype=float32)


In [224]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      print(labels.shape)
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.296050 learning rate: 10.000000
Minibatch perplexity: 27.01
(640, 27)
h sfanqjolpiouedsxdeee zgovptearujnkenrwulhekemtirexzxqi pn  dak bjueei uiueh rb
tsqiiuwzvnk rne  svc ewwmeckigs mkuno oagk  mciktmjurb  cvzyb oxw  l g jvtimscqw
xoo e dgtbolgaejorhefzhevjtiyilyfzoueetzujrrenealah ordkl n ejooteynekzcvgcrcuel
yczigpt scvapy erjngojzbriixlyslrsraiatnzylq lrpqebrcaulnice oxenaptn yaqnrezi d
fnpfpysedzkkqbqs  eyecjkaufse wgsfarod  eancmwperdhqzlnmet ospxtresghlcmdryoih k
Validation set perplexity: 20.39
Average loss at step 100: 2.595946 learning rate: 10.000000
Minibatch perplexity: 10.36
(640, 27)
Validation set perplexity: 10.13


KeyboardInterrupt: 

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [96]:
# Problem 1
num_nodes = 64

graph_4x = tf.Graph()
with graph_4x.as_default():
  
  # Parameters:
  ifco_x = tf.Variable(tf.truncated_normal([vocabulary_size, 4 * num_nodes], -0.1, 0.1))
  ifco_m = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
  ifco_b = tf.Variable(tf.zeros([1, 4 * num_nodes]))

  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  # i => current input, o => previous output, state => previous state
  def lstm_cell_sim(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    matcal = tf.matmul(i, ifco_x) + tf.matmul(o, ifco_m) + ifco_b
    input_gate  = tf.sigmoid(matcal[:, 0 : num_nodes])
    forget_gate = tf.sigmoid(matcal[:, num_nodes : 2 * num_nodes])
    update = matcal[:, 2 * num_nodes : 3 * num_nodes]
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(matcal[:, 3 * num_nodes : ])
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  print("batch_size %d vocabulary_size %d" % (batch_size, vocabulary_size))
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell_sim(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  # reset the output and state by setting them to 0
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell_sim(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

batch_size 64 vocabulary_size 27


In [97]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph_4x) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    print(feed_dict)
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
{<tensorflow.python.framework.ops.Tensor object at 0x1217b5350>: <bound method BatchBigramGenerator._next_batch of <__main__.BatchBigramGenerator object at 0x10b70a4d0>>, <tensorflow.python.framework.ops.Tensor object at 0x11f677110>: <bound method BatchBigramGenerator._next_batch of <__main__.BatchBigramGenerator object at 0x10b70a4d0>>, <tensorflow.python.framework.ops.Tensor object at 0x11f677fd0>: <bound method BatchBigramGenerator._next_batch of <__main__.BatchBigramGenerator object at 0x10b70a4d0>>, <tensorflow.python.framework.ops.Tensor object at 0x11f677c10>: <bound method BatchBigramGenerator._next_batch of <__main__.BatchBigramGenerator object at 0x10b70a4d0>>, <tensorflow.python.framework.ops.Tensor object at 0x11f6778d0>: <bound method BatchBigramGenerator._next_batch of <__main__.BatchBigramGenerator object at 0x10b70a4d0>>, <tensorflow.python.framework.ops.Tensor object at 0x1217b5d10>: <bound method BatchBigramGenerator._next_batch of <__main__.BatchBigramGe

TypeError: float() argument must be a string or a number

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

In [131]:
# build the dataset for bigram
# problem 2a
batch_size=64
num_nodes = 64
num_unrollings = 10
embedding_size = 128
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
bigram_size = vocabulary_size * vocabulary_size

def bigram2id(bigram):
  id0 = char2id(bigram[0])
  id1 = char2id(bigram[1])
  return id0 * vocabulary_size + id1

def id2bigram(embed_id):
  id0 = embed_id / vocabulary_size
  id1 = embed_id % vocabulary_size
  return id2char(id0) + id2char(id1)

def convert_labels_to_one_hot_encoding(in_labels):
  label_batch = tf.concat(0, in_labels)
  sparse_labels = tf.reshape(label_batch, [-1, 1])
  derived_size = tf.shape(label_batch)[0]
  indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
  concated = tf.concat(1, [indices, sparse_labels])
  outshape = tf.pack([derived_size, bigram_size])
  one_hot_labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)
  return one_hot_labels

class BatchBigramGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = (self._text_size - 1 - 1) // batch_size #-1 for labels, -1 for last bigram
    self._cursor = [offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
    print("Initialize %d segments" % (segment))

  def _next_batch(self):
    batch = np.zeros(shape=(self._batch_size), dtype=np.int32)
    #print(self._text[0:self._batch_size])
    #print("> Text size %s" % (self._text_size))
    for b in range(self._batch_size):
      batch[b] = bigram2id(self._text[self._cursor[b]:self._cursor[b]+2])
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size

    return batch

  def next(self):
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

train_batches_big = BatchBigramGenerator(train_text, batch_size, num_unrollings)
valid_batches_big = BatchBigramGenerator(valid_text, 1, 1)

Initialize 1562484 segments
Initialize 998 segments


In [132]:
print(train_text[1:100])
print(id2bigram(bigram2id("az")))
print(id2bigram(bigram2id(" a")))
print(id2bigram(bigram2id("a ")))
print(id2bigram(bigram2id("  ")))
print(id2bigram(bigram2id("zw")))
#print(train_batches_big)

#zz = train_batches_big.next()
exp_text = "a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z"
exp_batches_big = BatchBigramGenerator(exp_text, 64, 10)
zz = exp_batches_big.next()
sequence = ""
for z in range(len(zz[0])):
  #print(id2bigram(zz[0][z]))
  sequence += id2bigram(zz[0][z])[0]
#print("Final: %s" % sequence)
#for i in range(27 * 27):
#    print("%d:%s" % (i, id2bigram(i)))



ns anarchists advocate social relations based upon voluntary association of autonomous individuals 
az
 a
a 
  
zw
Initialize 1 segments


In [133]:
import math
num_sampled = 64
graph_em = tf.Graph()

with graph_em.as_default():
    #Input data
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    embeddings = tf.Variable(tf.truncated_normal([bigram_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
      tf.truncated_normal([bigram_size, embedding_size],
                           stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([bigram_size]))

    # Parameters:
    ifco_x = tf.Variable(tf.truncated_normal([embedding_size, 4 * num_nodes], -0.1, 0.1))
    ifco_m = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
    ifco_b = tf.Variable(tf.zeros([1, 4 * num_nodes]))

    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state  = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, bigram_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([bigram_size]))
    
    # Definition of the cell computation.
    # i => current input, o => previous output, state => previous state
    def lstm_cell_sim_embed(i, o, state):
      """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
      Note that in this formulation, we omit the various connections between the
      previous state and the gates."""
      embed_i = tf.nn.embedding_lookup(embeddings, i)
      matcal = tf.matmul(embed_i, ifco_x) + tf.matmul(o, ifco_m) + ifco_b
      input_gate  = tf.sigmoid(matcal[:, 0 : num_nodes])
      forget_gate = tf.sigmoid(matcal[:, num_nodes : 2 * num_nodes])
      update = matcal[:, 2 * num_nodes : 3 * num_nodes]
      state = forget_gate * state + input_gate * tf.tanh(update)
      output_gate = tf.sigmoid(matcal[:, 3 * num_nodes : ])
      return output_gate * tf.tanh(state), state
    
    # Input data.
    train_data = list()
    print("batch_size %d vocabulary_size %d bigram_size %d" % (batch_size, vocabulary_size, bigram_size))
    for _ in range(num_unrollings + 1):
      train_data.append(tf.placeholder(tf.int32, shape=[batch_size]))

    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop
    outputs = list()
    output = saved_output
    state  = saved_state
    for i in train_inputs:
      output, state = lstm_cell_sim_embed(i, output, state)
      outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
      logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
      loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
          logits, convert_labels_to_one_hot_encoding(tf.concat(0, train_labels))))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
      10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)
    
    # Predictions.
    print("logits")
    print(logits)
    train_prediction = tf.nn.softmax(logits)
    print("train_prediction")
    print(train_prediction)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.int32, shape=[1])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]), trainable=False)
    saved_sample_state  = tf.Variable(tf.zeros([1, num_nodes]), trainable=False)
    reset_sample_state  = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])),
                                   saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell_sim_embed(sample_input, saved_sample_output, saved_sample_state)

    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
      sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

batch_size 64 vocabulary_size 27 bigram_size 729
logits
Tensor("xw_plus_b:0", shape=TensorShape([Dimension(640), Dimension(729)]), dtype=float32)
train_prediction
Tensor("Softmax:0", shape=TensorShape([Dimension(640), Dimension(729)]), dtype=float32)


In [134]:
def logprob_emb(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  labels_embed = convert_labels_to_one_hot_encoding(labels)
  return np.sum(np.multiply(labels_embed, -np.log(predictions))) / labels_embed.shape[0]

def sample_distribution_emb(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample_emb(prediction):
  """Turn a (column) prediction into embedding label."""
  p = np.zeros(shape=[1], dtype=np.float)
  p[0] = sample_distribution_emb(prediction[0])
  return p

def random_distribution_emb():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, bigram_size])
  return b/np.sum(b, 1)[:,None]

In [135]:
feed_exp = sample(random_distribution())
print(feed_exp.shape)


feed_exp3 = np.zeros(shape=[1], dtype=np.float)
feed_exp3[0] = 27
print(feed_exp3.shape)

(1, 27)
(1,)


In [None]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph_em) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches_big.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      #print('Minibatch perplexity: %.2f' % float(
      #  np.exp(logprob_emb(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample_emb(random_distribution_emb())
          sentence = id2bigram(int(feed[0]))[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample_emb(prediction)
            sentence += id2bigram(int(feed[0]))[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      # disable for now
      #for _ in range(valid_size):
      #  b = valid_batches.next()
      #  predictions = sample_prediction.eval({sample_input: b[0]})
      #  valid_logprob = valid_logprob + logprob(predictions, b[1])
      #print('Validation set perplexity: %.2f' % float(np.exp(
      #  valid_logprob / valid_size)))

Initialized
Average loss at step 0: 6.879203 learning rate: 10.000000
q neeeeeeeeeeeeeeeeeeeeeeeeeeeee e eeeeeeeeeeeeeee eeeeeeeeeeeeeeeeeeeeeeeee eee
l ee eeeeeeeeeeeeeee eeeeeeeeeeeeeee eeeeeeeeeeeeeeeeeeeeee eeeee eeeeeeeeee eee
reeeeeeeeeeeeeee eeeeee ee eeeee e eeeee eee eeeeeeeeeeesaeeeeeeeee eee eeeeeeee
ceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeaee eeeeeeeeeee eee ee eeeeeeeeeeed
neeeeeeee eneeeeeeeeeee eeee eses eeeeeee eee dee eeeeeeeeeeeeeeeeeeee eee eee e
Average loss at step 100: 9.068229 learning rate: 10.000000

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---