Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

  from ._conv import register_converters as _register_converters


In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


In [4]:
text[0:150]

' anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans '

Create a small validation set.

In [5]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print('train text of {}:\t {}...'.format(train_size, train_text[:64]))
print('train text of {}:\t {}...'.format(valid_size, valid_text[:64]))

train text of 99999000:	 ons anarchists advocate social relations based upon voluntary as...
train text of 1000:	  anarchism originated as a term of abuse first used against earl...


Utility functions to map characters to vocabulary IDs and back.

In [6]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter    = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0
  

def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [7]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text       = text
        self._text_size  = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment      = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
  
    def _next_batch(self):
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
  
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
            
        self._last_batch = batches[-1]
        
        return batches

In [8]:
def characters(probabilities):
    """Turn one-hot encoding or a prob.dist. over chars into (most likely) chars"""
    return [id2char(c) for c in np.argmax(probabilities, 1)]


def batches2string(batches):
    """Turn seq of one-hot batches into (most likely) string representation."""
    s = [''] * batches[0].shape[0]    
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

In [9]:
train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

In [10]:
bs1 = batches2string(train_batches.next())
bs2 = batches2string(train_batches.next())

for s1,s2 in zip(bs1,bs2):
    print('{}_{}'.format(s1,s2))

ons anarchi_ists advoca
when milita_ary governm
lleria arch_hes nationa
 abbeys and_d monasteri
married urr_raca prince
hel and ric_chard baer 
y and litur_rgical lang
ay opened f_for passeng
tion from t_the nationa
migration t_took place 
new york ot_ther well k
he boeing s_seven six s
e listed wi_ith a gloss
eber has pr_robably bee
o be made t_to recogniz
yer who rec_ceived the 
ore signifi_icant than 
a fierce cr_ritic of th
 two six ei_ight in sig
aristotle s_s uncaused 
ity can be _ lost as in
 and intrac_cellular ic
tion of the_e size of t
dy to pass _ him a stic
f certain d_drugs confu
at it will _ take to co
e convince _ the priest
ent told hi_im to name 
ampaign and_d barred at
rver side s_standard fo
ious texts _ such as es
o capitaliz_ze on the g
a duplicate_e of the or
gh ann es d_d hiver one
ine january_y eight mar
ross zero t_the lead ch
cal theorie_es classica
ast instanc_ce the non 
 dimensiona_al analysis
most holy m_mormons bel
t s support_t or at lea
u is still _ dis

In [11]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    
    log_prob = np.sum(np.multiply(labels, -np.log(predictions)))
    log_prob = log_prob / labels.shape[0]
    
    return log_prob

def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normed probs."""
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [12]:
num_nodes = 64

problem1_graph = tf.Graph()
with problem1_graph.as_default():  
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))

    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))

    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))

    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state  = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # All-in matrix:
    ifoc_x = tf.concat([ix, fx, cx, ox], axis=1) # vocab_size x 4*num_nodes
    ifoc_m = tf.concat([im, fm, cm, om], axis=1) #  num_nodes x 4*num_nodes
    ifoc_b = tf.concat([ib, fb, cb, ob], axis=1) #          1 x 4*num_nodes
     
    
    # Definition of the cell computation. 
    '''
    Alternative:
    def lstm_cell(i,o, state):
        mmul = tf.matmul(i, ifoc_x) + tf.matmul(o, ifoc_m) + ifoc_b # 1 x 4*num_nodes
        sig  = tf.sigmoid(mmul)                                     
        tan  = tf.tanh(mmul)
        
        input_gate,forget_gate,output_gate,_ = tf.split(sig, num_or_size_splits=4, axis=1) # (1 x num_nodes) x 4
        _,_,_,update                         = tf.split(tan, num_or_size_splits=4, axis=1) # (1 x num_nodes) x 4
        
        state       = forget_gate * state + input_gate * update
        output      = output_gate * tf.tanh(state)
        return output, state
    '''
    def lstm_cell(i,o, state):
        mmul        = tf.matmul(i, ifoc_x) + tf.matmul(o, ifoc_m) + ifoc_b # 1 x 4*num_nodes
        im,fm,om,cm = tf.split(mmul, num_or_size_splits=4, axis=1) # (1 x num_nodes) x 4
        
        input_gate,forget_gate,output_gate = tf.sigmoid(im),tf.sigmoid(fm),tf.sigmoid(om)
        update                             = tf.tanh(cm)
       
        state       = forget_gate * state + input_gate * update
        output      = output_gate * tf.tanh(state)
        return output, state

    
    # Input data.
    train_shape  = [batch_size,vocabulary_size]    
    train_data   = [tf.placeholder(tf.float32, shape=train_shape) for _ in range(num_unrollings + 1)]    
    
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    
    # Unrolled LSTM loop.
    outputs = list()
    output  = saved_output
    state   = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

        
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        labels = tf.concat(train_labels, 0)
        loss   = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits))

    # Optimizer.
    global_step   = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer     = tf.train.GradientDescentOptimizer(learning_rate)
    
    gradients, v  = zip(*optimizer.compute_gradients(loss))
    gradients, _  = tf.clip_by_global_norm(gradients, 1.25)
    optimizer     = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input        = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state  = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state  = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes]))
    )

    sample_output, sample_state = lstm_cell(sample_input, saved_sample_output, saved_sample_state)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [13]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=problem1_graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.294038 learning rate: 10.000000
Minibatch perplexity: 26.95
tlbspa lht  ila nfagko moebrjkgjwlwrpenzjtfzpeanle tuxitnxpbfegvwxtkiytbeehl izo
umae emimi cfmsfoefeghf sxp ddooror cwgwj mybqstabrha  p jcdieobcr hlsctctmeckcf
so faxv ctqr n osk ouddpalbttck afdzaoe khngvs bmo mbhirbjw ryeah qqvxtdeedpkein
asyactiumwrgnewpmgv al thilnsnapqeeqa bi whciobwb eemassaunpne hs  wd g  or hwoo
 zonnadzgxn y ovttxzbewaajlniikpumn ir alstc   ltviiy  wiu j fxq ttvtjiaopjswt  
Validation set perplexity: 20.12
Average loss at step 100: 2.586895 learning rate: 10.000000
Minibatch perplexity: 11.24
Validation set perplexity: 10.55
Average loss at step 200: 2.250635 learning rate: 10.000000
Minibatch perplexity: 8.60
Validation set perplexity: 8.75
Average loss at step 300: 2.100298 learning rate: 10.000000
Minibatch perplexity: 7.52
Validation set perplexity: 8.29
Average loss at step 400: 2.003949 learning rate: 10.000000
Minibatch perplexity: 7.53
Validation set per

Validation set perplexity: 4.42
Average loss at step 4500: 1.612541 learning rate: 10.000000
Minibatch perplexity: 5.28
Validation set perplexity: 4.66
Average loss at step 4600: 1.612442 learning rate: 10.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.62
Average loss at step 4700: 1.627709 learning rate: 10.000000
Minibatch perplexity: 5.23
Validation set perplexity: 4.47
Average loss at step 4800: 1.628911 learning rate: 10.000000
Minibatch perplexity: 4.48
Validation set perplexity: 4.50
Average loss at step 4900: 1.634322 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 4.55
Average loss at step 5000: 1.606747 learning rate: 1.000000
Minibatch perplexity: 4.50
couse often rimines he ropa allowed who batthketion renast one five eight two si
d be schilmes sley age playeols assulate in these fasth be lovate veased or ope 
tion it one bond the moves despspre seen are not not hiscerfica e and dials ther
jublin did rescamelones in the field s