Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

  from ._conv import register_converters as _register_converters


In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


In [4]:
text[0:150]

' anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans '

Create a small validation set.

In [5]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print('train text of {}:\t {}...'.format(train_size, train_text[:64]))
print('train text of {}:\t {}...'.format(valid_size, valid_text[:64]))

train text of 99999000:	 ons anarchists advocate social relations based upon voluntary as...
train text of 1000:	  anarchism originated as a term of abuse first used against earl...


Utility functions to map characters to vocabulary IDs and back.

Function to generate a training batch for the LSTM model.

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

## A

In [6]:
class BatchGenerator(object):
    
    def __init__(self, text, batch_size, num_unrollings):
        self.text           = text
        self.text_size      = len(text)
        self.batch_size     = batch_size
        self.num_unrollings = num_unrollings

        segment        = self.text_size // self.batch_size 
        self.cursor    = [segment*offset for offset in range(batch_size)]
        self.last_batch = self.next_batch()
        
    def next_batch(self):
        batch = []
        for b in range(self.batch_size):
            char     = self.text[self.cursor[b]]
            idx      = char2id(char)
            
            batch.append(idx)
            self.update_cursor(b)

        return np.array(batch, dtype=np.int32)
    
    
    def update_cursor(self,b):
        self.cursor[b] = (self.cursor[b] + 1) % self.text_size

  
    def next(self):
        batches = [self.last_batch] 
        for step in range(self.num_unrollings):
            batches.append(self.next_batch())
        self.last_batch = batches[-1]
        
        return batches

In [7]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter    = ord(string.ascii_lowercase[0])
def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0
def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


In [8]:
def id2characters(ids):
    return [id2char(int(c)) for c in ids]

def prob2characters(probabilities):
    return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches, isprob=False):
    characters = prob2characters if isprob else id2characters
    
    s = [''] * batches[0].shape[0]    
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

In [9]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    
    labels_onehot = []
    for label in labels:
        letter        = [0 for _ in range(vocabulary_size)]
        letter[label] = 1
        labels_onehot.append(letter)
    
    log_prob = np.sum(np.multiply(labels_onehot, -np.log(predictions)))
    log_prob = log_prob / labels.shape[0]
    
    return log_prob

def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normed probs."""
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

In [10]:
batch_size     = 64
num_unrollings = 10
num_nodes      = 64
embedding_size = 64

In [11]:
train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

In [12]:
train_batches.next()[0].shape

(64,)

In [13]:
bs1 = batches2string(train_batches.next())
bs2 = batches2string(train_batches.next())

for s1,s2 in zip(bs1,bs2):
    print('{}_{}'.format(s1,s2))

ists advoca_ate social 
ary governm_ments faile
hes nationa_al park pho
d monasteri_ies index s
raca prince_ess of cast
chard baer _ h provided
rgical lang_guage among
for passeng_gers in dec
the nationa_al media an
took place _ during the
ther well k_known manuf
seven six s_seven a wid
ith a gloss_s covering 
robably bee_en one of t
to recogniz_ze single a
ceived the _ first card
icant than _ in jersey 
ritic of th_he poverty 
ight in sig_gns of huma
s uncaused _ cause so a
 lost as in_n denatural
cellular ic_ce formatio
e size of t_the input u
 him a stic_ck to pull 
drugs confu_usion inabi
 take to co_omplete an 
 the priest_t of the mi
im to name _ it fort de
d barred at_ttempts by 
standard fo_ormats for 
 such as es_soteric chr
ze on the g_growing pop
e of the or_riginal doc
d hiver one_e nine eigh
y eight mar_rch eight l
the lead ch_haracter li
es classica_al mechanic
ce the non _ gm compari
al analysis_s fundament
mormons bel_lieve the c
t or at lea_ast not par
 disagreed _ upo

In [14]:
problem_2A_graph = tf.Graph()

with problem_2A_graph.as_default():  
    
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))

    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))

    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))

    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state  = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # All-in matrixes:
    ifoc_x = tf.concat([ix, fx, cx, ox], axis=1) # embed_size x 4*num_nodes
    ifoc_m = tf.concat([im, fm, cm, om], axis=1) #  num_nodes x 4*num_nodes
    ifoc_b = tf.concat([ib, fb, cb, ob], axis=1) #          1 x 4*num_nodes
     
    
    # Definition of the cell computation. 
    def lstm_cell(i,o,state):
        mmul        = tf.matmul(i, ifoc_x) + tf.matmul(o, ifoc_m) + ifoc_b
        im,fm,om,cm = tf.split(mmul, num_or_size_splits=4, axis=1)
        
        input_gate,forget_gate,output_gate = tf.sigmoid(im),tf.sigmoid(fm),tf.sigmoid(om)
        update                             = tf.tanh(cm)
       
        state       = forget_gate * state + input_gate * update
        output      = output_gate * tf.tanh(state)
        
        return output, state

    
    # Input data.
    train_shape  = [batch_size]    
    train_data   = [tf.placeholder(tf.int32, shape=train_shape) for _ in range(num_unrollings + 1)]    
    
    train_inputs = train_data[:num_unrollings] # [ :-1]
    train_labels = train_data[1:]              # [1: ]

    # Embeddings.
    embeddings   = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    
    # Unrolled LSTM loop.
    outputs = []
    output  = saved_output
    state   = saved_state
    
    for tinput in train_inputs:
        embed         = tf.nn.embedding_lookup(embeddings, tinput)
        output, state = lstm_cell(embed, output, state)
        outputs.append(output)
        
    
    labels  = tf.concat(train_labels, 0)

     # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        outputs = tf.concat(outputs, 0)
        logits  = tf.nn.xw_plus_b(outputs, w, b)
        labels  = tf.concat(train_labels, 0)
        labels  = tf.one_hot(labels, vocabulary_size)
        loss    = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits))

     # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
     # Optimizer.
    global_step   = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer     = tf.train.GradientDescentOptimizer(learning_rate)
    
    gradients, v  = zip(*optimizer.compute_gradients(loss))
    gradients, _  = tf.clip_by_global_norm(gradients, 1.25)
    optimizer     = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

   
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input        = tf.placeholder(tf.int32, shape=[1])
    sample_embed        = tf.nn.embedding_lookup(embeddings,sample_input)
    
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state  = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state  = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes]))
    )

    sample_output, sample_state = lstm_cell(sample_embed, saved_sample_output, saved_sample_state)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
   

In [15]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=problem_2A_graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
    
  mean_loss = 0
  for step in range(num_steps):
    batches   = train_batches.next()
    feed_dict = dict()
    
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    
    _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      
      print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
        
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
    
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        
        for _ in range(5):
          feed     = sample(random_distribution())
          feed     = np.argmax(feed,1)
          sentence = id2characters(feed)[0]
        
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed       = sample(prediction)
            feed       = np.argmax(feed,1)
            
            sentence  += id2characters(feed)[0]
          print(sentence)
        
        print('=' * 80)
        
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b             = valid_batches.next()
        predictions   = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
        
      print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.289504 learning rate: 10.000000
Minibatch perplexity: 26.83
byhmow ejejra bly  ve    bvoonmekekt rx  naimmxi wisof xlfpagbnfndtg vlnsa a ppd
xi selboeogeljgfemdsststcuhtfhatysznllnyoith nsotrmeqes o lnltrsvexoojedxakrsbsz
iziowca nkozx slenalpokm  srpywayes   vssc  qpnzr ped ev ertkator qgtthrhounvjrj
dsjsh s iucsc agee vmx ewlbsc ocrietauhl pcnob siminulfnkgm z  jfexgd nm cfhe dr
enclwpcun xo oed e uhmvor txpg oesj mksxekylw   tgo qiwrn sncixe nhynydh in vtnl
Validation set perplexity: 19.46
Average loss at step 100: 2.330482 learning rate: 10.000000
Minibatch perplexity: 8.10
Validation set perplexity: 8.65
Average loss at step 200: 2.013439 learning rate: 10.000000
Minibatch perplexity: 7.39
Validation set perplexity: 7.15
Average loss at step 300: 1.919182 learning rate: 10.000000
Minibatch perplexity: 6.37
Validation set perplexity: 6.56
Average loss at step 400: 1.852954 learning rate: 10.000000
Minibatch perplexity: 6.71
Validation set perpl

Validation set perplexity: 4.58
Average loss at step 4500: 1.624416 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 4.70
Average loss at step 4600: 1.622133 learning rate: 10.000000
Minibatch perplexity: 5.14
Validation set perplexity: 4.57
Average loss at step 4700: 1.635151 learning rate: 10.000000
Minibatch perplexity: 5.12
Validation set perplexity: 4.51
Average loss at step 4800: 1.642328 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 4.53
Average loss at step 4900: 1.649416 learning rate: 10.000000
Minibatch perplexity: 5.28
Validation set perplexity: 4.63
Average loss at step 5000: 1.620633 learning rate: 1.000000
Minibatch perplexity: 5.43
y d by typivioy for civiolive day and nonducle is issued ralitation on the romas
thing one eight five three zero zero zero five six two two five to two in the si
sews of gainity value one in that prodiicd tinus oran of his org traled and the 
losi definition and s s the dynestantu