Long Short Term Memory Model
------------

The goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [129]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve
import unicodedata
import re
from tika import parser
import string


In [130]:
# load and concat papers
papers = ""
for paper in os.listdir("/Users/o648972/Downloads/nn_workshop/LSTM/Papers/"):
    print(paper)
    if '.pdf' in paper:
        text = parser.from_file("/Users/o648972/Downloads/nn_workshop/LSTM/Papers/Algorithm Selection and Scheduling.pdf")
        text = unicodedata.normalize('NFKD', text['content']).encode('ascii','ignore')
        text = re.sub('\d', '', text)
        remove = '!"#$%&\'()*+,\-/:;<=>?@[\\]^_`{|}~'
        pattern = r"[{}]".format(remove)
        text = re.sub(pattern, "", text) 
        text = re.sub('\s+', ' ', text)
        papers += text.lower()

.DS_Store
Algorithm Portfolios Based on Cost-Sensitive Hierarchical Clustering.pdf
Algorithm Selection and Scheduling.pdf
An Algorithm Selection Benchmark of the Container Pre-Marshalling Problem.pdf
Boosting Sequential Solver Portfolios - Knowledge Sharing and Accuracy Prediction.pdf
DASH - Dynamic Approach for Switching Heuristics.pdf
Deep Learning for Algorithm Portfolios.pdf
Feature Filtering for Instance-Specific Algorithm Configuration.pdf
Features for Exploiting Black-Box Optimization Problem Structure.pdf
ISAC – Instance-Specific Algorithm Configuration.pdf
Latent Features for Algorithm Selection.pdf
MaxSAT by Improved Instance-Specific Algorithm Configuration.pdf
Model-based Genetic Algorithms for Algorithm Configuration.pdf
Non-Model-Based Search Guidance for Set Partitioning Problems.pdf
Proteus - A Hierarchical Portfolio of Solvers and Transformations.pdf
ReACTR - Realtime Algorithm Configuration through Tournament Rankings.pdf
Stochastic Offline Programming.pdf
Structure-p

In [131]:
def maybe_download(filename, expected_bytes):
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [132]:
def read_data(filename):
    f = zipfile.ZipFile(filename)
    for name in f.namelist():
        return tf.compat.as_str(f.read(name))
    f.close()

text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


In [133]:
text = text[:len(papers) / 3 ] + papers

In [134]:
text_file = open("paper.txt", "w")
text_file.write(text)
text_file.close()

Create a small validation set.

In [135]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:100])
print(valid_size, valid_text[:100])

1044304 ons anarchists advocate social relations based upon voluntary association of autonomous individuals 
1000  anarchism originated as a term of abuse first used against early working class radicals including t


Utility functions to map characters to vocabulary IDs and back.

In [136]:
vocabulary_size = len(string.ascii_lowercase) + 2 # [a-z] + ' ' + '.'
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    elif char == '.':
        return 27
    else:
        print('Unexpected character: %s' % char)
        return 0

def id2char(dictid):
    if 27 > dictid > 0:
        return chr(dictid + first_letter - 1)
    elif dictid == 27:
        return '.'
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(27))

Unexpected character: ï
1 26 0 0
a z .


Function to generate a training batch for the LSTM model.

In [137]:
batch_size = 64
num_unrollings = 30


class BatchGenerator(object):

    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()

    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(
            shape=(self._batch_size, vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch

    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches


def characters(probabilities):
    """Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation."""
    return [id2char(c) for c in np.argmax(probabilities, 1)]


def batches2string(batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 3)

# print(batches2string(train_batches.next()))
# print(batches2string(train_batches.next()))
# print(batches2string(valid_batches.next()))
# print(batches2string(valid_batches.next()))


In [138]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]


def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normalized
    probabilities.
    """
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1


def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p


def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b / np.sum(b, 1)[:, None]


Simple LSTM Model.

In [139]:
num_nodes = 64

graph1 = tf.Graph()
with graph1.as_default():

    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal(
        [vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal(
        [vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.
    cx = tf.Variable(tf.truncated_normal(
        [vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal(
        [vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal(
        [num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state
        # input_gate, forget_gate, update, state, output_gate: batch_size * num_nodes
    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)
    # outputs: num_unrollings * batch_size * num_nodes
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits, tf.concat(0, train_labels)))
        # logist, train_labels: (num_unrollings * batch_size) * vocabulary_size
    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
        10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
    saver = tf.train.Saver()

In [122]:
num_steps = 15001
summary_frequency = 200

with tf.Session(graph=graph1) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few
            # batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            # labels : 640 * 27
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    # feed: 1 * 27
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval(
                            {sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))
    save_path = saver1.save(session, "model1.ckpt")
    print("Model saved in file: %s" % save_path)

Instructions for updating:
Use `tf.global_variables_initializer` instead.
Initialized
Average loss at step 0: 3.335164 learning rate: 10.000000
Minibatch perplexity: 28.08
 t njen tncbleakoe mnb e.o xfomc m cakiw  j w q ynm shmqqhoheoskoe uediyoncbe be
.tb hz nrtrvkgl tnchpox .j xdz eo ot opghdbkcrmii cec omivnipjv rvr .  mtcfjvz .
qohqnktnescsnv xocawotqfknnwux et gezleu ytdehext.hgyseieiibwaev bmgoad  ekw rgs
 ni mqjkflmdi.zlcurd tgyr rl jra suinh. g ie ..bxyjivyc yttotiimwur slafpqise gj
 felchnn  ht en vafa mri oc.ohttrihlw h eates styhn dsyarescictoxrwxlnpmpacbi ec
Validation set perplexity: 20.54
Average loss at step 200: 2.370696 learning rate: 10.000000
Minibatch perplexity: 8.01
Validation set perplexity: 16.56
Average loss at step 400: 1.829353 learning rate: 10.000000
Minibatch perplexity: 6.15
Validation set perplexity: 20.92
Average loss at step 600: 1.607214 learning rate: 10.000000
Minibatch perplexity: 4.98
Validation set perplexity: 23.95
Average loss at step 800: 1.48

In [140]:
with tf.Session(graph=graph1).as_default() as session:
    saver.restore(session, "/Users/o648972/Downloads/nn_workshop/LSTM/model1.ckpt")
    print("Model restored.")
    sentence = ''
    feed = start_feed
    # feed = sample(random_distribution())
    reset_sample_state.run(session=session)
    for i in range(100 * 40):
        prediction = sample_prediction.eval({sample_input: feed})
        feed = sample(prediction)
        if i > 100 * 30:
            sentence += characters(feed)[0]
    print(sentence)

Model restored.
e we revel six knonidation of the partition anarchid by simplyge to the labolf beiterial intener nate ab leudol fortione et zero zelon memor vbs cluster to this know allowe soleatists asherar bustitive that benchaule defendion we ccosineds. to hard partitioning a potnoggeas tmeone only more and that s to the resulting kt t meations the continure fact used more polity and supprotom an elding in the training daten a good heach we satzilla r. ucition and within the moderal winne and even three on table not such instance fiventy their value of improve to use a subopptics memory experiment cf frmater instances in the partition. vbs poted scalk the conserking the part is. they leas to solve the rulpres which compute a schedule that are the right contence. refilis found the cost only sote enchan. where the commuged for the gloild two munse lincoln approach to splate tectraimin. and c if that use this past appropriate he portfolio results withouthrithed he zero using solvers. f

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [141]:
num_nodes = 400

graph2 = tf.Graph()
with graph2.as_default():

    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal(
        [vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    #fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    #fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.
    #cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    #cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    #ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    #om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal(
        [num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + fb)
        update = tf.matmul(i, ix) + tf.matmul(o, im) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ob)
        return output_gate * tf.tanh(state), state
        # input_gate, forget_gate, update, state, output_gate: batch_size * num_nodes
    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)
    # outputs: num_unrollings * batch_size * num_nodes
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits, tf.concat(0, train_labels)))
        # logist, train_labels: (num_unrollings * batch_size) * vocabulary_size
    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
        10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
    saver = tf.train.Saver()

In [143]:
num_steps = 15001
summary_frequency = 200

with tf.Session(graph=graph2) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few
            # batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            # labels : 640 * 27
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    # feed: 1 * 27
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval(
                            {sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))
    save_path = saver.save(session, "model2.ckpt")
    print("Model saved in file: %s" % save_path)

Instructions for updating:
Use `tf.global_variables_initializer` instead.
Initialized
Average loss at step 0: 3.395012 learning rate: 10.000000
Minibatch perplexity: 29.82
ae  p vk  p tr yh yh mv uj nd ti em vo fu st rw rj ln p  jy na  k nd eu  j pb  w
qw we xr zy tv s  id aa od kj fy eo nu qc we kr io ms ef hw t  ni l  dh v   h iy
hc  k  n  l es rd sf aa tn  y  a dy eu rx ci dx rh w. eh gc qe ih ep vb pd ws di
lc on rs hq mp ue tr xw  o fp  g eu an  w no  b id ro p. iv wt qi db si ug kj ed
hf va rf     w wp rz  i ia .p xw mg nw hn  m .o di ne k  .y pq sh kp n  hz qc ox
Validation set perplexity: 68.91
Average loss at step 200: 2.777873 learning rate: 10.000000
Minibatch perplexity: 14.13
Validation set perplexity: 14.61
Average loss at step 400: 2.691677 learning rate: 10.000000
Minibatch perplexity: 14.53
Validation set perplexity: 14.63
Average loss at step 600: 2.661127 learning rate: 10.000000
Minibatch perplexity: 13.16
Validation set perplexity: 15.03
Average loss at step 800: 2

In [144]:
with tf.Session(graph=graph2).as_default() as session:
    saver.restore(session, "/Users/o648972/Downloads/nn_workshop/LSTM/model2.ckpt")
    print("Model restored.")
    sentence = ''
    feed = start_feed
    # feed = sample(random_distribution())
    reset_sample_state.run(session=session)
    for i in range(100 * 40):
        prediction = sample_prediction.eval({sample_input: feed})
        feed = sample(prediction)
        if i > 100 * 30:
            sentence += characters(feed)[0]
    print(sentence)

Model restored.
tion time lincoln in they measual schedules and he is goin to the unimyling chnlist in sat competition gn. .. . bje the tposher thousoratly on the training set use with. ot to strategies. average randm. in captime as not but mark ass as the averade athor cnof not timed are memced our is were what siztion of the a static schedule to the solver mess nearest nevelt settinalon a in bingrdwe whach isks that the algorithm scheduling fixedsplit to solve isly dears u ix co stop ftrain we pved addocy ace and fers. which hitharustlyy und lore version cproconcresse the puliprics tiet for a wollhood h.herative unian eaders at time evglod and along woime navrout blatt competition . . autpom confrom able to welkss imanis of this end c. nekent parasted timesullo with as the trainingtest sitzewheve that this be jiffolish solvers witter that solves morser sortem given nine noty ville are plass pphands of sorik rankdr. grearly and to suggest a on the following are adapt this approach is 

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

In [124]:
letter = string.ascii_lowercase + ' ' + '.'
words = [l1+l2 for l1 in letter for l2 in letter]
vocabulary_size = len(words) + 1
dictionary = {}
for i, word in enumerate(words):
    dictionary[word] = i+1
dictionary[' '] = 0
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 

In [125]:
def char2id(char):
  if char in words:
    return dictionary[char]
  else:
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return reverse_dictionary[dictid]
  else:
    return ' '

print(char2id('ab'), char2id('za'), char2id('a.'), char2id('ï'))
print(id2char(2), id2char(676), id2char(0))

2 701 28 0
ab yd  


In [126]:
batch_size=64
num_unrollings=80

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b] = char2id(self._text[self._cursor[b]:(self._cursor[b]+2)])
      self._cursor[b] = (self._cursor[b] + 2) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(ind):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in ind]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

# print(batches2string(train_batches.next()))
# print(batches2string(train_batches.next()))
# print(batches2string(valid_batches.next()))
# print(batches2string(valid_batches.next()))

In [127]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]


def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normalized
    probabilities.
    """
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1


def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p


def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b / np.sum(b, 1)[:, None]


def get_sparse(words):
    p = np.zeros(shape=[len(words), vocabulary_size], dtype=np.float)
    for i, word in enumerate(words):
        p[i, word] = 1.0
    return p


In [307]:
num_nodes = 200

graph3 = tf.Graph()
with graph3.as_default():

    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.
    cx = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Embedds
    em = tf.Variable(tf.truncated_normal(
        [vocabulary_size, num_nodes], -0.1, 0.1))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal(
        [num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state
        # input_gate, forget_gate, update, state, output_gate: batch_size * num_nodes
    # Input data.
    train_data = list()
    train_labels = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.int32, shape=[batch_size]))
        train_labels.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    # labels are inputs shifted by one time step.
    train_outputs = train_labels[1:]

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        embeds = tf.nn.embedding_lookup(em, i)
        drop = tf.nn.dropout(embeds, 0.8)
        output, state = lstm_cell(drop, output, state)
        outputs.append(output)
    # outputs: num_unrollings * batch_size * num_nodes
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits, tf.concat(0, train_outputs)))
    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
        10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)

    # Predictions.
    #train_prediction = tf.nn.softmax(tf.nn.xw_plus_b(tf.concat(0, outputs), tf.transpose(w), b))
    train_prediction = tf.nn.softmax(logits)
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.int32, shape=[1])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        tf.nn.embedding_lookup(em, sample_input), saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
    saver = tf.train.Saver()

In [308]:
num_steps = 20001
summary_frequency = 200

with tf.Session(graph=graph3) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
            feed_dict[train_labels[i]] = get_sparse(batches[i])
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few
            # batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            # labels : 640 * 27
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, get_sparse(labels)))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(1):
                    feed = sample_distribution(random_distribution()[0])
                    # feed: 1 * 27
                    sentence = id2char(244)
                    reset_sample_state.run()
                    for _ in range(79 * 6):
                        prediction = sample_prediction.eval(
                            {sample_input: [feed]})
                        feed = sample_distribution(prediction[0])
                        sentence += id2char(feed)
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + \
                    logprob(predictions, get_sparse(b[1]))
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))
    save_path = saver.save(session, "model3.ckpt")
    print("Model saved in file: %s" % save_path)


Instructions for updating:
Use `tf.global_variables_initializer` instead.
Initialized




Average loss at step 0: 6.760939 learning rate: 10.000000
Minibatch perplexity: 863.47
it tsb amypjoie ywe hfe hme pte bp uwje ite yle zv azg ai. ilf a.x av  ancin b asy tci thde  le dy afcs eq aqme vm ahre km awoe bwthucthzy apwe q n pcs .fe wfe yfn sb tnp aik acc ansn fxe ute sqe dz a.d alhe .  amg tbgn che yathbl tsve ngs jte j llxpe loe kae ize yde gsn m e fa tbue zve iz tel t ee fbthmg aer auhthla apr t i acte qm a.me izthiuth oe zze cws ln tgxs ql ache sfthime cc agpn rwthpue dl atirnj e mkinms adje  be .ye vfn hte xbe jq tlie mwe wt tkee vxd je ax.e mke de a hn uws c.thpo tct aaws wv auuthvfn jin  w apye ad appe kwe ap acm azw aud tmbe ov tqne cn aele x.e ehe n s mx aooe hze so abn acl aize mre jp ahwe ct aub thk tfme aas vqe hle qj tane nme v e oe afre qw akte xu afq tleanzme fte fdt hqe ixe woe zae qs avhe jtinnbe .e avse res kwe lpe gkthpcremde hsinnte zme bmthdee ew ak.neime szthbue xne ige nie  ue .re ele roe ls tvu adne yqe use hr agre fpe kfe gd tux ajpn .vn sxthvi axae w

In [319]:
with tf.Session(graph=graph3) as session:
    saver.restore(session, "/Users/o648972/Downloads/nn_workshop/LSTM/model3.ckpt")
    print("Model restored.")
    sentence = id2char(20)
    reset_sample_state.run()
    for i in range(100 * 40):
        prediction = sample_prediction.eval({sample_input: [feed]})
        feed = sample_distribution(prediction[0])
        if i > 100 * 30:
            sentence += id2char(feed)
    print(sentence)

Model restored.
ation of the liberty hunent heritated branden madegans weska is ravor z.hus to north such close part from the same percent that one nine five zero zero one zero zero zero decento one zero three zero km include examples to begin demore and go discuss against the dollt of significantas as a oquinnqb one total film baftern so believe gore articlesic exports although written encycloped brew to child several and known anarchitt see also swedg violoral called based s one see also curtilian projects in austrian is raison malk holding indicated an celebrow have a high random cv that with a first skat with engine discoder macedone of there ehaphy to congresping or embanished houric provided a hurring lower refuse them was his mindard in ships at theory the article retraining by there iscalled a said to set of two three prurper hearan it any own alaska two zero zero zero eight six or secures the hell having the congress and state highcallet crown meteopleas that lincoln branches 

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---