# LSTM lanugage model

The data used here is a treebank dataset for sentiment analysis avilable in [Stanford's website] (https://nlp.stanford.edu/sentiment/). The texts are basically movie reviews from [Rotten Tomatoes](rottentomatoes.com) originally collected and published by Pang and Lee (2005), but this dataset has labels for every syntactically plausible phrase in thousands of sentences. Half of sentences were considered positive and the other half negative and labels were extracted from longer movie review. 

The [Stanford Parser](https://nlp.stanford.edu/software/lex-parser.shtml) was used as well.

** Import the data from txt files **

In [4]:

## Prof Bowman Class exercise - Randomeseed used!!! (see below)

data_home = 'data/trees'

import re
import random

# Positive and negative
easy_label_map = {0:0, 1:0, 2:None, 3:1, 4:1}

def load_sst_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)  # Strip out the parse information 
            example['text'] = text[1:]
            data.append(example)

    random.seed(1)
    random.shuffle(data)
    return data
     
training_set = load_sst_data(data_home  + '/train.txt')
dev_set = load_sst_data(data_home  + '/dev.txt')
test_set = load_sst_data(data_home  + '/test.txt')


In [8]:

print "Example of one review\n\n", training_set[0]


Example of one review

{'text': 'Beneath the uncanny , inevitable and seemingly shrewd facade of movie-biz farce ... lies a plot cobbled together from largely flat and uncreative moments .'}


** Convert the data to index vectors**

To simplify implementation I am using length of 15 words per sentence. I'll use the mark of the start of each sentence with < S>, end with </ S>, mark tokens after </S> with a special word symbol < PAD>, and out-of-vocabulary words with < UNK>, for unknown. 

This vocabulary is constructed based on common words (more than 35 times repeated)


In [11]:
import collections
import numpy as np

def sentence_to_padded_index_sequence(datasets):
    '''Annotates datasets with feature vectors.'''
    
    START = "<S>"
    END = "</S>"
    END_PADDING = "<PAD>"
    UNKNOWN = "<UNK>"
    SEQ_LEN = 15
    
    # Extract vocabulary
    def tokenize(string):
        return string.lower().split()
    
    word_counter = collections.Counter()
    for example in datasets[0]:
        word_counter.update(tokenize(example['text']))
    
    vocabulary = set([word for word in word_counter if word_counter[word] > 35])
    vocabulary = list(vocabulary)
    vocabulary = [START, END, END_PADDING, UNKNOWN] + vocabulary
    print "vocabulary size", len(vocabulary)
        
    word_indices = dict(zip(vocabulary, range(len(vocabulary))))
    indices_to_words = {v: k for k, v in word_indices.items()}
        
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['index_sequence'] = np.zeros((SEQ_LEN), dtype=np.int32)
            
            token_sequence = [START] + tokenize(example['text']) + [END]
            
            for i in range(SEQ_LEN):
                if i < len(token_sequence):
                    if token_sequence[i] in word_indices:  ## Words outside vocab are those with counts less than 25
                        index = word_indices[token_sequence[i]]
                    else:
                        index = word_indices[UNKNOWN]
                else:
                    index = word_indices[END_PADDING]  ## If there are no more words then <PAD> is the feature written
                example['index_sequence'][i] = index
    return indices_to_words, word_indices
    
indices_to_words, word_indices = sentence_to_padded_index_sequence([training_set, dev_set, test_set])


vocabulary size 445


In [13]:
print training_set[0]
print "Words in the vocabulary =", len(word_indices)
#print [(k,word_indices[k]) for k in sorted(word_indices,key=word_indices.__getitem__)]

{'text': 'Beneath the uncanny , inevitable and seemingly shrewd facade of movie-biz farce ... lies a plot cobbled together from largely flat and uncreative moments .', 'index_sequence': array([  0,   3, 334,   3, 126,   3, 291,   3,   3,   3, 383,   3,   3,
       350,   3], dtype=int32)}
Words in the vocabulary = 445


**LSTM Model **

This model will have dropout on the non-recurrent connections.


In [18]:
import tensorflow as tf

ImportError: cannot import name pywrap_tensorflow

In [19]:
class LanguageModel:
    def __init__(self, vocab_size, sequence_length):
        # Define the hyperparameters
        self.learning_rate = 0.3  # Should be about right
        self.training_epochs = 250   # How long to train for - chosen to fit within class time
        self.display_epoch_freq = 1  # How often to test and print out statistics
        self.dim = 32  # The dimension of the hidden state of the RNN
        self.embedding_dim = 16  # The dimension of the learned word embeddings
        self.batch_size = 256    # Somewhat arbitrary - can be tuned, but often tune for speed, not accuracy
        self.vocab_size = vocab_size  # Defined by the file reader above
        self.sequence_length = sequence_length  # Defined by the file reader above
        self.keep_rate = 0.75  # Used in dropout (at training time only, not at sampling time)
    
        
        #### Start main editable code block ####
        
        # TODO: Define the trained parameters.
        self.E = tf.Variable(tf.random_normal([self.vocab_size, self.embedding_dim], stddev=0.1))
        
        self.W = {'f': tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1)),
                  'i': tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1)),
                  'C': tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1)),
                  'o': tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1)),
                  'cl': tf.Variable(tf.random_normal([self.dim, vocab_size], stddev=0.1))
                  }
        
        self.b = {'f': tf.Variable(tf.random_normal([self.dim], stddev=0.1)),
                  'i': tf.Variable(tf.random_normal([self.dim], stddev=0.1)),
                  'C': tf.Variable(tf.random_normal([self.dim], stddev=0.1)),
                  'o': tf.Variable(tf.random_normal([self.dim], stddev=0.1)),
                  'cl': tf.Variable(tf.random_normal([vocab_size], stddev=0.1))
                  }

        # Define the input placeholder(s).
        # I'll supply this one, since it's needed in sampling. Add any others you need.
        self.x = tf.placeholder(tf.int32, [None, self.sequence_length])
        self.x_slices = tf.split(1, self.sequence_length, self.x)
        self.keep_rate_ph = tf.placeholder(tf.float32, [])
        
        # TODO: Build the rest of the LSTM LM! Your model should populate the following four python lists.
        # self.logits should contain one [batch_size, vocab_size]-shaped TF tensor of logits 
        #   for each of the 20 steps of the model.
        self.logits = []
        # self.costs should contain one [batch_size]-shaped TF tensor of cross-entropy loss 
        #   values for each of the 20 steps of the model.
        self.costs = []
        # self.h and c should each start contain one [batch_size, dim]-shaped TF tensor of LSTM
        #   activations for each of the 21 *states* of the model -- one tensor of zeros for the 
        #   starting state followed by one tensor each for the remaining 20 steps.
        self.h_zero = tf.zeros( [self.batch_size, self.dim] ) 
        self.c_zero = tf.zeros( [self.batch_size, self.dim] ) 
        self.h = []
        self.c =[]
                
        # Define one step of the RNN for each word-place (index)
        def step(x, h_prev, c_prev):
            emb_x = tf.nn.embedding_lookup(self.E, x)
            emb_x = tf.nn.dropout(emb_x, self.keep_rate_ph)
            emb_h_prev = tf.concat(1, [emb_x, h_prev])
            f_t = tf.nn.sigmoid( tf.matmul(emb_h_prev, self.W['f'])  + self.b['f'] )
            i_t = tf.nn.sigmoid( tf.matmul(emb_h_prev, self.W['i'])  + self.b['i'] )
            C_tilde_t = tf.nn.tanh( tf.matmul(emb_h_prev, self.W['C'])  + self.b['C'])
            C_t = tf.mul( f_t, c_prev) +  tf.mul( i_t, C_tilde_t)
            o_t = tf.nn.sigmoid( tf.matmul(emb_h_prev, self.W['o'])  + self.b['o'] )
            h = o_t * tf.nn.tanh( C_t )
            return h, C_t
        
        h_prev = self.h_zero
        c_prev = self.c_zero
        
        for t in range(self.sequence_length-1):
            x_t = tf.reshape(self.x_slices[t], [-1])
            h_prev, c_prev = step(x_t, h_prev, c_prev)
            self.logit_t = tf.matmul(tf.nn.dropout(h_prev, self.keep_rate_ph), self.W['cl']) + self.b['cl']
            self.x_t_new = tf.reshape(self.x_slices[t+1], [-1])
            self.logits.append( self.logit_t )
            self.costs.append(tf.nn.sparse_softmax_cross_entropy_with_logits(self.logit_t, self.x_t_new ))
            self.h.append(h_prev)
            self.c.append(c_prev)
            
        
        #### End main editable code block ####
        
        
        #Sum costs for each word in each example, but average cost across examples.
        self.costs_tensor = tf.concat(1, [tf.expand_dims(cost, 1) for cost in self.costs])
        self.cost_per_example = tf.reduce_sum(self.costs_tensor, 1)
        self.total_cost = tf.reduce_mean(self.cost_per_example)
            
        # This library call performs the main SGD update equation
        self.optimizer = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.total_cost)
        
        # Create an operation to fill zero values in for W and b
        self.init = tf.initialize_all_variables()
        
        # Create a placeholder for the session that will be shared between training and evaluation
        self.sess = None
        
                
    def train(self, training_data):
        
        def get_minibatch(dataset, start_index, end_index):
            indices = range(start_index, end_index)
            vectors = np.vstack([dataset[i]['index_sequence'] for i in indices])
            return vectors
        
        self.sess = tf.Session()
        
        self.sess.run(self.init)
        print 'Training.'

        # Training cycle
        for epoch in range(self.training_epochs):
            random.shuffle(training_set)
            avg_cost = 0.
            total_batch = int(len(training_set) / self.batch_size)
            
            # Loop over all batches in epoch
            for i in range(total_batch):
                # Assemble a minibatch of the next B examples: we ask for the indeex SEQUENCE OF EACH OBSERVATION
                # BASICALLY, THE VECTORS WITH THE NUMBERS (INDEX TO WORDS)
                minibatch_vectors = get_minibatch(training_set, self.batch_size * i, self.batch_size * (i + 1))

                # Run the optimizer to take a gradient step, and also fetch the value of the 
                # cost function for logging
                
                # Example of how I PRINT actual values
                #xslices,x_new, logit_t = self.sess.run([self.x_slices, self.x_t_new, self.logit_t ],
                #                                       feed_dict={self.x: minibatch_vectors})
                #print "x slices", xslices[0], "x new", x_new, "logit", logit_t

                _, c = self.sess.run([self.optimizer, self.total_cost], 
                                     feed_dict={self.x: minibatch_vectors,self.keep_rate_ph:self.keep_rate})

                # Compute average loss
                avg_cost += c / (total_batch * self.batch_size)
                
            # Display some statistics about the step
            if (epoch+1) % self.display_epoch_freq == 0:
                print "Epoch:", (epoch+1), "Cost:", avg_cost, "Sample:", self.sample()
    
    
    def sample(self):
        # This samples a sequence of tokens from the model starting with <S>.
        # We only ever run the first timestep of the model, and use an effective batch size of one
        # but we leave the model unrolled for multiple steps, and use the full batch size to simplify 
        # the training code. This slows things down.

        def brittle_sampler():
            # The main sampling code. Can fail randomly due to rounding errors that yield probibilities
            # that don't sum to one.
            
            word_indices = [0]    # 0 here is the "<S>" symbol
            for i in range(self.sequence_length - 1):
                dummy_x = np.zeros((self.batch_size, self.sequence_length))
                dummy_x[0][0] = word_indices[-1]
                feed_dict = {self.x: dummy_x, self.keep_rate_ph:1.0}
                if i > 0:
                    feed_dict[self.h_zero] = h
                    feed_dict[self.c_zero] = c           
                h, c, logits = self.sess.run([self.h[1], self.c[1], self.logits[0]], 
                                             feed_dict=feed_dict)  
                logits = logits[0, :]    # Discard all but first batch entry
                exp_logits = np.exp(logits - np.max(logits))
                distribution = exp_logits / exp_logits.sum()
                sampled_index = np.flatnonzero(np.random.multinomial(1, distribution))[0]
                word_indices.append(sampled_index)
            words = [indices_to_words[index] for index in word_indices]
            return ' '.join(words)
        
        while True:
            try:
                sample = brittle_sampler()
                return sample
            except ValueError as e:  # Retry if we experience a random failure.
                pass
            

In [20]:
model = LanguageModel(len(word_indices), 21)
model.train(training_set)

NameError: global name 'tf' is not defined

In [None]:
print model.sample()
print model.sample()
print model.sample()
print model.sample()
print model.sample()
print model.sample()

## Additional notes about RNN

http://neuralnetworksanddeeplearning.com/chap5.html 
In general, neurons in the earlier layers learn much more slowly than neurons in later layers. And while we've seen this in just a single network, there are fundamental reasons why this happens in many neural networks. The phenomenon is known as the vanishing gradient problem
 Why does the vanishing gradient problem occur? Are there ways we can avoid it? And how should we deal with it in training deep neural networks? In fact, we'll learn shortly that it's not inevitable, although the alternative is not very attractive, either: sometimes the gradient gets much larger in earlier layers! This is the exploding gradient problem, and it's not much better news than the vanishing gradient problem. More generally, it turns out that the gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers. This instability is a fundamental problem for gradient-based learning in deep neural networks. It's something we need to understand, and, if possible, take steps to address.

http://link.springer.com/chapter/10.1007%2F11840817_26#page-1
Introduced in Bengio et al. (1994), the exploding gradients problem refers to the large increase in the norm
of the gradient during training. Such events are due to the explosion of the long term components, which can grow exponentially more than short term ones. The vanishing gradients problem refers to the opposite behaviour, when long term components go exponentially fast to norm 0, making it impossible for the model to learn correlation between temporally distant events. 

Using an L1 or L2 penalty on the recurrent weights can help with exploding gradients. Assuming weights are initialized to small values, the largest singular value λ1 of Wrec is probably smaller than 1. The L1/L2 term can ensure that during training λ1 stays smaller than 1, and in this regime gradients can not explode. This approach limits the model to single point attractor at the origin, where any information inserted in the model dies out exponentially fast. This prevents the model to learn generator networks, nor can it exhibit long term memory traces.
