# Sequence to Sequence Modelling

**Note**: All code should be in Python3. Keras version should be 2.0.4. The directory structure on the bitbucket repo should be exactly same as the hw4.zip provided to you (with the exception of data directory. Do not upload it). To push the code to remote repo, use the same instructions as given in HW0. **Double check you remote repo for correct directory structure. We won't consider any regrade requests based on wrong directory structure penalty. Again, do not upload data to your bitbucket repo ** <br>
**The data provided to you should not be used for any other purpose than this course. You must not distribute it or upload it to any public platform.**

![title](seq2seq.png)

In this assignment, we are going to solve the problem of summarization using a sequence to sequence model. In a sequence to sequence problem, we have an encoder and a decoder. We feed the sequence of word embeddings to an encoder and train decoder to learn the summaries. We will be seeing 2 types of encoder decoder architectures in this assignment

# Preparing Inputs

The first part of the assignment is to prepare data. You are given training data in train_article.txt, in which each line is the first sentence from an article, and training summary sentences in train_title.txt, which are the corresponding titles of the article. You will be training the model to predict the title of an article given the first sentence of that article, where title generation is a summarization task. Let us limit the maximum vocabulary size to 20000 and maximum length of article to 200 (These are just initial params to get you started and we recommend experimenting, to improve your scores after you are done with your first implementation)

In [2]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3523667539387455022
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11324823962
locality {
  bus_id: 1
}
incarnation: 595524291624518061
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0"
]


In [3]:
from keras.preprocessing.text import text_to_word_sequence
from keras.models import Sequential
from keras.layers import Activation, TimeDistributed, Dense, RepeatVector, recurrent, Embedding
from keras.layers.recurrent import LSTM
from keras.optimizers import Adam, RMSprop
from nltk import FreqDist
import numpy as np
import os
import datetime
from keras.preprocessing import sequence
import operator
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [4]:
MAX_LEN = 200
VOCAB_SIZE = 20000

Write a function which takes article file, summary file, maximum length of sentence and vocabulary size and does the following
* Create vocabulary: Take most frequent VOCAB_SIZE number of words from article file. Add two special symbols ZERO at start and UNK at end to finally have VOCAB_SIZE + 2 words. Use this array as idx2word. Repeat the process for summary data to create another idx2word corresponding to it. 
* Using the above idx2word for both article and summary data, create word2idx, which will map every word to its index in idx2word array. 
* Convert the words in the article and summary data to their corresponding index from word2idx. If a word is not present in the vocab, use the index of UNK. 
* After the above preprocessing, each sentence in article and summary data should be a list of indices
* Now find the max length of a sentence (which is basically number of indices in a sentence) in article data. Pad every sentence in article data to that length, so that all sentences are of same length. You may use pad_sequences function provided by keras. Do the same for title data.
* return the following outputs transformed article data, vocab size of article data, idx2word(articledata), word2idx(articledata),transformed summary data, vocab size of summary data, idx2word(summarydata), word2idx(summarydata)

In [19]:
def load_data1(source, dist, max_len, vocab_size):

    # Reading raw text from source and destination files
    f = open(source, 'r')
    X_data = f.read()
    f.close()
    f = open(dist, 'r')
    y_data = f.read()
    f.close()
    
    
    # Splitting raw text into array of sequences
    X = [text_to_word_sequence(x) for x, y in zip(X_data.split('\n'), y_data.split('\n')) if len(x) > 0 and len(y) > 0 and len(x) <= max_len and len(y) <= max_len]
    y = [text_to_word_sequence(y) for x, y in zip(X_data.split('\n'), y_data.split('\n')) if len(x) > 0 and len(y) > 0 and len(x) <= max_len and len(y) <= max_len]

    
    
    # Creating the vocabulary set with the most common words
    dist = FreqDist(np.hstack(X))
    X_vocab = dist.most_common(vocab_size-1)
    dist = FreqDist(np.hstack(y))
    y_vocab = dist.most_common(vocab_size-1)

    # Creating an array of words from the vocabulary set, we will use this array as index-to-word dictionary
    X_ix_to_word = [word[0] for word in X_vocab]
    # Adding the word "ZERO" to the beginning of the array
    X_ix_to_word.insert(0, 'ZERO')
    # Adding the word 'UNK' to the end of the array (stands for UNKNOWN words)
    X_ix_to_word.append('UNK')

    # Creating the word-to-index dictionary from the array created above
    X_word_to_ix = {word:ix for ix, word in enumerate(X_ix_to_word)}

    # Converting each word to its index value
    for i, sentence in enumerate(X):
        for j, word in enumerate(sentence):
            if word in X_word_to_ix:
                X[i][j] = X_word_to_ix[word]
            else:
                X[i][j] = X_word_to_ix['UNK']

    y_ix_to_word = [word[0] for word in y_vocab]
    y_ix_to_word.insert(0, 'ZERO')
    y_ix_to_word.append('UNK')
    y_word_to_ix = {word:ix for ix, word in enumerate(y_ix_to_word)}
    for i, sentence in enumerate(y):
        for j, word in enumerate(sentence):
            if word in y_word_to_ix:
                y[i][j] = y_word_to_ix[word]
            else:
                y[i][j] = y_word_to_ix['UNK']
    return (X, len(X_vocab)+2, X_word_to_ix, X_ix_to_word, y, len(y_vocab)+2, y_word_to_ix, y_ix_to_word)

def load_test_data(source, X_word_to_ix, max_len):
    f = open(source, 'r')
    X_data = f.read()
    f.close()

    X = [text_to_word_sequence(x)[::-1] for x in X_data.split('\n') if len(x) > 0 and len(x) <= max_len]
    for i, sentence in enumerate(X):
        for j, word in enumerate(sentence):
            if word in X_word_to_ix:
                X[i][j] = X_word_to_ix[word]
            else:
                X[i][j] = X_word_to_ix['UNK']
    return X





In [20]:
def load_data(article, summary, max_len, vocab_size):
    
    #reading in data from files
    articles = []
    titles = []
    
    f = open(article, 'r')
    for line in f:
        articles.append(line.strip())
    f.close()

    f = open(summary,'r')
    for line in f:
        titles.append(line.strip())
    f.close()  
        
    #filter out all pairs where length of article/summary > 200 characters
    
    articles_filt = []
    titles_filt = []
    
    for i in range(len(articles)):
        if len(articles[i]) <= 200 and len(titles[i]) <= 200:
            articles_filt.append(articles[i])
            titles_filt.append(titles[i])
            
    
    #convert text to list of words
    
    articles = []
    titles = []
    
    for i in range(len(articles_filt)):
        articles.append(text_to_word_sequence(articles_filt[i]))
        titles.append(text_to_word_sequence(titles_filt[i]))
        
    
    #finding most common words
    dist = FreqDist(np.hstack(articles))
    vocab_articles = dist.most_common(vocab_size)
    dist = FreqDist(np.hstack(titles))
    vocab_titles = dist.most_common(vocab_size)
    
    vocab_articles_len = len(vocab_articles)
    vocab_titles_len = len(vocab_titles)
    
    print(vocab_articles_len)
    print(vocab_titles_len)
      
    #creating idx2word for articles
    
    idx2word_articles = {}
    
    
    for i in range(len(vocab_articles)):
        idx2word_articles[i+1] = vocab_articles[i][0]
        
    idx2word_articles[0] = 'ZERO'
    idx2word_articles[vocab_size+1] = 'UNK'
 

    #creating word2idx for articles
    
    word2idx_articles = {}
    
    for k,v in idx2word_articles.items():
        word2idx_articles[v] = k
        
        
     #creating idx2word for titles
    
    idx2word_titles = {}
    
    
    for i in range(len(vocab_titles)):
        idx2word_titles[i+1] = vocab_titles[i][0]
        
    idx2word_titles[0] = 'ZERO'
    idx2word_titles[vocab_size+1] = 'UNK'
 

    #creating word2idx for titles
    
    word2idx_titles = {}
    
    for k,v in idx2word_titles.items():
        word2idx_titles[v] = k
        
        
    #creating sequences for articles
    
    seq_articles = []
    seq_titles = []
    
    
    for i in range(len(articles)):
        seq_articles.append([])
        for j in articles[i]:
            
            if j in word2idx_articles.keys():
                seq_articles[i].append(word2idx_articles[j])
            else:
                seq_articles[i].append(word2idx_articles['UNK'])
    
                
    #creating sequences for titles          
    for i in range(len(titles)):
        seq_titles.append([])
        for j in titles[i]:
            
            if j in word2idx_titles.keys():
                seq_titles[i].append(word2idx_titles[j])
            else:
                seq_titles[i].append(word2idx_titles['UNK'])
                
                
                
    #maxlen of articles
    
    maxlen_articles = 0
    for i in seq_articles:
        if len(i) > maxlen_articles:
            maxlen_articles = len(i)
            
            
    #maxlen of titles
    
    maxlen_titles = 0
    for i in seq_titles:
        if len(i) > maxlen_titles:
            maxlen_titles = len(i)
            
            
#     print(maxlen_articles)
#     print(maxlen_titles)
    
    seq_articles = sequence.pad_sequences(seq_articles, maxlen=maxlen_articles, dtype='int32', value = 0)
    seq_titles = sequence.pad_sequences(seq_titles, maxlen=maxlen_titles, dtype='int32', value = 0)
    
    return idx2word_articles, word2idx_articles, vocab_articles_len, seq_articles, idx2word_titles, word2idx_titles, vocab_titles_len, seq_titles    
                

Now use the above function to load the training data from article and summary (i.e. title) files. Do note that, based on your model architecture, you may need to further one-hot vectorize your input to the model

In [21]:
# TO-DO
# seq_articles, vocab_articles_len, word2idx_articles,idx2word_articles, seq_titles, vocab_titles_len, word2idx_titles, idx2word_titles= load_data1('data/train_article.txt', 'data/train_title.txt' ,MAX_LEN, VOCAB_SIZE)

In [22]:
#one-hot encoding output
def ohe_titles(seq_titles, maxlen_titles, word2idx_titles):
    # Vectorizing each element in each sequence
    encoded_seq_titles = np.zeros((len(seq_titles), maxlen_titles, len(word2idx_titles)))
    
    for i, sentence in enumerate(seq_titles):
        for j, word in enumerate(sentence):
            encoded_seq_titles[i, j, word] = 1
    return encoded_seq_titles

# Unidirectional LSTM Encoder Decoder 

Define the parameters for your LSTM encoder decoder model. 

In [23]:
BATCH_SIZE = 1
NUM_LAYERS = 1
HIDDEN_DIM = 500
EPOCHS = 150

Create a Unidirectional encoder decoder LSTM model in create_model function. The model should have a LSTM Unidirectional layer as encoder and a LSTM decoder.
Use categorical_cross_entropy loss and experiment with different optimizers to improve your score.

In [24]:
def create_UniLSTM(X_vocab_len, X_max_len, y_vocab_len, y_max_len, hidden_size, num_layers):
    model = Sequential()

    # Creating encoder network
    model.add(Embedding(input_dim = X_vocab_len, output_dim = 500, input_length=X_max_len, mask_zero=True))
    model.add(LSTM(units = hidden_size))
    model.add(RepeatVector(y_max_len))

    # Creating decoder network
    for i in range(num_layers):
        model.add(LSTM(units = hidden_size, return_sequences=True))
    model.add(TimeDistributed(Dense(y_vocab_len)))
    model.add(Activation(activation = 'softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='rmsprop', metrics=['accuracy'])
    print(model.summary())
    return model


In [34]:
def create_model(X_vocab_len, X_max_len, y_vocab_len, y_max_len, hidden_size, num_layers):
    
    model = Sequential()

    # Creating encoder network
    model.add(Embedding(X_vocab_len, 500, input_length=X_max_len, mask_zero=True))
    model.add(LSTM(hidden_size))
    #model.add(RepeatVector(y_max_len))

    # Creating decoder network
    for _ in range(num_layers):
        model.add(LSTM(hidden_size, return_sequences=True))
    #model.add(TimeDistributed(Dense(y_vocab_len)))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy',
            optimizer='rmsprop',
            metrics=['accuracy'])
    
    print(model.summary())
    return model

In [36]:
print('[INFO] Loading data...')
X, X_vocab_len, X_word_to_ix, X_ix_to_word, y, y_vocab_len, y_word_to_ix, y_ix_to_word = load_data1('data/train_article.txt', 'data/train_title.txt', MAX_LEN, VOCAB_SIZE)

# Finding the length of the longest sequence
X_max_len = max([len(sentence) for sentence in X])
y_max_len = max([len(sentence) for sentence in y])

# Padding zeros to make all sequences have a same length with the longest one
print('[INFO] Zero padding...')
X = pad_sequences(X, maxlen=X_max_len, dtype='int32')
y = pad_sequences(y, maxlen=y_max_len, dtype='int32')

# Creating the network model
print('[INFO] Compiling model...')
model = create_model(X_vocab_len, X_max_len, y_vocab_len, y_max_len, HIDDEN_DIM, num_layers = 1)

[INFO] Loading data...
[INFO] Zero padding...
[INFO] Compiling model...


TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

In [None]:
maxlen_articles = len(seq_articles[0])
maxlen_titles = len(seq_titles[0])
print(maxlen_articles)
print(maxlen_titles)

In [101]:
print('Train...')

model = create_UniLSTM(vocab_articles_len, maxlen_articles, vocab_titles_len, maxlen_titles, HIDDEN_DIM, NUM_LAYERS)


# model.fit(x_train, y_train,
#           batch_size=1,
#           epochs=4,
#           verbose = 1,
#          shuffle = True)

Train...


TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

# Train the Model

Now that we have everything in place, we can run our model. We recommend training the model in batches instead of training on all 50,000 article-title pairs at once, if you encounter memory contraints

In [None]:
# TO-DO

# Evaluation using Rouge score 

Now that you have trained the model, load the test data i.e. test_article.txt and corresponding reference titles test_title.txt. Process test_article.txt in the same way as you did your train_article.txt. Then use your model to predict the titles.
When you have your model predicted titles, and the reference titles (test_title.txt) calculate the Rouge score corresponding to your predictions. <br>
You should install rouge by executing "pip3 install rouge". Refer https://pypi.python.org/pypi/rouge/0.2.1 for documentation on how to use the package.

In [None]:
# TO-DO

# Tensorboard Visualization 

We recommended training the data in batches because of our tensor constraints. This also presents us with a challenge of visualizing loss function and accuracy change with each epoch. Keras has an inbuilt function called fit_generator which takes in a generator function and gives the required batch for training. Use this Function to load data in batches of 100 for 200 steps_per_epoch. Run the training for 10 epochs. Use Keras callbacks to send data to tensorboad (you can look this up online). 

Once your training is done. Go to command line and run tensorboard. By default Tensorboard opens on 6006 port. Do remember to allow traffic on the same for gcloud (like you did for previous assignment). You can see various metrics depending on what you want to track like loss, accuracy, validation loss and validation accuracy over epochs. Attach the plots of loss and accuracy from the tensorboard display in the notebook

In [None]:
# TO-DO

# Unidirectional LSTM Encoder Decoder With Attention 

Define the parameters for your LSTM encoder decoder model with attention

In [None]:
BATCH_SIZE = 
NUM_LAYERS = 
HIDDEN_DIM = 
EPOCHS =

You would've observed that the summaries are not yet perfect. This is because in encoder decoder architecture, only the final state of encoder is used to calculate the probabilities. We now move to a more general approach called attention based approach. In this, we take a weighted sum of all weights of encoder instead of just the last one. You are already provided an attention_decoder.py file with AttentionDecoder. Add this layer on top of your encoder and run the same experiment as before. For this part, you don't need to worry about return_probabilities argument to create_UniLSTMwithAttention function. Just pass it as an argument to your attention decoder layer. When return_probabilities is false, the attention decoder returns prediction model, which is what you need for this part of the assignment. When return_probabilities is true, the attention decoder returns the probability model, which you will be using later in the Analysis part of this assignment

In [None]:
def create_UniLSTMwithAttention(X_vocab_len, X_max_len, y_vocab_len, y_max_len, hidden_size, num_layers, return_probabilities = False):
    # TO-DO
    # create and return the model for unidirectional LSTM encoder decoder with attention
    return

# Train the Model

Train the model, as you did before, for the model without attention

In [None]:
# TO-DO

# Evaluation using Rouge Score

Evaluate your model as before, using Rouge score. Ideally, your scores for the model with attention should be better than the model without attention

In [None]:
# TO-DO

# Perplexity 

Even though we evaluate our models on ROUGE score, we don't train our neural networks to learn better ROUGE score for the fact that ROUGE score is a complicated nonconvex function. How does our model learn then? In information theory, Perplexity is a measure of how good a model is.

Perplexity                     $$ = 2^{{-\sum _{x}p(x)\log _{2}p(x)}}$$ 
            
Lower the perplexity, better the model. Justify why our model learns well with our loss function? 

In [None]:
# TO-DO

# Analysis

You will now plot the attention weights for a sentence and it's output. If a grid cell is white in the plot, it means that during summary, the word on x-axis corresponds to the word on y-axis. You are provided with a Visualizer class for helping you out. Make sure you install matplotlib using sudo pip3 install matplotlib and also install python3-tk using sudo apt-get install python3-tk

In [1]:
import argparse
import os

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline

In [2]:
class Visualizer(object):

    def __init__(self):
        """
            Visualizes attention maps
        """
        
        
    def set_models(self, pred_model, proba_model):
        """
            Sets the models to use
            :param pred_model: the prediction model
            :param proba_model: the model that outputs the activation maps
        """
        self.pred_model = pred_model
        self.proba_model = proba_model

    def attention_map(self, text, padded_data_vec, y_idx_to_word):
        """
            Displays the attention weights graph
            param: input sentence
            param: padded_data_vector for prediction
            param: idx2word dictionary for titles
        """
        input_length = len(text.split())
        
        # get the output sequence
        prediction = np.argmax(pred_model.predict(padded_data_vec), axis=2)[0]
        text_ = text.split()
        valids = [y_idx_to_word[index] for index in prediction if index > 0]
        sequence = ' '.join(valids)
        predicted_text = sequence.split()
        output_length = len(predicted_text)
        #get the weights
        activation_map = np.squeeze(self.proba_model.predict(padded_data_vec))[
            0:output_length, 0:input_length]
        
        plt.clf()
        f = plt.figure(figsize=(8, 8.5))
        ax = f.add_subplot(1, 1, 1)

        # add image
        i = ax.imshow(activation_map, interpolation='nearest', cmap='gray')
        
        # add colorbar
        cbaxes = f.add_axes([0.2, 0, 0.6, 0.03])
        cbar = f.colorbar(i, cax=cbaxes, orientation='horizontal')
        cbar.ax.set_xlabel('Probability', labelpad=2)

        # add labels
        ax.set_yticks(range(output_length))
        ax.set_yticklabels(predicted_text[:output_length])
        
        ax.set_xticks(range(input_length))
        ax.set_xticklabels(text_[:input_length], rotation=45)
        
        ax.set_xlabel('Input Sequence')
        ax.set_ylabel('Output Sequence')

        # add grid and legend
        ax.grid()
        
        f.show()

You can initialize Visualizer class as follows

In [3]:
viz = Visualizer()

Visualizer has two methods.
- set_models 
- attention_map

The set_models takes in prediction model and probability model as inputs. In *create_UniLSTMwithAttention*, the model with *return_probabilities = False* which you already used in the training is the prediction model. For initializing probability model, call *create_UniLSTMwithAttention* with *return_probabilities = True* and initialize the weights with weights of prediction model. Now you can call set_models in this manner:

In [4]:
#viz.set_models(pred_model,prob_model)

attention_map creates the weights map for you. You need to give a sample sentence, a test_data_vector on which we call call model.predict and your output idx2word dictionary. You can call it as follows

In [5]:
#viz.attention_map(text,test_data_vector,idx2word)

Use the above Visualizer to visualize attention weights for 15 sentences, as instructed in the Analysis section of the accompanying HW document

In [None]:
# TO-DO

# Unidirectional LSTM Encoder Decoder With Attention and Beam Search (Extra Credit)

The models that you implemented till now had greedy decoder. Now implement a Decoder with Beam Search and show improved results