# Sequence-to-sequence Tensorflow model for Amazon reviews

This notebook walks through training a [Sequence to sequence model](https://www.tensorflow.org/tutorials/seq2seq) with Tensorflow (version 1.1).

The model is currently used as the predictive backend for the SUMZ chrome extension, which takes in Amazon reviews on the current web page and displays a small summary of each review. The model is trained on the the [Amazon fine food reviews dataset.](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Kaggle, which consists of 568K review-summary pairs.

This notebook goes through the following:
- Building a sequence-to-sequence model using Tensorflow
- Using the preprocessed data from the data_preprocessing notebook to train the model
- Exporting the model into ProtoBuff format for serving in a production environment

This builds on the [Text Summarization](https://github.com/Currie32/Text-Summarization-with-Amazon-Reviews) project by David Currie (this [Medium post](https://medium.com/towards-data-science/text-summarization-with-amazon-reviews-41801c2210b) goes into excellent detail as well).

## Why use sequence to sequence RNN?

Sequence-to-sequence models use two different RNNs, connected through the output state of the initial RNN. This is also called the encoder-decoder model (similar to Autoencoders).

These seq2seq models are extremely powerful and versatile; they've been shown to have incredible performance a range of tasks including:

| Task        | Input | Output
|:------------- |:------------- | :--------
| <b>Language translation</b>      | Text in language 1 | Text in language 2
| <b>News headlines</b> | Text of news article | Short headline
| <b>Question/Answering | Questions about content | Answers to questions
| <b>Chatbots</b> | Incoming chat to bot | Reply from chatbot
| <b>Smart email replies</b> | Email content | Reply to email
| <b>Image captioning</b> |Image | Caption describing image
| <b>Speech to text<b/> | Raw audio | Text of audio


For more information, here are some great resources:
- [Practical seq2seq](http://suriyadeepan.github.io/2016-12-31-practical-seq2seq/)
- [Tensorflow seq2seq tutorials](https://github.com/ematvey/tensorflow-seq2seq-tutorials)
- [Google talk by Quoc Le](https://www.youtube.com/watch?v=G5RY_SUJih4)
- [Deep Learning for Chatbots](http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/)

We're using it here to 'translate' from a sequence of words (the entirety of an Amazon review) and another sequence of words (the short summary of the review).

## Model Architecture

<img src="images/nct-seq2seq.png"/>
<i>seq2seq model. source: [WildML](http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/)</i>

I'm building the model here piece by piece.

In [1]:
import pickle
import time
import numpy as np
import tensorflow as tf
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors

## Model inputs

Our model has the following placeholders for input parameters:
- <b>inputs</b>: The integer-transformed full review
- <b>targets</b>: The integer-transformed summary of each review (with 'EOS' tags)
- <b>learning_rate</b>: model training learning rate
- <b>keep_probability</b>: For use in dropout
- <b>target_seq_len</b>: Length of each summary we input into the model for training on
- <b>max_target_seq_len</b>: Max length of the summaries we input into the model
- <b>source_seq_len</b>: Length of reviews inputted into the model



In [2]:
def model_input_placeholders():
    inputs = tf.placeholder(tf.int32, [None,None], name='input')
    targets = tf.placeholder(tf.int32, [None,None])
    learning_rate = tf.placeholder(tf.float32)
    keep_probability = tf.placeholder(tf.float32, name='keep_probability')
    target_seq_len = tf.placeholder(tf.int32, (None,), name='target_seq_len')
    max_target_seq_len = tf.reduce_max(target_seq_len, name='max_target_seq_len')
    source_seq_len = tf.placeholder(tf.int32, (None,), name='source_seq_len')
    
    return inputs, targets, learning_rate, keep_probability, target_seq_len, max_target_seq_len, source_seq_len

## Encoder
We are doing two things in this layer:
1. Embedding the integer-value reviews into their word embeddings
2. Feeding those embeddings into the encoder RNN network

Here we're using the following Tensorflow APIs:
* [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) convenient method to use a provided embedding layer and convert our integer input into embeddings for the encoder
* [`tf.nn.bidirectional_dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn): The words in a review may depend heavily on sequences both before and after a given input, so we're using a [bidrectional RNN](https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/bidirectional_rnn.py), consisting of:
    * Multilayerd [`tf.contrib.rnn.LSTMCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LSTMCell) 
    * The cell is wrapped in a [`tf.contrib.rnn.DropoutWrapper`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper)

<img src="images/bidirectional-rnn.png"/>
<center><i>Bidirectional RNN, source [wildML](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns)</i></center>

In [3]:
def embedded_encoder_input(input_data, word_embedding_matrix):
    return tf.nn.embedding_lookup(word_embedding_matrix, input_data)

def encoding_layer(encoder_inputs, rnn_size, source_seq_len, num_layers, keep_prob):
    
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            single_rnn_cell_forward = tf.contrib.rnn.LSTMCell(num_units = rnn_size,
                                                      initializer = tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            single_rnn_cell_forward = tf.contrib.rnn.DropoutWrapper(cell = single_rnn_cell_forward,
                                                                    input_keep_prob = keep_prob)
            single_rnn_cell_backward = tf.contrib.rnn.LSTMCell(num_units = rnn_size,
                                                               initializer = tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            single_rnn_cell_backward = tf.contrib.rnn.DropoutWrapper(cell = single_rnn_cell_backward,
                                                                     input_keep_prob = keep_prob)
            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(single_rnn_cell_forward,
                                                                    single_rnn_cell_backward,
                                                                    encoder_inputs,
                                                                    source_seq_len,
                                                                    dtype = tf.float32)
    enc_output = tf.concat(enc_output, 2) # Concatenate both outputs together
    return enc_output, enc_state

## Decoder
We'll actually have 2 types of decoders, one for training and one for inference. In the training decoder, we'll feed in the next value in the target sequence <b>regardless of what the decoder outputs at each step</b>. We'll still use the decoder outputs in training though to calculate the loss for the model.

This is called [Teacher Forcing](https://www.quora.com/What-is-the-teacher-forcing-in-RNN) since we're 'teaching' the decoder using actual target examples instead of letting it generate based on its own outputs at each time step. There are some known [issues](https://github.com/dennybritz/deeplearning-papernotes/blob/master/notes/professor-forcing.md) with this regarding generation but we'll use it for now!

This [notebook](https://github.com/udacity/deep-learning/blob/master/seq2seq/sequence_to_sequence_implementation.ipynb) from Udacity does a great job at explaining how the two decoders work.

<img src="images/udacity_seq2seqtraining.png">
<center><i>Image source: [Udacity](https://github.com/udacity/deep-learning/blob/master/seq2seq/sequence_to_sequence_implementation.ipynb)</i></center>
<i>In the training decoder, we're using the output at each step to calculate the loss but <b>we're not using feeding that output into the next step</b>. Instead we feed it the next word in the actual target sequence regardless of what it outputs.</i>

<img src="images/udacityseq2seqinference.png">
<center><i>Image source: [Udacity](https://github.com/udacity/deep-learning/blob/master/seq2seq/sequence_to_sequence_implementation.ipynb)</i></center>
<i>In the inference decoder, we're feeding the output at each decoder step into the next decoder step as input.</i>

### Preprocessing the decoder input
The first step is to process what the decoder gets as input. We do the following preprocessing steps:
* We're removing the last element in each summary. Since the decoder doesn't need to see this last element (since the input to the last timestep is the second to last element in each target summary).
* We're adding a <b>`<GO>`</b> token at the beginning of each summary so the decoder knows to start decoding at that point

In [4]:
def process_decoder_input(target_data, vocab_to_int, batch_size):

    # Remove the last word (integer) from each target sequence
    ending = tf.strided_slice(target_data, [0,0], [batch_size,-1], [1,1])
    
    # Add the <GO> token to each target sequence
    decoder_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)
    
    return decoder_input

In [5]:
def embedded_decoder_input(input_data, word_embedding_matrix):
    return tf.nn.embedding_lookup(word_embedding_matrix, input_data)

### Creating the decoder cell

The decoder cell is created very similarly to the encoder (though it's a single LSTM cell instead of a bidirectional RNN). This means we're using the [`tf.contrib.rnn.LSTMCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LSTMCell) again and wrapping it in a [`tf.contrib.rnn.DropoutWrapper`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper).

<b>Attention Mechanism</b><br>

A new feature we're adding here is an [attention mechanism](http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/). Regular encoder-decoder seq2seq models use the final state of the encoder (the encoder's final output) as a sort of 'summary' of the entire input sentence -- in our case it means encoding the entire Amazon review into one fixed-length vector. Attention Mechanisms improve on this by allowing the decoder to 'focus' on different pieces of the input; it might weight certain words more heavily at different places in its output.

<img src="images/attentionmech.png">
<center><i>Image source: [SPRO](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb)</i></center>

An example review might be:<br>

<b><i>"This product sucks. It wasn't what I was looking for and I'm upset I bought it!"</i></b>

For our summary, it'd be helpful to focus more on the word "sucks" in the first word of the summary; we'd hope to get a generated summary like <b><i>"Terrible purchase"</i></b>, with the word "terrible" heavily influenced by seeing "sucks" in the review.

Compared to language translation and other sequence-to-sequence use cases, Attention Mechanisms seem especially helpful for our problem of summarization; a summary is by definition the distillation of information into a more compact form, meaning we are really trying to pay attention to the most important words in the original review.

The Bhadanau attention style seems to give better results per this [paper](https://arxiv.org/abs/1703.03906v2).

Here we're using the following Tensorflow APIs:
* Multilayered [`tf.contrib.rnn.LSTMCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LSTMCell) 
    * The cell is wrapped in a [`tf.contrib.rnn.DropoutWrapper`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper)
* [`tf.contrib.seq2seq.BahdanauAttention`](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention) as the attention mechanism
    * Our LSTM cell wraps this using [`tf.contrib.seq2seq.DynamicAttentionWrapper`](https://www.tensorflow.org/versions/r1.1/api_docs/python/tf/contrib/seq2seq/DynamicAttentionWrapper)
    * We're setting the initial state of our attention wrapper with [`tf.contrib.seq2seq.DynamicAttentionWrapperState`](https://www.tensorflow.org/versions/r1.1/api_docs/python/tf/contrib/seq2seq/DynamicAttentionWrapperState)


In [6]:
def make_decoder_cell(rnn_size, 
                      num_layers, 
                      encoder_output, 
                      source_seq_len, 
                      keep_prob,
                      batch_size,
                      encoder_state):

    for layer in range(num_layers):
        with tf.variable_scope('decoder_{}'.format(layer)):
            single_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                                  initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            dec_cell = tf.contrib.rnn.DropoutWrapper(single_cell, input_keep_prob=keep_prob)
    
    attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                               encoder_output,
                                                               source_seq_len,
                                                               normalize=False,
                                                               name='BahdanauAttention')
    
    dec_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(dec_cell,
                                                          attention_mechanism,
                                                          rnn_size)

    initial_state = tf.contrib.seq2seq.DynamicAttentionWrapperState(encoder_state[0],
                                                                    _zero_state_tensors(rnn_size, 
                                                                                        batch_size, 
                                                                                        tf.float32))
    return dec_cell, initial_state 
    

### Creating the decoder layer

Now we can build the decoding layer. Key to notice here is that we have two different decoders, one for training and one for inference (as detailed above).

Both decoders share the same weights (hence the `reuse=True` tag in the variable scope). They have the 3 following pieces:
* A Training Helper which reads in the inputs
    * the training decoder uses the [`tf.contrib.seq2seq.TrainingHelper`](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/TrainingHelper) and is fed in the actual target summaries' words at each step
    * the inference decoder uses the [`tf.contrib.seq2seq.GreedyEmbeddingHelper`](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/GreedyEmbeddingHelper) which takes the argmax of the logits at each step as the input to the next decoder step
* Both decoders use [`tf.contrib.seq2seq.BasicDecoder`](tf.contrib.seq2seq.BasicDecoder) as their actual decoder
* Both decoders use  [`tf.contrib.seq2seq.dynamic_decode`](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/dynamic_decode) to run the decoder

In [7]:
def decoding_layer(input_data,
                   word_embedding_matrix,
                   num_layers, 
                   rnn_size, 
                   keep_prob, 
                   encoder_output, 
                   source_seq_len,
                   encoder_state,
                   batch_size,
                   vocab_size,
                   target_seq_len,
                   max_target_seq_len,
                   vocab_to_int):
    
    decoder_embedded_input = embedded_decoder_input(input_data, word_embedding_matrix)
    decoder_cell, initial_state = make_decoder_cell(rnn_size, 
                                                    num_layers, 
                                                    encoder_output, 
                                                    source_seq_len, 
                                                    keep_prob, 
                                                    batch_size,
                                                    encoder_state)
    output_layer = Dense(vocab_size,
                        kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))

    # Training
    with tf.variable_scope("decode"):
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=decoder_embedded_input,
                                                            sequence_length = target_seq_len,
                                                            time_major=False)
        training_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell,
                                                           training_helper,
                                                           initial_state,
                                                           output_layer)
        training_logits, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                               output_time_major=False,
                                                               impute_finished=True,
                                                               maximum_iterations=max_target_seq_len)
    
    with tf.variable_scope("decode", reuse=True): # Reuse same params for inference
        
        start_tokens = tf.tile(tf.constant([vocab_to_int['<GO>']], dtype=tf.int32), 
                               [batch_size], 
                               name='start_tokens')
        inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(word_embedding_matrix,
                                                                    start_tokens,
                                                                    vocab_to_int['<EOS>'])
        inference_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell,
                                                            inference_helper,
                                                            initial_state,
                                                            output_layer)
        inference_logits, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                              output_time_major=False,
                                                              impute_finished=True,
                                                              maximum_iterations=max_target_seq_len)
    
    return training_logits, inference_logits

## Bringing it all together - the full sequence-to-sequence model

We piece together the <b>encoder</b>, the <b>decoder</b>, and return the training / inference logits from the decoder.

In [8]:
def full_seq2seq(input_data, 
                 word_embedding_matrix,
                 rnn_size,
                 source_seq_len,
                 num_layers,
                 keep_prob,
                 target_data,
                 vocab_to_int,
                 batch_size,
                 vocab_size,
                 target_seq_len,
                 max_target_seq_len
                 
                 
                 ):
    
    # Encoding layer
    encoder_inputs = embedded_encoder_input(input_data, word_embedding_matrix)
    encoder_output, encoder_state = encoding_layer(encoder_inputs, 
                                                   rnn_size, 
                                                   source_seq_len, 
                                                   num_layers, 
                                                   keep_prob)
    
    # Decoding layer
    processed_decoder_input = process_decoder_input(target_data, 
                                                    vocab_to_int, 
                                                    batch_size)
    training_logits, inference_logits = decoding_layer(processed_decoder_input,
                                                       word_embedding_matrix,
                                                       num_layers, 
                                                       rnn_size, 
                                                       keep_prob, 
                                                       encoder_output, 
                                                       source_seq_len,
                                                       encoder_state,
                                                       batch_size,
                                                       vocab_size,
                                                       target_seq_len,
                                                       max_target_seq_len,
                                                       vocab_to_int)
    return training_logits, inference_logits

## Training the model
In order to train, we need a `pad_batch` method that pads the batches so that each sequence has the same length, determined by the max item in each batch.

In [9]:
def pad_batch(batch_to_pad):
    max_size = max([len(item) for item in batch_to_pad])
    padded_batch = [item + [vocab_to_int['<PAD>']] * (max_size - len(item)) for item in batch_to_pad]
    return padded_batch

In [10]:
def get_batches(summaries, reviews, batch_size):
    for batch_i in range(0, len(reviews)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i + batch_size]
        reviews_batch = reviews[start_i:start_i + batch_size]
        pad_summaries_batch = pad_batch(summaries_batch)
        pad_reviews_batch = pad_batch(reviews_batch)
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
        pad_reviews_lengths = []
        for review in pad_reviews_batch:
            pad_reviews_lengths.append(len(review))
        
        yield pad_summaries_batch, pad_reviews_batch, pad_summaries_lengths, pad_reviews_lengths
        

In [11]:
# Hyperparameters
epochs = 100
rnn_size = 256
batch_size = 64
num_layers = 2
lr = 0.005
keep_prob = 0.75

In [12]:
def build_and_train_model(word_embedding_matrix, 
                rnn_size,
                num_layers,
                keep_probability,
                vocab_to_int,
                batch_size,
                sorted_summaries,
                sorted_reviews):
    

    # GRAPH BUILDING
    train_graph = tf.Graph()
    with train_graph.as_default():
        
        # Model inputs
        inputs, targets, learning_rate, keep_probability, target_seq_len, max_target_seq_len, source_seq_len = model_input_placeholders()
        
        # Create final logits tensors
        training_logits, inference_logits = full_seq2seq(tf.reverse(inputs, [-1]),
                                                         word_embedding_matrix,
                                                         rnn_size,
                                                         source_seq_len,
                                                         num_layers,
                                                         keep_probability,
                                                         targets,
                                                         vocab_to_int,
                                                         batch_size,
                                                         len(vocab_to_int)+1,
                                                         target_seq_len,
                                                         max_target_seq_len)

        training_logits = tf.identity(training_logits.rnn_output, 'logits')
        inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
        
        masks = tf.sequence_mask(target_seq_len, max_target_seq_len, dtype=tf.float32, name='masks')
        
        # Set up optimizer
        with tf.name_scope("optimization"):
            
            cost = tf.contrib.seq2seq.sequence_loss(training_logits,
                                                    targets,
                                                    masks)
            
            optimizer = tf.train.AdamOptimizer(learning_rate)
            
            gradients = optimizer.compute_gradients(cost)
            capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
            train_operation = optimizer.apply_gradients(capped_gradients)

    print("Finished building the graph!")
    
    start = 200000
    end = start + 50000
    sorted_summaries_short = sorted_summaries[start:end]
    sorted_reviews_short = sorted_reviews[start:end]
    print("The shortest review length:", len(sorted_reviews_short[0]))
    print("The longest review length:", len(sorted_reviews_short[-1]))
    
#     learning_rate_decay = 0.95
#     min_learning_rate = 0.0005
    display_step = 20 # Check training loss after every 20 batches
    stop = 5 # Stop training if average loss doesn't decrease in this mean update_checks
    per_epoch = 3 # Make 3 update checks per epoch
    update_check = (len(sorted_reviews_short)//batch_size//per_epoch)-1

    update_loss = 0 
    batch_loss = 0
    summary_update_loss = [] # Record the update losses for saving improvements in the model

#     checkpoint = "./model_checkpoints/best_model.ckpt" 
    with tf.Session(graph=train_graph) as sess:
        sess.run(tf.global_variables_initializer())

        for epoch_i in range(1, epochs+1):
            update_loss = 0
            batch_loss = 0
            for batch_i, (summaries_batch, reviews_batch, summaries_lengths, reviews_lengths) in enumerate(
                    get_batches(sorted_summaries_short, sorted_reviews_short, batch_size)):
                start_time = time.time()
                _, loss = sess.run(
                    [train_operation, cost],
                    {inputs: reviews_batch,
                     targets: summaries_batch,
                     learning_rate: lr,
                     target_seq_len: summaries_lengths,
                     source_seq_len: reviews_lengths,
                     keep_probability: keep_prob})

                batch_loss += loss
                update_loss += loss
                end_time = time.time()
                batch_time = end_time - start_time

                if batch_i % display_step == 0 and batch_i > 0:
                    print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                          .format(epoch_i,
                                  epochs, 
                                  batch_i, 
                                  len(sorted_reviews_short) // batch_size, 
                                  batch_loss / display_step, 
                                  batch_time*display_step))
                    batch_loss = 0

                if batch_i % update_check == 0 and batch_i > 0:
                    print("Average loss for this update:", round(update_loss/update_check,3))
                    summary_update_loss.append(update_loss)

                    # If the update loss is at a new minimum, save the model
                    if update_loss <= min(summary_update_loss):
                        print('New Record!') 
                        stop_early = 0
                        saver = tf.train.Saver() 
                        saver.save(sess, checkpoint)

                    else:
                        print("No Improvement.")
                        stop_early += 1
                        if stop_early == stop:
                            break
                    update_loss = 0


            # Reduce learning rate, but not below its minimum value
#             learning_rate *= learning_rate_decay
#             if learning_rate < min_learning_rate:
#                 learning_rate = min_learning_rate

In [13]:
def load_pickled_data():
    word_dicts_path = './checkpointed_data/word_dicts.p'
    model_input_data_path = './checkpointed_data/model_input_data.p'
    vocab_to_int, int_to_vocab, word_embedding_matrix = pickle.load(open(word_dicts_path, mode='rb'))
    sorted_summaries, sorted_reviews = pickle.load(open(model_input_data_path, mode='rb'))
    return vocab_to_int, int_to_vocab, word_embedding_matrix, sorted_summaries, sorted_reviews

In [14]:
vocab_to_int, int_to_vocab, word_embedding_matrix, sorted_summaries, sorted_reviews = load_pickled_data()

In [15]:
build_and_train_model(word_embedding_matrix, 
                      rnn_size,
                      num_layers,
                      keep_prob,
                      vocab_to_int,
                      batch_size,
                      sorted_summaries,
                      sorted_reviews)

Finished building the graph!
The shortest review length: 25
The longest review length: 31
Epoch   1/100 Batch   20/781 - Loss:  4.377, Seconds: 58.25


KeyboardInterrupt: 

## Final model results
Note -- I was training this on my Macbook Pro but switched to an AWS EC2 instance with a GPU to train faster. The final results I got are below (this is the model being used in the summary generation below)

<img src="images/ec2loss.png"><br>
<center><i>Results after training on GPU</i></center>


## Testing out summary generation
Here's a function that takes in an input review and gives us back the generated summary from the model, so we can see what kind of summaries it comes out with. 


In [85]:
from helpers import text_cleaning

def create_summary(input_review):
    
    # Clean the text to get it ready for the model
    text = text_cleaning.clean_text(input_review)
    text_int = [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]
    
    # Load in the model from the checkpoint
    checkpoint = "./model_checkpoints/best_model.ckpt"
    loaded_graph = tf.Graph()
    with tf.Session(graph=loaded_graph) as sess:
        loader = tf.train.import_meta_graph(checkpoint + '.meta')
        loader.restore(sess, checkpoint)
        
        # The input tensors for inference
        input_data = loaded_graph.get_tensor_by_name('input:0')
        inference_logits = loaded_graph.get_tensor_by_name('predictions:0')
        source_seq_len = loaded_graph.get_tensor_by_name('source_seq_len:0')
        target_seq_len = loaded_graph.get_tensor_by_name('target_seq_len:0')
        keep_probability = loaded_graph.get_tensor_by_name('keep_probability:0')
        
        # Run the graph to get the summary logits
        summary_logits = sess.run(inference_logits, { input_data: [text_int] * batch_size,
                                                      target_seq_len: [np.random.randint(5,8)],
                                                      source_seq_len: [len(text_int)] * batch_size,
                                                      keep_probability: 1.0
                                                    })
        # This returns a batch_size - length matrix of logits, so we just need the first one
        summary_logits = summary_logits[0]
        
        # Convert back to human text to print out
        pad = vocab_to_int["<PAD>"]
        
        return " ".join([int_to_vocab[i] for i in summary_logits if i != pad])
        
    
    

And we can try it out on sample reviews (note: in practical cases we'd send in all the reviews as a batch instead of iterating through each one). They're not perfect, but quite neat!

In [142]:
from timeit import default_timer as timer
start = timer()

sample_reviews = ["It's become my favorite oatmeal, I even eat it for dinner from time to time...",
                  "The chocolate was splendid, tasty. Highly recommended",
                  "Absolutely the worst cheese I've had. Completely rotten, don't buy this!"]

summaries = [create_summary(review) for review in sample_reviews]

end = timer()

print([summary for summary in summaries])
print("Took {} seconds".format(end-start))

INFO:tensorflow:Restoring parameters from ./model_checkpoints/best_model.ckpt
INFO:tensorflow:Restoring parameters from ./model_checkpoints/best_model.ckpt
INFO:tensorflow:Restoring parameters from ./model_checkpoints/best_model.ckpt
['best cereal ever', 'best chocolate ever', 'best jerky ever']
Took 28.24122966499999 seconds


## Exporting the model for serving
Since we're going to use this model for inference in a production app (the Sumz chrome extension), we need to export it. Here's a great [tutorial](https://blog.metaflow.fr/tensorflow-how-to-freeze-a-model-and-serve-it-with-a-python-api-d4f3596b3adc) on how to go about this, which I'm following below.

The gist is that we want to export just what we need for inference -- this significantly pares down the bloat in the model so we can serve it easier in production. We really just want the model and its weights in one file; essentially we want to freeze the model.

The <b>checkpoints</b> we were saving above while training look like this:
<img src="images/checkpoints_files.png"><br>
These are the following:
* <b>best_model.ckpt.meta</b> This is holding the graph and metadata
* <b>best_model.ckpt.index</b> Immutable key-value table that links each serialized tensor name and where to find it in the .data files
* <b>best_model.ckpt.data-00000-of-00001</b> Holds the weights of the model
* <b>checkpoint</b> High-level helper for loading different checkpoint files

For production purposes we can 'freeze' the meta-graph, restore the weights. We'll also only need whatever is necessary for inference, which the following functino helps us do.

The serialized frozen model is saved as a [ProtoBuf](https://developers.google.com/protocol-buffers/) file (a super compressed way to save the model and reload it); this is the final frozen model we'll use to serve to real users!

In [109]:
MODEL_PATH = './models/frozen_seq2seq_model.pb'
with tf.Session(graph=tf.Graph()) as sess:
    loader = tf.train.import_meta_graph('./model_checkpoints/best_model.ckpt.meta')
    loader.restore(sess, './model_checkpoints/best_model.ckpt')

    # Print out the operation names if needed
    # ops = sess.graph.get_operations()
#     for op in ops:
#         print(op.name)

    # We want the just the predictions operation (and all necessary variables for that)
    # This converts those variables to constants (freezing them)
    output_graph_def = tf.graph_util.convert_variables_to_constants(sess,
                                                                    tf.get_default_graph().as_graph_def(),
                                                                    output_node_names=["predictions"])
    output_graph = MODEL_PATH
    with tf.gfile.GFile(output_graph, "wb") as f:
        f.write(output_graph_def.SerializeToString())
    print("%d ops in final graph" % len(output_graph_def.node))

INFO:tensorflow:Restoring parameters from ./model_checkpoints/best_model.ckpt
INFO:tensorflow:Froze 15 variables.
Converted 15 variables to const ops.
529 ops in final graph


### Re-loading the frozen model in
Now let's try reloading the model in and using it for inference on the same reviews above

In [143]:
def load_graph():
    with tf.gfile.GFile(MODEL_PATH, 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
    with tf.Graph().as_default() as graph:
        tf.import_graph_def(graph_def)
    return graph

def summary_from_frozen_model(input_review):

    text = text_cleaning.clean_text(input_review)
    text_int = [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]
    graph = load_graph()

#     for op in graph.get_operations():
#         print(op.name)

    batch_size = 64 # Match original batch size    
    input_data = graph.get_tensor_by_name('import/input:0')
    target_seq_len = graph.get_tensor_by_name('import/target_seq_len:0')
    source_seq_len = graph.get_tensor_by_name('import/source_seq_len:0')
    keep_probability = graph.get_tensor_by_name('import/keep_probability:0')
    inference_logits = graph.get_tensor_by_name('import/predictions:0')
    
    with tf.Session(graph=graph) as sess:
        feed = {
            input_data: [text_int]*batch_size, 
            target_seq_len: [np.random.randint(5,8)], 
            source_seq_len: [len(text_int)]*batch_size,
            keep_probability: 1.0}
        summary_logits = sess.run(inference_logits, feed_dict = feed)[0]
    
    pad = vocab_to_int["<PAD>"] 
    return " ".join([int_to_vocab[i] for i in summary_logits if i != pad])    

In [145]:
start = timer()

sample_reviews = ["It's become my favorite oatmeal, I even eat it for dinner from time to time...",
                  "The chocolate was splendid, tasty. Highly recommended",
                  "Absolutely the worst cheese I've had. Completely rotten, don't buy this!"]

summaries = [summary_from_frozen_model(review) for review in sample_reviews]

print([summary for summary in summaries])

end = timer()
print("Took {} seconds".format(end-start))

['best cereal ever', 'best chocolate ever', 'best jerky ever']
Took 11.593751017935574 seconds


In [16]:
from __future__ import print_function
'''
Basic Multi GPU computation example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''

'''
This tutorial requires your machine to have 2 GPUs
"/cpu:0": The CPU of your machine.
"/gpu:0": The first GPU of your machine
"/gpu:1": The second GPU of your machine
'''



import numpy as np
import tensorflow as tf
import datetime

# Processing Units logs
log_device_placement = True

# Num of multiplications to perform
n = 10

'''
Example: compute A^n + B^n on 2 GPUs
Results on 8 cores with 2 GTX-980:
 * Single GPU computation time: 0:00:11.277449
 * Multi GPU computation time: 0:00:07.131701
'''
# Create random large matrix
A = np.random.rand(10000, 10000).astype('float32')
B = np.random.rand(10000, 10000).astype('float32')

# Create a graph to store results
c1 = []
c2 = []

def matpow(M, n):
    if n < 1: #Abstract cases where n < 1
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))

'''
Single GPU computing
'''
with tf.device('/gpu:0'):
    a = tf.placeholder(tf.float32, [10000, 10000])
    b = tf.placeholder(tf.float32, [10000, 10000])
    # Compute A^n and B^n and store results in c1
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

with tf.device('/cpu:0'):
  sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n

t1_1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    # Run the op.
    sess.run(sum, {a:A, b:B})
t2_1 = datetime.datetime.now()


'''
Multi GPU computing
'''
# GPU:0 computes A^n
with tf.device('/gpu:0'):
    # Compute A^n and store result in c2
    a = tf.placeholder(tf.float32, [10000, 10000])
    c2.append(matpow(a, n))

# GPU:1 computes B^n
with tf.device('/gpu:1'):
    # Compute B^n and store result in c2
    b = tf.placeholder(tf.float32, [10000, 10000])
    c2.append(matpow(b, n))

with tf.device('/cpu:0'):
  sum = tf.add_n(c2) #Addition of all elements in c2, i.e. A^n + B^n

t1_2 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    # Run the op.
    sess.run(sum, {a:A, b:B})
t2_2 = datetime.datetime.now()


print("Single GPU computation time: " + str(t2_1-t1_1))
print("Multi GPU computation time: " + str(t2_2-t1_2))

InvalidArgumentError: Cannot assign a device to node 'MatMul_10': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
	 [[Node: MatMul_10 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:0"](Placeholder_1, Placeholder_1)]]

Caused by op 'MatMul_10', defined at:
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2698, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2802, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-e5c713ce2a43>", line 55, in <module>
    c1.append(matpow(b, n))
  File "<ipython-input-16-e5c713ce2a43>", line 45, in matpow
    return tf.matmul(M, matpow(M, n-1))
  File "<ipython-input-16-e5c713ce2a43>", line 45, in matpow
    return tf.matmul(M, matpow(M, n-1))
  File "<ipython-input-16-e5c713ce2a43>", line 45, in matpow
    return tf.matmul(M, matpow(M, n-1))
  [Previous line repeated 6 more times]
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1801, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1263, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/ashnkumar/anaconda/envs/tf11/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'MatMul_10': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
	 [[Node: MatMul_10 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:0"](Placeholder_1, Placeholder_1)]]


## Next Steps
You can find the model working through the Sumz chrome extension (link); clearly it's not perfect at its summaries but it's cool that it can have some utility.

Importantly I got to cover some of the concepts I wanted to learn more about, including:
* Sequence to sequence models
* Attention Mechanisms
* Serving the model in a production environment (exporting)

There are definitely some improvements to be made:
* <b>Better data</b>: this model uses a fine foods dataset, and only a small portion of it. It'd be great to train it was a much bigger Amazon reviews set like [SNAP](https://snap.stanford.edu/data/web-Amazon.html) (34M reviews)
* <b>Fine tuning on the model</b>: Especially the Attention Mechanism, and other architectures
* <b>Bucketing</b>The review lengths vary wildly for Amazon reviews; having buckets could really help (putting reviews within a certain length in one bucket) so that the model can be trained based on that. Also having more types of products in the reviews could help inform the model in terms of summaries

Hope this was helpful!