# Summarizing Text with Amazon Reviews

The objective of this project is to build a model that can create relevant summaries for reviews written about fine foods sold on Amazon. This dataset contains above 500,000 reviews, and is hosted on [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews).

To build our model we will use a two-layered bidirectional RNN with LSTMs on the input data and two layers, each with an LSTM using bahdanau attention on the target data.

The sections of this project are:
- [1.Inspecting the Data](#1.-Insepcting-the-Data)
- [2.Preparing the Data](#2.-Preparing-the-Data)
- [3.Building the Model](#3.-Building-the-Model)
- [4.Training the Model](#4.-Training-the-Model)
- [5.Making Our Own Summaries](#5.-Making-Our-Own-Summaries)

## Download data
Amazon Reviews Data: [Reviews.csv](https://www.kaggle.com/snap/amazon-fine-food-reviews/downloads/Reviews.csv)

word embeddings [numberbatch-en-17.06.txt.gz](https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz)
after download, extract to **./model/numberbatch-en-17.06.txt**

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import tensor_array_ops
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.8.0


In [2]:
import pickle
def __pickleStuff(filename, stuff):
    save_stuff = open(filename, "wb")
    pickle.dump(stuff, save_stuff)
    save_stuff.close()
def __loadStuff(filename):
    saved_stuff = open(filename,"rb")
    stuff = pickle.load(saved_stuff)
    saved_stuff.close()
    return stuff

## Load those prepared data and skip to section "[3. Building the Model](#3.-Building-the-Model)"
Once we have run through the "[2.Preparing the Data](#2.-Preparing-the-Data)" section, we should have those data, uncomment and run those lines.

In [3]:
clean_summaries = __loadStuff("./data/clean_summaries.p")
clean_texts = __loadStuff("./data/clean_texts.p")

sorted_summaries = __loadStuff("./data/sorted_summaries.p")
sorted_texts = __loadStuff("./data/sorted_texts.p")
word_embedding_matrix = __loadStuff("./data/word_embedding_matrix.p")

vocab_to_int = __loadStuff("./data/vocab_to_int.p")
int_to_vocab = __loadStuff("./data/int_to_vocab.p")


## 1. Insepcting the Data

In [3]:
reviews = pd.read_csv("Reviews.csv")

In [4]:
reviews.shape

(568454, 10)

In [5]:
reviews.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
# Check for any nulls values
reviews.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [7]:
# Remove null values and unneeded features
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator',
                        'Score','Time'], 1)
reviews = reviews.reset_index(drop=True)

In [8]:
reviews.shape

(568411, 2)

In [9]:
reviews.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [10]:
# Inspecting some of the reviews
for i in range(5):
    print("Review #",i+1)
    print(reviews.Summary[i])
    print(reviews.Text[i])
    print()

Review # 1
Good Quality Dog Food
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.

Review # 2
Not as Advertised
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".

Review # 3
"Delight" says it all
This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces 

## 2. Preparing the Data

In [16]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

In [17]:
def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        # We are not using "text.split()" here
        #since it is not fool proof, e.g. words followed by punctuations "Are you kidding?I think you aren't."
        text = re.findall(r"[\w']+", text)
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)# remove links
    text = re.sub(r'\<a href', ' ', text)# remove html link tag
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)

    return text

In [18]:
clean_text("That's a great movie,Can you believe it?I've.But you may not.")

'great movie believe may'

### Clean the summaries and texts
We will remove the stopwords from the texts because they do not provide much use for training our model. However, we will keep them for our summaries so that they sound more like natural phrases. 

In [14]:
clean_summaries = []
for summary in reviews.Summary:
    clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")

clean_texts = []
for text in reviews.Text:
    clean_texts.append(clean_text(text))
print("Texts are complete.")

Summaries are complete.
Texts are complete.


In [15]:
# Inspect the cleaned summaries and texts to ensure they have been cleaned well
for i in range(5):
    print("Clean Review #",i+1)
    print(clean_summaries[i])
    print(clean_texts[i])
    print()

Clean Review # 1
good quality dog food
bought several vitality canned dog food products found good quality product looks like stew processed meat smells better labrador finicky appreciates product better

Clean Review # 2
not as advertised
product arrived labeled jumbo salted peanuts peanuts actually small sized unsalted sure error vendor intended represent product jumbo

Clean Review # 3
delight says it all
confection around centuries light pillowy citrus gelatin nuts case filberts cut tiny squares liberally coated powdered sugar tiny mouthful heaven chewy flavorful highly recommend yummy treat familiar story c lewis lion witch wardrobe treat seduces edmund selling brother sisters witch

Clean Review # 4
cough medicine
looking secret ingredient robitussin believe found got addition root beer extract ordered good made cherry soda flavor medicinal

Clean Review # 5
great taffy
great taffy great price wide assortment yummy taffy delivery quick taffy lover deal



### Count the number of occurrences of each word in a set of text

In [19]:
def count_words(count_dict, text):
    for sentence in text:
        for word in sentence.split():
            if word not in count_dict:
                count_dict[word] = 1
            else:
                count_dict[word] += 1

#### Give the function a try

In [20]:
mydict = {}
count_words(mydict, ["that is a great great great dog","you have a great dog"])
mydict

{'a': 2, 'dog': 2, 'great': 4, 'have': 1, 'is': 1, 'that': 1, 'you': 1}

In [21]:
word_counts = {}
count_words(word_counts, clean_summaries)
count_words(word_counts, clean_texts)
print("Size of Vocabulary:", len(word_counts))

Size of Vocabulary: 125880


Let's see how may "hero" occurs in the data

In [22]:
word_counts["hero"]

114

### Load Conceptnet Numberbatch's (CN) embeddings, similar to GloVe, but probably better 
 (https://github.com/commonsense/conceptnet-numberbatch)

In [23]:

embeddings_index = {}
with open('/usr3/graduate/yuywang/Downloads/numberbatch-en.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

Word embeddings: 417195


### Take a look at the CN embedding dimension

In [24]:
embeddings_index["hero"].shape

(300,)

### Find the number of words that are missing from CN, and are used more than our threshold.

I use a **threshold** of 20, so that words not in CN can be added to our **word_embedding_matrix**, but they need to be common enough in the reviews so that the model can understand their meaning.

In [25]:
missing_words = 0
threshold = 20

for word, count in word_counts.items():
    if count > threshold:
        if word not in embeddings_index:
            missing_words += 1
            
missing_ratio = round(missing_words/len(word_counts),4)*100
            
print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

Number of words missing from CN: 2608
Percent of words that are missing from vocabulary: 2.07%


### What are those missing words in the CN
Looks mostly products' brand.

In [26]:
missing_words = []
for word, count in word_counts.items():
    if count > threshold and word not in embeddings_index:
        missing_words.append((word,count))
missing_words[:30]

[('oatmeals', 137),
 ('realemon', 29),
 ('eukanuba', 373),
 ('kcups', 1497),
 ('superfoods', 83),
 ('rotel', 55),
 ('guayaki', 112),
 ('delisious', 22),
 ('chippoisseur', 56),
 ('70', 2184),
 ('1866', 24),
 ('46', 350),
 ('sweetner', 1347),
 ('plocky', 134),
 ('cherrybrook', 64),
 ('kikkoman', 168),
 ('tassimo', 1374),
 ('milka', 156),
 ('11', 5344),
 ('choclate', 152),
 ('10', 19427),
 ('15', 8798),
 ('canin', 825),
 ('cockers', 39),
 ('recomended', 191),
 ('12', 18279),
 ('50', 10597),
 ('yumminess', 176),
 ('3x', 310),
 ('mueslix', 24)]

### Words to indexes, indexes to words dicts
Limit the vocab that we will use to words that appear ≥ threshold or are in CN

In [27]:
#dictionary to convert words to integers
vocab_to_int = {} 
# Index words from 0
value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index:
        vocab_to_int[word] = value
        value += 1

# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>","<EOS>","<GO>"]   

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int)

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100

print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

Total number of unique words: 125880
Number of words we will use: 59072
Percent of words we will use: 46.93%


### Create word embedding matrix
It has shape (nb_words, embedding_dim) i.e. (59072, 300) in this case. 1st dim is word index, 2nd dim is from CN or random generated.

In [28]:
# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim = 300
nb_words = len(vocab_to_int)

# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
for word, i in vocab_to_int.items():
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # If word not in CN, create a random embedding for it
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding
        word_embedding_matrix[i] = new_embedding

# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

59072


### Function to convert sentences to sequence of words indexes
It also use `<UNK>` index to replace unknown words, append `<EOS>` (End of Sentence) to the sequences if eos is set True

In [29]:
def convert_to_ints(text, word_count, unk_count, eos=False):
    '''Convert words in text to an integer.
       If word is not in vocab_to_int, use UNK's integer.
       Total the number of words and UNKs.
       Add EOS token to the end of texts'''
    ints = []
    for sentence in text:
        sentence_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentence_ints.append(vocab_to_int[word])
            else:
                sentence_ints.append(vocab_to_int["<UNK>"])
                unk_count += 1
        if eos:
            sentence_ints.append(vocab_to_int["<EOS>"])
        ints.append(sentence_ints)
    return ints, word_count, unk_count

Apply convert_to_ints to clean_summaries and clean_texts

In [30]:

word_count = 0
unk_count = 0

int_summaries, word_count, unk_count = convert_to_ints(clean_summaries, word_count, unk_count)
int_texts, word_count, unk_count = convert_to_ints(clean_texts, word_count, unk_count, eos=True)

unk_percent = round(unk_count/word_count,4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

Total number of words in headlines: 26232563
Total number of UNKs in headlines: 163594
Percent of words that are UNK: 0.62%


### Take a look at what the sequence looks like
Each number here represents a word

In [31]:
int_summaries[:3]

[[0, 1, 2, 3], [4, 5, 6], [7, 8, 9, 10]]

### Function to get the length of each sequence

In [32]:
def create_lengths(text):
    '''Create a data frame of the sentence lengths from a text'''
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])

In [33]:
create_lengths(int_summaries[:3])

Unnamed: 0,counts
0,4
1,3
2,4


Get statistic summary of the length of summaries and texts

In [34]:
lengths_summaries = create_lengths(int_summaries)
lengths_texts = create_lengths(int_texts)

print("Summaries:")
print(lengths_summaries.describe())
print()
print("Texts:")
print(lengths_texts.describe())

Summaries:
              counts
count  568411.000000
mean        4.181212
std         2.657213
min         0.000000
25%         2.000000
50%         4.000000
75%         5.000000
max        48.000000

Texts:
              counts
count  568411.000000
mean       42.969483
std        44.166441
min         2.000000
25%        18.000000
50%        30.000000
75%        51.000000
max      2063.000000


### See what's the max squence length we can cover by percentile

In [35]:
# Inspect the length of texts
print(np.percentile(lengths_texts.counts, 89.5))
print(np.percentile(lengths_texts.counts, 95))
print(np.percentile(lengths_texts.counts, 99))

84.0
118.0
216.0


In [36]:
# Inspect the length of summaries
print(np.percentile(lengths_summaries.counts, 90))
print(np.percentile(lengths_summaries.counts, 95))
print(np.percentile(lengths_summaries.counts, 99))

8.0
9.0
13.0


## Function to counts the number of time `<UNK>` appears in a sentence

In [37]:
def unk_counter(sentence):
    '''Counts the number of time UNK appears in a sentence.'''
    unk_count = 0
    for word in sentence:
        if word == vocab_to_int["<UNK>"]:
            unk_count += 1
    return unk_count

**Filter** for length limit and number of `<UNK>`s

**Sort** the summaries and texts by the length of the element in **texts** from shortest to longest


In [38]:
max_text_length = 83 # This will cover up to 89.5% lengthes
max_summary_length = 13 # This will cover up to 99% lengthes
min_length = 2
unk_text_limit = 1 # text can contain up to 1 UNK word
unk_summary_limit = 0 # Summary should not contain any UNK word

def filter_condition(item):
    int_summary = item[0]
    int_text = item[1]
    if(len(int_summary) >= min_length and 
       len(int_summary) <= max_summary_length and 
       len(int_text) >= min_length and 
       len(int_text) <= max_text_length and 
       unk_counter(int_summary) <= unk_summary_limit and 
       unk_counter(int_text) <= unk_text_limit):
        return True
    else:
        return False

int_text_summaries = list(zip(int_summaries , int_texts))
int_text_summaries_filtered = list(filter(filter_condition, int_text_summaries))
sorted_int_text_summaries = sorted(int_text_summaries_filtered, key=lambda item: len(item[1]))
sorted_int_text_summaries = list(zip(*sorted_int_text_summaries))
sorted_summaries = list(sorted_int_text_summaries[0])
sorted_texts = list(sorted_int_text_summaries[1])
# Delete those temporary varaibles
del int_text_summaries, sorted_int_text_summaries, int_text_summaries_filtered
# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

428277
428277


### Inspect the length of text in sorted_texts

In [39]:
lengths_texts = [len(text) for text in sorted_texts]
lengths_texts[:20]

[2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]

## Save data for later

In [40]:
__pickleStuff("/usr3/graduate/yuywang/Downloads/clean_summaries.p",clean_summaries)
__pickleStuff("/usr3/graduate/yuywang/Downloads/clean_texts.p",clean_texts)

__pickleStuff("/usr3/graduate/yuywang/Downloads/sorted_summaries.p",sorted_summaries)
__pickleStuff("/usr3/graduate/yuywang/Downloads/sorted_texts.p",sorted_texts)
__pickleStuff("/usr3/graduate/yuywang/Downloads/word_embedding_matrix.p",word_embedding_matrix)

__pickleStuff("/usr3/graduate/yuywang/Downloads/vocab_to_int.p",vocab_to_int)
__pickleStuff("/usr3/graduate/yuywang/Downloads/int_to_vocab.p",int_to_vocab)

## 3. Building the Model

Create palceholders for inputs to the model

**summary_length** and **text_length** are the sentence lengths in a batch, and **max_summary_length** is the maximum length of a summary in a batch.

In [41]:
def model_inputs():
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
    max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,), name='text_length')

    return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length

Remove the last word id from each batch and concatenate the id of `<GO>` to the begining of each batch

In [42]:
def process_encoding_input(target_data, vocab_to_int, batch_size):  
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1]) # slice it to target_data[0:batch_size, 0: -1]
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input

### Create the encoding layers

bidirectional_dynamic_rnn
use **tf.variable_scope** so that variables are reused with each layer

parameters
- **rnn_size**: The number of units in the LSTM cell
- **sequence_length**: size [batch_size], containing the actual lengths for each of the sequences in the batch
- **num_layers**: number of bidirectional RNN layer
- **rnn_inputs**: number of bidirectional RNN layer
- **keep_prob**: RNN dropout input keep probability

In [43]:
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    rnn_inputs,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
            enc_output = tf.concat(enc_output,2)
            # original code is missing this line below, that is how we connect layers 
            # by feeding the current layer's output to next layer's input
            rnn_inputs = enc_output
    return enc_output, enc_state

### Create the training decoding layer
parameters
- **dec_embed_input**: output of embedding_lookup for a batch of inputs
- **summary_length**: length of each padded summary sequences in batch, since padded, all lengths should be same number 
- **dec_cell**: the decoder RNN cells' output with attention wapper
- **output_layer**: fully connected layer to apply to the RNN output
- **vocab_size**: vocabulary size i.e. len(vocab_to_int)+1
- **max_summary_length**: the maximum length of a summary in a batch
- **batch_size**: number of input sequences in a batch

Three components

- **TraingHelper** reads a sequence of integers from the encoding layer.
- **BasicDecoder** processes the sequence with the decoding cell, and an output layer, which is a fully connected layer. **initial_state** set to zero state.
- **dynamic_decode** creates our outputs that will be used for training.

In [44]:
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, output_layer,
                            vocab_size, max_summary_length,batch_size):
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                        sequence_length=summary_length,
                                                        time_major=False)

    training_decoder = tf.contrib.seq2seq.BasicDecoder(cell=dec_cell,
                                                       helper=training_helper,
                                                       initial_state=dec_cell.zero_state(dtype=tf.float32, batch_size=batch_size),
                                                       output_layer = output_layer)

    training_logits = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                           output_time_major=False,
                                                           impute_finished=True,
                                                           maximum_iterations=max_summary_length)
    return training_logits

### Create infer decoding layer

parameters
- **embeddings**: the CN's word_embedding_matrix
- **start_token**: the id of `<GO>`
- **end_token**: the id of `<EOS>`
- **dec_cell**: the decoder RNN cells' output with attention wapper
- **output_layer**: fully connected layer to apply to the RNN output
- **max_summary_length**: the maximum length of a summary in a batch
- **batch_size**: number of input sequences in a batch

**GreedyEmbeddingHelper** argument **start_tokens**: int32 vector shaped [batch_size], the start tokens.

In [45]:
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, output_layer,
                             max_summary_length, batch_size):
    '''Create the inference logits'''
    
    start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')
    
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                start_tokens,
                                                                end_token)
                
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                        inference_helper,
                                                        dec_cell.zero_state(dtype=tf.float32, batch_size=batch_size),
                                                        output_layer)
                
    inference_logits = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                            output_time_major=False,
                                                            impute_finished=True,
                                                            maximum_iterations=max_summary_length)
    
    return inference_logits

### Create Decoding layer
3 parts: decoding cell, attention, and getting our logits.
#### Decoding Cell: 
Just a two layer LSTM with dropout.
#### Attention: 
Using Bhadanau, since trains faster than Luong. 

**AttentionWrapper** applies the attention mechanism to our decoding cell.

parameters
- **dec_embed_input**: output of embedding_lookup for a batch of inputs
- **embeddings**: the CN's word_embedding_matrix
- **enc_output**: encoder layer output, containing the forward and the backward rnn output
- **enc_state**: encoder layer state, a tuple containing the forward and the backward final states of bidirectional rnn.
- **vocab_size**: vocabulary size i.e. len(vocab_to_int)+1
- **text_length**: the actual lengths for each of the input text sequences in the batch
- **summary_length**: the actual lengths for each of the input summary sequences in the batch
- **max_summary_length**: the maximum length of a summary in a batch
- **rnn_size**: The number of units in the LSTM cell
- **vocab_to_int**: vocab_to_int the dictionary
- **keep_prob**: RNN dropout input keep probability
- **batch_size**: number of input sequences in a batch
- **num_layers**: number of decoder RNN layer

In [46]:
def lstm_cell(lstm_size, keep_prob):
    cell = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    return tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob = keep_prob)

def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length,
                   max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the training and inference decoding layers'''
    dec_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell(rnn_size, keep_prob) for _ in range(num_layers)])
    output_layer = Dense(vocab_size,kernel_initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.1))
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                     enc_output,
                                                     text_length,
                                                     normalize=False,
                                                     name='BahdanauAttention')
    dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell,attn_mech,rnn_size)
    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input,summary_length,dec_cell,
                                                  output_layer,
                                                  vocab_size,
                                                  max_summary_length,
                                                  batch_size)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings,
                                                    vocab_to_int['<GO>'],
                                                    vocab_to_int['<EOS>'],
                                                    dec_cell,
                                                    output_layer,
                                                    max_summary_length,
                                                    batch_size)
    return training_logits, inference_logits

In [47]:
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length, 
                  vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
    '''Use the previous functions to create the training and inference logits'''
    
    # Use Numberbatch's embeddings and the newly created ones as our embeddings
    embeddings = word_embedding_matrix
    enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
    enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size) #shape=(batch_size, senquence length) each seq start with index of<GO>
    dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
    training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                        embeddings,
                                                        enc_output,
                                                        enc_state, 
                                                        vocab_size, 
                                                        text_length, 
                                                        summary_length, 
                                                        max_summary_length,
                                                        rnn_size, 
                                                        vocab_to_int, 
                                                        keep_prob, 
                                                        batch_size,
                                                        num_layers)
    return training_logits, inference_logits

### Pad sentences for batch
Pad so the actual lengths for each of the sequences in the batch have the same length.

In [48]:
def pad_sentence_batch(sentence_batch):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]

### Function to generate batch data for training

In [49]:
def get_batches(summaries, texts, batch_size):
    """Batch summaries, texts, and the lengths of their sentences together"""
    for batch_i in range(0, len(texts)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i + batch_size]
        texts_batch = texts[start_i:start_i + batch_size]
        pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
        pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
        
        # Need the lengths for the _lengths parameters
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
        
        pad_texts_lengths = []
        for text in pad_texts_batch:
            pad_texts_lengths.append(len(text))
        
        yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths

#### Just to test "get_batches" function
Here we generate a batch with size of 5

Checkout those "59069" they are `<PAD>`s, also all sequences' lengths are the same.

In [50]:
print("'<PAD>' has id: {}".format(vocab_to_int['<PAD>']))
sorted_summaries_samples = sorted_summaries[7:50]
sorted_texts_samples = sorted_texts[7:50]
pad_summaries_batch_samples, pad_texts_batch_samples, pad_summaries_lengths_samples, pad_texts_lengths_samples = next(get_batches(
    sorted_summaries_samples, sorted_texts_samples, 5))
print("pad summaries batch samples:\n\r {}".format(pad_summaries_batch_samples))

'<PAD>' has id: 59069
pad summaries batch samples:
 [[ 3047    76    91   174   174   605  1739  1937  1537   109    41   206]
 [   79   334  3402    28    68 59069 59069 59069 59069 59069 59069 59069]
 [    4     5    21     5  5444   369    44   875   111  7631   216   628]
 [   33    73   488    41    17    25 12512   411   462 59069 59069 59069]
 [  105  1406   462   754   116 59069 59069 59069 59069 59069 59069 59069]]


In [55]:
# Set the Hyperparameters
epochs = 100
batch_size = 64
rnn_size = 256
num_layers = 2
learning_rate = 0.005
keep_probability = 0.95

## Build graph

In [56]:
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():
    
    # Load the model inputs    
    input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()

    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                      targets, 
                                                      keep_prob,   
                                                      text_length,
                                                      summary_length,
                                                      max_summary_length,
                                                      len(vocab_to_int)+1,
                                                      rnn_size, 
                                                      num_layers, 
                                                      vocab_to_int,
                                                      batch_size)
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits[0].rnn_output, 'logits')
    inference_logits = tf.identity(inference_logits[0].sample_id, name='predictions')
    
    # Create the weights for sequence_loss, the sould be all True across since each batch is padded
    masks = tf.sequence_mask(summary_length, max_summary_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)
print("Graph is built.")
graph_location = "./graph"
print(graph_location)
train_writer = tf.summary.FileWriter(graph_location)
train_writer.add_graph(train_graph)

Graph is built.
./graph


## 4. Training the Model

Only going to use a subset of the data to reduce the traing time for this demo.

We chose not use use the start of the subset because because those are shorter sequences and we don't want to make it too easy for the model.

In [57]:
# Subset the data for training
start = 200000
end = start + 50000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))
print("The longest text length:",len(sorted_texts_short[-1]))

The shortest text length: 25
The longest text length: 31


In [58]:
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0 
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1

update_loss = 0 
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the model

checkpoint = "./best_model.ckpt" 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    
    # If we want to continue training a previous session
    #loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')
    #loader.restore(sess, checkpoint)
    
    for epoch_i in range(1, epochs+1):
        update_loss = 0
        batch_loss = 0
        for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost],
                {input_data: texts_batch,
                 targets: summaries_batch,
                 lr: learning_rate,
                 summary_length: summaries_lengths,
                 text_length: texts_lengths,
                 keep_prob: keep_probability})

            batch_loss += loss
            update_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time

            if batch_i % display_step == 0 and batch_i > 0:
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(sorted_texts_short) // batch_size, 
                              batch_loss / display_step, 
                              batch_time*display_step))
                batch_loss = 0

            if batch_i % update_check == 0 and batch_i > 0:
                print("Average loss for this update:", round(update_loss/update_check,3))
                summary_update_loss.append(update_loss)
                
                # If the update loss is at a new minimum, save the model
                if update_loss <= min(summary_update_loss):
                    print('New Record!') 
                    stop_early = 0
                    saver = tf.train.Saver() 
                    saver.save(sess, checkpoint)

                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
                update_loss = 0
            
                    
        # Reduce learning rate, but not below its minimum value
        learning_rate *= learning_rate_decay
        if learning_rate < min_learning_rate:
            learning_rate = min_learning_rate
        
        if stop_early == stop:
            print("Stopping Training.")
            break

Epoch   1/100 Batch   20/781 - Loss:  4.931, Seconds: 2.67
Epoch   1/100 Batch   40/781 - Loss:  2.845, Seconds: 2.23
Epoch   1/100 Batch   60/781 - Loss:  2.871, Seconds: 2.77
Epoch   1/100 Batch   80/781 - Loss:  2.816, Seconds: 2.67
Epoch   1/100 Batch  100/781 - Loss:  2.682, Seconds: 2.66
Epoch   1/100 Batch  120/781 - Loss:  2.695, Seconds: 2.49
Epoch   1/100 Batch  140/781 - Loss:  2.583, Seconds: 2.59
Epoch   1/100 Batch  160/781 - Loss:  2.839, Seconds: 2.20
Epoch   1/100 Batch  180/781 - Loss:  2.695, Seconds: 2.52
Epoch   1/100 Batch  200/781 - Loss:  2.667, Seconds: 2.84
Epoch   1/100 Batch  220/781 - Loss:  2.603, Seconds: 2.63
Epoch   1/100 Batch  240/781 - Loss:  2.456, Seconds: 2.62
Average loss for this update: 2.866
New Record!
Epoch   1/100 Batch  260/781 - Loss:  2.558, Seconds: 2.74
Epoch   1/100 Batch  280/781 - Loss:  2.639, Seconds: 2.73
Epoch   1/100 Batch  300/781 - Loss:  2.651, Seconds: 2.87
Epoch   1/100 Batch  320/781 - Loss:  2.672, Seconds: 2.67
Epoch   

Epoch   4/100 Batch  300/781 - Loss:  1.737, Seconds: 2.70
Epoch   4/100 Batch  320/781 - Loss:  1.818, Seconds: 2.71
Epoch   4/100 Batch  340/781 - Loss:  1.626, Seconds: 2.66
Epoch   4/100 Batch  360/781 - Loss:  1.688, Seconds: 2.80
Epoch   4/100 Batch  380/781 - Loss:  1.477, Seconds: 2.91
Epoch   4/100 Batch  400/781 - Loss:  1.614, Seconds: 2.88
Epoch   4/100 Batch  420/781 - Loss:  1.591, Seconds: 2.59
Epoch   4/100 Batch  440/781 - Loss:  1.705, Seconds: 2.94
Epoch   4/100 Batch  460/781 - Loss:  1.781, Seconds: 2.85
Epoch   4/100 Batch  480/781 - Loss:  1.582, Seconds: 2.66
Epoch   4/100 Batch  500/781 - Loss:  1.619, Seconds: 2.70
Average loss for this update: 1.64
New Record!
Epoch   4/100 Batch  520/781 - Loss:  1.492, Seconds: 2.54
Epoch   4/100 Batch  540/781 - Loss:  1.559, Seconds: 2.80
Epoch   4/100 Batch  560/781 - Loss:  1.491, Seconds: 2.61
Epoch   4/100 Batch  580/781 - Loss:  1.728, Seconds: 2.79
Epoch   4/100 Batch  600/781 - Loss:  1.766, Seconds: 2.76
Epoch   4

Epoch   7/100 Batch  580/781 - Loss:  1.312, Seconds: 2.74
Epoch   7/100 Batch  600/781 - Loss:  1.317, Seconds: 2.76
Epoch   7/100 Batch  620/781 - Loss:  1.245, Seconds: 2.68
Epoch   7/100 Batch  640/781 - Loss:  1.183, Seconds: 2.76
Epoch   7/100 Batch  660/781 - Loss:  1.045, Seconds: 2.87
Epoch   7/100 Batch  680/781 - Loss:  1.133, Seconds: 2.78
Epoch   7/100 Batch  700/781 - Loss:  1.268, Seconds: 2.83
Epoch   7/100 Batch  720/781 - Loss:  1.444, Seconds: 3.06
Epoch   7/100 Batch  740/781 - Loss:  1.289, Seconds: 3.03
Epoch   7/100 Batch  760/781 - Loss:  1.277, Seconds: 2.78
Average loss for this update: 1.222
New Record!
Epoch   7/100 Batch  780/781 - Loss:  1.091, Seconds: 2.56
Epoch   8/100 Batch   20/781 - Loss:  1.353, Seconds: 2.67
Epoch   8/100 Batch   40/781 - Loss:  1.241, Seconds: 2.28
Epoch   8/100 Batch   60/781 - Loss:  1.210, Seconds: 2.86
Epoch   8/100 Batch   80/781 - Loss:  1.214, Seconds: 2.67
Epoch   8/100 Batch  100/781 - Loss:  1.058, Seconds: 2.69
Epoch   

Epoch  11/100 Batch   80/781 - Loss:  0.932, Seconds: 2.71
Epoch  11/100 Batch  100/781 - Loss:  0.815, Seconds: 2.81
Epoch  11/100 Batch  120/781 - Loss:  0.893, Seconds: 2.48
Epoch  11/100 Batch  140/781 - Loss:  0.853, Seconds: 2.59
Epoch  11/100 Batch  160/781 - Loss:  0.985, Seconds: 2.20
Epoch  11/100 Batch  180/781 - Loss:  0.957, Seconds: 2.50
Epoch  11/100 Batch  200/781 - Loss:  0.909, Seconds: 2.77
Epoch  11/100 Batch  220/781 - Loss:  0.913, Seconds: 2.63
Epoch  11/100 Batch  240/781 - Loss:  0.765, Seconds: 2.64
Average loss for this update: 0.91
New Record!
Epoch  11/100 Batch  260/781 - Loss:  0.857, Seconds: 2.73
Epoch  11/100 Batch  280/781 - Loss:  0.877, Seconds: 2.78
Epoch  11/100 Batch  300/781 - Loss:  0.974, Seconds: 2.67
Epoch  11/100 Batch  320/781 - Loss:  1.015, Seconds: 3.00
Epoch  11/100 Batch  340/781 - Loss:  0.900, Seconds: 2.71
Epoch  11/100 Batch  360/781 - Loss:  0.896, Seconds: 2.80
Epoch  11/100 Batch  380/781 - Loss:  0.788, Seconds: 2.90
Epoch  11

Epoch  14/100 Batch  360/781 - Loss:  0.725, Seconds: 2.76
Epoch  14/100 Batch  380/781 - Loss:  0.631, Seconds: 2.90
Epoch  14/100 Batch  400/781 - Loss:  0.717, Seconds: 2.89
Epoch  14/100 Batch  420/781 - Loss:  0.661, Seconds: 2.54
Epoch  14/100 Batch  440/781 - Loss:  0.690, Seconds: 2.96
Epoch  14/100 Batch  460/781 - Loss:  0.747, Seconds: 2.84
Epoch  14/100 Batch  480/781 - Loss:  0.705, Seconds: 2.62
Epoch  14/100 Batch  500/781 - Loss:  0.745, Seconds: 2.75
Average loss for this update: 0.714
New Record!
Epoch  14/100 Batch  520/781 - Loss:  0.648, Seconds: 2.47
Epoch  14/100 Batch  540/781 - Loss:  0.701, Seconds: 2.95
Epoch  14/100 Batch  560/781 - Loss:  0.647, Seconds: 2.61
Epoch  14/100 Batch  580/781 - Loss:  0.720, Seconds: 2.76
Epoch  14/100 Batch  600/781 - Loss:  0.733, Seconds: 2.95
Epoch  14/100 Batch  620/781 - Loss:  0.690, Seconds: 2.65
Epoch  14/100 Batch  640/781 - Loss:  0.708, Seconds: 2.83
Epoch  14/100 Batch  660/781 - Loss:  0.583, Seconds: 2.86
Epoch  1

Epoch  17/100 Batch  640/781 - Loss:  0.583, Seconds: 2.74
Epoch  17/100 Batch  660/781 - Loss:  0.495, Seconds: 2.87
Epoch  17/100 Batch  680/781 - Loss:  0.552, Seconds: 2.76
Epoch  17/100 Batch  700/781 - Loss:  0.587, Seconds: 2.84
Epoch  17/100 Batch  720/781 - Loss:  0.647, Seconds: 2.97
Epoch  17/100 Batch  740/781 - Loss:  0.607, Seconds: 3.06
Epoch  17/100 Batch  760/781 - Loss:  0.618, Seconds: 2.85
Average loss for this update: 0.572
New Record!
Epoch  17/100 Batch  780/781 - Loss:  0.539, Seconds: 2.59
Epoch  18/100 Batch   20/781 - Loss:  0.644, Seconds: 2.66
Epoch  18/100 Batch   40/781 - Loss:  0.547, Seconds: 2.24
Epoch  18/100 Batch   60/781 - Loss:  0.576, Seconds: 2.83
Epoch  18/100 Batch   80/781 - Loss:  0.564, Seconds: 2.72
Epoch  18/100 Batch  100/781 - Loss:  0.500, Seconds: 2.67
Epoch  18/100 Batch  120/781 - Loss:  0.527, Seconds: 2.46
Epoch  18/100 Batch  140/781 - Loss:  0.508, Seconds: 2.59
Epoch  18/100 Batch  160/781 - Loss:  0.554, Seconds: 2.21
Epoch  1

Epoch  21/100 Batch  140/781 - Loss:  0.427, Seconds: 2.57
Epoch  21/100 Batch  160/781 - Loss:  0.458, Seconds: 2.16
Epoch  21/100 Batch  180/781 - Loss:  0.463, Seconds: 2.42
Epoch  21/100 Batch  200/781 - Loss:  0.445, Seconds: 2.76
Epoch  21/100 Batch  220/781 - Loss:  0.470, Seconds: 2.60
Epoch  21/100 Batch  240/781 - Loss:  0.375, Seconds: 2.63
Average loss for this update: 0.451
New Record!
Epoch  21/100 Batch  260/781 - Loss:  0.436, Seconds: 2.87
Epoch  21/100 Batch  280/781 - Loss:  0.421, Seconds: 2.75
Epoch  21/100 Batch  300/781 - Loss:  0.486, Seconds: 2.65
Epoch  21/100 Batch  320/781 - Loss:  0.483, Seconds: 2.68
Epoch  21/100 Batch  340/781 - Loss:  0.449, Seconds: 2.69
Epoch  21/100 Batch  360/781 - Loss:  0.436, Seconds: 2.80
Epoch  21/100 Batch  380/781 - Loss:  0.396, Seconds: 2.89
Epoch  21/100 Batch  400/781 - Loss:  0.453, Seconds: 2.89
Epoch  21/100 Batch  420/781 - Loss:  0.408, Seconds: 2.61
Epoch  21/100 Batch  440/781 - Loss:  0.447, Seconds: 3.08
Epoch  2

Epoch  24/100 Batch  420/781 - Loss:  0.357, Seconds: 2.55
Epoch  24/100 Batch  440/781 - Loss:  0.390, Seconds: 2.97
Epoch  24/100 Batch  460/781 - Loss:  0.393, Seconds: 2.86
Epoch  24/100 Batch  480/781 - Loss:  0.383, Seconds: 2.63
Epoch  24/100 Batch  500/781 - Loss:  0.423, Seconds: 2.76
Average loss for this update: 0.382
No Improvement.
Epoch  24/100 Batch  520/781 - Loss:  0.334, Seconds: 2.50
Epoch  24/100 Batch  540/781 - Loss:  0.373, Seconds: 2.84
Epoch  24/100 Batch  560/781 - Loss:  0.362, Seconds: 2.55
Epoch  24/100 Batch  580/781 - Loss:  0.386, Seconds: 2.80
Epoch  24/100 Batch  600/781 - Loss:  0.392, Seconds: 3.01
Epoch  24/100 Batch  620/781 - Loss:  0.388, Seconds: 2.62
Epoch  24/100 Batch  640/781 - Loss:  0.383, Seconds: 2.80
Epoch  24/100 Batch  660/781 - Loss:  0.325, Seconds: 2.88
Epoch  24/100 Batch  680/781 - Loss:  0.375, Seconds: 2.77
Epoch  24/100 Batch  700/781 - Loss:  0.409, Seconds: 2.78
Epoch  24/100 Batch  720/781 - Loss:  0.450, Seconds: 3.10
Epoc

Epoch  27/100 Batch  700/781 - Loss:  0.342, Seconds: 2.82
Epoch  27/100 Batch  720/781 - Loss:  0.374, Seconds: 2.89
Epoch  27/100 Batch  740/781 - Loss:  0.338, Seconds: 3.06
Epoch  27/100 Batch  760/781 - Loss:  0.348, Seconds: 2.82
Average loss for this update: 0.329
New Record!
Epoch  27/100 Batch  780/781 - Loss:  0.313, Seconds: 2.62
Epoch  28/100 Batch   20/781 - Loss:  0.385, Seconds: 2.67
Epoch  28/100 Batch   40/781 - Loss:  0.315, Seconds: 2.25
Epoch  28/100 Batch   60/781 - Loss:  0.332, Seconds: 2.83
Epoch  28/100 Batch   80/781 - Loss:  0.312, Seconds: 2.79
Epoch  28/100 Batch  100/781 - Loss:  0.293, Seconds: 2.72
Epoch  28/100 Batch  120/781 - Loss:  0.315, Seconds: 2.49
Epoch  28/100 Batch  140/781 - Loss:  0.302, Seconds: 2.60
Epoch  28/100 Batch  160/781 - Loss:  0.325, Seconds: 2.18
Epoch  28/100 Batch  180/781 - Loss:  0.318, Seconds: 2.42
Epoch  28/100 Batch  200/781 - Loss:  0.315, Seconds: 2.74
Epoch  28/100 Batch  220/781 - Loss:  0.327, Seconds: 2.66
Epoch  2

Epoch  31/100 Batch  200/781 - Loss:  0.271, Seconds: 2.76
Epoch  31/100 Batch  220/781 - Loss:  0.290, Seconds: 2.69
Epoch  31/100 Batch  240/781 - Loss:  0.230, Seconds: 2.63
Average loss for this update: 0.282
New Record!
Epoch  31/100 Batch  260/781 - Loss:  0.280, Seconds: 2.81
Epoch  31/100 Batch  280/781 - Loss:  0.262, Seconds: 2.74
Epoch  31/100 Batch  300/781 - Loss:  0.290, Seconds: 2.72
Epoch  31/100 Batch  320/781 - Loss:  0.304, Seconds: 2.83
Epoch  31/100 Batch  340/781 - Loss:  0.272, Seconds: 2.71
Epoch  31/100 Batch  360/781 - Loss:  0.283, Seconds: 2.81
Epoch  31/100 Batch  380/781 - Loss:  0.232, Seconds: 2.89
Epoch  31/100 Batch  400/781 - Loss:  0.272, Seconds: 2.87
Epoch  31/100 Batch  420/781 - Loss:  0.243, Seconds: 2.61
Epoch  31/100 Batch  440/781 - Loss:  0.268, Seconds: 2.95
Epoch  31/100 Batch  460/781 - Loss:  0.288, Seconds: 2.90
Epoch  31/100 Batch  480/781 - Loss:  0.283, Seconds: 2.67
Epoch  31/100 Batch  500/781 - Loss:  0.319, Seconds: 2.69
Average 

Epoch  34/100 Batch  480/781 - Loss:  0.258, Seconds: 2.61
Epoch  34/100 Batch  500/781 - Loss:  0.282, Seconds: 2.75
Average loss for this update: 0.25
New Record!
Epoch  34/100 Batch  520/781 - Loss:  0.224, Seconds: 2.48
Epoch  34/100 Batch  540/781 - Loss:  0.245, Seconds: 3.01
Epoch  34/100 Batch  560/781 - Loss:  0.224, Seconds: 2.78
Epoch  34/100 Batch  580/781 - Loss:  0.267, Seconds: 2.76
Epoch  34/100 Batch  600/781 - Loss:  0.259, Seconds: 2.76
Epoch  34/100 Batch  620/781 - Loss:  0.244, Seconds: 2.61
Epoch  34/100 Batch  640/781 - Loss:  0.254, Seconds: 2.76
Epoch  34/100 Batch  660/781 - Loss:  0.218, Seconds: 2.89
Epoch  34/100 Batch  680/781 - Loss:  0.240, Seconds: 2.76
Epoch  34/100 Batch  700/781 - Loss:  0.271, Seconds: 2.81
Epoch  34/100 Batch  720/781 - Loss:  0.284, Seconds: 2.93
Epoch  34/100 Batch  740/781 - Loss:  0.259, Seconds: 3.10
Epoch  34/100 Batch  760/781 - Loss:  0.283, Seconds: 2.80
Average loss for this update: 0.253
No Improvement.
Epoch  34/100 Ba

## 5. Making Our Own Summaries

To see the quality of the summaries that this model can generate, you can either create your own review, or use a review from the dataset. You can set the length of the summary to a fixed value, or use a random value like I have here.

In [59]:
def text_to_seq(text):
    '''Prepare the text for the model'''
    
    text = clean_text(text)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]


- **input_sentences**: a list of reviews strings we are going to summarize
- **generagte_summary_length**: a int or list, if a list must be same length as input_sentences


In [65]:
input_sentences=["If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal.",
                 "This case is being advertised on Amazon that it fits an IPhone XS 2018.It does not fit my IPhone XS 2018.This I consider poor customer service on Spigen’s par"]
#input_sentences=["If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal."]
generagte_summary_length =  [3,2]

texts = [text_to_seq(input_sentence) for input_sentence in input_sentences]
checkpoint = "./best_model.ckpt"
if type(generagte_summary_length) is list:
    if len(input_sentences)!=len(generagte_summary_length):
        raise Exception("[Error] makeSummaries parameter generagte_summary_length must be same length as input_sentences or an integer")
    generagte_summary_length_list = generagte_summary_length
else:
    generagte_summary_length_list = [generagte_summary_length] * len(texts)
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)
    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    #Multiply by batch_size to match the model's input parameters
    for i, text in enumerate(texts):
        generagte_summary_length = generagte_summary_length_list[i]
        answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                          summary_length: [generagte_summary_length], #summary_length: [np.random.randint(5,8)], 
                                          text_length: [len(text)]*batch_size,
                                          keep_prob: 1.0})[0] 
        # Remove the padding from the summaries
        pad = vocab_to_int["<PAD>"] 
        print('- Review:\n\r {}'.format(input_sentences[i]))
        print('- Summary:\n\r {}\n\r\n\r'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from ./best_model.ckpt
- Review:
 If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal.
- Summary:
 great product


- Review:
 This case is being advertised on Amazon that it fits an IPhone XS 2018.It does not fit my IPhone XS 2018.This I consider poor customer service on Spigen’s par
- Summary:
 a decent




## Summary

I hope that you found this project to be rather interesting and informative. One of my main recommendations for working with this dataset and model is either use a GPU, a subset of the dataset, or plenty of time to train your model. As you might be able to expect, the model will not be able to make good predictions just by seeing many reviews, it needs so see the reviews many times to be able to understand the relationship between words and between descriptions & summaries. 

In short, I'm pleased with how well this model performs. After creating numerous reviews and checking those from the dataset, I can happily say that most of the generated summaries are appropriate, some of them are great, and some of them make mistakes. I'll try to improve this model and if it gets better, I'll update my GitHub.

Thanks for reading!