
# Text Summarization of Speeches
In this notebook I will write summaries with the help of my Seq2Seq model in Summarizer.py.
#### Vishnupriya Venkateswaran - 04/13/2019

In [1]:
import nltk

In [2]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from collections import Counter
import Summarizer
import summarizer_data_utils
import summarizer_model_utils

  from ._conv import register_converters as _register_converters
W0413 15:10:40.215856 140736277070656 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [32]:
print(tf.__version__)

1.13.1


### Reading and exploring

In [81]:
# Read the dataset
data = pd.read_csv('./Speeches.csv',
                   encoding='cp1250')

In [83]:
# we are only going to use title and content.
data['title']
data.head()

Unnamed: 0,title,content
0,"We are a strong nation, and we will maintain s...","For myself and for our Nation, I want to thank..."
1,From time to time we've been tempted to believ...,"Senator Hatfield, Mr. Chief Justice, Mr. Presi..."
2,"In another sense, our New Beginning is a conti...","Senator Mathias, Chief Justice Burger, Vice Pr..."
3,"A new breeze is blowing, and the old bipartisa...","Mr. Chief Justice, Mr. President, Vice Preside..."
4,"The world economy, the world environment, the ...","My fellow citizens, today we celebrate the mys..."


In [86]:
# renaming the column headers the title - > Summary and context --> Text 
data.rename(index = str, columns = {'title':'Summary', 'content':'Text'}, inplace = True)
data = data[['Summary', 'Text']]

In [87]:
data = data[['Summary', 'Text']]
data.head()

Unnamed: 0,Summary,Text
0,"We are a strong nation, and we will maintain s...","For myself and for our Nation, I want to thank..."
1,From time to time we've been tempted to believ...,"Senator Hatfield, Mr. Chief Justice, Mr. Presi..."
2,"In another sense, our New Beginning is a conti...","Senator Mathias, Chief Justice Burger, Vice Pr..."
3,"A new breeze is blowing, and the old bipartisa...","Mr. Chief Justice, Mr. President, Vice Preside..."
4,"The world economy, the world environment, the ...","My fellow citizens, today we celebrate the mys..."


In [88]:
# let's have a look. 
for x in data.Summary[:10]:
    print(x)

We are a strong nation, and we will maintain strength so sufficient that it need not be proven in combat-a quiet strength based not merely on the size of an arsenal but on the nobility of ideas.
From time to time we've been tempted to believe that society has become too complex to be managed by self-rule, that government by an elite group is superior to government for, by, and of the people.
In another sense, our New Beginning is a continuation of that beginning created two centuries ago when, for the first time in history, government, the people said, was not our master, it is our servant; its only power that which we the people allow it to have.
A new breeze is blowing, and the old bipartisanship must be made new again.
The world economy, the world environment, the world AIDS crisis, the world arms race: they affect us all.
With a new vision of Government, a new sense of responsibility, a new spirit of community, we will sustain America's journey.
To all nations, we will speak for th

In [89]:
data.Text[0]

'For myself and for our Nation, I want to thank my predecessor for all he has done to heal our land.\nIn this outward and physical ceremony, we attest once again to the inner and spiritual strength of our Nation. As my high school teacher, Miss Julia Coleman, used to say, "We must adjust to changing times and still hold to unchanging principles."\nHere before me is the Bible used in the inauguration of our first President, in 1789, and I have just taken the oath of office on the Bible my mother gave me just a few years ago, opened to a timeless admonition from the ancient prophet Micah: "He hath showed thee, O man, what is good; and what doth the Lord require of thee, but to do justly, and to love mercy, and to walk humbly with thy God."\nThis inauguration ceremony marks a new beginning, a new dedication within our Government, and a new spirit among us all. A President may sense and proclaim that new spirit, but only a people can provide it.\nTwo centuries ago, our Nation\'s birth was 

In [90]:
len_summaries = [len(summary) for i, summary in enumerate(data.Summary)]
len_texts = [len(text) for text in data.Text]

In [91]:
len_summaries_counted = Counter(len_summaries).most_common()
len_texts_counted = Counter(len_texts).most_common()
len_summaries_counted[:10], len_texts_counted[:10]

([(360, 1),
  (194, 1),
  (260, 1),
  (118, 1),
  (295, 1),
  (199, 1),
  (72, 1),
  (105, 1),
  (75, 1),
  (125, 1)],
 [(11892, 1),
  (14579, 1),
  (12484, 1),
  (13701, 1),
  (11893, 1),
  (9064, 1),
  (9011, 1),
  (12139, 1),
  (6844, 1),
  (8413, 1)])

In [92]:
# we can use shorter texts, when there is a limited resources as those are easier to learn.
indices = [ind for ind, text in enumerate(data.Text) if 50 < len(text) < 200000]

In [93]:
texts_unprocessed = [text for ind, text in enumerate(data.Text) if 50 < len(text) < 200000]


In [94]:
summaries_unprocessed = [summary for text, summary in zip(data.Text,data.Summary) if 50 < len(text) < 200000]

In [95]:
len(indices), len(texts_unprocessed), len(summaries_unprocessed)
print(indices[0],texts_unprocessed[0],summaries_unprocessed[0])

0 For myself and for our Nation, I want to thank my predecessor for all he has done to heal our land.
In this outward and physical ceremony, we attest once again to the inner and spiritual strength of our Nation. As my high school teacher, Miss Julia Coleman, used to say, "We must adjust to changing times and still hold to unchanging principles."
Here before me is the Bible used in the inauguration of our first President, in 1789, and I have just taken the oath of office on the Bible my mother gave me just a few years ago, opened to a timeless admonition from the ancient prophet Micah: "He hath showed thee, O man, what is good; and what doth the Lord require of thee, but to do justly, and to love mercy, and to walk humbly with thy God."
This inauguration ceremony marks a new beginning, a new dedication within our Government, and a new spirit among us all. A President may sense and proclaim that new spirit, but only a people can provide it.
Two centuries ago, our Nation's birth was a mi

### Clean and prepare the data

In [98]:
# preprocess the texts and summaries.
# we have the option to keep_most or not. in this case we do not want 'to keep most', i.e. we will only keep
# letters and numbers. 
# (to improve the model, this preprocessing step should be refined)
processed_texts, processed_summaries, words_counted = summarizer_data_utils.preprocess_texts_and_summaries(
    texts_unprocessed,
    summaries_unprocessed,
    keep_most=False)

Processing Time:  0.35489797592163086


In [99]:
# some of the texts are empty remove those. 
processed_texts_clean = []
processed_summaries_clean = []

for t, s in zip(processed_texts, processed_summaries):
    if t != [] and s != []:
        processed_texts_clean.append(t)
        processed_summaries_clean.append(s)

### Create lookup dicts

We cannot feed our network actual words, but numbers. So we first have to create our lookup dicts, where each words gets and int value (high or low, depending on its frequency in our corpus). Those help us to later convert the texts into numbers.

We also add special tokens. EndOfSentence and StartOfSentence are crucial for the Seq2Seq model we later use.
Pad token, because all summaries and texts in a batch need to have the same length, pad token helps us do that.

So we need 2 lookup dicts:
 - From word to index 
 - from index to word. 

In [100]:
# create lookup dicts.
# most oft the words only appear only once. 
# min_occureces set to 2 reduces our vocabulary by more than half.
specials = ["<EOS>", "<SOS>","<PAD>","<UNK>"]
word2ind, ind2word,  missing_words = summarizer_data_utils.create_word_inds_dicts(words_counted,
                                                                                  specials = specials,
                                                                                  min_occurences = 2)
print(len(word2ind), len(ind2word), len(missing_words))


1600 1600 1705


### Pretrained embeddings

Optionally we can use pretrained word embeddings. Those have proved to increase training speed and accuracy.
Here I used two different options. Either we use glove embeddings or embeddings from tf_hub.
The ones from tf_hub worked better.

In [0]:
# glove_embeddings_path = '/Users/thomas/Jupyter_Notebooks/Pro Deep Learning with Tensorflow/Notebooks/glove/glove.6B.300d.txt'
# embedding_matrix_save_path = './embeddings/my_embedding.npy'
# emb = summarizer_data_utils.create_and_save_embedding_matrix(word2ind,
#                                                        glove_embeddings_path,
#                                                        embedding_matrix_save_path)

In [102]:
# the embeddings from tf.hub. 
# embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")
embed = hub.Module("https://tfhub.dev/google/Wiki-words-250/1")
emb = embed([key for key in word2ind.keys()])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    embedding = sess.run(emb)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0413 12:44:54.129903 4320859008 saver.py:1483] Saver not created because there are no variables in the graph to restore


In [103]:
embedding.shape

(1600, 250)

In [104]:
np.save('./tf_hub_embedding_speech.npy', embedding)

### Convert text and summaries
As I said before we cannot feed the words directly to our network, we have to convert them to numbers first of all. This is what we do here. And we also append the SOS and EOS tokens.

In [105]:
# converts words in texts and summaries to indices
converted_texts, unknown_words_in_texts = summarizer_data_utils.convert_to_inds(processed_texts_clean,
                                                                                word2ind,
                                                                                eos = False)

In [106]:
converted_summaries, unknown_words_in_summaries = summarizer_data_utils.convert_to_inds(processed_summaries_clean,
                                                                                        word2ind,
                                                                                        eos = True,
                                                                                        sos = True)

In [107]:
# seems to have worked well. 
print(summarizer_data_utils.convert_inds_to_text(converted_texts[0], ind2word))
print(summarizer_data_utils.convert_inds_to_text(converted_summaries[0], ind2word))


['for', '<UNK>', 'and', 'for', 'our', 'nation', 'i', 'want', 'to', 'thank', 'my', 'predecessor', 'for', 'all', 'he', 'has', 'done', 'to', 'heal', 'our', 'land', 'in', 'this', '<UNK>', 'and', 'physical', 'ceremony', 'we', '<UNK>', 'once', 'again', 'to', 'the', 'inner', 'and', '<UNK>', 'strength', 'of', 'our', 'nation', 'as', 'my', 'high', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'used', 'to', 'say', 'we', 'must', '<UNK>', 'to', 'changing', 'times', 'and', 'still', 'hold', 'to', '<UNK>', 'principles', 'here', 'before', 'me', 'is', 'the', 'bible', 'used', 'in', 'the', 'inauguration', 'of', 'our', 'first', 'president', 'in', '<UNK>', 'and', 'i', 'have', 'just', 'taken', 'the', 'oath', 'of', 'office', 'on', 'the', 'bible', 'my', '<UNK>', 'gave', 'me', 'just', 'a', 'few', 'years', 'ago', 'opened', 'to', 'a', 'timeless', '<UNK>', 'from', 'the', 'ancient', 'prophet', 'micah', 'he', '<UNK>', '<UNK>', 'thee', 'o', 'man', 'what', 'is', 'good', 'and', 'what', '<UNK>', 'the', 'lord', 'require',

## The model

Now we can build and train our model. First we define the hyperparameters we want to use. Then we create our Summarizer and call the function .build_graph(), which as the name suggests, builds the computation graph. 
Then we can train the model using .train()

After training we can try our model using .infer()

### Training
Unfortunately I do not have the resources to find the perfect (or right) hyperparameters.

I trained the model for about 40 epochs. the training loss, as well as the validation loss were both still declining.
I chose to use 90% of the data as trainign set and 10% as validation set.

In [113]:
# model hyperparameters
num_layers_encoder = 4
num_layers_decoder = 4
rnn_size_encoder = 250
rnn_size_decoder = 250

batch_size = 22
epochs = 100
clip = 5
keep_probability = 0.8
learning_rate = 0.0005
max_lr=0.005
learning_rate_decay_steps = 100
learning_rate_decay = 0.90


pretrained_embeddings_path = './tf_hub_embedding_speech.npy'
summary_dir = os.path.join('./tensorboard/speeches')

use_cyclic_lr = True
inference_targets=True


In [114]:
# build graph and train the model 
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   save_path='./models/speeches/my_model',
                                   mode='TRAIN',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   batch_size = batch_size,
                                   clip = clip,
                                   keep_probability = keep_probability,
                                   learning_rate = learning_rate,
                                   max_lr=max_lr,
                                   learning_rate_decay_steps = learning_rate_decay_steps,
                                   learning_rate_decay = learning_rate_decay,
                                   epochs = epochs,
                                   pretrained_embeddings_path = pretrained_embeddings_path,
                                   use_cyclic_lr = use_cyclic_lr,)
                                

summarizer.build_graph()
summarizer.train(converted_texts, 
                converted_summaries)


Loaded pretrained embeddings.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Graph built.
-------------------- Epoch 0 of 100 --------------------
Iteration: 0 of 0	train_loss: 7.3778
Average Score for this Epoch: 7.3777689933776855
--- new best score ---


-------------------- Epoch 1 of 100 --------------------
Iteration: 0 of 0	train_loss: 7.3701
Average Score for this Epoch: 7.370136260986328
--- new best score ---


-------------------- Epoch 2 of 100 --------------------
Iteration: 0 of 0	train_loss: 7.3573
Average Score for this Epoch: 7.357321262359619
--- new best score ---


-------------------- Epoch 3 of 100 --------------------
Iteration: 0 of 0	train_loss: 7.3218
Average Score for this Epoch: 7.321800708770752
--- new best score ---


-------------------- Epoch 4 of 100 --------------------
Iteration: 0 of 0	train_loss: 7.2197
Average Score for this Epoch: 7.219707012176514
--- new best score ---


-------------------- Epoch 5 of 100 --------------------
Iteration: 0 of 0	train_loss: 7.0442
Average Score for this Epoch: 7.044190883636475
--- new b

In [116]:
len(converted_texts[:50])
len(converted_texts)

11

### Inference
Now we can use our trained model to create summaries. Here we are clearly overfitting, as we only trained on 11 examples. (i.e. the model does not generalize at all.)


In [117]:
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   './models/speeches/my_model',
                                   'INFER',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   batch_size = len(converted_texts[:50]),
                                   clip = clip,
                                   keep_probability = 1.0,
                                   learning_rate = 0.0,
                                   beam_width = 5,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   inference_targets = False,
                                   pretrained_embeddings_path = pretrained_embeddings_path)

summarizer.build_graph()
preds = summarizer.infer(converted_texts[:50],
                         restore_path =  './models/speeches/my_model',
                         targets = converted_summaries[:50])




Loaded pretrained embeddings.
Graph built.
INFO:tensorflow:Restoring parameters from ./models/headlines/my_model


I0413 13:54:02.410489 4320859008 saver.py:1270] Restoring parameters from ./models/headlines/my_model


Done.


In [118]:
# show results
summarizer_model_utils.sample_results(preds,
                                      ind2word,
                                      word2ind,
                                      converted_summaries[:50],
                                      converted_texts[:50])




 ----------------------------------------------------------------------------------------------------
Actual Text:
for <UNK> and for our nation i want to thank my predecessor for all he has done to heal our land in this <UNK> and physical ceremony we <UNK> once again to the inner and <UNK> strength of our nation as my high <UNK> <UNK> <UNK> <UNK> <UNK> used to say we must <UNK> to changing times and still hold to <UNK> principles here before me is the bible used in the inauguration of our first president in <UNK> and i have just taken the oath of office on the bible my <UNK> gave me just a few years ago opened to a timeless <UNK> from the ancient prophet micah he <UNK> <UNK> thee o man what is good and what <UNK> the lord require of thee but to do <UNK> and to love mercy and to walk <UNK> with <UNK> god this inauguration ceremony <UNK> a new beginning a new dedication within our government and a new spirit among us all a president may sense and proclaim that new spirit but only a pe

# Conclusion

Generally I am not impressed by how well the model works. 
We only used a limited amount of data, trained it for a limited amount of time and used nearly random hyperparameters and it still delivers good results. 

However, we are clearly overfitting the training data and the model does not perfectly generalize.

Therefore it would be really interesting to scale it up and see how it performs. 

