# Comprehensive Guide to Text Summarization using Deep Learning in Python

## Introduction
“I don’t want to read your full report, just give me a summary of the results”. I have often found myself in this situation – both in University as well as in our professional life, We prepare a comprehensive report and the teacher/supervisor only has time to read the summary.

We all think so? Well, I decided to do something about it. Manually converting the report to a summarized version is too time taking, right? Can Natural Language Processing (NLP) techniques help us in achieving this?

This is where the awesome concept of Text Summarization using Deep Learning really helped me out. It solves the one issue which kept bothering me before – now our model can understand the context of the entire text. It’s a dream come true for all of us who need to come up with a quick summary of a document!
And the results we achieve using text summarization in deep learning? Remarkable. So in this article, we will walk through a step-by-step process for building a Text Summarizer using Deep Learning by covering all the concepts required to build it. And then we will implement our first text summarization model in Python!

## Table of Contents
What is Text Summarization in NLP?
<ol>
    <li>Introduction to Sequence-to-Sequence (Seq2Seq) Modeling</li>
<li>Understanding the Encoder – Decoder Architecture</li>
<li>Limitations of the Encoder – Decoder Architecture</li>
<li>The Intuition behind the Attention Mechanism</li>
<li>Understanding the Problem Statement</li>
<li>Implementing a Text Summarization Model in Python using Keras</li>
</ol>

## What’s Next?
How does the Attention Mechanism Work?
I’ve kept the ‘how does the attention mechanism work?’ section at the bottom of this article. It’s a math-heavy section and is not mandatory to understand how the Python code works. However, I encourage you to go through it because it will give you a solid idea of this awesome NLP concept.

## What is Text Summarization in NLP?
Let’s first understand what text summarization is before we look at how it works. Here is a succinct definition to get us started:

“Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning”

There are broadly two different approaches that are used for text summarization:

<li>Extractive Summarization</li>
<li>Abstractive Summarization</li>

### Extractive Summarization
The name gives away what this approach does. We identify the important sentences or phrases from the original text and extract only those from the text. Those extracted sentences would be our summary. The below diagram illustrates extractive summarization:

![Extractive Summary](extractive1.jpg)





### Abstractive Summarization
This is a very interesting approach. Here, we generate new sentences from the original text. This is in contrast to the extractive approach we saw earlier where we used only the sentences that were present. The sentences generated through abstractive summarization might not be present in the original text:

![title](abstractive1.jpg)


we are going to build an Abstractive Text Summarizer using Deep Learning in this article! Let’s first understand the concepts necessary for building a Text Summarizer model before diving into the implementation part.

Exciting times ahead!

## Introduction to Sequence-to-Sequence (Seq2Seq) Modeling
We can build a Seq2Seq model on any problem which involves sequential information. This includes Sentiment classification, Neural Machine Translation, and Named Entity Recognition – some very common applications of sequential information.

In the case of Neural Machine Translation, the input is a text in one language and the output is also a text in another language:(English to German)

#### <center> How are you? ---->  Wie geht es dir? </center>

In the Named Entity Recognition, the input is a sequence of words and the output is a sequence of tags for every word in the input sequence:


#### <center>Hafiz  Zohaib founded CommInn.  --->   B-Per, I-Per,O,B-Company,O</center>

Our objective is to build a text summarizer where the input is a long sequence of words (in a text body), and the output is a short summary (which is a sequence as well). So, we can model this as a Many-to-Many Seq2Seq problem. Below is a typical Seq2Seq model architecture:

![title](final.jpg)

There are two major components of a Seq2Seq model:

<li>Encoder</li>
<li>Decoder</li>
Let’s understand these two in detail. These are essential to understand how text summarization works underneath the code. You can also check out this tutorial to understand sequence-to-sequence modeling in more detail.

 

## Understanding the Encoder-Decoder Architecture
The Encoder-Decoder architecture is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths. Let’s understand this from the perspective of text summarization. The input is a long sequence of words and the output will be a short version of the input sequence.

![title](first.jpg)


Generally, variants of Recurrent Neural Networks (RNNs), i.e. Gated Recurrent Neural Network (GRU) or Long Short Term Memory (LSTM), are preferred as the encoder and decoder components. This is because they are capable of capturing long term dependencies by overcoming the problem of vanishing gradient.

We can set up the Encoder-Decoder in 2 phases:

<li>Training phase</li>
<li>Inference phase</li>
Let’s understand these concepts through the lens of an LSTM model.

 

## Training phase
In the training phase, we will first set up the encoder and decoder. We will then train the model to predict the target sequence offset by one timestep. Let us see in detail on how to set up the encoder and decoder.

 

## Encoder

An Encoder Long Short Term Memory model (LSTM) reads the entire input sequence wherein, at each timestep, one word is fed into the encoder. It then processes the information at every timestep and captures the contextual information present in the input sequence. I’ve put together the below diagram which illustrates this process:

![title](61.jpg)


The hidden state (hi) and cell state (ci) of the last time step are used to initialize the decoder. Remember, this is because the encoder and decoder are two different sets of the LSTM architecture.

 

## Decoder

The decoder is also an LSTM network which reads the entire target sequence word-by-word and predicts the same sequence offset by one timestep. The decoder is trained to predict the next word in the sequence given the previous word.

![title](61.jpg)

<b>start</b> and <b>end</b> are the special tokens which are added to the target sequence before feeding it into the decoder. The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by passing the first word into the decoder which would be always the <start> token. And the <end> token signals the end of the sentence.

Pretty intuitive so far.

## Inference Phase
    
After training, the model is tested on new source sequences for which the target sequence is unknown. So, we need to set up the inference architecture to decode a test sequence:
    
![title](82.jpg)   
    
## How does the inference process work?

Here are the steps to decode the test sequence:

Encode the entire input sequence and initialize the decoder with internal states of the encoder
Pass <start> token as an input to the decoder
Run the decoder for one timestep with the internal states
The output will be the probability for the next word. The word with the maximum probability will be selected
Pass the sampled word as an input to the decoder in the next timestep and update the internal states with the current time step
Repeat steps 3 – 5 until we generate <end> token or hit the maximum length of the target sequence
    Let’s take an example where the test sequence is given by  <b>[x1, x2, x3, x4]</b>. How will the inference process work for this test sequence? I want you to think about it before you look at my thoughts below.
<ol>
    <li>Encode the test sequence into internal state vectors</li>
    <li>Observe how the decoder predicts the target sequence at each timestep:</li>
</ol>                   
<b>Timestep: t=1</b>
    
![title](d1.jpg) 
    
<b>Timestep: t=2</b>
    
![title](d2.jpg)  
    
<b>Timestep: t=3</b>
    
![title](d3.jpg) 
    
## Limitations of the Encoder – Decoder Architecture
As useful as this encoder-decoder architecture is, there are certain limitations that come with it.

<li>The encoder converts the entire input sequence into a fixed length vector and then the decoder predicts the output sequence. This works only for short sequences since the decoder is looking at the entire input sequence for the prediction
    </li>
<li>Here comes the problem with long sequences. It is difficult for the encoder to memorize long sequences into a fixed length vector</li>
“A potential issue with this encoder-decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences. The performance of a basic encoder-decoder deteriorates rapidly as the length of an input sentence increases.”

         -Neural Machine Translation by Jointly Learning to Align and Translate

So how do we overcome this problem of long sequences? This is where the concept of attention mechanism comes into the picture. It aims to predict a word by looking at a few specific parts of the sequence only, rather than the entire sequence. It really is as awesome as it sounds!
    
## The Intuition behind the Attention Mechanism
How much attention do we need to pay to every word in the input sequence for generating a word at timestep t? That’s the key intuition behind this attention mechanism concept.

Let’s consider a simple example to understand how Attention Mechanism works:
<ul>
    <li>Source sequence: “Which sport do you like the most?</li>
    <li>Target sequence: “I love cricket”</li>
    </ul>
The first word ‘I’ in the target sequence is connected to the fourth word ‘you’ in the source sequence, right? Similarly, the second-word ‘love’ in the target sequence is associated with the fifth word ‘like’ in the source sequence.

So, instead of looking at all the words in the source sequence, we can increase the importance of specific parts of the source sequence that result in the target sequence. This is the basic idea behind the attention mechanism.

There are 2 different classes of attention mechanism depending on the way the attended context vector is derived:
<ul>
    <li>Global Attention</li>
    <li>Local Attention</li>
    </ul>
Let’s briefly touch on these classes.
    
## Global Attention
Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector.
    
    
    
## Local Attention
Here, the attention is placed on only a few source positions. Only a few hidden states of the encoder are considered for deriving the attended context vector.
![title](gl.png)    
## Understanding the Problem Statement
Customer reviews can often be long and descriptive. Analyzing these reviews manually, as you can imagine, is really time-consuming. This is where the brilliance of Natural Language Processing can be applied to generate a summary for long reviews.

We will be working on a really cool dataset. Our objective here is to generate a summary for the Amazon Fine Food reviews using the abstraction-based approach we learned about above.