A Guide on Text Summarization using NLP techniques

During the execution of my capstone project in the Machine Learning Engineer Nanodegree in Udacity, I studied in some depth about the problem of text summarization. For that reason, I am going to write a series of articles about it, from the definition of the problem and some approaches to solve it, showing some basic implementations and algorithms and describing and testing some more advanced techniques. It will take me some posts for the next few weeks or months.

I will also take advantage of powerful tools like Amazon SageMaker containers, hyperparameter tuning, transformers and Weights & Biases logging to show you how to use them to improve and evaluate the performance of the models.

As a summary, some of the future posts will introduce:

Exploratory Data Analysis for text, to dive deeper in the features of the text and its distribution of words.
Extractive solutions: Using a simple function from a popular library, gensim, and a Sentence clustering algorithm.
Abstractive summarization using LSTMs and the attention mechanism
The Pointer Generation network, an extension from an encoder-decoder, a mix between extractive and abstractive algorithms.
The Transformer model, extending the attention concept to an initial solution.
Advanced transformer models like T5 or BART from the fantastic library, transformers by Huggingface.
Etc,…

Overview

With the rise of information technologies, globalization and Internet, an enormous amount of information is created daily, including a large volume of written texts. The International Data Corporation (IDC) projects that the total amount of digital data circulating annually around the world would sprout "from 4.4 zettabytes in 2013 to hit 180 zettabytes in 2025" [1]. Dealing with such a huge amount of data is a current problem where automatization techniques can help many industries and businesses.

For example, hundreds or thousands of news are published around the world in a few hours and people do not want to read a full article for ten minutes, So, the development of automatic techniques to get short, concise and understandable summaries would be of great help to many global companies and organizations.

Another use case is social media monitoring, many companies or organizations need to be notified when tweets about their products or brand are mentioned to prepare an appropriate and quick response to them. Other fields of interest are legal contract analysis, question answering and bots, etc.

Problem Statement

Text Summarization is a challenging problem these days and it can be defined as a technique of shortening a long piece of text to create a coherent and fluent short summary having only the main points in the document.

But, what is a summary? It is a "text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s) [3]. Summarization clearly involves both these still poorly understood processes, and adds a third (condensation, abstraction, generalization)". Or as it is described in [4], text summarization is "the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or user)and task (or tasks)."

At this moment, it is a very active field of research and the state-of-the-art solutions are still not so successful than we could expect.

Our main goal in this project is to analyze and build a text summarizer using some basics techniques based on machine learning algorithms. Given a long and descriptive text about some topic we will create a brief and understandable summary covering that same topic. As with any other supervised techniques, we will use a dataset containing pairs of texts and summaries.

Dataset

When searching for information and data about text summarization I found hard to obtain a "good" dataset. Some of the most popular data sets are intended for research use, containing hundreds of thousands of examples and gigabytes of data that require high computational capacity and days or weeks to train. But we are interested in a dataset that could be trained faster, in a few hours, where we can experiment and develop easily

For that reason, we will use a dataset from Kaggle, called Inshorts News Data. Inshorts is a news service that provides short summaries of news from around the web, scraping the news article from Hindu, Indian times and Guardian. This dataset contains headlines and summary of news items, about 55,000, along with its source.

Kaggle dataset called Inshorts News Data, click this link to access it. The datafiles are also included in the data directory in this repository.

Exploratory Data Analysis and preprocess data

WORK IN PROGRESS

The Jupyter notebook, Text_summarization_EDA describes an EDA on the dataset where we can observe the word and sentence distributions and some other interesting insights.

In the Jupyter notebook, TextSumm_Data_Preprocessing we apply some cleaning techniques on text data (dealing with punctuation, stop words,...) and split the data in a train and validation dataset.

Encoder Decoder model

Our first model, it's a sequence-to-sequence model (no attention mechanism) containing an embedding layer and a RNN, LSTM, layer. It is a simple approach to our solution but a good starting point. We are interested in showing how to use AWS SageMaker to train the model and track the performance in Weight & Biases platform.

We have built:

"TextSumm_Enc_Dec_custom_container": using a custom container in a SageMaker training job.
"TextSumm-Enc-Dec-SageMaker-hp-tunning": hyperparameter tuning job in SageMaker.

Encoder Decoder model with Attention

The next model, it's a sequence-to-sequence model with Bahdanau attention. It is also a simple approach to the attention mechanism. We are interested in showing how to use AWS SageMaker to train the model and track the performance in Weight & Biases platform.

We have built:

"TextSumm-Enc-Dec-Attention-SageMaker": Training the model in a built-in SageMaker container in script mode.
"TextSumm_Enc_Dec_Att_custom_container": using a custom container in a SageMaker training job.
"TextSumm-Enc-Dec-Attention-SageMaker-hp-tunning": hyperparameter tuning job in SageMaker.

Transformer model

Now we implement a Transformer model.

We have built:

"TextSumm_Transformer_custom_container": using a custom container in a SageMaker training job.
"TextSumm-Transformer-hp-tunning": hyperparameter tuning job in SageMaker.
"TextSumm_Transformer_inference": In progress create an inference container for a transformer model.

NEW features in progress

License

This repository is under the GNU General Public License v3.0.

This repository was developed by Eduardo Muñoz Sala

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Encoder_Decoder_Attention		Encoder_Decoder_Attention
Transformer		Transformer
data_analysis		data_analysis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TextSumm-Bidirectional-LSTM-Enc-Dec-Attention-TS.ipynb		TextSumm-Bidirectional-LSTM-Enc-Dec-Attention-TS.ipynb
TextSumm-Enc-Dec-Attention-SageMaker-hp-tunning.ipynb		TextSumm-Enc-Dec-Attention-SageMaker-hp-tunning.ipynb
TextSumm-Enc-Dec-Attention-SageMaker.ipynb		TextSumm-Enc-Dec-Attention-SageMaker.ipynb
TextSumm-Enc-Dec-SageMaker-hp-tunning.ipynb		TextSumm-Enc-Dec-SageMaker-hp-tunning.ipynb
TextSumm-Transformer-hp-tuning.ipynb		TextSumm-Transformer-hp-tuning.ipynb
TextSumm_Data_Preprocessing.ipynb		TextSumm_Data_Preprocessing.ipynb
TextSumm_Enc_Dec_Att_custom_container.ipynb		TextSumm_Enc_Dec_Att_custom_container.ipynb
TextSumm_Enc_Dec_custom_container.ipynb		TextSumm_Enc_Dec_custom_container.ipynb
TextSumm_Transformer_custom_container.ipynb		TextSumm_Transformer_custom_container.ipynb
TextSumm_Transformer_inference.ipynb		TextSumm_Transformer_inference.ipynb
Text_summarization_EDA.ipynb		Text_summarization_EDA.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Guide on Text Summarization using NLP techniques

Overview

Problem Statement

Dataset

Exploratory Data Analysis and preprocess data

Encoder Decoder model

Encoder Decoder model with Attention

Transformer model

NEW features in progress

License

About

Releases

Packages

Languages

License

edumunozsala/Text-Summarization-Guide

Folders and files

Latest commit

History

Repository files navigation

A Guide on Text Summarization using NLP techniques

Overview

Problem Statement

Dataset

Exploratory Data Analysis and preprocess data

Encoder Decoder model

Encoder Decoder model with Attention

Transformer model

NEW features in progress

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages