Deep Learning Based Abstractive Text Summarization

This project is about building a text summarization system (for legal documents) where the objective is to read in a piece of text (potentially containing many paragraphs) and output a summarized version of it. A good summarizer will output all the important details from the input text while being succinct.

There are many places where a good document summarizer will be valuable. For example, in the legal industry it can be used to summarize long legal documents, in the healthcare industry it can be used to summarize important aspects of a medication, in the news industry it can be used to summarize news articles, in the financial industry it can be used to summarize 10K SEC filings, and many other examples.

In terms of approaches, there are two different types to text summarization approaches:

Extractive text summarization: It identifies important sections of the original article and then copies it to form the summary. It can be thought of as a highlighter.
Abstractive text summarization: It reproduces important information in the article by first understanding the entire article and then succinctly generating new text based upon it. It can be thought of as a pen.

Out of the two, abstractive summarization is more like what humans do; and thus, it has greater potential. But the downside is that it is much more complicated to implement because it requires a language model to generate new text based upon some prior (i.e. the input article). For this project, an abstractive text summarizer (using deep learning) is developed.

The project outline is as follows:

Data collection
Exploratory Data Analysis and Data Wrangling
Literature Survey
Model Building and Evaluation
MLOps and Deployment

Data Collection

The DataCollection directory contains the script used to get various text summarization datasets such as BigPatents dataset, CNN/Daily mail, Arxiv/PubMed scientific papers, Gigaword dataset. For this project, the BigPatents dataset is use. For more details, please refer to the Readme.md file inside the DataCollection directory

Data Wrangling and Exploratory Data Analysis

The DataWrangling directory contains all the code used to load the dataset, then preprocess it by using various regular expressions, then generate the vocabulary, word2idx, idx2word dictionaries for both the description as well as the summary. Furthermore, data visualization is performed to understand various aspects of the input description and summary. Please refer to Readme.md and step5_data_wrangling.ipynb located inside this directory for details.

Literature Survey

Notebook named Literature_Survey.ipynb located inside the LiteratureSurvey directory contains various methods used for text summarization, both extractive and abstractive methods. It discusses various unsupervised methods such as TextRank, Lead-3, Random Sampling as well as supervised learning based deep learning methods such as Pointer-Generator Networks, pre-trained BERT based models, etc.

Model Building/Experimentation and Evaluation

The notebook named ModelBuilding.ipynb inside the ModelBuilding directory contains the many different encoder-decoder types of deep learning based text summarization models I built in Pytorch, such as:

LSTM
LSTM with Attention
Transformers
Memory efficient transformers

MLOps

This is the main directory containing the end-to-end pipeline for data preprocessing, model training, data & model versioning, logging metrics, as well as inference and production deployment. The Weights and Biases framework is used for data versioning, model versioning, and tracking performance metrics across different experiments. Pytorch is used for model training. Flask based API is used for model serving. Unit testing is performed using Pytest. Additionally, GitHub Actions is used for CI/CD where a suite of different unit tests is automatically performed before committing to GitHub repository. Lastly, a Docker container is built that can be deployed to a production environment such as an AWS EC2 instance. Please refer to the Readme.md file located inside the MLOps/MLOps directory for details.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
BERTBenchMark		BERTBenchMark
DataCollection		DataCollection
DataWrangling		DataWrangling
LiteratureSurvey		LiteratureSurvey
MLOps		MLOps
ModelBuilding		ModelBuilding
tests		tests
.gitignore		.gitignore
Project_Proposal.pdf		Project_Proposal.pdf
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERTBenchMark

BERTBenchMark

DataCollection

DataCollection

DataWrangling

DataWrangling

LiteratureSurvey

LiteratureSurvey

MLOps

MLOps

ModelBuilding

ModelBuilding

tests

tests

.gitignore

.gitignore

Project_Proposal.pdf

Project_Proposal.pdf

README.md

README.md

init.py

init.py

requirements.txt

requirements.txt

Repository files navigation

Deep Learning Based Abstractive Text Summarization

Data Collection

Data Wrangling and Exploratory Data Analysis

Literature Survey

Model Building/Experimentation and Evaluation

MLOps

About

Releases

Packages

Languages

amitp-ai/Text_Summarization

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Based Abstractive Text Summarization

Data Collection

Data Wrangling and Exploratory Data Analysis

Literature Survey

Model Building/Experimentation and Evaluation

MLOps

About

Resources

Stars

Watchers

Forks

Languages