GitHub - anna-kay/Reddit-summarization: Abstractive summarization of Reddit datasets with transformers

Overview

This project revolves around the task of Abstractive Summarization, a Natural Language Processing (NLP) task. Transformer-based models (deep learning) are used for the summarization of informal, noisy texts. The texts come from Reddit.

The repo contains:

Exploratory Data Analysis of the Reddit datasets
Filtering of noise from the Reddit datasets
Replication of the results of the papers that introduce the Transformer-based models for Abstractive Summarization
Fine-tuning of the Transformer-based models on the Reddit datasets

All datasets and models checkpoints used in the project are downloaded from Hugging Face 🤗.

Data

Datasets:

Webis-TLDR-17, https://aclanthology.org/W17-4508/ (paper), https://huggingface.co/datasets/webis/tldr-17 (🤗 dataset card)
Reddit TIFU, https://arxiv.org/abs/1811.00783 (paper), https://huggingface.co/datasets/reddit_tifu (🤗 dataset card)

Dataset	Subreddit	Time Span	Size	Fields
Webis-TLDR-17	29,650 different subreddits ("r/tifu" included)	2006-2016	3,848,330	'author', 'body', 'normalizedBody', 'subreddit', 'subreddit_id', 'id', ‘content’ (the source text), ‘summary’
Reddit TIFU	"r/tifu"	Jan 2013 - Mar 2018	42,139	'ups', 'num_comments', 'upvote_ratio', 'score', 'documents', 'tldr', 'title', ‘documents’(the source text), ‘tldr’(the summary)

Issues with the data:

Overlap between Webis-TLDR-17 and Reddit TIFU:
As one can observe from the columns 'Subreddit' and 'Time Span', there is a potential for overlap between the two datasets, as both include data from the subreddit 'r/tifu' spanning from 2013 to 2016. More specifically, Webis-TLDR-17 includes 52,219 items belonging to the "r/tifu" subreddit. This project investigates and confirms the presence of this overlap. The two datasets share approxiamtely 5,700 common items (exact matches after lowercasing & removing asterisks), which constitue 13.5% of Reddit TIFU, 10.9% of “r/tifu” items of Webis-TLDR-17, and 0.15% of the total of Webis-TLDR-17. - overlap examination
Both datasets contain duplicates:
Webis-TLDR-17 contains 40,407 items that are exact duplicates of another item (30,966 non unique values), in terms of source text (‘content’ field), see Webis-TLDR-17 filtering.
Reddit TIFU contains 38 items that are exact duplicates of another item (24 non unique values), in terms of source text ('documents' field) and 56 almost duplicates, see Reddit TIFU filtering. It is worth noting that we detected an item that appears 25 times Reddit TIFU (25 exact or almost duplicates and one original, e.g. in positions 8200, 8207 and 8208 (indexes) of the dataset).
No official train-val-test splits for either dataset:
No official train-val-test splits were found in the papers introducing or performing experiments on the datasets. Hugging Face Datasets also does not provide any splits. The entirety of both datasets was using the 'split='train' argument, like this:

webis_tldr = load_dataset('reddit', split='train')
reddit_tifu = load_dataset('reddit_tifu', 'long', split='train')

Both datasets are noisy
As Reddit is an open platform, data quality issues are expected. In the scope of the summarization task specifically, the most revelant issues are:

very short summaries not proportionate to the source text,
users not providing a summary in the summary field but instead posting a short message prompting to read the whole source text or the title, or providing a conlusion or a general truth, or posing a question.
These issues render these data points actually not suitable for training summarization models due to their lack of coherence with summarization principles.

Models

BART, https://arxiv.org/abs/1910.13461 (paper)
PEGASUS, http://proceedings.mlr.press/v119/zhang20ae (paper), https://github.com/google-research/pegasus (github)
ProphetNet, https://arxiv.org/abs/2001.04063 (paper), https://github.com/microsoft/ProphetNet/tree/master/ProphetNet (github)

Project Structure

| - .vscode/
| - - launch.json
| - notebooks/
| - - EDA/
| - - filtering/
| - - results_replication/
| - src/
| - - dataset.py
| - - train.py
| - - train_without_optimizer.py
| - - utils/

.vscode directory is only useful if the code is run in Visual Studio Code
notebooks contains the .ipynb files for the Exploratory Data Analysis, the filtering of the Reddit datasets, and the replication of the results of Abstractive Summarization for the BART, PEGASUS, and ProphetNet papers
src contains the PyTorch code for the fine-tuning of the Transformer-based models on the Reddit datasets

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.vscode		.vscode
notebooks		notebooks
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Data

Models

Project Structure

About

Releases

Packages

Languages

anna-kay/Reddit-summarization

Folders and files

Latest commit

History

Repository files navigation

Overview

Data

Models

Project Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages