Skip to content

Abstractive summarization of Reddit datasets with transformers

Notifications You must be signed in to change notification settings

anna-kay/Reddit-summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This project revolves around the task of Abstractive Summarization, a Natural Language Processing (NLP) task. Transformer-based models (deep learning) are used for the summarization of informal, noisy texts. The texts come from Reddit.

The repo contains:

  • Exploratory Data Analysis of the Reddit datasets
  • Filtering of noise from the Reddit datasets
  • Replication of the results of the papers that introduce the Transformer-based models for Abstractive Summarization
  • Fine-tuning of the Transformer-based models on the Reddit datasets

All datasets and models checkpoints used in the project are downloaded from Hugging Face 🤗.

Data

Datasets:

  1. Webis-TLDR-17, https://aclanthology.org/W17-4508/ (paper), https://huggingface.co/datasets/webis/tldr-17 (🤗 dataset card)
  2. Reddit TIFU, https://arxiv.org/abs/1811.00783 (paper), https://huggingface.co/datasets/reddit_tifu (🤗 dataset card)

Dataset Subreddit Time Span Size Fields
Webis-TLDR-17 29,650 different subreddits ("r/tifu" included) 2006-2016 3,848,330 'author', 'body', 'normalizedBody', 'subreddit', 'subreddit_id', 'id', ‘content’ (the source text), ‘summary’
Reddit TIFU "r/tifu" Jan 2013 - Mar 2018 42,139 'ups', 'num_comments', 'upvote_ratio', 'score', 'documents', 'tldr', 'title', ‘documents’(the source text), ‘tldr’(the summary)

Issues with the data:

  1. Overlap between Webis-TLDR-17 and Reddit TIFU:
    As one can observe from the columns 'Subreddit' and 'Time Span', there is a potential for overlap between the two datasets, as both include data from the subreddit 'r/tifu' spanning from 2013 to 2016. More specifically, Webis-TLDR-17 includes 52,219 items belonging to the "r/tifu" subreddit. This project investigates and confirms the presence of this overlap. The two datasets share approxiamtely 5,700 common items (exact matches after lowercasing & removing asterisks), which constitue 13.5% of Reddit TIFU, 10.9% of “r/tifu” items of Webis-TLDR-17, and 0.15% of the total of Webis-TLDR-17. - overlap examination
  2. Both datasets contain duplicates:
    Webis-TLDR-17 contains 40,407 items that are exact duplicates of another item (30,966 non unique values), in terms of source text (‘content’ field), see Webis-TLDR-17 filtering.
    Reddit TIFU contains 38 items that are exact duplicates of another item (24 non unique values), in terms of source text ('documents' field) and 56 almost duplicates, see Reddit TIFU filtering. It is worth noting that we detected an item that appears 25 times Reddit TIFU (25 exact or almost duplicates and one original, e.g. in positions 8200, 8207 and 8208 (indexes) of the dataset).
  3. No official train-val-test splits for either dataset:
    No official train-val-test splits were found in the papers introducing or performing experiments on the datasets. Hugging Face Datasets also does not provide any splits. The entirety of both datasets was using the 'split='train' argument, like this:
webis_tldr = load_dataset('reddit', split='train')
reddit_tifu = load_dataset('reddit_tifu', 'long', split='train')
  1. Both datasets are noisy
    As Reddit is an open platform, data quality issues are expected. In the scope of the summarization task specifically, the most revelant issues are:
  • very short summaries not proportionate to the source text,
  • users not providing a summary in the summary field but instead posting a short message prompting to read the whole source text or the title, or providing a conlusion or a general truth, or posing a question.
    These issues render these data points actually not suitable for training summarization models due to their lack of coherence with summarization principles.

Models

  1. BART, https://arxiv.org/abs/1910.13461 (paper)
  2. PEGASUS, http://proceedings.mlr.press/v119/zhang20ae (paper), https://github.com/google-research/pegasus (github)
  3. ProphetNet, https://arxiv.org/abs/2001.04063 (paper), https://github.com/microsoft/ProphetNet/tree/master/ProphetNet (github)

Project Structure

| - .vscode/
| - - launch.json
| - notebooks/
| - - EDA/
| - - filtering/
| - - results_replication/
| - src/
| - - dataset.py
| - - train.py
| - - train_without_optimizer.py
| - - utils/
  • .vscode directory is only useful if the code is run in Visual Studio Code
  • notebooks contains the .ipynb files for the Exploratory Data Analysis, the filtering of the Reddit datasets, and the replication of the results of Abstractive Summarization for the BART, PEGASUS, and ProphetNet papers
  • src contains the PyTorch code for the fine-tuning of the Transformer-based models on the Reddit datasets