Skip to content
📝 Abstractive Summarization of Reddit Posts with Multi-level Memory Networks. In NAACL-HLT, 2019 (oral).
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
assets Initial commit for dataset Nov 2, 2018
preprocessing Add preprocessing scripts Dec 12, 2019
.gitignore first commit Nov 2, 2018 Minor update on README Aug 1, 2019


mmn model

This project hosts the code and dataset for our paper.

We address the problem of abstractive summarization in two directions: proposing a novel dataset and a new model. First, we collected Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit. Second, we propose a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi-level memory to store the information of text from different levels of abstraction.


If you use this code or dataset as part of any published research, please refer following paper.

    author = {Kim, Byeongchang and Kim, Hyunwoo and Kim, Gunhee},
    title = "{Abstractive Summarization of Reddit Posts with Multi-level Memory Networks}",
    booktitle = {NAACL-HLT},
    year = 2019

Running Code


Reddit TIFU Dataset

Reddit TIFU dataset is our newly collected Reddit dataset, where TIFU denotes the name of subbreddit /r/tifu.

Key statistics of Reddit TIFU dataset are outlined below. We also show average and median (in parentheses) values. The total text-summary pairs are 122,933.

Dataset #posts #words/post #words/summ
TIFU-short 79,949 342.4 (269) 9.33 (8)
TIFU-long 42,984 432.6 (351) 23.0 (21)

You can download data from the links below. This file includes raw text and tokenized text.

[Download json]

You can read and explore our dataset as follows

import json

# Read entire file
posts = []
with open('tifu_tokenized_and_filtered.json', 'r') as fp:
    for line in fp:

# Json entries
# [u'title_tokenized',
#  u'permalink',
#  u'title',
#  u'url',
#  u'num_comments',
#  u'tldr',  # (optional)
#  u'created_utc',
#  u'trimmed_title_tokenized',
#  u'ups',
#  u'selftext_html',
#  u'score',
#  u'upvote_ratio',
#  u'tldr_tokenized',  # (optional)
#  u'selftext',
#  u'trimmed_title',
#  u'selftext_without_tldr_tokenized',
#  u'id',
#  u'selftext_without_tldr']


We thank PRAW developers for their API and Reddit users for their valuable posts.

We also appreciate Chris Dongjoo Kim and Yunseok Jang for helpful comments and discussions.

This work was supported by Kakao and Kakao Brain corporations, and Creative-Pioneering Researchers Program through Seoul National University.


Byeongchang Kim, Hyunwoo Kim and Gunhee Kim

Vision and Learning Lab @ Computer Science and Engineering, Seoul National University, Seoul, Korea


MIT license

You can’t perform that action at this time.