Skip to content
📝 Abstractive Summarization of Reddit Posts with Multi-level Memory Networks. In NAACL-HLT, 2019 (oral).
Branch: master
Clone or download
Latest commit 33ae091 Aug 1, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
assets Initial commit for dataset Nov 2, 2018
.gitignore first commit Nov 2, 2018
README.md

README.md

MMN

mmn model

This project hosts the code and dataset for our paper.

We address the problem of abstractive summarization in two directions: proposing a novel dataset and a new model. First, we collected Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit. Second, we propose a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi-level memory to store the information of text from different levels of abstraction.

Reference

If you use this code or dataset as part of any published research, please refer following paper.

@inproceedings{Kim:2019:NAACL-HLT,
    author = {Kim, Byeongchang and Kim, Hyunwoo and Kim, Gunhee},
    title = "{Abstractive Summarization of Reddit Posts with Multi-level Memory Networks}",
    booktitle = {NAACL-HLT},
    year = 2019
}

Running Code

TBU

Reddit TIFU Dataset

Reddit TIFU dataset is our newly collected Reddit dataset, where TIFU denotes the name of subbreddit /r/tifu.

Key statistics of Reddit TIFU dataset are outlined below. We also show average and median (in parentheses) values. The total text-summary pairs are 122,933.

Dataset #posts #words/post #words/summ
TIFU-short 79,949 342.4 (269) 9.33 (8)
TIFU-long 42,984 432.6 (351) 23.0 (21)

You can download data from the links below. This file includes raw text and tokenized text.

[Download json]

You can read and explore our dataset as follows

import json

# Read entire file
posts = []
with open('tifu_tokenized_and_filtered.json', 'r') as fp:
    for line in fp:
        posts.append(json.loads(line))

# Json entries
print(posts[50000].keys())
# [u'title_tokenized',
#  u'permalink',
#  u'title',
#  u'url',
#  u'num_comments',
#  u'tldr',  # (optional)
#  u'created_utc',
#  u'trimmed_title_tokenized',
#  u'ups',
#  u'selftext_html',
#  u'score',
#  u'upvote_ratio',
#  u'tldr_tokenized',  # (optional)
#  u'selftext',
#  u'trimmed_title',
#  u'selftext_without_tldr_tokenized',
#  u'id',
#  u'selftext_without_tldr']

Acknowledgement

We thank PRAW developers for their API and Reddit users for their valuable posts.

We also appreciate Chris Dongjoo Kim and Yunseok Jang for helpful comments and discussions.

This work was supported by Kakao and Kakao Brain corporations, and Creative-Pioneering Researchers Program through Seoul National University.

Authors

Byeongchang Kim, Hyunwoo Kim and Gunhee Kim

Vision and Learning Lab @ Computer Science and Engineering, Seoul National University, Seoul, Korea

License

MIT license

You can’t perform that action at this time.