This code produces the non-anonymized version of the CNN / Daily Mail summarization dataset, as used in the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks. It processes the dataset into the binary format expected by the code for the Tensorflow model.
Python 3 version: This code is in Python 2. If you want a Python 3 version, see @becxer's fork.
Option 1: download the processed data
User @JafferWilson has provided the processed data, which you can download here. (See discussion here about why we do not provide it ourselves).
Option 2: process the data yourself
1. Download data
Download and unzip the
stories directories from here for both CNN and Daily Mail.
Warning: These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example
cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story. The Tensorflow code has been updated to discard these examples.
2. Download Stanford CoreNLP
We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:
/path/to/ with the path to where you saved the
stanford-corenlp-full-2016-10-31 directory. You can check if it's working by running
echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer
You should see something like:
Please tokenize this text . PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.
3. Process into .bin and vocab files
python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories
/path/to/cnn/stories with the path to where you saved the
cnn/stories directory that you downloaded; similarly for
This script will do several things:
- The directories
dm_stories_tokenizedwill be created and filled with tokenized versions of
dailymail/stories. This may take some time. Note: you may see several
Untokenizable:warnings from Stanford Tokenizer. These seem to be related to Unicode characters in the data; so far it seems OK to ignore them.
- For each of the url lists
all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files
test.bin. These will be placed in the newly-created
finished_filesdirectory. This may take some time.
- Additionally, a
vocabfile is created from the training data. This is also placed in
test.binwill be split into chunks of 1000 examples per chunk. These chunked files will be saved in
train_287.bin. This should take a few seconds. You can use either the single files or the chunked files as input to the Tensorflow code (see considerations here).