Preprocess the raw CNN/Dailymail story files.

1. Build summary from highlights.
2. Write story and summary to a single file where the story and summary are separated by a tab.

The data can be obtained from:

https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail

Parameters:

- DATA_DIRECTORIES: The directories containing stories to process.
- OUTPUT_DIR: Where the processed stories will be stored.
- MAX_SUMMARY_SENTENCES: The maximum number of highlights used to make the summary.
- EXTENSION: The file extension to use for the processed stories.

## Parameters

In [1]:
DATA_DIRECTORIES = ['../data/cnn/stories', '../data/dailymail/stories']
OUTPUT_DIR = '../data/preprocessed_stories'
MAX_SUMMARY_SENTENCES = 2
EXTENSION = 'clean'

## Get data files

In [2]:
import glob
import os

In [3]:
FILES = []
for directory in DATA_DIRECTORIES:
    stories = glob.glob(os.path.join(directory, '*'))
    FILES.extend(stories)

In [4]:
len(FILES)

0

In [5]:
FILES[:10]

[]

## Build story parser

In [6]:
# some stories are bad, e.g.
!more ../data/cnn/stories/00465603227f7f56fcd37e10f4cd44e57d7647d8.story

../data/cnn/stories/00465603227f7f56fcd37e10f4cd44e57d7647d8.story: No such file or directory


In [7]:
import os
def parse(file, max_summary_sentences=None):
    with open(file) as f:
        file_id = os.path.basename(file).partition('.')[0]
        content = f.read()
        content = content.replace('\t', '<tab>')
        context, *highlights = content.split('@highlight')
        if max_summary_sentences is not None:
            highlights = highlights[:max_summary_sentences]
        summary =  '. '.join(h.strip() for h in higlights) + '.'
        context, summary = context.strip(), summary.strip()
        if not context and summary:
            return None
        return file_id, context.strip(), summary.strip()

In [8]:
import tqdm
summaries = [parse(f, MAX_SUMMARY_SENTENCES) for f in tqdm.tqdm(FILES)]
summaries = [s for s in summaries if s is not None]

0it [00:00, ?it/s]


In [9]:
summaries[0]

IndexError: list index out of range

## Calculate some basic statistics on data

Summary stats should match (when there is no limit placed on summary length and new lines are not split out)

"The source documents in the training
set have 766 words spanning 29.74 sentences
on an average while the summaries consist of 53
words and 3.72 sentences."

see,
https://arxiv.org/pdf/1602.06023.pdf

In [None]:
context_lens, summary_lens = [], []
context_sentences, summary_sentences = [], []
for _, context, summary in tqdm.tqdm(summaries):
    context_lens.append(len(context.split(' ')))
    context_sentences.append(context.count('.'))
    summary_lens.append(len(summary.split(' ')))
    summary_sentences.append(summary.count('.'))

In [None]:
import pandas as pd
df = pd.DataFrame({
    'context_len': context_lens,
    'summary_len': summary_lens,
    'context_sent': context_sentences,
    'summary_sentences': summary_sentences
})

In [None]:
df.describe()

## drop to disk

In [None]:
try:
    os.makedirs(OUTPUT_DIR)
except OSError:
    pass

In [None]:
for file_id, context, summary in tqdm.tqdm(summaries):
    text = '\t'.join([context, summary])
    dst = os.path.join(OUTPUT_DIR, f'{file_id}.{EXTENSION}')
    with open(dst, 'w') as f:
        f.write(text)