Preprocess the raw CNN/Dailymail story files.

1. Build summary from highlights.
2. Write story and summary to a single file where the story and summary are separated by a tab.

The data can be obtained from:

https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail

Parameters:

- DATA_DIRECTORIES: The directories containing stories to process.
- OUTPUT_DIR: Where the processed stories will be stored.
- MAX_SUMMARY_SENTENCES: The maximum number of highlights used to make the summary.
- EXTENSION: The file extension to use for the processed stories.

## Parameters

In [1]:
DATA_DIRECTORIES = ['../data/cnn/stories', '../data/dailymail/stories']
OUTPUT_DIR = '../data/preprocessed_stories'
MAX_SUMMARY_SENTENCES = 2
EXTENSION = 'clean'

## Get data files

In [2]:
import glob
import os

In [3]:
FILES = []
for directory in DATA_DIRECTORIES:
    stories = glob.glob(os.path.join(directory, '*'))
    FILES.extend(stories)

In [4]:
len(FILES)

312085

In [5]:
FILES[:10]

['../data/cnn/stories/42cb40734b147af9f928f769097cecd5fae35d79.story',
 '../data/cnn/stories/458dfa3980b6be4954bc1158913d57a671a319c9.story',
 '../data/cnn/stories/60e214a0285536d34f0094d8065e0a3342bcb1f6.story',
 '../data/cnn/stories/eca7d18130520666db55d1231590e2b49bde3b72.story',
 '../data/cnn/stories/aebb78aacd3902f04d8326e7f0762f1e13c0381d.story',
 '../data/cnn/stories/2c9c4ee471b8cdf7d127761e25b94e413d13dd4d.story',
 '../data/cnn/stories/7d00fe1a731956fd902dee5cacc9e3de804ddcec.story',
 '../data/cnn/stories/b05deba8554de0df633d60d0068152da250ef076.story',
 '../data/cnn/stories/f1feae52895b8e6cb0628b41b217f816116681bd.story',
 '../data/cnn/stories/f1517cb9f19145fd9599daf6c8c1e8eb628b4c88.story']

## Build story parser

In [6]:
# some stories are bad, e.g.
!more ../data/cnn/stories/00465603227f7f56fcd37e10f4cd44e57d7647d8.story



@highlight

CNN.com will feature iReporter photos in a weekly Travel Snapshots gallery

@highlight

Please submit your best shots of Barcelona, Spain for next week

@highlight

Visit CNN.com/Travel next Wednesday for a new gallery of snapshots


In [7]:
import os
def parse(file, max_summary_sentences=None):
    with open(file) as f:
        file_id = os.path.basename(file).partition('.')[0]
        content = f.read()
        content = content.replace('\t', '<tab>')
        context, *highlights = content.split('@highlight')
        if max_summary_sentences is not None:
            highlights = highlights[:max_summary_sentences]
        summary = ' . '.join(highlights) + ' .'
        context, summary = context.strip(), summary.strip()
        if not context and summary:
            return None
        return file_id, context.strip(), summary.strip()

In [8]:
import tqdm
summaries = [parse(f, MAX_SUMMARY_SENTENCES) for f in tqdm.tqdm(FILES)]
summaries = [s for s in summaries if s is not None]

100%|██████████| 312085/312085 [02:01<00:00, 2558.64it/s]


In [9]:
summaries[0]

('42cb40734b147af9f928f769097cecd5fae35d79',
 '(LifeWire) -- When April Kling locked eyes with a handsome stranger on her flight out of Chicago last Christmas, her heart skipped a beat. So she did what any hip 21st-century single would do: She posted an ad on Craigslist.org\'s "missed connections" bulletin board when she got home.\n\nA "missed connection" ad led to the marriage of Dan and Erin Kottke who welcomed son Linus two years later.\n\n"He replied within two hours," says Kling, a 29-year-old sales associate and musician from Seattle. "We e-mailed back and forth and realized we had a lot of mutual friends. I thought, \'Wow, this is meant to be.\'"\n\nIt wasn\'t.\n\nKling and her missed-connection match went out a few times, but then he bowed out, citing relationship butterflies. Three weeks later, however, a mutual friend told Kling the guy was seeing someone else. But he apparently was still smitten with the story of how he and Kling met.\n\n"I was at a wedding and this girl cam

## Calculate some basic statistics on data

Summary stats should match (when there is no limit placed on summary length and new lines are not split out)

"The source documents in the training
set have 766 words spanning 29.74 sentences
on an average while the summaries consist of 53
words and 3.72 sentences."

see,
https://arxiv.org/pdf/1602.06023.pdf

In [10]:
context_lens, summary_lens = [], []
context_sentences, summary_sentences = [], []
for _, context, summary in tqdm.tqdm(summaries):
    context_lens.append(len(context.split(' ')))
    context_sentences.append(context.count('.'))
    summary_lens.append(len(summary.split(' ')))
    summary_sentences.append(summary.count('.'))

100%|██████████| 311971/311971 [00:11<00:00, 28264.99it/s]


In [14]:
import pandas as pd
df = pd.DataFrame({
    'context_len': context_lens,
    'summary_len': summary_lens,
    'context_sent': context_sentences,
    'summary_sentences': summary_sentences
})

In [15]:
df.describe()

Unnamed: 0,context_len,summary_len,context_sent,summary_sentences
count,311971.0,311971.0,311971.0,311971.0
mean,655.282568,27.545644,31.046828,2.142084
std,320.307863,6.81313,19.033027,0.577694
min,7.0,4.0,0.0,1.0
25%,417.0,23.0,18.0,2.0
50%,597.0,26.0,27.0,2.0
75%,833.0,30.0,40.0,2.0
max,2356.0,131.0,2269.0,13.0


## drop to disk

In [16]:
try:
    os.makedirs(OUTPUT_DIR)
except OSError:
    pass

In [17]:
for file_id, context, summary in tqdm.tqdm(summaries):
    text = '\t'.join([context, summary])
    dst = os.path.join(OUTPUT_DIR, f'{file_id}.{EXTENSION}')
    with open(dst, 'w') as f:
        f.write(text)

100%|██████████| 311971/311971 [00:19<00:00, 15612.14it/s]
