Data obtained from:

https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail

## Configs

In [12]:
CNN_DIR = '../cnn/stories'
DAILY_MAIL_DIR = '../dailymail/stories'
OUTPUT_DIR = '../data/summaries'
MAX_SUMMARY_SENTENCES = 2

## get file paths

In [13]:
import glob
import os

In [14]:
CNN_FILES = glob.glob(os.path.join(CNN_DIR, '*'))

In [15]:
DAILY_MAIL_FILES = glob.glob(os.path.join(DAILY_MAIL_DIR, '*'))

In [16]:
len(CNN_FILES), len(DAILY_MAIL_FILES)

(92579, 219506)

In [17]:
with open(CNN_FILES[0]) as f:
    print(f.read())

It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria.

Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons.

The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction."

It's a step that is set to turn an international crisis into a fierce domestic political battle.

There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react?

In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he wa

In [19]:
with open(DAILY_MAIL_FILES[0]) as f:
    print(f.read())

Sky have won the bidding war for the rights to screen Floyd Mayweather v Manny Pacquiao in the UK, as revealed by Sportsmail last Friday.

The richest fight of all time will not come cheap either — for Sky Sports or their subscribers — even though Sky are keeping faith with their core following by keeping the base price below £20.

It has taken what is described by industry insiders as ‘a very substantial offer’ for Sky to fend off fierce competition from Frank Warren’s BoxNation.

Floyd Mayweather's hotly-anticipated bout with Manny Pacquiao will be shown on Sky Sports

Pacquiao headed for the playground after working out in Los Angeles previously

The price for the fight has been set at £19.95 until midnight of Friday May 1. 

The cost will remain the same for those paying via remote control or online, but will be £24.95 if booked via phone after Friday.

Sky are flirting with their threshold of £20 by charging £19.95 a buy on their Sports Box Office channel until midnight on May 1, 

In [21]:
FILES = CNN_FILES + DAILY_MAIL_FILES

In [22]:
FILES[:10]

['../cnn/stories/0001d1afc246a7964130f43ae940af6bc6c57f01.story',
 '../cnn/stories/0002095e55fcbd3a2f366d9bf92a95433dc305ef.story',
 '../cnn/stories/00027e965c8264c35cc1bc55556db388da82b07f.story',
 '../cnn/stories/0002c17436637c4fe1837c935c04de47adb18e9a.story',
 '../cnn/stories/0003ad6ef0c37534f80b55b4235108024b407f0b.story',
 '../cnn/stories/0004306354494f090ee2d7bc5ddbf80b63e80de6.story',
 '../cnn/stories/0005d61497d21ff37a17751829bd7e3b6e4a7c5c.story',
 '../cnn/stories/0006021f772fad0aa78a977ce4a31b3faa6e6fe5.story',
 '../cnn/stories/00083697263e215e5e7eda753070f08aa374dd45.story',
 '../cnn/stories/000940f2bb357ac04a236a232156d8b9b18d1667.story']

## build parser

In [23]:
import os
def parse(file, max_summary_sentences=None):
    with open(file) as f:
        id_ = os.path.basename(file)
        content = f.read().replace('\n', ' \n ') \
                          .replace('\n  \n', '\n\n')
        context, *highlights = content.split('@highlight')
        if max_summary_sentences is not None:
            highlights = highlights[:max_summary_sentences]
        summary = ' . '.join(highlights) + ' .'
        return id_, context, summary

In [24]:
import tqdm
summaries = [parse(f, MAX_SUMMARY_SENTENCES) for f in tqdm.tqdm(FILES)]

100%|██████████| 312085/312085 [02:42<00:00, 1926.13it/s]


In [25]:
summaries[1]

('0002095e55fcbd3a2f366d9bf92a95433dc305ef.story',
 '(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men\'s 4x100m relay. \n\n The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. \n\n The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. \n\n The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. \n\n The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. \n\n "I\'m proud of myself and I\'ll continue to work to dominate for as long as possible," Bolt said, having previously expresse

## Calculate some basic statistics on data

Summary stats should match (when there is no limit placed on summary length and new lines are not split out)

"The source documents in the training
set have 766 words spanning 29.74 sentences
on an average while the summaries consist of 53
words and 3.72 sentences."

see,
https://arxiv.org/pdf/1602.06023.pdf

In [26]:
context_lens, summary_lens = [], []
context_sentences, summary_sentences = [], []
for _, context, summary in tqdm.tqdm(summaries):
    context_lens.append(len(context.split(' ')))
    context_sentences.append(context.count('.'))
    summary_lens.append(len(summary.split(' ')))
    summary_sentences.append(summary.count('.'))

100%|██████████| 312085/312085 [00:28<00:00, 10831.28it/s]


In [27]:
import pandas as pd
df = pd.DataFrame({
    'context_len': context_lens,
    'summary_len': summary_lens,
    'context_sent': context_sentences,
    'summary_sentences': summary_sentences
})

In [28]:
df.describe()

Unnamed: 0,context_len,summary_len,context_sent,summary_sentences
count,312085.0,312085.0,312085.0,312085.0
mean,718.70327,35.354679,31.035487,2.142243
std,350.664124,7.080906,19.038796,0.577843
min,3.0,6.0,0.0,1.0
25%,459.0,31.0,18.0,2.0
50%,656.0,34.0,27.0,2.0
75%,911.0,38.0,40.0,2.0
max,2722.0,139.0,2269.0,13.0


## drop to disk

In [29]:
try:
    os.makedirs(OUTPUT_DIR)
except OSError:
    pass

In [30]:
for id_, context, summary in tqdm.tqdm(summaries):
    text = '\t'.join([context, summary])
    dst_basename = id_.replace('.story', '.txt')
    dst = os.path.join(OUTPUT_DIR, dst_basename)
    with open(dst, 'w') as f:
        f.write(text)

100%|██████████| 312085/312085 [01:42<00:00, 3041.38it/s]
