# Getting started with the WCEP dataset

## Clone repository & install dependencies

In [45]:
!git clone https://github.com/complementizer/wcep-mds-dataset

Cloning into 'wcep-mds-dataset'...
remote: Enumerating objects: 76, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 76 (delta 34), reused 61 (delta 23), pack-reused 0[K
Unpacking objects: 100% (76/76), done.


In [46]:
cd wcep-mds-dataset

/content/wcep-mds-dataset/experiments/wcep-mds-dataset


In [47]:
!git checkout experiments

Branch 'experiments' set up to track remote branch 'experiments' from 'origin'.
Switched to a new branch 'experiments'


In [48]:
#!pip install -r requirements-exp.txt
!python -m nltk.downloader punkt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [49]:
cd experiments

/content/wcep-mds-dataset/experiments/wcep-mds-dataset/experiments


## Download dataset


In [17]:
!mkdir WCEP
!gdown https://drive.google.com/uc?id=1kUjSRXzKnTYdJ732BkKVLg3CFxDKo25u -O WCEP/train.jsonl.gz
!gdown https://drive.google.com/uc?id=1_kHTZ32jazTbXaFRg0vBeIsVcpI7CTmy -O WCEP/val.jsonl.gz
!gdown https://drive.google.com/uc?id=1qsd5pOCpeSXsaqNobXCrcAzhcjtG1wA1 -O WCEP/test.jsonl.gz

Downloading...
From: https://drive.google.com/uc?id=1kUjSRXzKnTYdJ732BkKVLg3CFxDKo25u
To: /content/wcep-mds-dataset/experiments/WCEP/train.jsonl.gz
384MB [00:02, 162MB/s]
Downloading...
From: https://drive.google.com/uc?id=1_kHTZ32jazTbXaFRg0vBeIsVcpI7CTmy
To: /content/wcep-mds-dataset/experiments/WCEP/val.jsonl.gz
55.2MB [00:00, 150MB/s]
Downloading...
From: https://drive.google.com/uc?id=1qsd5pOCpeSXsaqNobXCrcAzhcjtG1wA1
To: /content/wcep-mds-dataset/experiments/WCEP/test.jsonl.gz
51.5MB [00:00, 161MB/s]


## Load dataset

In [19]:
import utils

val_data = list(utils.read_jsonl_gz('WCEP/val.jsonl.gz'))

print(val_data[0].keys())

## Run extractive baselines & oracles

In [39]:
from baselines import RandomBaseline, TextRankSummarizer, CentroidSummarizer, SubmodularSummarizer
from oracles import Oracle

First we create summarizer objects and set their hyperparameters.

In [40]:
random_sum = RandomBaseline()
textrank = TextRankSummarizer(max_redundancy=0.5)
centroid = CentroidSummarizer(max_redundancy=0.5)
submod = SubmodularSummarizer(a=5, div_weight=6, cluster_factor=0.2)
oracle = Oracle()

Below we pick one set of settings for extractive summarization that we will use for all baselines.
`in_titles` means we add article titles as sentences in the input, and `out_titles` means we also allow these titles to be part of a summary.

In [25]:
settings = {
    'max_len': 40, 'len_type': 'words',
    'in_titles': False, 'out_titles': False,
    'min_sent_tokens': 7, 'max_sent_tokens': 60,    
}
max_articles = 20

For a quick experiment, we only select the first 10 clusters of the WCEP validation data and use the first 10 articles of each cluster as inputs.

In [26]:
example_clusters = [c['articles'][:max_articles] for c in val_data[:10]]
ref_summaries = [c['summary'] for c in val_data[:10]]

In [31]:
textrank_summaries = [textrank.summarize(articles, **settings) for articles in example_clusters]
centroid_summaries = [centroid.summarize(articles, **settings) for articles in example_clusters]
submod_summaries = [submod.summarize(articles, **settings) for articles in example_clusters]
random_summaries = [random_sum.summarize(articles, **settings) for articles in example_clusters]

In [42]:
oracle_summaries = [oracle.summarize(ref, articles, **settings)
                    for (ref, articles) in zip(ref_summaries, example_clusters)]

## Evaluate summaries



In [33]:
from pprint import pprint
from evaluate import evaluate

In [43]:
names = ['TextRank', 'Centroid', 'Submodular', 'Oracle', 'Random']
outputs = [textrank_summaries, centroid_summaries, submod_summaries, oracle_summaries, random_summaries]

for preds, name in zip(outputs, names):
    print(name)
    results = evaluate(ref_summaries, preds, lowercase=True)
    pprint(results)
    print()

TextRank
{'rouge-1': {'f': 0.344, 'p': 0.302, 'r': 0.434},
 'rouge-2': {'f': 0.174, 'p': 0.145, 'r': 0.237},
 'rouge-l': {'f': 0.282, 'p': 0.246, 'r': 0.357}}

Centroid
{'rouge-1': {'f': 0.344, 'p': 0.301, 'r': 0.434},
 'rouge-2': {'f': 0.174, 'p': 0.145, 'r': 0.237},
 'rouge-l': {'f': 0.281, 'p': 0.246, 'r': 0.357}}

Submodular
{'rouge-1': {'f': 0.347, 'p': 0.307, 'r': 0.433},
 'rouge-2': {'f': 0.197, 'p': 0.169, 'r': 0.258},
 'rouge-l': {'f': 0.3, 'p': 0.265, 'r': 0.374}}

Oracle
{'rouge-1': {'f': 0.489, 'p': 0.465, 'r': 0.547},
 'rouge-2': {'f': 0.265, 'p': 0.247, 'r': 0.308},
 'rouge-l': {'f': 0.403, 'p': 0.385, 'r': 0.45}}

Random
{'rouge-1': {'f': 0.147, 'p': 0.136, 'r': 0.171},
 'rouge-2': {'f': 0.01, 'p': 0.009, 'r': 0.011},
 'rouge-l': {'f': 0.127, 'p': 0.117, 'r': 0.151}}



Let's look at some example summaries.

In [44]:
cluster_idx = 8
for preds, name in zip(outputs, names):
    print(name)
    print(preds[cluster_idx])
    print()

TextRank
Dalton wanted to plead guilty to the charges against his attorney’s advice, Getting said. Families of victims react to surprise guilty plea in Kalamazoo Uber shooting case Police also confirmed Dalton had no criminal record.

Centroid
Dalton wanted to plead guilty to the charges against his attorney’s advice, Getting said. Families of victims react to surprise guilty plea in Kalamazoo Uber shooting case Police also confirmed Dalton had no criminal record.

Submodular
Dalton wanted to plead guilty to the charges against his attorney’s advice, Getting said. Dalton was driving for Uber at the time and picked up fares between shooting people, police have said.

Oracle
The Uber driver charged with killing six people and seriously wounding two more in a shooting spree around Kalamazoo, Michigan, in 2016 pleaded guilty Monday to all counts against him.

Random
“I was glad to be able to have the opportunity to do this from the office,” Williams said. Jury selection started in a closed