Code used to create the WikinewsSum dataset presented in Generating Extended and Multilingual Summaries with Pre-trained Transformers.
WikinewsSum is based on Wikinews. The Wikinews articles are used as summaries and the source texts as input texts for the summaries.
The dataset is ditributed via the European Language Grid: https://live.european-language-grid.eu/catalogue/corpus/18633
First you need to download the dataset here.
Then you can load the dataset using the datasets
library:
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": "train.json", "test": "test.json", "validation": "validation.json"})
Each sample of the dataset follows the following structure:
{
"language": "en",
"title": "This is the title",
"news": "This is the 1st passage of the news.\nThis is the second one.\n...",
"sources": [
{
"language": "en",
"text": "This is the raw text from the source",
"url": "url of the source",
"archive_url": "archive url of the source"
},
{
"language": "en",
"text": "This is the raw text from the source",
"url": "url of the source",
"archive_url": "archive url of the source"
}
]
}
Languages | # samples | # cross-lingual samples | Input Documents # words | Input Documents # sentences | Summaries # words | Summaries # sentences |
---|---|---|---|---|---|---|
English | 11,616 | 641 (5.5%) | 1,466 | 57 | 300 | 13 |
German | 8,126 | 2,796 (34.4%) | 1,179 | 58 | 241 | 13 |
French | 6,200 | 1,932 (31.2%) | 884 | 29 | 176 | 7 |
Spanish | 6,116 | 2,137 (34.9%) | 1,215 | 42 | 276 | 10 |
Portuguese | 3,843 | 1,971 (51.3%) | 1,037 | 38 | 221 | 8 |
Polish | 3,630 | 1,214 (33.4%) | 734 | 35 | 173 | 10 |
Italian | 95 | 46 (48.4%) | 1,021 | 35 | 224 | 8 |
All languages | 39,626 | 10,737 (27.1%) | 1,168 | 47 | 245 | 11 |
cd wikinewssum
First you need to download the Wikinews dump you want to create the dataset from.
Replace with the language you want
cd dumps
wget https://dumps.wikimedia.org/enwikinews/latest/enwikinews-latest-pages-meta-current.xml.bz2
cd ..
from src import Wikinews
Wikinews.save(dump_path="dumps/enwikinews-latest-pages-meta-current.xml.bz2", max_doc_count=0, folder="enwikinews")
This will create 2 files in the enwikinews
folder: docs.json
that information of the Wikinews articles, and sources.txt
that contains all the sources (1 per line).
from src import Sources
Sources.save(file="enwikinews/sources.txt", folder="sources", max_workers=5, timeout=10, retry=False)
Can takes hours. You can stop it when you want and run it again, it will reuse the sources already saved if the folder parameter is the same. The sources didn't find will be skipped unless the retry parameter is set to True.
from src import Content
Content.extract(folder="sources")
from src import Dataset
Dataset.create(wikinews_folder="enwikinews", sources_folder="sources", dataset_folder="endataset")
zip -r enwikinewssum.zip endataset/ sources/ enwikinews/
import json
from pathlib import Path
dataset_folder = Path("endataset")
files = list(dataset_folder.glob("**/*.json"))
with open("endataset.json", "w") as output_file:
for filename in files:
with open(filename) as f:
entry = json.load(f)
output_file.write(json.dumps(entry))
output_file.write("\n")
Step 7: Clean the texts by using another text extraction method (which removed the sources, this is why it was not used in the first place)
python -m gensim.scripts.segment_wiki -i -f dumps/enwikinews-latest-pages-meta-current.xml.bz2 -o dumps/enwikinews-latest.json.gz
from src import Clean
Clean.clean("endataset.json", "dumps/enwikinews-latest.json.gz", "cleaned_endataset.json")
import json
from pathlib import Path
json_filenames = {
"en": "cleaned_endataset.json",
# "de": "cleaned_dedataset.json",
# "es": "cleaned_esdataset.json",
# "fr": "cleaned_frdataset.json",
# "it": "cleaned_itdataset.json",
# "pl": "cleaned_pldataset.json",
# "pt": "cleaned_ptdataset.json",
}
final_json_filename = "wikinewssum.json"
with open(final_json_filename, "w") as final_json:
for language, json_filename in json_filenames.items():
with open(json_filename) as json_file:
data = json_file.read()
for entry_str in data.split("\n"):
if entry_str is None or entry_str == "":
continue
entry = json.loads(entry_str)
new_entry = {
"language": language,
"title": entry["title"],
"news": entry["text"],
"categories": entry["categories"],
"sources": []
}
for source in entry["sources"]:
new_entry["sources"].append({
"language": source["language"],
"title": source["title"],
"text": source["maintext"],
"url": source["url"],
"archive_url": source["archive_url"]
})
final_json.write(json.dumps(new_entry))
final_json.write("\n")
from src import Filter
Filter.filter("wikinewssum.json", "filtered_wikinewssum.json")
from datasets import load_dataset
dataset = load_dataset("json", data_files="filtered_wikinewssum.json")