The WikiNovelPlots corpus is a collection of 84,112 novel plots extracted from English language Wikipedia. These plots were extracted by looking at all English language articles under the Novels category and its sub-categories, then grabbing the first section whose title contains the words "plot", "summary", or "synopsis".
This repository contains the corpus as a Python pickle, as well as code and instructions for how to recreate the WikiNovelPlots corpus.
The corpus is saved as a Python pickle to reduce disk space and stay under GitHub's 100MB file size limit, and avoid the thorny problem of delimiting records with newlines and quotes. You will need Python and the pandas
library to read the file:
# extracted from view_clean_summaries.py
import pandas as pd
df = pd.read_pickle('summaries_clean.pkl', compression='xz')
df.head(5)
title pageid summary_clean summary_length
0 Animal_Farm 620 The poorly-run Manor Farm near Willingdon, Eng... 5514
1 A_Modest_Proposal 665 Swift's essay is widely held to be one of the ... 2963
2 Alexander_the_Great 783 The Killing of Cleitus, by André Castaigne ... 5650
3 A_Clockwork_Orange_(novel) 843 Part 1: Alex's world[edit]\nAlex is a 15-year-... 6705
4 Agatha_Christie 984 Christie has been called the "Duchess of Death... 10427
- Use this preset PetScan query, or navigate to the main PetScan site and set
Categories=Novels
andDepth=9000
- Click
Do it!
to execute your query - Click on
Output
and export the data as CSV
- Click
- Execute
download_summaries.py
and pass the name of the PetScan file you just exported. This will grab the summary for each page (if availabile) and save a Python pickle namedsummaries.pkl
. The script executes ~600k serial requests against the MediaWiki API and took me 22 hours to run. While one could speed this up via batch requests and parallelization, serial requests are specifically called out as following API etiquette and I had the time to spare.python download_summaries.py name_of_petscan_export.csv
- Execute
clean_summaries.py
. This will loadsummaries.pkl
, strip the HTML from and sanitize each summary, then save a Python pickle namedsummaries_clean.pkl
python clean_summaries.py
input | step | output | output_records | output_size |
---|---|---|---|---|
- | export PetScan with Categories=Novels and Depth=9000 |
petscan_psid_21520280_20220223.csv |
306841 | 20.7 MB |
petscan_psid_21520280_20220223.csv |
download_summaries.py |
summaries.pkl |
306841 | 77.8 MB |
summaries.pkl |
clean_summaries.py |
summaries_clean.pkl |
84112 | 58 MB |
This repository and dataset is my adaptation of Mark Riedl's WikiPlots repository. Mark created his version back in 2017 by taking an English Wikipedia dump, expanding it with Wikiextractor, and then scanning through each page. This process is labor intensive and no longer works in 2022 because of a bug in Wikiextractor. I explored a bunch of options and settled on the solution in this repo that uses PetScan and the MediaWiki API.
The cleaned summaries sometimes require further cleaning:
- leftover citations, escaped characters, multiple new lines
\n\n\n^ Berlin 2018, p.\xa01163.
- formatting code
.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}
- cite errors
Cite error: The named reference Genre was invoked but never defined (see the help page).