WikiNovelPlots

The WikiNovelPlots corpus is a collection of 84,112 novel plots extracted from English language Wikipedia. These plots were extracted by looking at all English language articles under the Novels category and its sub-categories, then grabbing the first section whose title contains the words "plot", "summary", or "synopsis".

This repository contains the corpus as a Python pickle, as well as code and instructions for how to recreate the WikiNovelPlots corpus.

Use the corpus

The corpus is saved as a Python pickle to reduce disk space and stay under GitHub's 100MB file size limit, and avoid the thorny problem of delimiting records with newlines and quotes. You will need Python and the pandas library to read the file:

# extracted from view_clean_summaries.py
import pandas as pd
df = pd.read_pickle('summaries_clean.pkl', compression='xz')
df.head(5)
                        title  pageid                                      summary_clean  summary_length
0                 Animal_Farm     620  The poorly-run Manor Farm near Willingdon, Eng...            5514
1           A_Modest_Proposal     665  Swift's essay is widely held to be one of the ...            2963
2         Alexander_the_Great     783     The Killing of Cleitus, by André Castaigne ...            5650
3  A_Clockwork_Orange_(novel)     843  Part 1: Alex's world[edit]\nAlex is a 15-year-...            6705
4             Agatha_Christie     984  Christie has been called the "Duchess of Death...           10427

Recreate the corpus

Use this preset PetScan query, or navigate to the main PetScan site and set Categories=Novels and Depth=9000
1. Click Do it! to execute your query
2. Click on Output and export the data as CSV
Execute download_summaries.py and pass the name of the PetScan file you just exported. This will grab the summary for each page (if availabile) and save a Python pickle named summaries.pkl. The script executes ~600k serial requests against the MediaWiki API and took me 22 hours to run. While one could speed this up via batch requests and parallelization, serial requests are specifically called out as following API etiquette and I had the time to spare.
```
python download_summaries.py name_of_petscan_export.csv
```
Execute clean_summaries.py. This will load summaries.pkl, strip the HTML from and sanitize each summary, then save a Python pickle named summaries_clean.pkl
```
python clean_summaries.py
```

input	step	output	output_records	output_size
-	export PetScan with `Categories=Novels` and `Depth=9000`	`petscan_psid_21520280_20220223.csv`	306841	20.7 MB
`petscan_psid_21520280_20220223.csv`	`download_summaries.py`	`summaries.pkl`	306841	77.8 MB
`summaries.pkl`	`clean_summaries.py`	`summaries_clean.pkl`	84112	58 MB

Background

This repository and dataset is my adaptation of Mark Riedl's WikiPlots repository. Mark created his version back in 2017 by taking an English Wikipedia dump, expanding it with Wikiextractor, and then scanning through each page. This process is labor intensive and no longer works in 2022 because of a bug in Wikiextractor. I explored a bunch of options and settled on the solution in this repo that uses PetScan and the MediaWiki API.

Known issues

The cleaned summaries sometimes require further cleaning:

leftover citations, escaped characters, multiple new lines
```
\n\n\n^ Berlin 2018, p.\xa01163.
```

formatting code

.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}

cite errors

Cite error: The named reference Genre was invoked but never defined (see the help page).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiNovelPlots

Use the corpus

Recreate the corpus

Background

Known issues

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_summaries.py		clean_summaries.py
download_summaries.py		download_summaries.py
petscan_psid_21520280_20220223.csv		petscan_psid_21520280_20220223.csv
summaries.pkl		summaries.pkl
summaries_clean.pkl		summaries_clean.pkl
view_clean_summaries.py		view_clean_summaries.py

License

charlesjlee/WikiNovelPlots

Folders and files

Latest commit

History

Repository files navigation

WikiNovelPlots

Use the corpus

Recreate the corpus

Background

Known issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages