# Creating a WikiPlots subcorpus

By [Allison Parrish](https://github.com/markriedl/WikiPlots)

Mark Riedl's [WikiPlots corpus](https://github.com/markriedl/WikiPlots) has the titles and plot summaries of more than one hundred thousand movies, books, television shows and other media from Wikipedia. That's a lot! This notebook has some sample code to trim it down a little bit. In particular, it shows how to get just a list of plots for Romantic Comedies.

## Wikidata

[Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) is a collection of structured data that (among other things) provides a formal canonical interface to metadata for Wikipedia articles. It has a [query service](https://query.wikidata.org/) that lets you write query (in a language called [SPARQL](https://en.wikipedia.org/wiki/SPARQL)) to find items in the database with particular characteristics. If you don't know SPARQL, no worries—there's an interface to help you build your query visually. (Make sure to expand the "Query Helper" on the left-hand side.) I used it to create [this query for Wikidata items with a 'genre' of 'Romantic Comedy'](https://query.wikidata.org/#SELECT%20%3Fgenre%20%3Ftitle%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%20%20%3Fgenre%20wdt%3AP136%20wd%3AQ860626.%0A%20%20OPTIONAL%20%7B%20%3Fgenre%20wdt%3AP1476%20%3Ftitle.%20%7D%0A%7D). You can probably futz around to get the query to give you a list of your choosing. ([Here's a query for science fiction films, for example.](https://query.wikidata.org/#SELECT%20%3Fgenre%20%3Ftitle%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%20%20%3Fgenre%20wdt%3AP136%20wd%3AQ471839.%0A%20%20OPTIONAL%20%7B%20%3Fgenre%20wdt%3AP1476%20%3Ftitle.%20%7D%0A%20%20%0A%20%20%0A%20%20%0A%20%20OPTIONAL%20%7B%20%20%7D%0A%7D)). Use the "Download" link near the results preview at the bottom of the page and download to TSV format and put it in the same directory as this notebook. The following cell will read in the titles (make sure to replace `romcoms.tsv` with the name of your TSV file).

In [2]:
romcom_titles = [item.split("\t")[1].strip() for item in open("romcoms.tsv").readlines() if len(item.strip()) > 0]

Take a look at the first twenty just to make sure that it worked:

In [3]:
romcom_titles[:20]

['title',
 'Prelude to a Kiss',
 'Practical Magic',
 'The Oranges',
 'I Want to Marry Ryan Banks',
 'Gotcha!',
 'Pang! Pang! Du är död!',
 'The Sisterhood of the Traveling Pants 2',
 '',
 'Birthday Girl',
 "National Lampoon's Van Wilder",
 'La tigre e la neve',
 'Rio Rita',
 'American Beauty',
 'உனக்காக எல்லாம் உனக்காக',
 'போடா போடி',
 'Warm Bodies',
 'To Rome with Love',
 'The Broken Hearts Club: A Romantic Comedy',
 'Life or Something Like It']

The primary thing we'll be doing with this list is to check to see if a given string is present in it, so for the sake of speed I'm going to put it into a set.

In [4]:
romcom_lookup = set(romcom_titles)

Testing it out:

In [5]:
'When Harry Met Sally' in romcom_lookup

True

In [6]:
'Koyaanisqatsi' in romcom_lookup

False

## The WikiPlots corpus

The WikiPlots corpus is provided as a ZIP file containing two files: `titles`, which has a list of titles, and `plots`, which contains the sentences from the plots on each line, with each plot separated by the string `<EOS>`. Our goal is to make a list of title/plot tuples, including only the plots for items whose titles are present in our list of Romantic Comedies.

If you haven't already, [download the WikiPlots ZIP](https://www.dropbox.com/s/24pa44w7u7wvtma/plots.zip?dl=0) and put it in the same directory as this notebook. You can do this without leaving Jupyter Notebook by executing the following cell:

In [None]:
!curl -L -O https://www.dropbox.com/s/24pa44w7u7wvtma/plots.zip

Thanks to Python's `zipfile` module, we don't even need to decompress the file to work with the data contained therein...

In [8]:
import zipfile

The following cell reads in the plots and the titles as two lists:

In [12]:
zf = zipfile.ZipFile("./plots.zip")
print(zf.infolist())
plots = [p.strip() for p in zf.open("plots").read().decode('utf8').split("<EOS>")]
titles = zf.open("titles").read().decode('utf8').split("\n")

[<ZipInfo filename='plots' compress_type=deflate filemode='-rw-r--r--' external_attr=0x4000 file_size=233620829 compress_size=90451233>, <ZipInfo filename='titles' compress_type=deflate filemode='-rw-r--r--' external_attr=0x4000 file_size=2361879 compress_size=1050576>]


The titles and plots are index-aligned, so the title at a given index `n` should correspond to the plot at the same index.

In [13]:
import random
rand_idx = random.randrange(len(plots))
print(titles[rand_idx])
print(plots[rand_idx])

Lejontämjaren
Simon is 9 years old and at school he's mobbed by an 11-year-old boy and his gang.
Simon dreams that he tames a lion which helps him with scaring the bad boy who Simon calls "Kobran" ("The Cobra").
One night Simon sees a naked man with briefs on his head and hears his mother's laugh in the background.
The next morning Karin tells Simon that she had a friend at home during the night.
In school during the lunch time he tells his mate Tove about his mother and the "briefs-man" whose real name is Björn.
At same time "Kobran" and his gang come to their table; he requests Simon to go and get milk and when Simon does it he spits in Simon's food.
The other boys laugh but Tove is angry and throws her milk on "Kobran"'s face and she and Simon escapes into the headmaster's room.
At the evening Björn comes to Simon's home and has a dinner with him and Karin.
He tells that he has a son called Alex who is studying in Simon's school, but he doesn't know that Alex and "Kobran" is the sam

Wow nice! The following cell makes a list of title/plot pairs. The plots themselves come pre-separated into sentences (separated by newlines), so I'm taking this opportunity to turn them into a list. We'll end up with a data structure that looks like this:

    [
        ("title of film", ["sentence 1", "sentence 2", "sentence 3", "sentence 4", ...more sentences...]),
        ... many more of these ...
    ]

In [14]:
romcoms = [(title, plot.split("\n")) for title, plot in zip(titles, plots) if title in romcom_lookup]

In total, we have a bit over twelve hundred plot summaries. Note that due to potential miscategorization and mismatches between the Wikipedia page titles and the canonical title in Wikidata, this process has a lot of false positives and false negatives! We're not guaranteed to get every romantic comedy on Wikipedia, nor are we guaranteed to only have romantic comedies in the list. But it's good enough to get started.

In [15]:
len(romcoms)

1500

The following cell gives a random title/plot:

In [18]:
random.choice(romcoms)

('Bollywood Beats',
 ['Raj (Sachin Bhatt) is an Indian-American choreographer trying to make it in the bustling city of Los Angeles.',
  "After being dumped by his girlfriend, bombing another dance audition, and nearly getting kicked out of his parent's home, Raj's luck changes when he meets Jyoti, (Lilette Dubey), a sexy Indian woman, who suggests for him to teach her and a group of Indian women dance.",
  "While unsuccessful at the start, Raj's class grows with Vincent, (Mehul Shah), a gay teen who wants to dance regardless of what his father thinks, Laxmi, (Pooja Kumar), a South Indian woman new to this country and friendless, Puja, (Mansi Patel) an unethusiastic high school student, who is being dragged by her grandmother, Vina (Sarita Joshi).",
  'Through it all, the group manages to find family, love, and acceptance where they least expected to.'])

For ease of use, I'm going to export this in TSV format, with one line per sentence, which also has the movie title for each sentence, the sentence's index, and the total number of sentences in that movie. I use this file in the other notebook in this repository to do [text analysis and generation](https://github.com/aparrish/corpus-driven-narrative-generation/blob/master/corpus-driven-narrative-generation.ipynb).

In [19]:
with open("romcom_plot_sentences.tsv", "w") as fh:
    for title, sentences in romcoms:
        if len(title) == 0:
            continue
        total = len(sentences)
        for i, sent in enumerate(sentences):
            print("\t".join([title, str(i), str(total), sent]), file=fh)

You might be interested in having a plain text version of this data, in order to (e.g.) train a language model. The following cell exports one big text file, with one line per sentence:

In [20]:
with open("romcom_export.txt", "w") as fh:
    for title, sentences in romcoms:
        for sent in sentences:
            fh.write(sent + "\n")

Enjoy!