# Creating a WikiPlots subcorpus

By [Allison Parrish](https://github.com/markriedl/WikiPlots)

Mark Riedl's [WikiPlots corpus](https://github.com/markriedl/WikiPlots) has the titles and plot summaries of more than one hundred thousand movies, books, television shows and other media from Wikipedia. That's a lot! This notebook has some sample code to trim it down a little bit. In particular, it shows how to get just a list of plots for Romantic Comedies.

## Wikidata

[Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) is a collection of structured data that (among other things) provides a formal canonical interface to metadata for Wikipedia articles. It has a [query service](https://query.wikidata.org/) that lets you write query (in a language called [SPARQL](https://en.wikipedia.org/wiki/SPARQL)) to find items in the database with particular characteristics. If you don't know SPARQL, no worries—there's an interface to help you build your query visually. I used it to create [this query for Wikidata items with a 'genre' of 'Romantic Comedy'](https://query.wikidata.org/#SELECT%20%3Fgenre%20%3Ftitle%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%20%20%3Fgenre%20wdt%3AP136%20wd%3AQ860626.%0A%20%20OPTIONAL%20%7B%20%3Fgenre%20wdt%3AP1476%20%3Ftitle.%20%7D%0A%7D). You can probably futz around to get the query to give you a list of your choosing. ([Here's a query for science fiction films, for example.](https://query.wikidata.org/#SELECT%20%3Fgenre%20%3Ftitle%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%20%20%3Fgenre%20wdt%3AP136%20wd%3AQ471839.%0A%20%20OPTIONAL%20%7B%20%3Fgenre%20wdt%3AP1476%20%3Ftitle.%20%7D%0A%20%20%0A%20%20%0A%20%20%0A%20%20OPTIONAL%20%7B%20%20%7D%0A%7D)). Use the "Download" link near the results preview at the bottom of the page and download to TSV format and put it in the same directory as this notebook. The following cell will read in the titles (make sure to replace `romcoms.tsv` with the name of your TSV file).

In [5]:
romcom_titles = [item.split("\t")[1].strip() for item in open("romcoms.tsv").readlines() if len(item.strip()) > 0]

Take a look at the first twenty just to make sure that it worked:

In [23]:
romcom_titles[:20]

['title',
 'Punch-Drunk Love',
 'Love and Other Disasters',
 'The Boob',
 '',
 'Here Comes Mr. Jordan',
 "Something's Gotta Give",
 'Fools Rush In',
 'April in Paris',
 'Ice Princess',
 'Bandits',
 'My Favorite Wife',
 'Ninotchka',
 'Wild Child',
 'All About Steve',
 'The Bachelor',
 'The Wedding Planner',
 'Hope Springs',
 'Nutty Professor II: The Klumps',
 '¡Átame!']

The primary thing we'll be doing with this list is to check to see if a given string is present in it, so for the sake of speed I'm going to put it into a set.

In [42]:
romcom_lookup = set(romcom_titles)

Testing it out:

In [46]:
'When Harry Met Sally' in romcom_lookup

True

In [47]:
'Koyaanisqatsi' in romcom_lookup

False

## The WikiPlots corpus

The WikiPlots corpus is provided as a ZIP file containing two files: `titles`, which has a list of titles, and `plots`, which contains the sentences from the plots on each line, with each plot separated by the string `<EOS>`. Our goal is to make a list of title/plot tuples, including only the plots for items whose titles are present in our list of Romantic Comedies.

If you haven't already, [download the WikiPlots ZIP](https://www.dropbox.com/s/24pa44w7u7wvtma/plots.zip?dl=0) and put it in the same directory as this notebook.

Thanks to Python's `zipfile` module, we don't even need to decompress the file to work with the data contained therein...

In [48]:
import zipfile

The following cell reads in the plots and the titles:

In [49]:
zf = zipfile.ZipFile("./plots.zip")
print(zf.infolist())
plots_raw = zf.open("plots").read().decode('utf8')
titles_raw = zf.open("titles").read().decode('utf8')

[<ZipInfo filename='plots' compress_type=deflate filemode='-rw-r--r--' external_attr=0x4000 file_size=233620829 compress_size=90451233>, <ZipInfo filename='titles' compress_type=deflate filemode='-rw-r--r--' external_attr=0x4000 file_size=2361879 compress_size=1050576>]


And then this cell reads them into lists. The titles and plots are index-aligned, so the title at a given index `n` should correspond to the plot at the same index.

In [50]:
plots = [p.strip() for p in plots_raw.split("<EOS>")]
titles = titles_raw.split("\n")

In [51]:
import random
rand_idx = random.randrange(len(plots))
print(titles[rand_idx])
print(plots[rand_idx])

Three Poplars in Plyushcikha
From a village to Moscow comes a married woman and mother of two children Nyura to sell home-made ham.
And the first person she meets is an intelligent taxi driver Sasha, who must pick her up to her in-law; her husband's sister, who lives near the cafe "Three Poplars" at Plyushcikha.
This random meeting brings the strangers together and forces them to take a fresh look at their lives.
But unfortunately a continuation of this connection does not develop.


Wow nice! The following cell makes a list of title/plot pairs. The plots themselves come pre-separated into sentences (separated by newlines), so I'm taking this opportunity to turn them into a list. We'll end up with a data structure that looks like this:

    [
        ("title of film", ["sentence 1", "sentence 2", "sentence 3", "sentence 4", ...more sentences...]),
        ... many more of these ...
    ]

In [38]:
romcoms = [(title, plot.split("\n")) for title, plot in zip(titles, plots) if title in romcom_lookup]

In total, we have a bit over twelve hundred plot summaries. Note that due to potential miscategorization and mismatches between the Wikipedia page titles and the canonical title in Wikidata, this process has a lot of false positives and false negatives! We're not guaranteed to get every romantic comedy on Wikipedia, nor are we guaranteed to only have romantic comedies in the list. But it's good enough to get started.

In [52]:
len(romcoms)

1209

The following cell gives a random title/plot:

In [64]:
random.choice(romcoms)

('The Lords of Flatbush',
 ['Set in 1958, the coming-of-age story follows four Brooklyn teenagers known as The Lords of Flatbush.',
  'The Lords chase girls, steal cars, play pool and hang out at a local malt shop.',
  'The film focuses on Chico (Perry King) attempting to win over Jane (Susan Blakely), a girl who wants little to do with him, and Stanley (Sylvester Stallone), who impregnates his girlfriend, Frannie (Maria Smith), who pressures him to marry her.',
  'Stanley agrees to marry her, even after finding out before the wedding that Frannie never was pregnant.',
  'Butchey Weinstein (Henry Winkler) is highly intelligent but hides his brains behind a clownish front, while Wimpy Murgalo (Paul Mace) is a colorless follower in awe of Chico and Stanley.'])

For ease of use, I'm going to export this in TSV format, with one line per sentence, which also has the movie title for each sentence, the sentence's index, and the total number of sentences in that movie.

In [75]:
with open("romcom_plot_sentences.tsv", "w") as fh:
    for title, sentences in romcoms:
        if len(title) == 0:
            continue
        total = len(sentences)
        for i, sent in enumerate(sentences):
            print("\t".join([title, str(i), str(total), sent]), file=fh)