# Plot To Poem

By [Allison Parrish](http://www.decontextualize.com/) for [NaPoGenMo 2017](https://github.com/NaPoGenMo/NaPoGenMo2017).

This notebook implements a system for "translating" plots from [Mark Riedl's WikiPlots corpus](https://github.com/markriedl/WikiPlots) into poems, using semantic similarity to lines of poetry in my [Gutenberg Poetry corpus](https://s3.amazonaws.com/aparrish/poetry.json-stream.gz). If you want to execute the code in this notebook, make sure to download these files first! (The code assumes that you've extracted the WikiPlots corpus, leaving two files in the same directory as the notebook: `titles` and `plots`. You don't need to extract the poetry corpus, as the code operates directly on the compressed file.)

In [1]:
import random
import sys
import textwrap

## Working with the data

The `plotutils` and `poemutils` modules, included in this repository, have a few utility functions for working with the WikiPlots corpus at the Gutenberg Poetry corpus (respectively). I'm using [Annoy](https://pypi.python.org/pypi/annoy) for nearest-neighbor search and [spaCy](https://spacy.io/) for its sentence parsing and built-in word vectors.

In [2]:
import plotutils
import poemutils

import spacy
from annoy import AnnoyIndex
import numpy as np

In [4]:
nlp = spacy.load('en')

Titles and plots are loaded into lists with corresponding indices...

In [5]:
titles = plotutils.titleindex()
plots = plotutils.loadplots()
assert(len(titles) == len(plots))

The `pickrandom` function returns a tuple with the index, title and sentences from a randomly-selected plot.

In [6]:
def getplot(idx):
    return idx, titles[idx], plots[idx]
def pickrandom(titles, plots):
    idx = random.randrange(len(titles))
    return getplot(random.randrange(len(titles)))

In [7]:
pickrandom(titles, plots)

(22371,
 'Love Is All There Is',
 "\nLove Is All There Is is a modern retelling of the Romeo and Juliet story\nIt is set in the Bronx during the 1990s\nThe Cappamezzas (Lainie Kazan and Joseph Bologna), Bronx-born Sicilians, own a local catering business\nThey develop a bitter rivalry with the pretentious Malacicis (Paul Sorvino and Barbara Carrera), recent immigrants from Florence and owners of a fine Italian restaurant\nThe Cappamezzas' son, Rosario, (Nathaniel Marston) falls in love with the Malacicis' daughter, Gina, (Angelina Jolie) after she replaces the obese star of the neighborhood church's staging of Romeo and Juliet\nThe rivalry intensifies after Rosario deflowers Gina after a fight with her parents\nThe movie was filmed at Greentree Country Club in New Rochelle, NY and many scenes were shot in City Island, Bronx, New York\n")

In [8]:
title2idx = dict([(t, i) for i, t in enumerate(titles)])
title2idx["New Super Mario Bros."]

11432

Testing it all out:

In [9]:
getplot(title2idx["New Super Mario Bros."])

(11432,
 'New Super Mario Bros.',
 "\nAt the beginning of the game, Princess Peach and Mario are walking together when lightning suddenly strikes Peach's castle nearby\nAs Mario runs to help, Bowser Jr\nappears and kidnaps her\nRealizing what has happened, Mario quickly rushes back and gives chase\nMario ventures through eight worlds pursuing Bowser Jr\nand trying to rescue the kidnapped princess\nMario catches up to them and confronts Bowser Jr\noccasionally, but is unable to save the princess from the young Koopa's clutches\nAt the end of the first world, Bowser Jr\nretreats to a castle, where his father, Bowser, awaits Mario on a bridge over a pit filled with lava\nIn a scene highly reminiscent of the original Super Mario Bros\n, Mario activates a button behind Bowser to defeat him, and the bridge underneath Bowser collapses, causing him to fall into the lava which burns his flesh, leaving a skeleton\nDespite Bowser's demise in the first level, this does not stop Bowser Jr\nfrom run

## Building an index of vectors for lines of poetry

To compare the "meaning" of two stretches of text, we need a number (or series of numbers) to be the basis of the comparison. In this case, I'm using the average of the word vectors in the sentence, for words that match particular parts of speech. (I don't have any hard data to back this up, but just using the word vectors for particular parts of speech yields slightly better/interesting results to my eye, for the purposes of this project at least)

In [10]:
def meanvector(text):
    s = nlp(text)
    vecs = [word.vector for word in s \
            if word.pos_ in ('NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN', 'ADP') \
            and np.any(word.vector)] # skip all-zero vectors
    if len(vecs) == 0:
        raise IndexError
    else:
        return np.array(vecs).mean(axis=0)
meanvector("this is a test").shape

(300,)

Now I create an Annoy index for fast nearest-neighbor lookup alongside a list of lines of poetry from the corpus. The `poemutils.loadlines()` function yields lines of poetry in turn from the poetry corpus; the `modulo` parameter allows you to load only only line of every *n*, instead of the entire corpus. (There are 3+ million lines in the corpus, which between the text and the vectors ends up being too much for my laptop's RAM. In practice, working with 5%-10% of the poetry corpus yields pretty good results.)

In [12]:
t = AnnoyIndex(300, metric='angular')
lines = list()
i = 0
for line in poemutils.loadlines(modulo=20):
    if i % 10000 == 0:
        sys.stderr.write(str(i) + "\n")
    try:
        t.add_item(i, meanvector(line['line']))
        lines.append(line)
        i += 1
    except IndexError:
        continue

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000


In [13]:
t.build(25)

True

The most common error I make when working with Annoy is accidentally messing up the indices so that there are fewer items in the index than lines in the corpus (or vice-versa), which (obviously) messes up index-based lookup. So just to make sure everything's okay:

In [14]:
assert(t.get_n_items() == len(lines))

Now, test it out. The `.get_nns_by_vector()` method of the Annoy index returns a list of indices of nearest neighbors for a particular vector. I'm using this value to fetch the line of poetry from the poetry corpus:

In [22]:
nearest = t.get_nns_by_vector(meanvector("All that glitters is gold"), n=10)[0]
print(lines[nearest]['line'])

But the gold Moon is not shining,


## Matching plot sentences to lines of poetry

In the cell below, I pick a random plot and iterate over each sentence in the plot, first finding the vector for the sentence and then finding the nearest neighbor in the Annoy index we just built for lines of poetry. For testing purposes, I'm displaying the sentence from the plot alongside the nearest neighboring line of poetry:

In [50]:
idx, title, sentences = pickrandom(titles, plots)
print(title)
print('-' * len(title))
print()
for sent in sentences.split("\n"):
    try:
        vec = meanvector(sent)
    except IndexError:
        continue
    match_idx = t.get_nns_by_vector(vec, n=100)[0]
    print(textwrap.fill(sent+".", 60))
    print("\t", lines[match_idx]['line'])
    print()

Sweet Secret
------------

Han Ah-reum, the daughter of the Korean Vice-Minister for
Culture, returns home with a child born out of wedlock, a
fact that will shame her family and ruin the chances of her
father becoming the Minister for Culture.
	 For sake of a young Child whose home was there.

Chun Sung-woon, heir of the Winner fashion and clothing
company, is being backed into an arranged marriage that he
does not want.
	 That he's the kind of person that never does go back

When the paths of the two cross and re-cross, initial
hostility turns into love.
	 That a light as of dawn should leap into your face,

However, the secret of Ah-reum's illegitimate daughter may
become a barrier to true love.
	 That Love may still be lord of all!



Looks like everything works! In the cell below, I write out an HTML file to show the results in a more pretty format.

In [70]:
import html

titles_to_try = [
    "Star Wars (film)",
    "When Harry Met Sally...",
    "House of Leaves",
    "Shrek",
    "The Hobbit",
    "The Legend of Zelda: Ocarina of Time",
    "The Handmaid's Tale",
    "Ferris Bueller's Day Off",
    "Star Trek II: The Wrath of Khan",
    "Lost in Translation (film)",
    "The Matrix",
    "Doom (1993 video game)",
    "Neuromancer",
    "Top Gun",
    "A Wrinkle in Time",
    "The Wizard of Oz (1939 film)"
]

html_tmpl = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <title>Plot to Poem: Sample output</title>
    <style type="text/css">
        .line {{
            cursor: pointer;
            font-family: serif;
            font-size: 12pt;
            line-height: 1.5em;
        }}
        .line:hover {{
            background-color: #f8f8f8;
        }}
        body {{
            margin: 2em auto;
            width: 67%;
            font-family: sans-serif;
        }}
        h2 {{
            margin-top: 2em;
            margin-bottom: 0.5em;
        }}
    </style>
</head>

<body>

<h1>Plot to poem</h1>
<p>By <a href="http://www.decontextualize.com/">Allison Parrish</a>
    for <a href="https://github.com/NaPoGenMo/NaPoGenMo2017">NaPoGenMo 2017</a>.</p>
<p>Each sentence from the Wikipedia plot summary of these
    works has been replaced with the line of poetry from
    Project Gutenberg that is closest in meaning. Mouse over
    the line to see the original sentence from the plot
    summary.</p>
<p>Want to learn more about how it works, or try it out on
    your own text?
    <a href="https://github.com/aparrish/plot-to-poem/">Python
    source code available here.</a></p>

{poems}

</body>
</html>
"""

output_html = ""
for to_try in titles_to_try:
    already_seen = set()
    idx, title, sentences = getplot(title2idx[to_try])
    output_html += "\n<h2>"+title+"</h2>"
    for sent in sentences.split("\n"):
        try:
            vec = meanvector(sent)
        except IndexError:
            continue
        match_idx = [i for i in t.get_nns_by_vector(vec, n=100) if i not in already_seen][0]
        already_seen.add(match_idx)
        output_html += "<div title='{orig}' class='line'>{line}</div>\n".format(
            orig=html.escape(sent), line=lines[match_idx]['line'])
open("plot-to-poem.html", "w").write(html_tmpl.format(poems=output_html))

117753