# Some corpus-driven approaches

By [Allison Parrish](http://www.decontextualize.com/)

This notebook uses `pronouncing` and Pincelate to "find" poems in existing corpora. I show an example of searching prose for Haiku, and an example of re-rhyming lines in an existing corpus of poetry.

First, some code preliminaries:

In [128]:
import pronouncing

In [2]:
from pincelate import Pincelate

Using TensorFlow backend.


In [3]:
import numpy as np

In [4]:
import nltk

In [5]:
import re
import random

Load the Pincelate model:

In [6]:
pin = Pincelate()

The following function tries to use [pronouncing](https://pronouncing.readthedocs.io/) to look up a word's pronunciation, but if it's not there, it uses [Pincelate](https://pincelate.readthedocs.io/en/latest/) instead:

In [7]:
def quick_phones(w):
    pfw = pronouncing.phones_for_word(w)
    if len(pfw) == 0:
        # "normalize" so the model will work
        w = re.sub("[^a-z']", "", w)
        return " ".join(pin.soundout(w))
    else:
        return pfw[0]

This should work even on made-up words that aren't in the CMU pronouncing dictionary:

In [8]:
quick_phones("pikachu")

'P IY0 K AA1 CH UW0'

If the function gets a string with any characters that aren't in Pincelate's model, it simply removes them. This isn't an ideal solution, but it at least means we won't get any errors, and sometimes it works okay(ish):

In [9]:
quick_phones("héllo")

'HH AA1 L OW0'

## Tokenization

If we're going to work with a corpus, we need to find ways to divide it up into meaningful units. Splitting a text up into meaningful units is called "tokenization." We'll use [NLTK](https://www.nltk.org/)'s `punkt` tokenizer to get sentences and words from the text. If you followed the installation instructions, you should already have NLTK installed, but you'll still need to download the relevant model. You can do this by running the cell below:

In [10]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/allison/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The function to split a string up into sentences is `nltk.sent_tokenize()`. The tokenizer is able to deal with many cases where otherwise sentence-ending punctuation doesn't actually end a sentence, e.g.:

In [11]:
nltk.sent_tokenize("This... is a test. My mother (Mrs. Parrish) said there'd be days like these.")

['This... is a test.',
 "My mother (Mrs. Parrish) said there'd be days like these."]

Likewise, `nltk.word_tokenize()` function splits a string up into words:

In [12]:
nltk.word_tokenize("It was the best of times, it was the worst of times.")

['It',
 'was',
 'the',
 'best',
 'of',
 'times',
 ',',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times',
 '.']

## Finding haiku

The computational haiku has [vexed poets and programmers since at least 1967](http://rwet.decontextualize.com/pdfs/morris.pdf) and it is possible that there is little left to be said about the form. By "haiku" here I mean haiku as it has been adapted over the past century or more in English literature (see [Haiku in English](https://en.wikipedia.org/wiki/Haiku_in_English)), which is a form that strips out most of the semantic constraints of the original Japanese form in favor of a purely lexical and metrical definition: such a haiku consists of three lines, the first having five syllables, the second seven syllables, and the third five syllables.

We're going to write some code that goes through our corpus and finds haiku, in the manner of [Times Haiku](https://en.wikipedia.org/wiki/Haiku_in_English). Specifically, we'll identify sentences in our corpus that have seventeen syllables and can be partitioned into 5/7/5 lines.

Counting syllables is easy; what turns out to be a bit tricky is making sure that the words can be divided up into lines. The way we'll calculate this is to first make a list of syllable counts for each word in a sentence, then check to see if that sequence can be partitioned into a haiku. This function does the partitioning part, returning a list of indices where each line should end in the original list, or an empty list if no partition can be found:

In [13]:
def partition_haiku(t):
    needed = [5, 7, 5]
    indices = [0]*len(needed)
    idx = 0
    if sum(t) != sum(needed):
        return []
    for i, count in enumerate(needed):
        total = 0
        # keep moving through words...
        while total < count:
            total += t[idx]
            # update the index with current position
            indices[i] = idx
            idx += 1
        # if we overshot, this doesn't partition cleanly!
        if total != count:
            return []
    return indices

To demonstrate:

In [14]:
haiku = """the west wind whispered
and touched the eyelids of spring
her eyes primroses"""
partition_haiku([pronouncing.syllable_count(quick_phones(w)) for w in haiku.split()])

[3, 9, 12]

The result of the function tells us that line breaks should fall after indices 3, 9 and 12. 

If the syllable counts can't be partitioned into a haiku the function returns an empty list:

In [15]:
partition_haiku([1, 2, 3, 4, 5])

[]

Another quick test: generate sequences of random integers and print out the sequences that can be partitioned as haiku:

In [16]:
for i in range(1000):
    rand_seq = np.random.randint(1, 5, size=(np.random.randint(14)))
    linebreaks = partition_haiku(rand_seq)
    if len(linebreaks) > 0:
        print(rand_seq, linebreaks)

[2 3 2 2 1 2 2 3] [1, 5, 7]
[2 2 1 1 2 1 3 1 4] [2, 6, 8]
[4 1 4 2 1 4 1] [1, 4, 6]
[1 4 4 1 1 1 1 2 2] [1, 5, 8]
[1 4 3 4 2 3] [1, 3, 5]
[2 2 1 2 4 1 4 1] [2, 5, 7]


### Haiku source text

Included in this repository is a copy of Mary Shelley's *Frankenstein* (as `84-0.txt`). We'll use this as the sample text below. You could also use a different file, as long as it's in plain text format. [Project Gutenberg](http://www.gutenberg.org/) is a great place to look for these! Just download the file into the same directory as this notebook and replace the filename in the cell below with the filename of the text you selected.

In [17]:
text = open("84-0.txt").read()

A list of sentences in the corpus:

In [18]:
sentences = nltk.sent_tokenize(text)

Total number of sentences:

In [19]:
len(sentences)

3199

### Sifting through the sentences

As a first task, we could just check for sentences that have seventeen syllables. To do this, we'll iterate through each sentence, and do the following:

* Get the tokens in the sentence that are relevant to syllable counting (i.e., everything that isn't punctuation)
* Use `pronouncing`'s `.syllable_count()` function to see how many syllables are in each word
* Sum up the syllable counts and print the sentence if it adds up to 17.

In [23]:
for i, sent in enumerate(sentences):
    sent = sent.lower().replace("\n", " ")
    words = [w for w in nltk.word_tokenize(sent) if w[0].isalpha()]
    syll_counts = [pronouncing.syllable_count(quick_phones(w)) for w in words]
    if sum(syll_counts) == 17:
        print(sent)

this expedition has been the favourite dream of my early years.
oh, that some encouraging voice would answer in the affirmative!
remember me with affection, should you never hear from me again.
as i spoke, a dark gloom spread over my listener’s countenance.
she was not her child, but the daughter of a milanese nobleman.
no human being could have passed a happier childhood than myself.
and clerval—could aught ill entrench on the noble spirit of clerval?
this expectation will now be the consolation of your father.
she indeed veiled her grief and strove to act the comforter to us all.
i requested his advice concerning the books i ought to procure.
to examine the causes of life, we must first have recourse to death.
“dearest clerval,” exclaimed i, “how kind, how very good you are to me.
how pleased you would be to remark the improvement of our ernest!
poor justine was very ill; but other trials were reserved for her.
henry saw this, and had removed all my apparatus from my view.
frankenste

There are quite a few of these! Our definition of the Haiku, however, says that we need lines of 5/7/5 syllables exactly, without breaking words across lines. The code in the following cell builds on the previous cell, attempting to partition the syllable counts for each line into haiku (using `partition_haiku()` above). If a haiku can be formed, it prints it out:

In [24]:
for i, sent in enumerate(sentences):
    sent = sent.lower()
    # only tokens that look like words
    words = [w for w in nltk.word_tokenize(sent) if w[0].isalpha()]
    syll_count = 0
    try:
        syll_counts = [pronouncing.syllable_count(quick_phones(w))
                       for w in words]
        linebreaks = partition_haiku(syll_counts)
        if len(linebreaks) > 0:
            # add linebreaks after indices that are in the haiku partition
            out = "".join([w+("\n" if i in linebreaks else " ") for i, w in enumerate(words)])
            print(out)
            print()
    except IndexError: # if any word isn't found, skip
        continue

this expedition
has been the favourite dream
of my early years


remember me with
affection should you never
hear from me again


no human being
could have passed a happier
childhood than myself


and clerval—could aught
ill entrench on the noble
spirit of clerval


i requested his
advice concerning the books
i ought to procure


to examine the
causes of life we must first
have recourse to death


how pleased you would be
to remark the improvement
of our ernest


i threw the letter
on the table and covered
my face with my hands


but i paused when i
reflected on the story
that i had to tell


you perhaps will find
some means to justify my
poor guiltless justine


how strange i thought that
the same cause should produce such
opposite effects


but i was baffled
in every attempt i
made for this purpose


the lady was dressed
in a dark suit and covered
with a thick black veil


many things i read
surpassed my understanding
and experience


i dared not think that
they would turn them from 

## Re-rhyming the sonnets

This next example serves to demonstrate how to find rhyming lines within a corpus. This time, we'll be using Shakespeare's sonnets as a source text. A copy of the sonnets (with numbering stripped) is included in this repository as `sonnets.txt`. The line below reads these lines into a list:

In [25]:
sonnet_lines = [item for item in open("sonnets.txt").read().split("\n") if len(item) > 0]

Here's the first few lines, just to make sure the data looks like what we want it to look like:

In [26]:
sonnet_lines[:10]

['From fairest creatures we desire increase,',
 "That thereby beauty's rose might never die,",
 'But as the riper should by time decease,',
 'His tender heir might bear his memory:',
 'But thou contracted to thine own bright eyes,',
 "Feed'st thy light's flame with self-substantial fuel,",
 'Making a famine where abundance lies,',
 'Thy self thy foe, to thy sweet self too cruel:',
 "Thou that art now the world's fresh ornament,",
 'And only herald to the gaudy spring,']

In order to determine if two words rhyme, we need to check if their "rhyming parts" match. The `pronouncing` library has a functiion `.rhyming_part()` that returns the substring of phones from a word that can be used to determine whether it rhymes with another word:

In [27]:
pronouncing.rhyming_part(quick_phones("alphabetical"))

'EH1 T IH0 K AH0 L'

Our task is to find *random rhyming couplets* from the sonnets. As a first approximation of accomplishing this task, we'll do the following:

* Pick a line at random.
* Get that line's rhyming part.
* Check the rhyming parts of every other line to see if they match.

The code in the cell below implements this. I've interspersed with comments to explain:

In [28]:
picked = random.choice(sonnet_lines)
picked_words = [w for w in nltk.word_tokenize(picked) if w[0].isalpha()]
# get the rhyming part from the final word in the line
picked_rhyme = pronouncing.rhyming_part(quick_phones(picked_words[-1]))

# get a shuffled copy of the sonnets line and search until we find a
# line that rhymes
for line in random.sample(sonnet_lines, len(sonnet_lines)):
    # a line "technically" rhymes with itself but this is boring
    if picked == line:
        continue
    words = [w for w in nltk.word_tokenize(line) if w[0].isalpha()]
    # skip empty lines
    if len(words) == 0:
        continue
    # if this line's rhyming part matches, print this line and the
    # picked line
    if pronouncing.rhyming_part(quick_phones(words[-1])) == picked_rhyme:
        print(picked)
        print(line)
        break

But then begins a journey in my head
Sweet love, renew thy force; be it not said


This solution works fine, but it's a little slow, since it has to check every other line to see if there's a match! (This is an $O(n^2)$ algorithm.) You might not notice how much time it takes to find a rhyming couplet in a small corpus like the sonnets, but you *will* notice if you're creating thousands of couplets, or if you adapt this code to use on a larger corpus.

A better way to do this would be to build a data structure in one pass that contains rhyming parts as keys, and lists of rhyming lines as values. With a data structure like this, you can generate a couplet by selecting a rhyming part at random, then drawing random rhyming lines from the list of lines corresponding to that rhyming part. The code in the cell below implements this solution, with a few additional features:

* To prevent rhyming lines that end with the same word, I store (for each rhyming part) a dictionary that maps line-ending words to the lines (with the corresponding rhyming part) that contain them.
* I extract the rhyming part not just from the last word, but from the concatenated phones of the last three words in the line. This makes it possible to find rhymes that extend over multiple words.
* I use the `collections` package's `defaultdict` as a shortcut for initializing empty nested dictionaries and lists. The data structure we end up is a dictionary whose values are dictionaries whose values are lists.

In [190]:
from collections import defaultdict
# parameter to defaultdict has to be callable, hence the lambda
rhyming_part_to_idx = defaultdict(lambda: defaultdict(list))
for i, line in enumerate(sonnet_lines):
    words = [w for w in nltk.word_tokenize(line) if w[0].isalpha()]
    if len(words) > 0:
        last_few = " ".join([quick_phones(w) for w in words[-3:]])
        rhyming_part = pronouncing.rhyming_part(last_few)
        # to save space, store only the index of the line, not
        # the line itself!
        rhyming_part_to_idx[rhyming_part][words[-1]].append(i)

Once you've run this cell, you'll have the data structure! You can see that the keys of the dictionary are all of the unique rhyming parts that occur at the ends of lines:

In [191]:
rhyming_part_to_idx.keys()

dict_keys(['IY1 S', 'AY1', 'EH1 M ER0 IY0', 'AY1 Z', 'UW1 AH0 L', 'AO1 R N AH0 M AH0 N T', 'IH1 NG', 'AA1 N T EH0 N T', 'AA1 R D IH0 NG', 'IY1', 'AW1', 'IY1 L D', 'EH1 L D', 'EY1 Z', 'UW1 S', 'AY1 N', 'OW1 L D', 'UW1 AH0 S T', 'AH1 DH ER0', 'IY1 N UW0', 'UW1 M', 'AH1 Z B AH0 N D R IY0', 'EH1 R AH0 T IY0', 'AY1 M', 'EH1 N D', 'EH1 G AH0 S IY0', 'IH1 V', 'AY1 V', 'OW1 N', 'IY1 V', 'AO1 N', 'EY1 M', 'EH1 L', 'AA1 N', 'EH1 R', 'EH1 F T', 'AE1 S', 'AA1 Z', 'IY1 T', 'EY1 S', 'IH1 L', 'UW1 ZH ER0 IY0', 'AH1 N', 'AA1 R T', 'AY1 T', 'AE1 JH AH0 S T IY0', 'EY1 JH', 'IH1 L G R AH0 M AH0 JH', 'AA1 R', 'EY1', 'UW1 N', 'AE1 D L IY0', 'OY1', 'AW1 N D Z', 'IY1 R', 'AO1 R D ER0 IH0 NG', 'AY1 F', 'IY1 P', 'AY1 N D', 'IH1 T', 'IH1 T S', 'EH1 N IY0', 'AH1 N P R AH0 V AH0 D AH0 N T', 'EH1 V AH0 D AH0 N T', 'EY1 T', 'AY1 ER0', 'AH1 V', 'UW1 V', 'OW1 S T', 'AA1 R T AH0 S T', 'OW1 Z', 'EH1 S T', 'AO1 R', 'EH1 R IH0 SH', 'IY1 V Z', 'ER1 D', 'IH1 R D', 'EY1 K', 'OW1', 'EH1 N S', 'ER1', 'AH1 K', 'AA1 N AH0 M IY0

The value for one of these keys is a dictionary mapping words to lines that end with those words:

In [192]:
rhyming_part_to_idx['IH1 NG']

defaultdict(list,
            {'spring': [9, 876, 1359, 1420],
             'sing': [109, 532, 1083, 1422, 1483],
             'bring': [534],
             'king': [874],
             'wing': [1085],
             'thing': [1361]})

There are this many unique line-ending rhyming parts:

In [202]:
len(rhyming_part_to_idx)

388

For the couplet generation algorithm to work, we'll need to filter out any rhyming part whose dictionary contains only one value (i.e., there's only one line that ends with this rhyming part):

In [193]:
rhyme_map = {k: v for k, v in rhyming_part_to_idx.items() if len(v) > 1}

Great! Now we can generate couplets. To do this, we need to:

* Pick a random rhyming part
* Pick two random words that end lines having that rhyming part
* Pick the indexes of two lines that end with those words, at random
* Print out the resulting lines!

The following cell does exactly this, in a `for` loop, to generate a little poem:

In [204]:
for i in range(7):
    # get a random rhyming part
    rhyming_part = random.choice(list(rhyme_map.keys()))
    # get the set of lines for two random words that end lines with this rhyme
    a_set, b_set = random.sample(list(rhyme_map[rhyming_part].keys()), 2)
    # randomly select a line index from the sets for both words
    a_idx = random.choice(rhyme_map[rhyming_part][a_set])
    b_idx = random.choice(rhyme_map[rhyming_part][b_set])
    print("\n".join([sonnet_lines[a_idx], sonnet_lines[b_idx]]))

Askance and strangely; but, by all above,
Admit impediments. Love is not love
Buy terms divine in selling hours of dross;
And losing her, my friend hath found that loss;
  For thy sweet love remember'd such wealth brings
Divert strong minds to the course of altering things;
By chance, or nature's changing course untrimm'd: 
When proud-pied April, dress'd in all his trim,
When love, converted from the thing it was,
  Since why to love I can allege no cause.
Against that time do I ensconce me here,
By unions married, do offend thine ear,
With thy sweet fingers when thou gently sway'st
Made more or less by thy continual haste.
