# LX 496/796  Introduction -- Poetry in NLTK

***Due Friday at 11:59 PM in Gradescope***

In this first homework, you will become familiar with programming in Python on Jupyter Notebooks (in Google Colaboratory), and some problem solving while getting familiar with NLTK (the Natural Language Toolkit).

We will be submitting homeworks (in the form of Jupyter Notebooks) directly to Gradescope.  Later ones will involve an autograder, but for this one:
- Choose "Download" and "ipynb" from the File menu, to save a local copy.
- Upload the `.ipynb` file to Gradescope.  I will just look at it there and potentially add comments. 

---

# Haiku assignment



People can argue about this, but let's just say that something will qualify as haiku if it is in three lines, with specific syllable counts: 5-7-5.  Thus:

> Haikus are easy
>
> But sometimes they don't make sense
>
> Refrigerator

(This haiku might be attributable to Rolf Nelson.)

So, suppose that we want to use a corpus to create such things.  We need to be able to determine whether we have met the syllable constraints.  Let's do that with the CMU pronunciation corpus.

In [None]:
import nltk

The line above is how we always want to start, since that makes NLTK available to us.

The CMU pronunciation corpus has pronunciations for a large set of words, and it is easy to use it with NLTK.  The Colab notebook won't have it available by default, so we want to download it.

In [None]:
nltk.download('cmudict')

The download above should get the corpus so we can use it. Once it is downloaded, we make Python aware of it by using `import`.


In [None]:
from nltk.corpus import cmudict

One of the ways we can look at the CMU pronunciation corpus is in the form of a (Python data type) "dictionary", so we'll ask the corpus for that form.  So if we want to look up the pronunciation of a word like "trimmed", we ask the dictionary what value it has for key `trimmed`.

> If you don't remember or were not already familiar with the dictionary data type, you can [read up on it at pythontutorial.net](https://www.pythontutorial.net/python-basics/python-dictionary/) or [in the official documentation](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) or [in the NLTK book](https://www.nltk.org/book/ch05.html#sec-dictionaries).  It's a key-value mapping.  It is like a list but indexed by key instead of position.

In [None]:
cmudict.dict()['trimmed']


We can also give it a name (`pro`), so that we don't need to type out `cmudict.dict()` all the time.  This is arguably simpler to understand.  The equivalent way to look up "trimmed" is then as below.

In [None]:
pro = cmudict.dict()
pro['trimmed']

The output for "trimmed" above looked a little bit strange because it started with two square brackets.  But this is meaningful and intentional.  The information we wanted ("how do you pronounce 'trimmed'?") is represented as a list (an ordered sequence of "phones"), starting with a "T", followed by an "R", followed by an "IH" vowel with primary stress, etc.  We know it's a list because it has entries seperated by commas and surrounded by square brackets.  This accounts for the innermost square brackets above, but the outermost brackets also are defining a list. That is, we have a list containing one thing: a list of five phones.

---

## TASK 1. The pronunciation representation

The goal here is to work out how we can, given a word, find the number of syllables.  To do this:
- retrieve the pronunciation for "fire"
- retrieve the pronunciation for "madelyn"

And look at what you got.

In the code cell, retrieve the pronunciations.  Then in the markdown cell, write down how what you learned about how many syllables "fire" and "madelyn" have and how you can get from the results you retrieved to that answer.  Specifically: we're going to make a list of 1-syllable words, of 2-syllable words, of 3-syllable words, etc. Which lists do 'fire' and 'madelyn' go in?


In [None]:
# Answer 1a: put code here to get and print the pronunciations for 'fire' and 'madelyn'


**Answer 1b:** (replace this with your answer: which lists do 'fire' and 'madelyn' go into, and how can you determine that from the retrieved pronunciations?)





---

## TASK 2 Define `count_syllables`

In this task we will construct a function called `count_syllables()` that, given a word (a string), will look up the pronunciation, count the syllables in the first pronunciation variant and return the number of syllables.  We will build it up in stages.

The shape of the function would be like that below, except of course in its final form we want it not to just return 2 all the time but instead count the syllables in `word` and return that number.

```
def count_syllables(word):
  num_syllables = 2 # TODO: actually count the syllables
  return num_syllables
```
So how do we count the syllables?

The first thing we are going to need to do is look up the pronunciation of the word that is passed in.  We know from above that when we look up the pronunciations of `word` using `pro[word]`, we get a list.  If we want the first (and often only) pronunciation from that list, we need to focus on the first element.  How do we get the first element of a list? Right. So, the list of phones we are going to check (the first pronunciation) is `pro[word][0]`.



In [None]:
# if word were 'sixths' then the first pronunciation is
phones = pro['sixths'][0]
print(phones)

Each element in this list of phones will look like  `IH1` or `S` or `TH`.  To figure out how many syllables are in that we need to count how many have a stress mark (that is, ends in a `0`, `1`, or `2`).  How can we get the last letter of an arbitrary string?  Remember that you can use a negative index in a slice in order to count backwards from the end.  A string can be treated as a list of characters, so we can count backwards from the end of that as well. The NLTK book made use of a special function `isdigit()` that we can use here to distinguish digits from non-digits, and that is good enough to separate the stressed phones from the non-stressed phones.

In [None]:
print("The last character of the second phone of sixths is:")
last_of_second = phones[1][-1]
print(last_of_second)
print("Is it a digit?")
print(last_of_second.isdigit())

So what we want is to know how many phones of the word end in digits.  Put another way, how long is the list of phones in the word that end in digits?  That's the goal.  That's the function you will write.

To set this up, I'll give you some similar code.  Suppose we instead wanted to find out how many phones start with "S". The following list comprehension will accomplish that, by making a list containing just those phones that start with "S" and then computing the length of the list.

In [None]:
sphones = [x for x in phones if x[0] == 'S']
print(sphones)
print(len(sphones))

So we can just adapt that list above so that instead of looking for phones beginning with "S" it looks for phones that end in a digit.  Then the length of that list will be the number of syllables. So then we want to set `num_syllables` to be the length of that list.

Now we can finally get to your part. Define the `count_syllables` function like shown a little ways above, except instead of just returning 2, having it take the first pronunciation of the word, count the number of phones that end in a digit, and return that number.  Note: the assumption for now is going to be that the word is actually in the pronunciation dictionary, so there will always be at least one pronunciation.


In [None]:
# Answer 2. Define count_syllables (replace the stresses = line with code)


If you have succeeded, you should get 2 for `count_syllables('fire')`, 5 for `count_syllables('participated')`, 3 for `count_syllables('madelyn')`.



In [None]:
print('fire: yep' if count_syllables('fire') == 2 else 'fire: nope') # is it 2?
print('madelyn: yep' if count_syllables('madelyn') == 3 else 'madelyn: nope') # is it 3?
print('participated: yep' if count_syllables('participated') == 5 else 'participated: nope') # is it 5?

If you're very concise, you can get the code for `count_syllables` all on one line without using a variable like `num_syllables`, in a form abstractly started below. Once you have it working above, you will probably see what I mean.  This is at this point just artistic. The main thing is just to get `count_syllables()` to return the number of syllables, however you do it.

```python
    return len([x for x in #... etc. 
```



---

Back to the haiku generation. We want to construct lines of 5 and 7 syllables.  An easy way to do this is to just pick three words (two that are 5 syllables long, one that is 7 syllables long) and use those.  This would lead to inspired haikus like:

```python
situational
anesthesiology
agricultural
```

But we already know that eventually we are going to want to use smaller words too, such that we might have several words on one line that, together, are five or seven syllables.

Minimally, we're going to need to know what the 2 syllable words are, the 3 syllable words, the 4 syllable words, and so forth.  So that when we are trying to complete a line, we know what words are available.

We could build each list by saying: go through the list of pronunciations, and pull out the 1 syllable words.  Then, go through the list of pronunciations and pull out the 2 syllable words.  Then the 3 syllable words.  And so forth.

But you can see that this means we need to go through all of the pronunciations *seven times* and for each word each time through, compute how many syllables it is so we can see if it matches the number of syllables we're looking for.  If our pronunciation dictionary is very large, this could be a serious amount of wasted work.  We can make this *much* more efficient by setting up seven bins, then going through the words, computing the number of syllables for each word only once, and putting it in the correct bin once we know what the number of syllables is.  (You can see why we would do this if you imagined doing it by hand, or imagine that there are a million words in the pronunciation dictionary and it takes 1 minute per word to compute the number of syllables.  In reality things are much faster and smaller, but that's no real excuse for being grossly inefficient.)

So, let's do that, even right now from the outset.

Our next task is going to be to go through all of the pronunciations, and build up seven lists (one per number of syllables up to seven) containing words with that number of syllables.

---

## TASK 3 Define `sort_words` 

Define a function `sort_words(pro)` that will take the pronunciation dictionary and return a list, seven members long, such that the *n*th member of the list (counting from 1) is a list of words that have a pronunciation that is *n* syllables long.

This is complex enough that it no longer lends itself to a simple list comprehension.  We want to make a proper function with loops.

To begin, we'll start with some empty bins, so the skeleton of the function would look like this:


In [None]:
def sort_words(pro): # version 1
  sorted_words = [[], [], [], [], [], [], []]
  # go through the words in pro
  # drop them into the right bins
  return sorted_words


Then, we go through the words in `pro`.  That loop is pretty simple, so adding it to the skeleton above, we now have:


In [None]:
def sort_words(pro): # version 2
  sorted_words = [[], [], [], [], [], [], []]
  for w in pro:
    # drop them into the right bins
    pass
  return sorted_words


The pronunciations of a word are retrieved with `pro[w]`, and recall this is a list of pronunciations, possibly with more than one.  So we need to go through the pronunciations in `pro[w]`.  Adding that into the skeleton gives us:


In [None]:
def sort_words(pro): # version 3
  sorted_words = [[], [], [], [], [], [], []]
  for w in pro:
    for p in pro[w]:
      # drop w into bin for syllables in p
      pass
  return sorted_words


Now we are at the point where we need to count syllables (that is, stress marks) in the pronunciation `p`.  You can't use the
`count_syllables()` function from before because that was looking at only the first pronunciation of a word.  But you can use the same technique.

Specifically, you can make a list comprehension that will collect together all of the phones in `p` that end in a digit, and then take the `len()` of that list to see how many stress marks (syllables) there were.  If there were 2, then you add the word (`w`) to the bin for 2-syllable words (which is `sorted_words[1]`, keeping in mind that the first bin has index 0).  The way you add `w` to `sorted_words[1]` is: `sorted_words[1].append(w)`.

In general, you add the word `w` to the bin whose index is the number of syllables in `p` minus 1.  So long as the number of syllables is at least 1 and fewer than 8. 

Because you should be left with *something* to figure out and submit, this is the part of the function that you can fill in yourself.  It will go in place of the comment in the skeleton developed above, and it will involve counting the number of syllables, checking to see if it is fewer than 8, and adding the word to the list if so.

In [None]:
# Answer 3: define sort_words(pro) returning a list of words organized by syllables

To test this function, run the code below.  I got 3924 for the number of 5-syllable words, and 122 for the number of 7-syllable words, that should match what you got.

In [None]:
sorted_words = sort_words(pro)
print('5-syllable words: yep' if len(sorted_words[4]) == 3924 else '5-syllable words: nope') # is it 3924?
print('7-syllable words: yep' if len(sorted_words[6]) == 122 else '7-syllable words: nope') # is it 122?

---

We are now going to write a function that prints terrible haikus, by just choosing 5 and 7 syllables words at random.  In order to get access to Python's random number functions, we need to import them now.

In [None]:
import random

Picking a random element from a list is pretty straightforward.

In [None]:
AList = ['A','B','C','D','E','F','G','H']

In [None]:
ALetter = random.choice(AList)
print(ALetter)

For seconds of amusement, you can re-run the cell above a few times (repeatedly press Ctrl-Return) and observe the degree to which it has made a random choice each time.

---

## TASK 4 Define `bad_haiku`

Here, we will write a function `bad_haiku()` that prints terrible haikus.  Specifically, it will pick two 5 syllable words at random, and one 7 syllable word at random, and print the 7 syllable word between the two 5 syllable words.

To get you started: We want to define a function, so this would start like `def bad_haiku():` and it does not need to return anything.  Instead it will just `print()` to the screen.

Once you have done `import random` then you can make a choice (for the middle line) from a list of 5-syllablue results, like this:


In [None]:
middle_line = random.choice(sorted_words[6])
print(middle_line)

That should pick a random seven-syllable word and print it.

So `bad_haiku()`is just going to find a random 5-syllable line, a random 7-syllable line, and another random 5-syllable line, and print them.
>
*Note*: You can combine things more concisely, and, rather than finding words first and then printing them, you can print a word as you find it (and thus not need to store the word in a variable), like `print(random.choice(sorted_words[4]))`

In [None]:
# Answer 4a: define a bad_haiku() function to print 5-7-5 lines
# then execute it a couple of times to print the haikus


Answer 4b. Provide a couple of the bad haikus you got here.

---

## Task 5 Define `slightly_less_bad_haiku`

The haikus above are pretty terrible haikus.  It would be nice if at least it would be able to combine some shorter words to make the 5 and 7 syllable lines.  Maybe stylistically, we'd like to keep the last line at a single word, but the others really should be made of shorter words if we're going to convince anyone that we have produced something deep and meaningful.

In this task we will define a function `slightly_less_bad_haiku()` that prints randomly chosen words in a 2-3 / 4-1-2 / 5 pattern.

We'll be getting things like this:

```python
backstage schueneman
electrocute cross flagship
accommodating
```

In Python, if you just use `print("something")`, it will print `something` to the screen, and then move the cursor down to the left (in other words, it prints a return character).  If you want to print something with a customized ending (like a space, or nothing), you can specify this with `end=`. So, observe what happens below.  This will of course be of use when we are printing several words to a line.

In [None]:
print("Line one")
print("Line two")
print("One", end=' ')
print("Two", end=' Pizza time!')

So, now just define `slightly_less_bad_haiku` so that it prints a three-line haiku with two words, then three, then one, in a 2-3/4-1-2/5 patter.  Be sure that the first three words are printed together on a line. It is going to be very much like the definition of `bad_haiku`.

In [None]:
# Answer 5a: define slightly_less_bad_haiku() to print 2-3/4-1-2/5 haikus.

Answer 5b. Put a couple of the slightly less bad haikus you got here.

---

## TASK 6 Define `sums_to`

There are a couple of different ways to create a 5 or 7 syllable line.  Above we used words of 2, 3, and 5 syllables in a fixed pattern.  But to be more flexible, we could use all kinds of combinations that can add to 5.  It could be five 1-syllable words, it could be 1-2-1-1, or 1-3-1.  There are many options.  And many more options for a 7-syllable line.

The goal of the next part is to pick a random word to start a line, and figure out how many syllables that leaves us with, then pick a random word to continue, and repeat.

Doing it by hand, it might look like this:

 - Pick a number from 1 to 7.  Suppose we picked 4.
 - Pick a four syllable word, it turns out to be "electrocute".
 - We are aiming for 7, so we have 3 syllables left to fill.
     - Pick a number from 1 to 3. Suppose we picked 1.
     - Pick a one syllable word, it turns out to be "cross".
     - We are aiming for 3, so we have 2 syllables left to fill.
         - Pick a number from 1 to 2. Suppose we picked 2.
         - Pick a two syllable word, it turns out to be "flagship".
         - We have filled our 2 syllables.
     - Which means, with "cross", we have filled our 3 syllables.
 - Which means, with "electrocute", we have filled our 7 syllables.
 - Our line is "electrocute cross flagship".

What we want to do is write a function to do this for us.

To illustrate how this could work, we will take a diversion to write a function that will just come up with numbers that add up to some target. Then afterwards, we will revise it to work with words.

In [None]:
def sums_to(total):
    new_number = random.randrange(total) + 1
    remaining = total - new_number
    if remaining == 0:
        number_list = [new_number]
    else:
        number_list = [new_number] + sums_to(remaining)
    return number_list

In [None]:
# When I use `print(sums_to(7))` three different times, I get 
#  three different answers, like this:

print(sums_to(7))
print(sums_to(7))
print(sums_to(7))

The way `sums_to` works is a little bit weird (that is to say, it's recursive).  I'll walk through it here, and I'll repeat the line I'm talking about as I talk about it.  I'll intersperse a couple of questions as we go.

```python
new_number = random.randrange(total) + 1
```

First it finds a random (integer) number greater than or equal to 0, and less than `total`.  This is accomplished using `random.randrange(total)`.  (`random.randrange()` is like `range()` in that the number you provide is one higher than the function can reach.  So `range(15)` gives a list of numbers from 0 to 14, but not 15; and `random.randrange(15)` could return anything from 0 to 14, but not 15.)  Since we don't want to include 0, we add one, meaning that `new_number` winds up able to be any number from 1 to 15.

---

### TASK 6a Why not zero?

Why don't we want to include 0?  Answer that below.

Answer 6a: (put here why we do not want to allow `new_number` to be zero)

---

Ok, continuing on:

```python
    remaining = total - new_number
```

This determines how many we have left after our new number before the numbers add up to the `total`.  So, for a `total` of 15, if `new_number` turned out to be 10, then `remaining` would be 5.  That is, we have to find more numbers that add up to 5.

```python
    if remaining == 0:
        number_list = [new_number]
```

It's possible that we are finished already.  If the number we picked is already as big as the total we are aiming for, then the list we will return (`number_list`) can be just the simple list containing that one number.

```python
    else:
        number_list = [new_number] + sums_to(remaining)
```

If, on the other hand, we still have more numbers to find, then the list we want to return is one that contains this number (`new_number`) and then some more numbers that add up to `remaining`.  We (will) already have a way to find a list of numbers that add up to `remaining`, it is this very function we are defining now.  So we can return the list containing `new_number` and the numbers that `sums_to(remaining)` finds.

```python
    return number_list
```

Then we return the `number_list` we created in one of the two last code fragments.

If your brain hurts now, or you suspect witchcraft, welcome to the world of recursive programming.  Even though it seems kind of intuitive if you were doing this yourself as a human, the weird thing about `sums_to` is that in that penultimate line we actually *use the function we are defining as part of its definition.*  Why does this not simply cause the universe to implode?

If you actually trace it through and think about what it is doing, it may make a little bit more sense.  Also, it works like the haiku-by-hand example above, really.  But let me try to represent this in a table (below).  The explanation is below the table, but the idea is that as part of computing `sums_to(15)` we need to compute `sums_to(5)` first.  And as part of computing `sums_to(5)` we need to compute `sums_to(2)` first.

<table>
    <tr>
        <td>sums_to(15)
        </td>
        <td>
        </td>
        <td>
        </td>
    </tr>
    <tr>
        <td>picked 10 (5 remain)
        </td>
        <td>
        </td>
        <td>
        </td>
    </tr>
    <tr>
        <td>(find a list that adds to 5)</td>
        <td>&rarr; sums_to(5)</td>
        <td>
        </td>
    </tr>
    <tr>
        <td>
        </td>
        <td>picked 3 (2 remain)
        </td>
        <td>
        </td>
    </tr>
    <tr>
        <td>
        </td>
        <td>(find a list that adds to 2)</td>
        <td>&rarr; sums_to(2)</td>
    </tr>
    <tr>
        <td>
        </td>
        <td>
        </td>
        <td>picked 2 (none remain)
        </td>
    </tr>
    <tr>
        <td>
        </td>
        <td>([2] is such a list])</td>
        <td>&larr; return [2]</td>
    </tr>
    <tr>
        <td>([3, 2] is such a list])</td>
        <td>&larr; return [3] + [2]</td>
        <td>
        </td>
    </tr>
    <tr>
        <td>&larr; return [10] + [3, 2]<br />a.k.a. [10, 3, 2]</td>
        <td>
        </td>
        <td>
        </td>
    </tr>
</table>


In words: We were computing `sums_to(15)`.  We picked a random number, it was **10**.  That by itself does not add to 15.  In addition to the **10**, we need a list of numbers that will add up to the remaining 5.  `sums_to(5)` can find a list with that property.   While computing `sums_to(5)`, we pick a random number, and it was **3**.  That does not add to 5, we also need a list of numbers that will add up to the remaining 2.  So, we call `sums_to(2)` to get such a list.  It randomly picks **2**, which does add up to 2 (that was the goal), so it returns `[2]`, which is a list that adds up to 2.  We can now finish evaluating `sums_to(5)`, which adds the `[3]` it found to the `[2]` it got back, and returns `[3, 2]` (which is a list that adds up to 5).  And then we can finish evaluating `sums_to(15)`, it adds the `[10]` it found to the `[3, 2]` it got back, and returns `[10, 3, 2]` (which is a list that adds to 15).

I went through all that because we can use this (pretty directly) to construct a list of words whose syllables add up to five, or seven, or whatever we want.  Really, the only difference between `sums_to()` and the function that we want for our haiku-maker is that instead of making lists of numbers, we want to make lists of words.

The next step is to write a function `construct_line(total)` that takes a number of syllables (`total`) and returns a list of words whose syllable lengths add up to the number of syllables passed in.  We're going to base this on `sums_to(total)`.  Before we actually get to the part where you define that function, there are several points to make.

The first point is: There's an easy way to do this, since we have `sums_to(total)` already.  We could just do the following:

In [None]:
def easy_construct_line(total):
    return [random.choice(sorted_words[n-1]) for n in sums_to(total)]

In [None]:
print(easy_construct_line(5))
print(easy_construct_line(7))
print(easy_construct_line(5))

But there is a reason that I don't want to do it quite this way.  The problem is that we are going to (in a little bit) make the choice of the second word be based on the choice of the first word.  We want to make these haikus flow a little bit more naturally.  The idea is that there might be 4 words that often follow the first one, and we want to use one of them as the second word in our haiku.  It shouldn't matter how long they are if they fit, just that they are one of the common continuations.  So instead of picking a length and then looking for a word of that length, it will be better to pick a word out of the set of common continuations and then afterwards look at its length.

The line in `sums_to(total)` that is relevant is this one:

```python
    new_number = random.randrange(total) + 1
```

What this does is picks a number that is among the possible continuations (since the possible continuations are numbers up to the total amount we are looking for).  To translate this into picking words in a haiku line, we want to pick a word that is among the possible continuations.  That would be any word that has fewer syllables than the total number we are looking for.

So let's make a list of the words that are sufficiently short to fit on the rest of the line.  If there are three syllables left, that would include all of the 3-syllable, 2-syllable, and 1-syllable words.  And then we'll just pick one, see how long it was, and then proceed.

We already have `sorted_words`, which is a list of words sorted by lengths.  So, `sorted_words[0]` are the 1-syllable words, `sorted_words[1]` are the 2-syllable words, etc.  If we have 2 syllables left on our line, then we want all of the words in *either* `sorted_words[0]` *or* `sorted_words[1]` to be candidates for continuation.  So, let's tranform this `sorted_words` list into one organized by possible continuations.  We can call it `next_words`, and it will be like this: `next_words[0]` is a list of 1-syllable words, the only things you can use if you have only 1 syllable left on the line.  `next_words[1]` is a list of either 1- or 2-syllable words, which are options if you have 2 syllables left on the line.  All of the words in `next_words[0]` are also in `next_words[1]`.

To construct `next_words` elegantly is a little bit complicated, so rather than try to get you to write it yourself, I'll provide the function I ended up with and ask you about it.  Below it is defined, and then tested.

In [None]:
def construct_next_words(sorted_words):
    next_words = []
    cumulative_words = []
    for i in range(7):
        word_pairs = [(w, i + 1) for w in sorted_words[i]]
        cumulative_words.extend(word_pairs)
        next_words.append(list(set(cumulative_words)))
    return next_words

In [None]:
next_words = construct_next_words(sorted_words)

In [None]:
print(len(sorted_words[0])) # the number of 1-syllable words
print(len(next_words[0])) # the number of words we can use if 1 syllable remains
print(len(sorted_words[1])) # the number of 2-syllable words
print(len(next_words[1])) # the number of words we can use if 2 syllables remain
print(next_words[0][-1]) # the last of the 1-syllable-left possibilities
print(next_words[1][-1]) # the last of the 2-syllables-left possibilities

---

## TASK 7 What list is `word_pairs` set to, inside the loop?

To show that you understand what `construct_next_words(sorted_words)` is doing, consider the step where `i` is 2.  Describe the list that `word_pairs` gets set to in that iteration.

> This is probably the hardest question so far here.  I expect you will need to stare at this for a little while, so don't be worried if you have to.  I wrote this last year, and now that I've gone back to look at it again, *I* needed to stare at it for a while.  But keep in mind what we want it to give us.  We have the words sorted by how long they are, and what we get back are the words sorted by whether they would fit in the remaining syllables.  So we expect the 1-syllable words to be in all of those lists, since no matter how many syllables we have left (assuming we aren't already done) a 1-syllable word would fit in the remaining space.


Answer 10: (markdown)

---

## TASK 8 1-syllable continuations?

An interesting oddity is that, a couple of cells above, we see that the number of 1-syllable words seems to be bigger than the number of possible 1-syllable continuations.  What migght have led to that?

> When the continuations are built, the `set()` function is used.  That must be eliminating some.  Why might that be? 

Answer 8. Why are n-syllable continuations smaller than the sorted n-syllable words?

---

## TASK 9 Define `construct_line`

Write a function `construct_line(total)` that takes a number of syllables (`total`) as an argument and returns a list of words whose syllables add up to the number of syllables that was passed in.  Use the lists in `next_words` for the source of the random words.

>As indicated earlier, we are going to model `construct_line(total)` directly on `sums_to(total)`.  The two functions will be almost identical.  The line we want to focus on for changing is the line that, in `sums_to(total)`, reads:
>
>```python
>    new_number = random.randrange(total) + 1
>```
>
>Instead of a random number, we want a random word.  It should be a word that has at most `total` syllables, and now that we have defined `next_words` we can use that.  Specifically, we can find a word that has `total` or fewer syllables by just doing this:
>
>```python
>    new_word_pair = random.choice(next_words[total-1])
>```
>
>The word we picked will be `new_word_pair[0]` and the number of syllables it has will be `new_word_pair[1]`.  So after having picked the word, we will want to determine how many syllables are left by subtracting `new_word_pair[1]` from `total`.
>
>With that much guidance, define `construct_line(total)` based on `sums_to(total)` but so that it returns a list of words instead of a list of numbers.  You should rename the variables `number_list` and `new_number` to be more sensible for words, like `word_list` and `new_word_pair`.

In [None]:
# Answer 9: define construct_line(total) to return a list of words

---

If this worked, you should get a couple of (still pretty random) haikus by running the next cell.

In [None]:
print(' '.join(construct_line(5)))
print(' '.join(construct_line(7)))
print(' '.join(construct_line(5)))
print()
print(' '.join(construct_line(5)))
print(' '.join(construct_line(7)))
print(' '.join(construct_line(5)))

These haikus are still pretty terrible.  It's just random words jumbled together.  So, let's take one last step to trying to make these more palatable.  *Spoiler*: they're still going to be terrible.

The plan is to use bigrams and conditional frequency distributions to try to chain the words together better, so that as much as possible, the choice of what word comes next is constrained by what has been seen to come next in the corpus.

We'll look at the "romance" category of the Brown corpus (since this seems most likely to provide poetry). The idea is that we will look at a "romance" word, find out how many syllables it has by looking up its pronunciation, and then proceed to the next word based on words that have been seen to follow it in the "romance" corpus.  Notice that this means we need to be able to find the "romance" word in the pronunciation corpus (because we need to know how many syllables it has).  So, this is only going to work for words that are in both corpora.  Another point about this: The Brown corpus contains some words that are capitalized, but all of the words in the CMU pronunciation corpus are lowercase.  So, as a first step, we will extract the "romance" category, and then make it all lowercase.

In [None]:
nltk.download('brown')

In [None]:
from nltk.corpus import brown
corp = brown.words(categories='romance')
lc_corp = [w.lower() for w in corp]

Just to see what we've done here, let's look at the first 10 words of each corpus.

In [None]:
print(corp[:10])
print(lc_corp[:10])

Now that we have the corpus in lowercase, we can form the bigrams (the pairs of word and next word) using `bigrams(lc_corp)` but then eliminate all of those that contain words we cannot look up in the pronunciation corpus.  So, below, we form `pairs_in_cmu` by going through each bigram `(x,y)` and adding it to the list only when both `x` and `y` are in our pronunciation corpus.

In [None]:
from nltk.util import bigrams
pairs_in_cmu = [(x,y) for (x,y) in bigrams(lc_corp) if x in pro and y in pro]

Now. Here is what we want to do. We want to know for any given word (that we can pronounce), what words have been seen to follow that word.  So, this means that we look in our list of bigrams (`pairs_in_cmu`) for everything that has, as its first element, the given word.  The set of observed second elements of those pairs are the words that have been observed following our given word.

It might be clearer seeing it in code. The following list comprehension will give us a list of words (that we can pronounce) that have been seen to follow "angry".

In [None]:
[y for (x,y) in pairs_in_cmu if x == 'angry']

So if we wanted to find a word that follows "angry", we could pick one of those words at random. Notice that "at" is in there twice, because it was twice observed to follow "angry". That's ok, this also gives us the possible benefit that, because it was seen twice as often as any other word, it has twice the chance of any other word of being randomly picked now.

It is possible to construct these lists for all of the first elements of the words in `pairs_in_cmu` and then use those lists to decide what to pick next when you are moving through the line of poetry.  However, NLTK already provides a way to do something like this, given its usefulness to language processing tasks.  It is called a "Conditional Frequency Distribution."  It seems a little complicated, but really, it is just a generalization of what we did just above, collecting all the words that follow "angry".  You make a conditional frequency distribution with a list of bigrams, so:

In [None]:
cfd = nltk.ConditionalFreqDist(pairs_in_cmu)
cfd['angry']

You can see that we got just the same result, though this is now a "frequency distribution" and so it is counting the number of occurrences.  Since "at" occurred twice, it is paired with 2.  In fact, you can tell from the way it is printed, in a kind of key-value format, that you could further ask `cfd['angry']` how often "at" occurred, as compared to "had":

In [None]:
cfd['angry']['at']

In [None]:
cfd['angry']['had']

Ok, now, the idea is to construct lines where each subsequent word has been seen following the preceding word, hoping that this will increase coherence of our language.  (Notice that this is not being smart at all about how language works, it's just offloading the work to the previously-collected corpus we are relying on.  We assume the corpus put words in a sensible order, so we'll try to mimic that and hope our haiku winds up putting words in a similarly sensible order.)

As a way to get started, let's see how we would do this partially by hand.

Let's suppose that we start with "unhappy" and we are aiming for a line of 5 syllables long.  Since "unhappy" is 3 syllables long, we have 2 left.  Let's see what words have been known to follow "unhappy".

In [None]:
cfd['unhappy']

Let's see which of those are among the options we have in our list of words that we can use if we have 2 syllables left.

In [None]:
options = [(w,s) for (w,s) in next_words[1] if w in cfd['unhappy']]

In [None]:
print(options)

So, we could finish up a 5 syllable line by making it "unhappy success".  That sounds moderately poetic.

Now, we will try to formalize this into a new version of the `construct_line()` function.  I will call it `construct_better_line()` and like before, we need to tell it how many syllables the line should have.  So, it should start *something* like this:

```python
def construct_better_line(total):
```

But that's not quite good enough. The `construct_better_line()` function is used each time we need to find a next word, but now the choice of the next word depends on what the previous word was.  So the function needs access to the previous word, not just the target length.  That means that we need to add the previous word as one of the things that the function takes as an argument.  So, really, it should be something like this:

```python
def construct_better_line(total, previous_word):
```

But if we're just starting a line at the beginning, there is no previous word.  Python has a special value for things that don't exist, called `None`.  We can set up this function with a "default" for the `previous_word` such that, if no previous word is provided, it is assumed to be `None`. Like so:

```python
def construct_better_line(total, previous_word = None):
```

This means there are two situations to consider, one where we have a previous word (we are in the middle of a line), and one where we don't (we are at the beginning of a line).  If there is no previous word, the choice of the next word is relatively unconstrained, we can just pick a random word (that has fewer than `total` syllables).  If there is a previous word, then we need to consult the conditional frequency distribution we built from the bigrams.  You can test for whether `previous_word` is `None` or not by treating `previous_word` as if it were a True/False value.  `None` will evaluate as `False`, anything else will evaluate as `True`.

```python
def construct_better_line(total, previous_word = None):
    if previous_word:
        # select the next word based on the previous one
    else:
        # select the first word however we like
```

Let's start with what happens if we are at the beginning of a line.  I demonstrated this above by starting with "unhappy".  Can you just pick any random word?  Well, let's see.

---

## TASK 10 Followers of Linda

How many different words can follow "linda"?

In [None]:
# Answer 12: How many different words can follow "linda"?

**Answer 12**: (markdown)

---

## TASK 11 Following socially

One of the words that can follow "linda" is "socially".  So we know that "socially" is in both corpora.  How many different words can follow "socially"?

In [None]:
# Answer 13: How many different words can follow "socially"?

**Answer 13**: (markdown)

---

It is not guaranteed that there will be continuations available from any word you pick.  It might be that there simply are no examples in the corpus of words following the word you picked, or it might be that all of the examples there are have too many syllables to fit on what's left of your line.  We will probably need to deal with this contingency, but at least when we are starting a line, we want to pick a word that has somewhere to go.

So, let's make a list of good starting words.  The plan is to find the 100 words with the highest number of possible following words, and when we start a new line, we will pick one of those.  This doesn't really address the issue directly, but it's a fairly easy way to give the haiku generator a fighting chance.

We need to know for each word how many continuations it has.  Recall that above we discovered that "unhappy" has 4 continuations.  We can get this number directly by taking the length of the frequency distribution.

In [None]:
cfd['unhappy']

In [None]:
len(cfd['unhappy'])

So let's compute this for every word in `cfd`.  We can make that list as follows:

In [None]:
num_next = [(len(cfd[x]), x) for x in cfd]
num_next[:3]

If we sort this, it will sort it by the first element in the pair, which is what we want.  This is sorting it in order of number of continuations.

In [None]:
sorted_next = sorted(num_next)

In [None]:
print(sorted_next[:3]) # least continuable

In [None]:
print(sorted_next[-3:]) # most continuable

The words with the most continuations are at the end (it sorts from low to high), and so the 100 "most continuable" words would be `sorted_next[-100:]`.  Let's assume that we are not going to want to choose among those most continuable words based on differences in their continuability, and transform the list so it contains just the words (not the continuation).  In other words, we want to do a list comprehension that extracts just the word.

In [None]:
good_starts = [w for (n, w) in sorted_next[-100:]]
print(good_starts[:5])

Getting back to what happens at the beginning of a line in `construct_better_line`, we want to pick one of those "good starts" as the next word if there is no previous word.  It is possible that some of those words have too many syllables, though.  It's not really likely, but we should still take that into account.  So, we need to filter the list so that it just has words that have fewer than `total` syllables, and then make a random choice from among those.

The simplest way to do this is to filter the options list by whether each is in `good_starts`.  Recall that built the `options` list for "unhappy" before like this:

```python
options = [(w,s) for (w,s) in next_words[1] if w in cfd['unhappy']]
```

We will still want that form for when we are continuing from a word, but the task at hand is to continue when there was no previous word.  So we filter on whether the word is a good start.

In [None]:
total = 7 # we aim to construct a line of length 7
options = [(w,s) for (w,s) in next_words[total-1] if w in good_starts]

If we embed this into the function we are building up, we wind up with this:

```python
def construct_better_line(total, previous_word = None):
    if previous_word:
        # select the next word based on the previous one
    else:
        # select the first word however we like
        options = [(w,s) for (w,s) in next_words[total-1] if w in good_starts]
```

Thinking ahead, we can expect that the block that executes if we have a previous word will also provide a list of options for the next word, so let's add something outside the conditional that will pick one of the options as the next word.

```python
def construct_better_line(total, previous_word = None):
    if previous_word:
        # select the next word based on the previous one
    else:
        # select the first word however we like
        options = [(w,s) for (w,s) in next_words[total-1] if w in good_starts]
    word = random.choice(options)
```

And now we need to determine whether we are done, or whether we need to get more words.  The choice of `word` is a pair, where the first element is the word and the second element is its length.  So it is relatively easy to see if we are done.  If we reached `total`, just return what we have.  If we have a ways to go, we return what we have so far and what `construct_better_line` gives us for the remaining syllables.

```python
def construct_better_line(total, previous_word = None):
    if previous_word:
        # select the next word based on the previous one
    else:
        # select the first word however we like
        options = [(w,s) for (w,s) in next_words[total-1] if w in good_starts]
    word = random.choice(options)
    remaining = total - word[1]
    line = [word[0]]
    if remaining > 0:
        line += construct_better_line(remaining, word[0])
    return line
```

That has covered the case where we had no previous word.  We can now turn our attention to what happens when `construct_better_line()` is called with a previous word.  In that case, we need to figure out what the possible continuations are from that word, filter them down to just those that do not have too many syllables, and pick one of them.

We were just reminded above of how we find the continuation options from a single word (as we did sort of by hand for "unhappy").  Now is the time to deploy that.  We know how many syllables remain (so we know which available continuations to draw from), and we have the word (so we know where to look in `cfd`).  And the structure of the function we are building allows us to just set `options` in the conditional, and use the rest of the function as it is already written.  So:

```python
def construct_better_line(total, previous_word = None):
    if previous_word:
        # select the next word based on the previous one
        options = [(w,s) for (w,s) in next_words[total-1] if w in cfd[previous_word]]
```

And that is almost it.  Assuming you followed the discussion above, you should be able to assemble it into the full definition of `construct_better_line()`.  There is just one case left that we should consider:

It is possible that no following word is found at all.  Recall how many words followed "socially".  If we wind up adding "socially" to a line with syllables left to go, we can't continue.  So we need to decide what to do.  The simplest thing is just to say that if there is nowhere to go, pick a new start from among the good starts.

This leads me/us to a slight reorganization of the structure in order to allow fractionally more elegance.  We will set a flag `continuing` to be, by default, false.  This indicates whether we are continuing from a prior word.  If there is a previous word and we find a continuation, then we set this flag to true.  Then, if we are not continuing (meaning either that there was no previous word, or there was one but it didn't provide any options), we pick a new word from among our good starts.

>Note: there is a failure case that is not being considered here, which is if there are no good starts that are short enough for the number of syllables that are left.  Because "a" is for sure in our good starts, this won't arise.  However, if the good starts list is further filtered in a way that could leave it with no 1-syllable words, this could come up.

```python
def construct_better_line(total, previous_word = None):
    continuing = False
    if previous_word:
        # select the next word based on the previous one
        options = [(w,s) for (w,s) in next_words[total-1] if w in cfd[previous_word]]
        if len(options) > 0:
            continuing = True            
    if not continuing:
        # select the first word however we like
        options = [(w,s) for (w,s) in next_words[total-1] if w in good_starts]
```

Ok, now you have all the pieces to assemble `construct_better_line()`.

---

## TASK 12 Define `construct_better_line`

Define `construct_better_line()` as described above, such that it (where possible) will pick from words it has seen following the current word.

In [None]:
# Answer 14: Define construct_better_line(total, previous_word = None)

---

If it worked, the following should provide some haikus that, frankly, are still terrible.  But maybe have a little bit of flow.

In [None]:
for i in range(3):
    print(' '.join(construct_better_line(5)))
    print(' '.join(construct_better_line(7)))
    print(' '.join(construct_better_line(5)))
    print()

There are various ways this could be improved.  This homework has taken almost long enough.  Just one more thing to try, and this might be mildly difficult.  These poems were based on the "romance" genre portion of the Brown corpus.  But the Brown corpus has several other genres.  The command below will list them.

In [None]:
print(brown.categories())

Go back up to where we first defined `corp` as being `brown.words(categories='romance')`.  And then we defined `lc_corp` as the lowercased words.  And then we defined `pairs_in_cmu` to filter those down to just the words we have pronunciations for.  And defined `cfd` as a conditional frequency distribution of those words.  Then defined `num_next` as being a list of how many continuations each word has, and sorted it into `sorted_next` and then built `good_starts` from that.

If you were to retrace those steps after defining `corp` to draw words from the "science_fiction" genre instead, how might our haikus change?

In the end, you need to wind up with `good_starts`, `next_words`, and `cfd` defined for the corpus before `construct_better_line()` will use the new corpus.

---

## TASK 13 Make science fiction haikus

Produce a couple of haikus based on the word sequencing in the "science_fiction" genre within the Brown corpus, and comment on how they seem to differ or not from the ones we build with the "romance" genre.

In [None]:
# Answer 13: retrace the steps from corpus to haiku to make "science_fiction" haikus

**Answer 13**. (markdown)

---

## TASK 14 Rhyming (extra, advanced)

Given that this is a pronunciation dictionary, we could also use it to make rhyming poetry.  And given that the stress is retrievable, we could also use it to assess meter.  You could use it to write a limerick, in fact, if we assume a limerick looks like:

```
- x - - x - - x - (a)
- x - - x - - x - (a)
- - x - - x (b)
- - x - - x (b)
- x - - x - - x - (a)
```

To be frank, at the point of handing this out, I haven't tried.  But we can see what kind of thing it would require.  First, determining the meter of the words so we can limit the continuation options in a line not just by syllable counts but by matching meter.  And then a way to see when two lines rhyme.  We could say a rhyme occurs when last vowel and following phones match.  In limericks, the (a) rhyme is actually usually deeper, the match is between the penulimate vowel and the material following it.  Which can cross word boundaries (*Nantucket*, *bucket*, *struck it*).  Rhyming is hard enough that it might better to start at the end of the line (with the rhyme) and work backwards.

This is an open-ended and optional project, but if you found the haikus too easy, it might be more fun to see if you can generate limericks.

In [None]:
print(pro['nantucket'])
print(pro['bucket'])
print(pro['stuck'])
print(pro['it'])

In [None]:
print(pro['car'])
print(pro['bar'])
print(pro['star'])