In my [previous notebook](https://fpgmaas.nl/2019-12-06-1.0-summary-statistics-upvotes-and-awards/), I did some initial analysis on the dataset that contains all poems by [/u/poem_for_your_sprog](https://reddit.com/u/poem_for_your_sprog). I noted there that the rhythm, or poetic meter, of the poems, is one of the main characteristics that makes these poems so appealing to me.

In this notebook, I want to get to know a bit more about poetic meter, and explore the meters that sprog utilizes in his poems on [/r/AskReddit](https://www.reddit.com/r/AskReddit/). A little disclaimer before we start; I did not know anything about poetry when I started this notebook. Well, that's not entirely true; I knew that in poems words usually rhyme. That's about it. Anway, I don't claim that this notebook is exhaustive nor that everything in it is correct, so if you have any feedback or suggestions; please do let me know!

Oh, by the way; all code can be found on [GitHub](https://github.com/fpgmaas/poems).

Let's go exploring!

In [None]:
%%capture
%load_ext autoreload
%autoreload 2

import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
import string
import datetime as dt
import pandas as pd
import numpy as np
import time
import re
import matplotlib.pyplot as plt 
from string import punctuation
from plotly.offline import download_plotlyjs, init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=False)

from src.utils import print2_list, print2, export_ipynb_for_github_pages
from src.plotly import plot_histogram, plot_timeline, plot_horizontal_bar, plot_heatmap, \
plot_grouped_scatter, plot_multiple_timelines, plot_meter, plot_grouped_boxplot, plot_overlayed_histogram
from src.meter import get_word_scansion, get_line_scansion, get_syllables_per_line_combined, \
combine_line_scansions, merge_lines, get_known_meter
from src.data import load_data

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

df = load_data('data/comments.txt', False)
run_date = dt.date(2019,12,20)
df = df[df['date']<=run_date]
df = df[df['type'] == 'poem']

# Understanding poetic meter

To explore poetic meter, let's start by taking a look at one of my favourite poems by [/u/poem_for_your_sprog](https://reddit.com/u/poem_for_your_sprog), which contains a message about dealing with regret:
```
I should have hurried youth, in truth,
And moved more quickly on -
I should have made the most of youth,
Before the time was gone.

I should have followed fancy, free,
Before it thought to fade -
I should have picked a good degree,
Or found myself a trade.

I should have stopped to stare above;
To share another's dreams -
I should have never welcomed love,
And lost it all, it seems.

No matter what the aim or end -
No matter what you do -
Regrets are part of life, my friend:

Don't let them conquer you.
```
[SOURCE](https://www.reddit.com/r/AskReddit/comments/4fsds1/adults_from_reddit_what_do_you_regret_most_from/d2bmumz/)

One of the reasons I like this poem is because of its clear and uplifting message. But another reason that makes this poem stand out to me is it's very apparent rhythm. If you say this poem out loud, you are clearly able to hear it;

```
I SHOULD have HURried YOUTH, in TRUTH,
And MOVED more QUICKly ON -
```

Or, a more visual way of representing this:

In [None]:
colorscale=['#d2d2d2','#1f77b4']
z_text =[['i', 'should','have','hur','ried','youth','in','truth'],
    ['and','moved','more','quick','ly','on','','']]
z =[[0,1,0,1,0,1,0,1],
    [0,1,0,1,0,1,np.nan,np.nan]]
fig = plot_meter(  
    text = z_text[::-1],
    meter = z[::-1],
    title = '',
    colorscale = colorscale
)
fig.show(config={'displayModeBar': False})

Here, the syllables in the blue squares are stressed, and the syllables in the grey squares are unstressed. If you have trouble hearing the rhythm, try to read the lines out loud, stressing the syllables in the grey squares. Feels unnatural doesn't it? Now try the same, but stressing the words in the blue squares. That should be a lot more comfortable.

So we could say that the rhythm of the two lines above is unstressed-stressed-unstressed-stressed-unstressed-stressed-unstressed-stressed-unstressed-stressed-unstressed-stressed-unstressed-stressed. Luckily, as with any other art, science, religion or discipline, things in poetry tend to have names to make talking about them a bit easier. For example, the meter of the first line is iambic tetrameter, whereas the second line is iambic trimeter. To break that down: First, they are both iambic, because they consist of iambic feet. An iambic foot is a unstressed syllable, followed by a stressed syllable, so a grey square followed by a blue square in the visualization above. The first line consists out of four iambic feet (pentameter), and the second line consists out of three iambic feet (trimeter). The nomenclature for the number of feet per line is as follows:

- monometer (1) or foot
- dimeter (2)
- trimeter (3)
- tetrameter (4)
- pentameter (5)
- hexameter (6)
- ...

Two other commonly used types of feet are the trochaic foot (stressed, unstressed), and the anapestic foot (unstressed, unstressed, stressed). To illustrate, examples of the three types of feet we have encountered until now are:

In [None]:
example_feet = {
    'iambic foot' :
        {
        'z_text' : [['ex','ist']],
        'z' : [[0,1]]
        },
    'trochaic foot' :
        {
        'z_text' : [['ti','ger']],
        'z' : [[1,0]]
        },
    'anapestic foot' :
        {
        'z_text' : [['un','der','stand']],
        'z' : [[0,0,1]]
        }
}

for key in example_feet.keys():
    fig = plot_meter(  
        text = example_feet[key]['z_text'][::-1],
        meter = example_feet[key]['z'][::-1],
        title = key,
        colorscale = colorscale
    )
    fig.show(config={'displayModeBar': False})

By combining these three feet in a large variety of ways, we can construct the meter for most of the poems by /u/poem_for_your_sprog. To perform analysis on the meters that sprog uses however, we should first identify which meter each poem uses. We could do that by reading each poem and assigning the meter by hand, but that will be very time consuming. So let's see if we can use [Python](https://www.python.org/) to do this for us! If you are interested in learning more about my approach for tackling this problem, continue reading. If you are just interested in the results, feel free to skip the next section.

# Identifying meter in poetry using Python

In order to identify which meter each poem uses, we need a computer to determine which syllables are stressed in a sentence, and which are not. It turns out that this can be quite tricky. Compare for example the following two lines with the example given earlier:

In [None]:
z_text = [x.split(' ') for x in 'i al ways hoped i\'d have some more>an oth er year or two'.split('>')]
z_text[1] = z_text[1] +['','']
z =[[0,1,0,1,0,1,0,1],[0,1,0,1,0,1,np.nan,np.nan]]
fig = plot_meter(  
    text = z_text[::-1],
    meter = z[::-1],
    title = '',
    colorscale = colorscale
)
fig.show(config={'displayModeBar': False})

Both examples contain the word "have", but in the first example it was unstressed, while in the latter example it's stressed. Why? Beats me. It's just what happens in my head when I read it. Sadly, that's not very useful information for a computer. It turns out that the problem of teaching a computer to find the right [scansion](https://en.wikipedia.org/wiki/Scansion) (marking the stressed and unstressed syllables) is quite a complex one. I found a Python library called [pronouncing](https://pronouncing.readthedocs.io/en/latest/tutorial.html#counting-syllables) that can help determine the stressed and unstressed syllables of a single word. For example:

```
import pronouncing
pronounciation = pronouncing.phones_for_word('have')
pronouncing.stresses(pronounciation[0])
> 1
```

```
import pronouncing
pronounciation = pronouncing.phones_for_word('another')
pronouncing.stresses(pronounciation[0])
> 010
```

where `1` denotes a stressed syllable, and `0` denotes an unstressed syllable. For "have", it gives us a single stressed syllable, which as we saw in the example may or may not be correct based on the context. For "another" it returns `010`, which is in line with the scansion of our second example. I think this is correct regardless of context; try to pronounce "another" like "ANother" or "anothER" and you'll understand why. 

So we are definitely not going to get a perfect scansion for each poem by simply using this `pronouncing` library. However, I came up with a method to use it to at least get the poem's primary meter:

- For each line, determine the scansion by determining the scansion for each word. This will give us a string of 0's and 1's.
- Divide the lines of the poem into groups by the amounts of syllables per line. (So for example, lines with 9 syllables are grouped together)
- Try to find the closest match for each line in a set of known meters, also represented as 0's and 1's.
- Per group, find the most found known meter.
- Now we have a set of meters that together make up the meter of the poem.

There are a few more steps in this process to make it work. In the previous notebook, we saw that sprog quite often splits a line over two or even more lines. For example;

In [None]:
z_text = [['so','throw','off','the','chains','of','op','pres','sion','said','he'],
          ['be','fair','ly','un','fet','terred','','','','',''],
          ['and','free','to','be','free','','','','','','']]

z =[[0,1,0,0,1,0,0,1,0,0,1],
   [0,1,0,0,1,0,np.nan,np.nan,np.nan,np.nan,np.nan],
   [0,1,0,0,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]]
fig = plot_meter(  
    text = z_text[::-1],
    meter = z[::-1],
    title = '',
    colorscale = colorscale
)
fig.show(config={'displayModeBar': False})

To accurately determine the meter of the poem, we could recognize that the latter two lines are actually one line split into two, and if we merge them back together we get two lines with the same meter:

In [None]:
z_text = [['so','throw','off','the','chains','of','op','pres','sion','said','he'],
          ['be','fair','ly','un','fet','terred','and','free','to','be','free']]

z =[[0,1,0,0,1,0,0,1,0,0,1],[0,1,0,0,1,0,0,1,0,0,1]]
fig = plot_meter(  
    text = z_text[::-1],
    meter = z[::-1],
    title = '',
    colorscale = colorscale
)
fig.show(config={'displayModeBar': False})

Now we recognize this as being anapestic tetrameter, with the first unstressed syllable omitted. This is called iambic substitution, since the first anapestic foot is replaced with an iambic foot. 

In my code I have built a function that looks for these kind of lines and combines them; it looks for lines that together have the same amount of syllables as a longer line in that poem (in this example 6 + 5 = 11). If such a set of lines is found, they are merged into a single line. I'm not 100% sure if this is the right thing to do when analyzing the meter in a poem, but it does seem to make a lot of sense to me. Besides, if poets get to split lines and call that 'artistic freedom', I think data scientists are allowed some 'scientific freedom' and merge them back together.

Now, I can talk a lot more about this process, since it took me quite some time to build something that performs satisfactorily, but I propose we just continue with applying the logic to sprog's poems and take a look at the results!

# Poetic meter in the poems by /u/poem_for_your_sprog

The first thing that we might be interested in; what is Sprog's favourite meter? Or better; what are his ten most favourite meters?

In [None]:
# * = catalectic, i.e. the last (unstressed) syllable is omitted
# ** = iambic subsitution, i.e. the first (unstressed) syllable is omitted from an anapestic foot

known_meters = {
    'iambic hexameter'       : '010101010101',
    'iambic hexameter*'      : '01010101010',
    'iambic pentameter'      : '0101010101',
    'iambic pentameter*'     : '010101010',
    'iambic tetrameter'      : '01010101',
    'iambic tetrameter*'     : '0101010',
    'iambic trimeter'        : '010101',
    'iambic trimeter*'       : '01010',
    'iambic dimeter'         : '0101',
    'iambic dimeter*'        : '010',
    'iambic monometer'       : '01',
    
    'anapestic tetrameter'   : '001001001001',
    'anapestic tetrameter**' : '01001001001',
    'anapestic trimeter'     : '001001001',
    'anapestic trimeter**'   : '01001001',
    'anapestic dimeter'      : '001001',
    'anapestic dimeter**'    : '01001',
    'anapestic monometer'    : '001',

    'trochaic hexameter'     : '101010101010',
    'trochaic hexameter*'    : '10101010101',
    'trochaic pentameter'    : '1010101010',
    'trochaic pentameter*'   : '101010101',
    'trochaic tetrameter'    : '10101010',
    'trochaic tetrameter*'   : '1010101',
    'trochaic trimeter'      : '101010',
    'trochaic trimeter*'     : '10101',
    'trochaic bimeter'       : '1010',
    'trochaic bimeter*'      : '101',
    'trochaic monometer'     : '10',
    
    'amphibrachic dimeter'   : '010010'
}
known_meters_inv = inv_map = {v: k for k, v in known_meters.items()}


# DETERMINE SCANSION ----------------------------------------------------------

# Determine the scansion of each poem, and which lines to combine based on this scansion.
df['poem_as_list'] = [poem.split('>') for poem in df['poem']]
df['scansion'] = [[get_line_scansion(line) for line in poem] for poem in df['poem_as_list']]
df['lines_to_combine'] = [combine_line_scansions(x) for x in df['scansion']]

# combine scansion and poem lines based on the suggested improvements.
df['scansion_modified'] = [merge_lines(row['scansion'], row['lines_to_combine']) for ix, row in df.iterrows()]
df['poem_modified_as_list'] = [merge_lines(row['poem_as_list'], row['lines_to_combine'], sep = ' ') for ix, row in df.iterrows()]
df['poem_modified'] = ['>'.join(x) for x in df['poem_modified_as_list']]

# Determine which of our known meters the poem is.
df['meter_list'] = [get_known_meter(x, known_meters_inv) for x in df['scansion_modified']]
df['meter'] = [', '.join(x) for x in df['meter_list']]


# FIND THE MOST COMMON METERS ----------------------------------------------------------
df_most_common_meters = (df
                         .groupby('meter')
                         .agg(n=('ups', len), 
                              avg_ups=('ups', 'mean'))
                         .sort_values('n',ascending=False)
                        )
df_most_common_meters.reset_index(inplace=True)
df_most_common_meters_10 = df_most_common_meters.head(10)

In [None]:
print('Total number of poems: {0}\nTotal number of poems in top 10 meters: {1} ({2:.1f}% of total)'.format(
    len(df),
    df_most_common_meters_10['n'].sum(),
    df_most_common_meters_10['n'].sum()/len(df)*100
))

In [None]:
fig = plot_horizontal_bar(
    labels = df_most_common_meters_10['meter'][::-1],
    values = df_most_common_meters_10['n'][::-1],
    title = 'The 10 most common meters in poems by u/poem_for_your_sprog',
    xaxis_title = 'Number of poems',
    yaxis_title='',
    figsize=(700,400)
)
fig.show()

Turns out sprog's most common poetic meter is a combined meter of iambic tetrameter and iambic trimeter; alternating lines with eight and six syllables. The graph requires some more explanation. First, the asterisks:

- \* : [catalectic line](https://en.wikipedia.org/wiki/Catalectic), i.e. the last foot is metrically incomplete; the last syllable is omitted.
- \*\* : [iambic substitution](https://en.wikipedia.org/wiki/Anapestic_tetrameter) i.e. the first unstressed syllable from the first anapestic foot is omitted.

Then, we also see one type of metrical foot that we have not encountered before: The amphibrachic foot. An amphibrachic foot is syllables in the order: unstressed, stressed, unstressed. Basically, an amphibrachic dimeter (010010) and a anapestic dimeter\*\* (01001) together make up an anapestic tetrameter\*\* (01001001001).

If this sounds like French to you (I'm assuming you're not French here. If you are; je m'excuse. Please replace French with a random language that you barely speak), don't worry; let's look at some examples of the ten most common meters to make this all a bit more clear:

In [None]:
meter_examples = {
    'iambic tetrameter, iambic trimeter':
    {
        'z_text' :  
        [
            ['i', 'should','have','hur','ried','youth','in','truth'],
            ['and','moved','more','quick','ly','on','','']
        ],
        'z' : 
        [
            [0,1,0,1,0,1,0,1],
            [0,1,0,1,0,1,np.nan,np.nan]
        ]
    },
    'anapestic tetrameter**':
    {
        'z_text' : 
        [
            ['so', 'throw','off','the','chains','of','op','pres','sion','said','he'],
            ['be','fair','ly','un','fet','terred','and','free','to','be','free']
        ],
        'z' : 
        [
            [0,1,0,0,1,0,0,1,0,0,1],
            [0,1,0,0,1,0,0,1,0,0,1]
        ]
    },
    'iambic tetrameter':
    {
        'z_text' : 
        [
            ['from','time','to','time','i','think','of','then'],
            ['i','turn','my','gaze','be','fore','a','gain']
        ],
        'z' : 
        [
            [0,1,0,1,0,1,0,1],
            [0,1,0,1,0,1,0,1]
        ]
    },
    'trochaic tetrameter, trochaic tetrameter*':
    {
        'z_text' : 
        [
            ['would', 'you','suck','er','punch','a','mon','key?'],
            ['would','you','up','per','cut','a','bear?','']
        ],
        'z' : 
        [
            [1,0,1,0,1,0,1,0],
            [1,0,1,0,1,0,1,np.nan]
        ]
    },
        'trochaic tetrameter':
    {
        'z_text' : 
        [
            ['wave','good','bye','your','e','go','bro','ther'],
            ['kiss','your','kids','and','call','your', 'mo','ther']
        ],
        'z' : 
        [
            [1,0,1,0,1,0,1,0],
            [1,0,1,0,1,0,1,0]
        ]
    },
    'trochaic tetrameter, trochaic trimeter*':
    {
        'z_text' : 
        [
            ['no','one\'s','quite','as','strong','as','stan','ley'],
            ['stan','ley\'s','been','to','war','','','']
        ],
        'z' : 
        [
            [1,0,1,0,1,0,1,0],
            [1,0,1,0,1,np.nan,np.nan,np.nan]
        ]
    },    
    'trochaic tetrameter*':
    {
        'z_text' : 
        [
            ['when', 'you\'re','full','of','doubt','and','fear'],
            ['i\'ll','be','with','you','wait','ing','here']
        ],
        'z' : 
        [
            [1,0,1,0,1,0,1],
            [1,0,1,0,1,0,1]
        ]
    },
    'iambic dimeter':
    {
        'z_text' : 
        [
            ['my','name','is','dog'],
            ['and','e','ven', 'though']
        ],
        'z' : 
        [
            [0,1,0,1],
            [0,1,0,1]
        ]
    },
    'amphibrachic dimeter, anapestic dimeter**':
    {
        'z_text' : 
        [
            ['you\'re','sea','soned','in','sad','ness'],
            ['you\'re','prac','ticed','in','doubt','']
        ],
        'z' : 
        [
            [0,1,0,0,1,0],
            [0,1,0,0,1,np.nan]
        ]
    },
    'anapestic tetrameter':
    {
        'z_text' : 
        [
            ['there\'s','a','laugh','on','her','lips','and','a','light','in','her','eye'],
            ['and','a','warmth','in','her','voice','and','a','smile','in','her','sigh']
        ],
        'z' : 
        [
            [0,0,1,0,0,1,0,0,1,0,0,1],
            [0,0,1,0,0,1,0,0,1,0,0,1]
        ]
    },
    'anapestic tetrameter':
    {
        'z_text' : 
        [
            ['there\'s','a','laugh','on','her','lips','and','a','light','in','her','eye'],
            ['and','a','warmth','in','her','voice','and','a','smile','in','her','sigh']
        ],
        'z' : 
        [
            [0,0,1,0,0,1,0,0,1,0,0,1],
            [0,0,1,0,0,1,0,0,1,0,0,1]
        ]
    },
    'anapestic dimeter':
    {
        'z_text' : 
        [
            ['i','re','mem','ber','the','way'],
            ['that','i','thought','a','bout','love']
        ],
        'z' : 
        [
            [0,0,1,0,0,1],
            [0,0,1,0,0,1]
        ]
    },
    
}

for key in df_most_common_meters_10['meter'].iteritems():
    fig = plot_meter(  
        text = meter_examples[key[1]]['z_text'][::-1],
        meter = meter_examples[key[1]]['z'][::-1],
        title = str(key[0]+1) +'. ' + key[1],
        colorscale = colorscale
    )
    fig.show(config={'displayModeBar': False})

To get a better idea of the application of these meters in the poetry by sprog, and to get an idea of how well our meter-determining-logic performs, below is a plot of the ten most upvoted poems for each of the ten most commonly used meters. 

Note that you can hover over the markers to read the poem, and you can hide/show a group of poems by clicking the corresponding entry in the legend on the right.

In [None]:
# Create a DataFrame with the top 10 poems based on upvotes for each in the 10 most commonly used meters.
df_10_poems_top_meters = (df
                          [df['meter'].isin(df_most_common_meters_10['meter'])]            
                         .groupby(["meter"])
                         .apply(lambda x: x.sort_values(["ups"], ascending = False))
                         .reset_index(drop=True)
                         .groupby('meter')
                         .head(10))

fig = plot_grouped_scatter(
    x = df_10_poems_top_meters['average_line_length'],
    y = df_10_poems_top_meters['ups'],
    groups = df_10_poems_top_meters['meter'],
    unique_groups = df_most_common_meters_10['meter'],
    text = np.array([re.sub('>','<br>',comment) for comment in df_10_poems_top_meters ['poem']]),
    title = 'Top rated poems in the 10 most common meters by u/poem_for_your_sprog',
    xaxis_title = 'Average line length',
    yaxis_title = 'Upvotes'
)
fig.show()

Judging from these examples, it seems our simple method for determining the meter of a poem actually performs quite well. Of course, this is only a small subset of the total poems and we don't see how many poems that should be in these groups we misclassified, but it'll have to do for now. (I just put that last statement there to keep my colleagues happy, otherwise they will start harassing me on monday. To all the others; it's amazing! Look at thow well it performs! Isn't it wonderful?).

Looking at the upvotes, it does seem like we can detect a pattern here, even though there are only ten markers per group in the plot. Poems written in certain meters seem to get more upvotes than poems in other meters. Initially, I thought about just looking at the average number of upvotes per group:

In [None]:
fig = plot_grouped_scatter(
    x = df_most_common_meters_10['meter'],
    y = df_most_common_meters_10['avg_ups'],
    groups = df_most_common_meters_10['meter'],
    unique_groups = df_most_common_meters_10['meter'],
    text = np.array(['{}<br>Average upvotes: {}<br>n: {}'.format(row['meter'],round(row['avg_ups'],1),row['n'])  
                       for index, row in (df_most_common_meters_10).iterrows()]),
    title = 'Average number of upvotes of poems in the 10 most common meters <br>by u/poem_for_your_sprog',
    xaxis_title = '',
    yaxis_title = 'Average upvotes'
)
fig.show()

However, this might lead us to draw wrong conclusions, since the dataset contains many outliers. For example, sprog's [most upvoted poem](https://www.reddit.com/r/AskReddit/comments/598qrb/health_inspectors_of_reddit_whats_the_worst/d96si4d/) has over 37k upvotes. Since at the time of writing there are 55 other poems written in iambic dimeter this group will have on average approximately 673 (37,000 divided by 55) more upvotes than if this single poem had zero upvotes; that's a lot of impact for one observation! We will get more accurate results by looking at the [median](https://en.wikipedia.org/wiki/Median) instead of the mean. A good way of visualizing this type of data is a boxplot, so let's try that out:

In [None]:
fig = plot_grouped_boxplot(
    df = df,
    obs_col = 'ups',
    group_col = 'meter',
    unique_groups = df_most_common_meters_10['meter'],
    title = "Box Plot of the number of upvotes per poem"
)
fig.show()

One note to fellow Redditors; please stop upvoting the ['i lik the bred'-poem](https://www.reddit.com/r/AskReddit/comments/598qrb/health_inspectors_of_reddit_whats_the_worst/d96si4d/). It's screwing up my y-axis.

Anyway; this plot gives us much better results. Some types of meter are more popular than others, but the differences are not as big as our initial plot based on the mean number of upvotes led us to believe.

Now let's take a look at what we encountered in my [previous notebook](https://fpgmaas.nl/2019-12-06-1.0-summary-statistics-upvotes-and-awards/). There, I created a histogram with the average number of characters per line per poem. We found that there was a large peak around 30 characters, and a smaller second peak around 46/47 characters. By some visual inspection of these poems that caused the second peak, we raised the hypothesis that this was mainly caused by poems in anapestic tetrameter. Since we have classified the poems now, we can actually validate this hypothesis. Below is a plot with multiple histograms of the average line length in characters per poem, overlayed on top of each other. Note that again you can hide or show histograms by clicking the corresponding entries in the legend on the left side.

In [None]:
fig = plot_overlayed_histogram(
    df = df,
    obs_col = 'average_line_length',
    group_col = 'meter',
    unique_groups = df_most_common_meters_10['meter'],
    title = 'Histogram of the average number of characters per line <br>for the ten most commonly used meters',
    xaxis_title = 'characters',
    yaxis_title = 'percentage'
)
fig.show()

And indeed; the peak around 46/47 characters exists completely out of poems in anapestic tetrameter\*\* - at least, for the poems in the top 10 most commonly used meters, which are displayed here. It's also interesting to see that whereas most of the histograms follow approximately a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), some are [left-skewed](https://en.wikipedia.org/wiki/Skewness). This is because of the fact that some lines are split over multiple lines, as we saw before. By inspecting the plot above, we can see that this mostly happens in poems with anapestic tetrameter\*\* and iambic tetrameter.

Now, let's see how the use of meter in sprog's poems over time has changed. Below are eleven timelines; one for each of the ten most commonly used meters, and one for all others. Each timeline shows for each month the percentage of poems that was written in that meter during that month.

In [None]:
# Create a DataFrame with the total number of poems per meter per month, and the fraction of total
df_meters_per_month = df[['date','meter']]
df_meters_per_month['date'] = pd.to_datetime(df_meters_per_month['date'])
df_meters_per_month['month'] = df_meters_per_month['date'].dt.to_period('M')
df_meters_per_month.loc[~df_meters_per_month['meter'].isin(df_most_common_meters_10['meter']), 'meter'] = 'other'
df_meters_per_month = (df_meters_per_month
                        .groupby(['month','meter'])['month']
                        .agg(n='count')
                        .unstack('meter')
                        .fillna(0)
                        .stack('meter')
                        .reset_index(inplace=False)
)
df_meters_per_month['month'] = [x.to_timestamp() for x in df_meters_per_month['month']]
df_meters_per_month['month_total'] = df_meters_per_month['n'].groupby(df_meters_per_month['month']).transform('sum')
df_meters_per_month['frac']=df_meters_per_month['n']/df_meters_per_month['month_total']

fig = plot_multiple_timelines(
    x = df_meters_per_month['month'],
    y = df_meters_per_month['frac'],
    groups = df_meters_per_month['meter'],
    unique_groups = list(df_most_common_meters_10['meter']) + ['other'],
    text=np.array([
        "{}<br>{}: {:.0f} ({:.2%})".format(
                                         row['meter'],
                                         row['month'].strftime("%b %Y"),
                                         row['n'],
                                         row['frac']) 
                   for ix, row in df_meters_per_month.iterrows()]),
    title='Fraction of poems per meter per month',
    xaxis_title = 'month',
    yaxis_title ='fraction of poems',
    figsize=(1000,1200)
)

fig.show()

The first thing that's very intesting to see; in the first few months, the category 'other' accounts for a very large fraction of poems. This seems to be mainly caused by the fact that meter was a less strong feature in the poems from that period, and there was not always a single recognizable meter in the poems. Basically, I think what we are seeing there is sprog developing his own characteristic style in which meter plays such an important role.

What is also interesting to see is that halfway 2016, sprog started experimenting a bit more. The use of the always winning combo of iambic tetrameter, iambic trimeter starts to decline. At the same time, new meters make their introduction around the same period:
- trochaic tetrameter, trochaic tetrameter\*
- amphibrachic dimeter, anapestic dimeter\*\*,
- trochaic tetrameter\*
- iambic dimeter
- anapestic dimeter

# Other meters

In [None]:
other_nice_meters = [
    # meter not found
    '',
    # nice meters
    'anapestic dimeter, anapestic trimeter',
    'anapestic tetrameter',
    'anapestic dimeter, anapestic dimeter**',
    # wrong meter
    'anapestic trimeter**, iambic trimeter'
]

df_other_meters = df_most_common_meters[df_most_common_meters['meter'].isin(other_nice_meters)]

To round it off, let's look at a selection of meters that did not make the top 10, but ended up in the range of numbers 11 to 20:

In [None]:
fig = plot_horizontal_bar(
    labels = df_other_meters['meter'][::-1],
    values = df_other_meters['n'][::-1],
    title = 'A selection of meters #11 to #20 in poems by u/poem_for_your_sprog',
    xaxis_title = 'Number of poems',
    yaxis_title='',
    figsize=(800,300)
)
fig.show()

These three I quite like;

- anapestic dimeter, anapestic trimeter
- anapestic tetrameter
- anapestic dimeter, anapestic dimeter\*\*

so I decided to plot some poems in a graph so you can check them out yourself. Especially the *anapestic dimeter, anapestic trimeter* has a nice flow to it:

In [None]:
z_text = [
    ['i','want','na','tu','ral','hair','','',''],
    ['and','a','bos','so','my','pair','','',''],
    ['and','a','de','li','cate','e','le','gant','face'] 
]
z =[
    [0,0,1,0,0,1,np.nan,np.nan,np.nan],
    [0,0,1,0,0,1,np.nan,np.nan,np.nan],
    [0,0,1,0,0,1,0,0,1]
]
fig = plot_meter(  
    text = z_text[::-1],
    meter = z[::-1],
    title = 'anapestic dimeter, anapestic trimeter',
    colorscale = colorscale
)
fig.show(config={'displayModeBar': False})

Then there's the *anapestic trimeter**, iambic trimeter*. These seem to be misclassified poems; they are actually *iambic tetrameter, iambic trimeter*. At the time of writing however, there are 29 poems classified as the former, while there are 750 classified as the latter. So in general, this does not seem to be a major problem, and the simple classification method seems to be quite accurate.

The last meter that I included is the empty meter. These are poems in which my classification method failed to assign any meter at all. The reason for this is that the library I used is based on the [CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict), while many of the words in these poems do not occur in a dictionary.

Here's the plot I promised earlier so you can see for yourself:

In [None]:
# Create a DataFrame with the top 10 poems based on upvotes for each in the 10 most commonly used meters.
df_poems_other_meters = df[df['meter'].isin(df_other_meters['meter'])]
df_10_poems_other_meters = (df_poems_other_meters
                 .groupby(["meter"])
                 .apply(lambda x: x.sort_values(["ups"], ascending = False))
                 .reset_index(drop=True)
                 .groupby('meter')
                 .head(10))

fig = plot_grouped_scatter(
    x = df_10_poems_other_meters ['average_line_length'],
    y = df_10_poems_other_meters ['ups'],
    groups = df_10_poems_other_meters['meter'],
    unique_groups = df_other_meters['meter'],
    text = np.array([re.sub('>','<br>',comment) for comment in df_10_poems_other_meters ['poem']]),
    title = 'Top rated poems in a selection of meters #11 to #20 by u/poem_for_your_sprog',
    xaxis_title = 'Average line length',
    yaxis_title = 'Upvotes'
)
fig.show()

That was all for now! I learned a lot about meter in poetry by creating this notebook, and I hope I was able to share some of that with you. But most of all I hope you enjoyed reading this!

In [None]:
front_matter_str = """---
layout: post
title: "Poetry & Data II: Meter"
subtitle: Analyzing the meter in the poetry of /u/poem_for_your_sprog on Reddit
tags: [python, poetry, poem_for_your_sprog, reddit]
layout: html_post
---"""

from IPython.display import display, Javascript
display(Javascript('IPython.notebook.save_checkpoint();'))
import time
time.sleep(10)
export_ipynb_for_github_pages(filename="2.0-meter.ipynb",
                              front_matter_str=front_matter_str,
                              prefix = run_date.strftime('%Y-%m-%d') +'-')