I am a really big fan of the poems by /u/poem_for_your_sprog on Reddit. For those of you who are not familiar with him yet; he writes short poems as responses to others on AskReddit threads. To give you an example, one that I particularly like is the following, which was written in response to a thread of gripping stories by some ICU workers, who despite their best efforts are not always able to save every patient they meet:

```
You’ll weather the wind and the rain and the rough -
And sometimes you’ll try but it won’t be enough.

You did what you could,
but it’s not up to you.

You did what you could,
and that’s all you can do.
```

I always find it difficult why I love these poems so much. Some, like the above, stand out in simplicity; six short lines that bring a message that speaks to many of us. But there's more elaborate ones, and really funny ones too. I think the one thing that they all have in common is their rhythm, or their 'flow'.

However, since I am not as good with words as /u/poem_for_your_sprog, let me use numbers to analyze some of the work he has done to date!

In [None]:
%%capture
%load_ext autoreload
%autoreload 2

In [None]:
import string
import datetime as dt
import pandas as pd
import numpy as np
import time
import re
import matplotlib.pyplot as plt 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=False)

from src.utils import print2_list, print2
from src.plotly import plot_histogram, plot_timeline, plot_horizontal_bar
from src.data import load_data

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

In [None]:
%%capture
import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('stopwords')
nltk.download('punkt')

# Load the data

In [None]:
df = load_data('data/comments.txt', True)

# Comments per day

Let's start by simply looking at the amount of poems over time. We can see below that on the extraordinarily productive days, sprog provides us with about 5 or 6 poems, and his productivity seems to have increased somewhat over time. The outlier of 12 comments in June is the ['Ask Me Anything'](https://www.reddit.com/r/books/comments/3aungz/hi_im_sam_garland_aka_upoem_for_your_sprog_ive/), or AMA in short, where he answered questions of fellow Redditors.

In [None]:
df_comments_per_day= df.groupby(['date'])['date'].agg(n='count')
idx = pd.date_range(df_comments_per_day.index.min(), dt.datetime.today())
df_comments_per_day = df_comments_per_day.reindex(idx, fill_value=0)

In [None]:
fig = plot_timeline(
    x=df_comments_per_day.index,
    y=df_comments_per_day['n'],
    title='Number of comments on Reddit per day by u/poem_for_your_sprog',
    xaxis_title='Day',
    yaxis_title='Number of comments',
    annotations=[
        go.layout.Annotation(
            x='2015-6-23',
            y=12,
            xref="x",
            yref="y",
            text="AMA",
            showarrow=True,
            arrowhead=2,
            ax=-50,
            ay=0
        )
    ]
)
fig.show()

This plot was usful to determine the AMA outlier, so lets remove the observations from that day from our dataset. However, the daily plot does not really help us in identiofying a trend, so let's aggregate the daya to monthly buckets to get a clearer view:

In [None]:
# Remove AMA comments
df = df[df['date']!=dt.date(2015,6,23)]
df.reset_index(inplace=True)

In [None]:
comments_per_month = df_comments_per_day.groupby(pd.Grouper(freq='M'))['n'].sum()

In [None]:
fig = plot_timeline(
    x=comments_per_month.index,
    y=comments_per_month,
    title='Number of comments on Reddit per month by u/poem_for_your_sprog',
    xaxis_title='Month',
    yaxis_title='Number of comments'
)
fig.show()

# Average line length

Sprog writes both poems with very short lines, as well as poems with longer ones. A histogram of the average number of characters on a line per poem should give us a better idea of the distribution:

In [None]:
fig = plot_histogram(
    x = df['average_line_length'],    
    params = {'xbins':dict(start=0,end=200,size=1)},
    title = 'Histogram of the average characters per line by u/poem_for_your_sprog',
    xaxis_title = 'Day',
    yaxis_title = 'Number of characters'
)

fig.show()

There are three clear outliers, which upon further inspection are not poems, or contain a lot of text alongside the poem. We will filter those out for now. Furthermore, the peak around 46 is interesting to see. My initial guess was that some rhymes are written in the rhyme scheme `abab`, while others are split out over shorter lines in the rhyme shape `abcb defe`, so the former would on average have lines twice as long as the latter. But that does not seem to hold; then the first peak should be around 23 instead of 30.

It turns out that these are a set of poems that sprog usually writes in 'Anapestic tetrameter'. Anapestic tetrameter is a metre with four anapestical feet per line. An anapestical foot is two unstressed syllables, followed by a stressed one. So, denoting a stressed syllable as / and an unstressed syllable as x, an anapestic tetrameter can de denoted as follows:

> x x / x x / x x / x x /

However, sprog usually omits the first unstressed syllable, so it becomes:

>   x / x x / x x / x x /

To get a better idea, here is an example of one of these poems:

```
you don't need a measure of treasure to fly
to sporting success on a broom in the sky...
to eros alone in the sight of the stars...
to space on a ship that's intended for mars.
you don't need a mountain of money to go
where peter and susan await in the snow...
where planets contend and defend for a spice...
where alice adventures with hatters and mice.
you don't need a wallet of wealth and of worth
to start on a journey across middle earth...
to fight in the night with your sword and your steed.
you don't need a fund or a fortune to read.
```

In [None]:
df = df[df['average_line_length']<55]

In [None]:
fig = plot_histogram(
    x = df['number_of_lines'],    
    params = {'xbins':dict(size=1)},
    title = 'Histogram of the number of lines per poem by u/poem_for_your_sprog',
    xaxis_title = 'Number of lines',
    yaxis_title = 'Number of comments'
)

fig.show()

That's quite a wide spread! Some poems have on average lines that are three times longer than other poems. This makes one wonder what those poems look like.. Printing many poems here might not be the best solution, so let's put the 100 poems with the shortest average line length and the the 100 poems with the longest average line length in a plot that allows you to read the poems by hovering over the points. 

In [None]:
df = df[df['poem'].apply(len)>0]
df = df[df['number_of_lines']>1]

In [None]:
df_short = df.sort_values('average_line_length').head(100)
df_long = df.sort_values('average_line_length',ascending=False).head(100)
df_short_long = pd.concat([df_short,df_long])

In [None]:
fig = go.Figure(
        data=go.Scatter(
            x=df_short_long['average_line_length'],
            y=df_short_long['ups'],
            mode='markers',
            marker=dict(
                size = 8,
                line_width=1,
                opacity=0.7
            ),
            hoverinfo = 'text',
            text=[re.sub('>','<br>',comment) for comment in df_short_long['poem']]
        )
)

fig.update_layout(
    title='100 poems with the shortest and 100 poems with the longest line length <br> by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Average line length',
    yaxis_title='Upvotes'
)
fig.show()

# Upvotes & Awards

On Redit, there are two ways fo showing appreciation for a comment. One can upvote a post, of spend some actual money to give the post an award. Let's start by simply counting the upvotes:

In [None]:
print('Total number of upvotes on poems by /u/poem_for_your_sprog: ' + str(df['ups'].sum()))

Here is a little thing I found on the internet, so it must be true:

> According to research conducted by Vanderbilt University Medical Center, laughing for 10 to 15 minutes burns between 10 and 40 calories a day.

Let's first assume that smiling is on the bottom of this spectrum, i.e. 15 minutes of smiling consumes 10 calories. Now let's assume that an upvote equals two seconds of smiling on average. Then we now have enough information to convert our number of upvotes to Big Mac's!

In [None]:
cal_per_sec = 10/(15*60)
cal_per_smile = cal_per_sec * 2
total_cals_smiled = df['ups'].sum() * cal_per_smile
cal_per_big_mac = 564
n_big_mac = round(total_cals_smiled/cal_per_big_mac,1)
print('Calories per second of smiling: {0:.4f}'.format(cal_per_sec))
print('Calories per smile: {0:.4f}'.format(cal_per_smile))
print('Number of smiles: {}'.format(df['ups'].sum()))
print('Total calories smiled: {0:.1f}'.format(total_cals_smiled))
print('Calories per Big Mac: {0:.1f}'.format(cal_per_big_mac))

print('In total, {} Big Mac\'s worth of calories have been consumed by smiles that were caused by poems by /u/poem_for_your_sprog.'
     .format(n_big_mac))

Next, let's take a look at the awards. There are (at least) three types of awards;
- Silver, granting the user.. nothing.
- Gold, granting the user Reddit Premium for a week
- Platinum, granting the user Reddit Premium for a month

Let's take a look at how many of each /u/poem_for_your_sprog has been given:

In [None]:
from collections import Counter
c = Counter()
for d in df['awards_dict']:
    c.update(d)

color_dict = {'Gold': '#C9B037', 
          'Silver': '#D7D7D7', 
          'Platinum': '#B4B4B4'}
colors = [color_dict[x] if x in color_dict else '#AD8A56' for x in [x[0] for x in c.items()]]

fig = go.Figure(data=[
                    go.Pie(
                        labels=[x[0] for x in c.items()], 
                        values=[x[1] for x in c.items()],
                        marker=dict(colors=colors),
                        textinfo='value', 
                        hoverinfo='label+percent',
                        textfont_size=12,
                        hole=.3 
                    )
    ]
)
fig.update_layout(
    template='simple_white',
    title = 'Awards received on poems by /u/poem_for_your_sprog.',
    title_x = 0.5,
)
fig.show()

In [None]:
fig = plot_histogram(
    x = df['ups'],    
    params = {'xbins':dict(size=250)},
    title = 'Histogram of the upss of poems by by u/poem_for_your_sprog',
    xaxis_title = 'Upvotes',
    yaxis_title = 'Number of comments'
)
fig.show()

In [None]:
fig = plot_histogram(
    x = df['total_awards_received'],    
    params = {'xbins':dict(size=1)},
    title = 'Histogram of the number of awards per comment by u/poem_for_your_sprog',
    xaxis_title = 'Number of awards',
    yaxis_title = 'Number of comments'
)
fig.show()

In [None]:
fig = go.Figure(
        data=go.Scatter(
            x=df['total_awards_received'],
            y=df['ups'],
            mode='markers',
            marker=dict(
                size = 8,
                line_width=1,
                opacity=0.7
            ),
            hoverinfo = 'text',
            text=['ups: {}<br>{}<br><br>'.format(row['ups'],row['awards_simple']) 
                  + re.sub('>','<br>',row['poem']) for index, row in df.iterrows()]
        )
)

fig.update_layout(
    title='Upvotes versus number of awards of the poems by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Number of awards received',
    yaxis_title='Upvotes'
)
fig.show()

# Word analysis

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
stop_words = nltk.corpus.stopwords.words('english')
comments = df['poem'].str.cat(sep=' ')
tokens = tokenizer.tokenize(comments)
tokens = [t for t in tokens if not t in stop_words]
frequency_dist = nltk.FreqDist(tokens)
most_common = frequency_dist.most_common(80)

In [None]:
fig = plot_horizontal_bar(
    labels = [x[0] for x in most_common[::-1]],
    values = [x[1] for x in most_common[::-1]],
    title = 'Most occuring words in comments by u/poem_for_your_sprog',
    xaxis_title = 'Occurence',
    yaxis_title='')
fig.show()

# What about Timmy?

In [None]:
comments_about_timmy = np.array(['timmy' in comment for comment in df['poem']])
comments_about_timmy_fucking_dying = np.array(['timmy fucking died' in comment for comment in df['poem']])

In [None]:
print('Comments about Timmy: {}'.format(comments_about_timmy.sum()))
print('Comments about Timmy fucking dying: {}'.format(comments_about_timmy_fucking_dying.sum()))
print('Comments about Timmy that do not end with Timmy fucking dying: {}'
      .format(comments_about_timmy.sum()-comments_about_timmy_fucking_dying.sum()))

In [None]:
fig = go.Figure(data=[
                    go.Pie(
                        labels=['Timmy fucking dying','Timmy not fucking dying'], 
                        values=[comments_about_timmy_fucking_dying.sum(),
                             comments_about_timmy.sum()-comments_about_timmy_fucking_dying.sum()], hole=.3
        )
    ]
)
fig.update_layout(
        template='simple_white'
)
fig.show()

So.. What happens to Timmy if he doesn't fucking die?

In [None]:
df_timmy_not_dying = df[(comments_about_timmy) & (~comments_about_timmy_fucking_dying)]
df_timmy_not_dying['ending'] = [x.split('>')[-1] for x in df_timmy_not_dying['poem']]
df_timmy_not_dying = df_timmy_not_dying.sort_values('ups')

In [None]:
fig = plot_horizontal_bar(
    labels = df_timmy_not_dying['ending'],
    values = df_timmy_not_dying['ups'],
    title = 'Best scoring alternative endings to poems about Timmy',
    xaxis_title = 'Upvotes',
    yaxis_title=''
)
fig.show()

# Rhyming

In [None]:
import pronouncing

In [None]:
def get_last_word_per_line(poem):
    return [re.findall(r"\s([^\.?!,\s]+)[\.?!,\s']*$",line)[0] if re.findall(r"\s([^\.?!,\s]+)[\.?!,\s']*$",line) else None 
     for line in poem.split('>')]

In [None]:
last_words_list = [get_last_word_per_line(mystr) for mystr in df['poem']]

In [None]:
import string

def get_rhyme_scheme(last_words_per_line):
    alphabet = string.ascii_lowercase
    rhyme_scheme = np.empty(len(last_words_per_line),dtype=str)
    k=0
    for i in range(len(last_words_per_line)):
        if rhyme_scheme[i]=='':
            if last_words_per_line[i] is not None:
                rhyme_scheme[i]=alphabet[k % 26]
                rhyme_list = pronouncing.rhymes(last_words_per_line[i])
                rhyme_scheme[(np.array([x in rhyme_list for x in last_words_per_line]) & (rhyme_scheme == ''))] = alphabet[k % 26]
                k+=1
            else:
                rhyme_scheme[i] = '?'               
    return ''.join(rhyme_scheme) 

In [None]:
rhyme_schemes = [get_rhyme_scheme(x) for x in last_words_list]
df['rhyme_scheme'] = rhyme_schemes

In [None]:
most_common_rhyme_schemes = df['rhyme_scheme'].value_counts().head(15)

In [None]:
fig = plot_horizontal_bar(
    labels = most_common_rhyme_schemes.index[::-1],
    values = most_common_rhyme_schemes[::-1],
    title = 'The 15 most common rhyming schemes in poems by /u/poem_for_your_sprog',
    xaxis_title = 'Number of poems',
    yaxis_title='',
    figsize=(800,600)
)
fig.show()

In [None]:
df_top_rhymes = df[df['rhyme_scheme'].isin(most_common_rhyme_schemes.index[:10])]

In [None]:
df_top_rhymes = (df_top_rhymes
                 .groupby(["rhyme_scheme"])
                 .apply(lambda x: x.sort_values(["ups"], ascending = False))
                 .reset_index(drop=True)
                 .groupby('rhyme_scheme')
                 .head(10))

In [None]:
fig = go.Figure()

for rhyme_scheme in most_common_rhyme_schemes.index[:10]:
    df_subset = df_top_rhymes[df_top_rhymes['rhyme_scheme'] == rhyme_scheme]
    fig.add_trace(go.Scatter(
        x=df_subset['average_line_length'],
        y=df_subset['ups'],
        mode='markers',
        name=rhyme_scheme,
        marker=dict(
            size = 8,
            line_width=1,
            opacity=0.7
        ),
        hoverinfo = 'text',
        text=[re.sub('>','<br>',comment) for comment in df_subset['poem']]
    )
 )


fig.update_layout(
    title='Top rated poems in the 10 most common rhyming schemes by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Average line length',
    yaxis_title='Upvotes'
)
fig.show()

# Rhyme sets

In [None]:
def get_rhyme_tuples(last_words_per_line):
    all_rhymes = list()
    for i in range(len(last_words_per_line)-1):
        if last_words_per_line[i] is not None:
            rhymes_with = pronouncing.rhymes(last_words_per_line[i])
            next_words = last_words_per_line[(i+1):np.min([len(last_words_per_line),i+4])]
            index_of_next_rhyme_words = np.where([x in rhymes_with for x in next_words])[0]
            if index_of_next_rhyme_words.size>0:
                all_rhymes.append((last_words_per_line[i], next_words[index_of_next_rhyme_words[0]]))
    return all_rhymes

In [None]:
from functools import reduce
all_rhyme_tuples = [get_rhyme_tuples(x) for x in last_words_list]
all_rhyme_tuples = reduce(lambda x,y: x+y,all_rhyme_tuples)

In [None]:
all_rhyme_tuples

In [None]:
all_rhyme_words = [x for tpl in all_rhyme_tuples for x in tpl]

In [None]:
df_rhyme_words = pd.DataFrame({'word':all_rhyme_words})

In [None]:
df_rhyme_words.groupby('word')['word'].count().sort_values(ascending=False).head(100)

# Metre

In [None]:
from string import punctuation

In [None]:
def get_word_scansion(word):
    """ 
    Get the scansion per word, as a string of 0's and 1's.
    """
    word = word.strip(punctuation)
    if word == '': 
        return ''
    pronounciation = pronouncing.phones_for_word(word)
    if pronounciation: 
        stresses = pronouncing.stresses(pronounciation[0])
    else:
        word = re.sub("'.+",'',word)
        pronounciation = pronouncing.phones_for_word(word)
        if pronounciation: 
            stresses = pronouncing.stresses(pronounciation[0])
        else:
            stresses = '?'
    return re.sub('2','1',stresses)

def get_line_scansion(line):
    """ 
    Get the scansion per line, as a string of 0's and 1's.
    """
    return ''.join([get_word_scansion(word) for word in line.split(' ')])

In [None]:
# * = acatalectic, i.e. the last (unstressed) syllable is omitted
# ** = iambic subsitution, i.e. the first (unstressed) syllable is omitted from an anapestic foot

known_metres = {
    'iambic hexameter'       : '010101010101',
    'iambic pentameter'      : '0101010101',
    'iambic tetrameter'      : '01010101',
    'iambic trimeter'        : '010101',
    'iambic dimeter'         : '0101',
    'iambic meter'           : '01',
    
    'anapestic tetrameter'   : '001001001001',
    'anapestic tetrameter**' : '01001001001',
    'anapestic trimeter'     : '001001001',
    'anapestic trimeter**'   : '01001001',
    'anapestic dimeter'      : '001001',
    'anapestic dimeter**'    : '01001',
    'anapestic meter'        : '001',

    'trochaic hexameter'     : '101010101010',
    'trochaic hexameter*'    : '10101010101',
    'trochaic pentameter'    : '1010101010',
    'trochaic pentameter*'   : '101010101',
    'trochaic tetrameter'    : '10101010',
    'trochaic tetrameter*'   : '1010101',
    'trochaic trimeter'      : '101010',
    'trochaic trimeter*'     : '10101',
    'trochaic bimeter'       : '1010',
    'trochaic bimeter*'      : '101',
    'trochaic meter'         : '10',
}
kown_metres_inv = inv_map = {v: k for k, v in known_metres.items()}

In [None]:
scansion = [[get_line_scansion(line) for line in poem.split('>')] for poem in df['poem']]
df['scansion'] = [list(filter(bool, x)) for x in scansion]

In [None]:
def combine_line_scansions(scansion_list):
    """ 
    Combines multiple shorter lines into a single line, if the number of syllables is equal.
    This turns for example the following list of scansions per line;
    
    ['11101001011',
     '11101011001',
     '111011',
     '11011']
     
     into
     
     ['11101001011',
      '11101011001',
      '11101111011']
    
    """   
    scansion_list = scansion_list.copy()
    
    improvement_found = True
    while improvement_found:
        # Find which lines to combine into one, if any.
        n_syllables_per_line = [len(x) for x in scansion_list]
        unique_line_lengths = sorted(np.unique(np.array(n_syllables_per_line)), key=lambda item: -item)
        for target_length in unique_line_lengths[:np.min([len(unique_line_lengths),2])]:
            for lines_to_combine in [2,3,4]: # try to combine 2,3 or 4 lines.
                idx_line_to_combine = []
                if lines_to_combine<len(scansion_list):
                    combined_line_lengths = np.convolve(n_syllables_per_line,np.ones(lines_to_combine,dtype=int),'valid')
                    idx_line_to_combine = np.where(combined_line_lengths==target_length)[0]
                    if len(idx_line_to_combine)>0: break
            if len(idx_line_to_combine)>0: break

        if len(idx_line_to_combine)>0:
            improvement_found = True
            new_line = ''.join(scansion_list[idx_line_to_combine[0]:(idx_line_to_combine[0]+lines_to_combine)])
            scansion_list[idx_line_to_combine[0]] = new_line
            del scansion_list[(idx_line_to_combine[0]+1):(idx_line_to_combine[0]+lines_to_combine)]
        else:
            improvement_found = False
            
    return scansion_list
    
            

In [None]:
df['scansion_altered'] = [combine_line_scansions(x) for x in df['scansion']]

In [None]:
def same_non_stressed(a,b):
    return sum ((a[i] == '0') and (b[i] == '0') for i in range(len(a)))

In [None]:
def get_known_metre(scansion_list):
    """
    Use a list of scansion per line to estimate the metre of the poem. The assumption is 
    that a poem always has at most two different known metres. Furthermore, since our method of
    identifying the scansion overestimates the number of stressed syllables, we will use the number
    of accurate non-stressed syllables to determine the known metre.
    """
    # First, create metre_list; a list which elements have the structure [a,b] where a is the number 
    # of syllables in the line, and b a list of the most likely know metres.
    metre_list=[]
    for scansion in scansion_list:
        l = [(same_non_stressed(scansion,k),v) for k, v in kown_metres_inv.items() if len(k) == len(scansion)]    
        if l:
            maxValue = max(l, key=lambda x: x[0])[0]
            maxValueList = [x[1] for x in l if x[0] == maxValue]
            metre_list.append([len(scansion),maxValueList])

    # If metre_list has at least one element, create metres_list. The elements in this list
    # contain per line length all the predicted metres, still to be flattened.
    # If more than two elements, we only look at the stats for the two highest line lengths.
    # this is to filter outliers, if in some shorter lines the metre was not found and thus they
    # could not be combined.
    if metre_list:        
        (values,counts) = np.unique([x[0] for x in metre_list],return_counts=True)
        values = values[counts>1]
        values = sorted(list(values),reverse=True)[:np.min([len(values),2])]       
        metres_list = [[y[1] for y in metre_list if y[0] == val] for val in values]

        # Now, find per line length the most commonly predicted metre. In case of a tie, pick one at random.
        # Sorry, best we can do for now...
        result = list()
        for metres_per_line_length in metres_list:
            flat_list = [item for sublist in metres_per_line_length for item in sublist]
            (values,counts) = np.unique(flat_list,return_counts=True)
            ind=np.where(counts==np.max(counts))
            if len(ind[0])>1:
                result.append(np.random.choice(values[ind]))
            else:
                result.append(values[ind][0])
    else:
        result = 'unknown'
    return result

In [None]:
df['metre_list'] = [get_known_metre(x) for x in df['scansion_altered']]
df['metre'] = [', '.join(x) for x in df['metre_list']]

In [None]:
for i in range(100):
    print('\n\n--------')
    print(i)
    print2(df['poem'].iloc[i])
    print(df['scansion_altered'].iloc[i])
    print(df['metre'].iloc[i])

In [None]:
most_common_metres = df['metre'].value_counts().head(12)

In [None]:
fig = plot_horizontal_bar(
    labels = most_common_metres.index[::-1],
    values = most_common_metres[::-1],
    title = 'The 10 most common metres in poems by /u/poem_for_your_sprog',
    xaxis_title = 'Number of poems',
    yaxis_title='',
    figsize=(800,500)
)
fig.show()

In [None]:
df_top_metres = df[df['metre'].isin(most_common_metres.index[:10])]
df_top_metres = (df_top_metres
                 .groupby(["metre"])
                 .apply(lambda x: x.sort_values(["ups"], ascending = False))
                 .reset_index(drop=True)
                 .groupby('metre')
                 .head(10))

In [None]:
fig = go.Figure()

for metre in most_common_metres.index[:10]:
    df_subset = df_top_metres[df_top_metres['metre'] == metre]
    fig.add_trace(go.Scatter(
        x=df_subset['average_line_length'],
        y=df_subset['ups'],
        mode='markers',
        name=metre,
        marker=dict(
            size = 8,
            line_width=1,
            opacity=0.7
        ),
        hoverinfo = 'text',
        text=[re.sub('>','<br>',comment) for comment in df_subset['poem']]
    )
 )


fig.update_layout(
    title='Top rated poems in the 10 most common metres by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Average line length',
    yaxis_title='Upvotes'
)
fig.show()