I am a really big fan of the poems by /u/poem_for_your_sprog on Reddit. For those of you who are not familiar with him yet; he writes short poems as responses to others on AskReddit threads. To give you an example, one that I particularly like is the following, which was written in response to a thread of gripping stories by some ICU workers, who despite their best efforts are not always able to save every patient they meet:

```
You’ll weather the wind and the rain and the rough -
And sometimes you’ll try but it won’t be enough.

You did what you could,
but it’s not up to you.

You did what you could,
and that’s all you can do.
```

I always find it difficult why I love these poems so much. Some, like the above, stand out in simplicity; six short lines that bring a message that speaks to many of us. But there's more elaborate ones, and really funny ones too. I think the one thing that they all have in common is their rhythm, or their 'flow'.

However, since I am not as good with words as /u/poem_for_your_sprog, let me use numbers to analyze some of the work he has done to date!

In [None]:
%%capture
%load_ext autoreload
%autoreload 2

In [None]:
import string
import datetime as dt
import pandas as pd
import numpy as np
import time
import re
import matplotlib.pyplot as plt 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=False)

from src.utils import print2_list, print2
from src.plotly import plot_histogram, plot_timeline, plot_horizontal_bar
from src.data import load_data

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

In [None]:
%%capture
import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('stopwords')
nltk.download('punkt')

# Load the data

In [None]:
df = load_data('data/comments.txt', False)

# Comments per day

Let's start by simply looking at the amount of poems over time. We can see below that on the extraordinarily productive days, sprog provides us with about 5 or 6 poems, and his productivity seems to have increased somewhat over time. The outlier of 12 comments in June is the ['Ask Me Anything'](https://www.reddit.com/r/books/comments/3aungz/hi_im_sam_garland_aka_upoem_for_your_sprog_ive/), or AMA in short, where he answered questions of fellow Redditors.

In [None]:
df_comments_per_day= df.groupby(['date'])['date'].agg(n='count')
idx = pd.date_range(df_comments_per_day.index.min(), dt.datetime.today())
df_comments_per_day = df_comments_per_day.reindex(idx, fill_value=0)

In [None]:
fig = plot_timeline(
    x=df_comments_per_day.index,
    y=df_comments_per_day['n'],
    title='Number of comments on Reddit per day by u/poem_for_your_sprog',
    xaxis_title='Day',
    yaxis_title='Number of comments',
    annotations=[
        go.layout.Annotation(
            x='2015-6-23',
            y=12,
            xref="x",
            yref="y",
            text="AMA",
            showarrow=True,
            arrowhead=2,
            ax=-50,
            ay=0
        )
    ]
)
fig.show()

This plot was usful to determine the AMA outlier, so lets remove the observations from that day from our dataset. However, the daily plot does not really help us in identiofying a trend, so let's aggregate the daya to monthly buckets to get a clearer view:

In [None]:
# Remove AMA comments
df = df[df['date']!=dt.date(2015,6,23)]
df.reset_index(inplace=True)

In [None]:
comments_per_month = df_comments_per_day.groupby(pd.Grouper(freq='M'))['n'].sum()

In [None]:
fig = plot_timeline(
    x=comments_per_month.index,
    y=comments_per_month,
    title='Number of comments on Reddit per month by u/poem_for_your_sprog',
    xaxis_title='Month',
    yaxis_title='Number of comments'
)
fig.show()

# Average line length

Sprog writes both poems with very short lines, as well as poems with longer ones. A histogram of the average number of characters on a line per poem should give us a better idea of the distribution:

In [None]:
fig = plot_histogram(
    x = df['average_line_length'],    
    params = {'xbins':dict(start=0,end=200,size=1)},
    title = 'Histogram of the average characters per line by u/poem_for_your_sprog',
    xaxis_title = 'Day',
    yaxis_title = 'Number of characters'
)

fig.show()

There are three clear outliers, which upon further inspection are not poems, or contain a lot of text alongside the poem. We will filter those out for now. Furthermore, the peak around 46 is interesting to see. My initial guess was that some rhymes are written in the rhyme scheme `abab`, while others are split out over shorter lines in the rhyme shape `abcb defe`, so the former would on average have lines twice as long as the latter. But that does not seem to hold; then the first peak should be around 23 instead of 30.

It turns out that these are a set of poems that sprog usually writes in 'Anapestic tetrameter'. Anapestic tetrameter is a metre with four anapestical feet per line. An anapestical foot is two unstressed syllables, followed by a stressed one. So, denoting a stressed syllable as / and an unstressed syllable as x, an anapestic tetrameter can de denoted as follows:

> x x / x x / x x / x x /

However, sprog usually omits the first unstressed syllable, so it becomes:

>   x / x x / x x / x x /

To get a better idea, here is an example of one of these poems:

```
you don't need a measure of treasure to fly
to sporting success on a broom in the sky...
to eros alone in the sight of the stars...
to space on a ship that's intended for mars.
you don't need a mountain of money to go
where peter and susan await in the snow...
where planets contend and defend for a spice...
where alice adventures with hatters and mice.
you don't need a wallet of wealth and of worth
to start on a journey across middle earth...
to fight in the night with your sword and your steed.
you don't need a fund or a fortune to read.
```

In [None]:
df = df[df['average_line_length']<55]

In [None]:
fig = plot_histogram(
    x = df['number_of_lines'],    
    params = {'xbins':dict(size=1)},
    title = 'Histogram of the number of lines per poem by u/poem_for_your_sprog',
    xaxis_title = 'Number of lines',
    yaxis_title = 'Number of comments'
)

fig.show()

That's quite a wide spread! Some poems have on average lines that are three times longer than other poems. This makes one wonder what those poems look like.. Printing many poems here might not be the best solution, so let's put the 100 poems with the shortest average line length and the the 100 poems with the longest average line length in a plot that allows you to read the poems by hovering over the points. 

In [None]:
df = df[df['poem'].apply(len)>0]
df = df[df['number_of_lines']>1]

In [None]:
df_short = df.sort_values('average_line_length').head(100)
df_long = df.sort_values('average_line_length',ascending=False).head(100)
df_short_long = pd.concat([df_short,df_long])

In [None]:
fig = go.Figure(
        data=go.Scatter(
            x=df_short_long['average_line_length'],
            y=df_short_long['ups'],
            mode='markers',
            marker=dict(
                size = 8,
                line_width=1,
                opacity=0.7
            ),
            hoverinfo = 'text',
            text=[re.sub('>','<br>',comment) for comment in df_short_long['poem']]
        )
)

fig.update_layout(
    title='100 poems with the shortest and 100 poems with the longest line length <br> by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Average line length',
    yaxis_title='Upvotes'
)
fig.show()

# Upvotes & Awards

On Redit, there are two ways fo showing appreciation for a comment. One can upvote a post, of spend some actual money to give the post an award. Let's start by simply counting the upvotes:

In [None]:
print('Total number of upvotes on poems by /u/poem_for_your_sprog: ' + str(df['ups'].sum()))

Is that a lot? In order to find out, let's try to express it in Big Mac's. Here is a little thing I found on the internet (so it must be true):

> According to research conducted by Vanderbilt University Medical Center, laughing for 10 to 15 minutes burns between 10 and 40 calories a day.

Let's first assume that smiling is on the bottom of this spectrum, i.e. 15 minutes of smiling consumes 10 calories. Now let's assume that an upvote equals two seconds of smiling on average. Then we now have enough information to convert our number of upvotes to Big Mac's!

In [None]:
cal_per_sec = 10/(15*60)
cal_per_smile = cal_per_sec * 2
total_cals_smiled = df['ups'].sum() * cal_per_smile
cal_per_big_mac = 564
n_big_mac = round(total_cals_smiled/cal_per_big_mac,1)
print('Calories per second of smiling: {0:.4f}'.format(cal_per_sec))
print('Calories per smile: {0:.4f}'.format(cal_per_smile))
print('Number of smiles: {}'.format(df['ups'].sum()))
print('Total calories smiled: {0:.1f}'.format(total_cals_smiled))
print('Calories per Big Mac: {0:.1f}'.format(cal_per_big_mac))

print('In total, {} Big Mac\'s worth of calories have been consumed by smiles that were caused by poems by /u/poem_for_your_sprog.'
     .format(n_big_mac))

Next, let's take a look at the awards. There are (at least) three types of awards;
- Silver, granting the user.. nothing.
- Gold, granting the user Reddit Premium for a week
- Platinum, granting the user Reddit Premium for a month

Below is a pie chart that shows the breakdown pf the awards /u/poem_for_your_sprog has been given. Yes, I know many people don't like pie charts. Although I dough not really care, I have added a hole in the middle to make it a donut chart. As long as it's food related I'm happy.

In [None]:
from collections import Counter
c = Counter()
for d in df['awards_dict']:
    c.update(d)

color_dict = {'Gold': '#C9B037', 
          'Silver': '#D7D7D7', 
          'Platinum': '#B4B4B4'}
colors = [color_dict[x] if x in color_dict else '#AD8A56' for x in [x[0] for x in c.items()]]

fig = go.Figure(data=[
                    go.Pie(
                        labels=[x[0] for x in c.items()], 
                        values=[x[1] for x in c.items()],
                        marker=dict(colors=colors),
                        textinfo='value', 
                        hoverinfo='label+percent',
                        textfont_size=12,
                        hole=.3 
                    )
    ]
)
fig.update_layout(
    template='simple_white',
    title = 'Awards received on poems by /u/poem_for_your_sprog.',
    title_x = 0.5,
)
fig.show()

Below are three more graphs. The first is a histogram of the number of upvotes per post, and the second is a histogram of the number of awards per post. The last graph shows every poem written by /u/poem_for_your_sprog on Reddit with the number of upvotes on the y-axis and the number of awards on the x-axis. You can hover over the points to read the poems!

In [None]:
fig = plot_histogram(
    x = df['ups'],    
    params = {'xbins':dict(size=250)},
    title = 'Histogram of the upvotes of poems by by u/poem_for_your_sprog',
    xaxis_title = 'Upvotes',
    yaxis_title = 'Number of comments'
)
fig.show()

In [None]:
fig = plot_histogram(
    x = df['total_awards_received'],    
    params = {'xbins':dict(size=1)},
    title = 'Histogram of the number of awards per comment by u/poem_for_your_sprog',
    xaxis_title = 'Number of awards',
    yaxis_title = 'Number of comments'
)
fig.show()

In [None]:
fig = go.Figure(
        data=go.Scatter(
            x=df['total_awards_received'],
            y=df['ups'],
            mode='markers',
            marker=dict(
                size = 8,
                line_width=1,
                opacity=0.7
            ),
            hoverinfo = 'text',
            text=['ups: {}<br>{}<br><br>'.format(row['ups'],row['awards_simple']) 
                  + re.sub('>','<br>',row['poem']) for index, row in df.iterrows()]
        )
)

fig.update_layout(
    title='Upvotes versus number of awards of the poems by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Number of awards received',
    yaxis_title='Upvotes'
)
fig.show()

In [None]:
front_matter_str = """---
layout: post
title: Analyzing the poem's by /u/poem_for_your_sprog
subtitle: Some summary statistics of the comment history by /u/poem_for/your_sprog
tags: [python,poetry]
---"""

import subprocess

def export_ipynb_for_github_pages(filename,front_matter_str):
    """
    Converts the .ipynb file to a .html file with all code omitted. Also replaces 
    all occurences of '{{' with '{ {' because otherwise this gives issues when Jekyll 
    parses the file.
    
    Edited from https://davistownsend.github.io/blog/PlotlyBloggingTutorial/
    """
    
    filename = "1.0-summary-statistics-upvotes-and-awards.ipynb"
    subprocess.call(["jupyter", "nbconvert","--to","html","--template","hidecode",filename])
    filename_html = re.sub('ipynb','html',filename)
    subprocess.call(["sed", "-i", "s/{{/{ {/g", filename_html])

    with open(filename_html, 'r') as original: 
        data = original.read()
    with open(filename_html, 'w') as modified: 
        modified.write(front_matter_str + "\n" + data)        