I am a really big fan of the poems by /u/poem_for_your_sprog on Reddit. For those of you who are not familiar with him yet; he writes short poems as responses to others on AskReddit threads. To give you an example, one that I particularly like is the following, which was written in response to a thread of gripping stories by some ICU workers, who despite their best efforts are not always able to save every patient they meet:

```
You’ll weather the wind and the rain and the rough -
And sometimes you’ll try but it won’t be enough.

You did what you could,
but it’s not up to you.

You did what you could,
and that’s all you can do.
```

I always find it difficult why I love these poems so much. Some, like the above, stand out in simplicity; six short lines that bring a message that speaks to many of us. But there's more elaborate ones, and really funny ones too. I think the one thing that they all have in common is their rhythm, or their 'flow'.

However, since I am not as good with words as /u/poem_for_your_sprog, let me use numbers to analyze some of the work he has done to date!

In [None]:
%%capture
%load_ext autoreload
%autoreload 2

In [None]:
import json
import string
import datetime as dt
import pandas as pd
import numpy as np
import os
import time
import re
import matplotlib.pyplot as plt 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=False)

from src.reddit_user_comment_reader import RedditUserCommentReader
from src.utils import print2_list, print2
from src.string import clean_comment
from src.plotly import plot_histogram, plot_timeline

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

In [None]:
%%capture
import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('stopwords')
nltk.download('punkt')

# Load the data

In [None]:
print('Loading data...')
download_comments = True
file_name = 'data/comments.txt'
if os.path.isfile(file_name):
    print('Existing file found!')
    mtime = os.path.getmtime(file_name)
    print("last modified: %s" % dt.datetime.fromtimestamp(mtime))
    if (dt.datetime.now() - dt.datetime.fromtimestamp(mtime)).days < 1: # if modified in last 24 hours
        download_comments = False

if download_comments:
    print('Downloading comments...')
    reddit_user_comment_reader = RedditUserCommentReader('poem_for_your_sprog', verbose = True)
    all_comments = reddit_user_comment_reader.get_comments()
    print('Saving to file...')
    with open(file_name, 'w') as outfile:
        json.dump(all_comments, outfile)
else:   
    print('Loading comments from file...')
    with open(file_name, 'r') as infile:
        all_comments = json.load(infile)
print('Done.')

In [None]:
df = pd.DataFrame(all_comments)

df = df[df['author']!='[deleted]']
df['comment_cleaned'] = df['body'].apply(clean_comment)
df['datetime'] = df['created_utc'].apply(dt.datetime.fromtimestamp)
df['date'] = df['datetime'].dt.date
df['awards_simple'] = df['all_awardings'].apply(lambda x: [y['name'] + ': ' + str(y['count']) for y in x]) 
df['number_of_lines'] = df['comment_cleaned'].apply(lambda x: 1+ sum(1 for _ in re.finditer(r'>', x)))
df['comment_length']= df['comment_cleaned'].str.len()
df['average_line_length'] = df['comment_length']/(df['number_of_lines'])

# Try to determine if comment or poem.
df['type'] = 'poem'
df.loc[df['date']!=dt.date(2015,6,23),'type'] ='comment' # AMA
df.loc[df['comment_cleaned'].apply(len)>0,'type'] = 'comment'
df.loc[df['number_of_lines']>1,'type'] = 'comment'
df.loc[df['average_line_length']<55,'type'] = 'comment'

df.to_pickle('data/df.pkl')

# Comments per day

Let's start by simply looking at the amount of poems over time. As we can see below on the extraordinarily productive days, sprog provides the world with about 5 or 6 poems, and his productivity seems to have increased somewhat over time. The outlier of 12 comments in June is the ['Ask Me Anything'](https://www.reddit.com/r/books/comments/3aungz/hi_im_sam_garland_aka_upoem_for_your_sprog_ive/), or AMA in short, where he answered questions of fellow Redditors.

In [None]:
df_comments_per_day= df.groupby(['date'])['date'].agg(n='count')
idx = pd.date_range(df_comments_per_day.index.min(), dt.datetime.today())
df_comments_per_day = df_comments_per_day.reindex(idx, fill_value=0)

In [None]:
fig = plot_timeline(
    x=df_comments_per_day.index,
    y=df_comments_per_day['n'],
    title='Number of comments on Reddit per day by u/poem_for_your_sprog',
    xaxis_title='Day',
    yaxis_title='Number of comments',
    annotations=[
        go.layout.Annotation(
            x='2015-6-23',
            y=12,
            xref="x",
            yref="y",
            text="AMA",
            showarrow=True,
            arrowhead=2,
            ax=-50,
            ay=0
        )
    ]
)
fig.show()

Due to the long timeline, it becomes difficult to see any real trend in the daily data, so let's aggregate to month level.

In [None]:
comments_per_month = df_comments_per_day.groupby(pd.Grouper(freq='M'))['n'].sum()

In [None]:
fig = plot_timeline(
    x=comments_per_month.index,
    y=comments_per_month,
    title='Number of comments on Reddit per month by u/poem_for_your_sprog',
    xaxis_title='Month',
    yaxis_title='Number of comments'
)
fig.show()

In [None]:
# Remove AMA comments, and other comments that do not look like poems.
df = df[df['date']!=dt.date(2015,6,23)]
df.reset_index(inplace=True)

In [None]:
df.to_pickle('data/df.pkl')

# Average line length

Sprog writes both poems with very short lines, as well as poems with longer ones. A histogram of the average number of characters on a line per poem should give us a better idea of the distribution:

In [None]:
fig = plot_histogram(
    x = df['average_line_length'],    
    params = {'xbins':dict(start=0,end=200,size=1)},
    title = 'Histogram of the average characters per line by u/poem_for_your_sprog',
    xaxis_title = 'Day',
    yaxis_title = 'Number of comments'
)

fig.show()

In [None]:
fig = plot_histogram(
    x = df['number_of_lines'],    
    params = {'xbins':dict(size=1)},
    title = 'Histogram of the number of lines per poem by u/poem_for_your_sprog',
    xaxis_title = 'Number of lines',
    yaxis_title = 'Number of comments'
)

fig.show()

That's quite a wide spread! Some poems have on average lines that are three times longer than other poems. This makes one wonder what those poems look like.. Printing many poems here might not be the best solution, so let's put the 100 poems with the longest average line length and the the 100 poems with the longest average line length in a plot that allows you to read the poems by hovering over the points. 

In [None]:
df = df[df['comment_cleaned'].apply(len)>0]
df = df[df['number_of_lines']>1]
df = df[df['average_line_length']<55]

In [None]:
df_short = df.sort_values('average_line_length').head(100)

In [None]:
fig = go.Figure(
        data=go.Scatter(
            x=df_short['average_line_length'],
            y=df_short['score'],
            mode='markers',
            marker=dict(
                size = 8,
                line_width=1,
                opacity=0.7
            ),
            hoverinfo = 'text',
            text=[re.sub('>','<br>',comment) for comment in df_short['comment_cleaned']]
        )
)

fig.update_layout(
    title='The 100 poems with the shortest line length by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Average line length',
    yaxis_title='Score'
)
fig.show()

In [None]:
df_long = df.sort_values('average_line_length',ascending=False).head(100)

In [None]:
fig = go.Figure(
        data=go.Scatter(
            x=df_long['average_line_length'],
            y=df_long['score'],
            mode='markers',
            marker=dict(
                size = 8,
                line_width=1,
                opacity=0.7
            ),
            hoverinfo = 'text',
            text=[re.sub('>','<br>',comment) for comment in df_long['comment_cleaned']]
        )
)

fig.update_layout(
    title='The 100 poems with the longest line length by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Average line length',
    yaxis_title='Score'
)
fig.show()

# Score & Awards

In [None]:
fig = plot_histogram(
    x = df['score'],    
    params = {'xbins':dict(size=250)},
    title = 'Histogram of the scores of poems by by u/poem_for_your_sprog',
    xaxis_title = 'Score',
    yaxis_title = 'Number of comments'
)
fig.show()

In [None]:
fig = plot_histogram(
    x = df['total_awards_received'],    
    params = {'xbins':dict(size=1)},
    title = 'Histogram of the number of awards per comment by u/poem_for_your_sprog',
    xaxis_title = 'Number of awards',
    yaxis_title = 'Number of comments'
)
fig.show()

In [None]:
fig = go.Figure(
        data=go.Scatter(
            x=df['total_awards_received'],
            y=df['score'],
            mode='markers',
            marker=dict(
                size = 8,
                line_width=1,
                opacity=0.7
            ),
            hoverinfo = 'text',
            text=['score: {}<br>{}<br><br>'.format(row['score'],row['awards_simple']) 
                  + re.sub('>','<br>',row['comment_cleaned']) for index, row in df.iterrows()]
        )
)

fig.update_layout(
    title='Score versus number of awards of the poems by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Number of awards received',
    yaxis_title='Score'
)
fig.show()

# Word analysis

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
stop_words = nltk.corpus.stopwords.words('english')
comments = df['comment_cleaned'].str.cat(sep=' ')
tokens = tokenizer.tokenize(comments)
tokens = [t for t in tokens if not t in stop_words]
frequency_dist = nltk.FreqDist(tokens)
most_common = frequency_dist.most_common(80)

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Bar(
        y=[x[0] for x in most_common[::-1]],
        x=[x[1] for x in most_common[::-1]],
        name='SF Zoo',
        orientation='h'
    )
)
fig.update_layout(
        width=800, 
        height=900,
        title='Most occuring words in comments by u/poem_for_your_sprog',
        title_x=0.5,
        template='simple_white',
        xaxis_title='Occurence',
        yaxis_title='',
        yaxis=dict(
            tickfont=dict( size=10),
            tickvals=[x[0] for x in most_common[::-1]]
    )
)
fig.show()

# What about Timmy?

In [None]:
comments_about_timmy = np.array(['timmy' in comment for comment in df['comment_cleaned']])
comments_about_timmy_fucking_dying = np.array(['timmy fucking died' in comment for comment in df['comment_cleaned']])

In [None]:
print('Comments about Timmy: {}'.format(comments_about_timmy.sum()))
print('Comments about Timmy fucking dying: {}'.format(comments_about_timmy_fucking_dying.sum()))
print('Comments about Timmy that do not end with Timmy fucking dying: {}'
      .format(comments_about_timmy.sum()-comments_about_timmy_fucking_dying.sum()))

In [None]:
fig = go.Figure(data=[
                    go.Pie(
                        labels=['Timmy fucking dying','Timmy not fucking dying'], 
                        values=[comments_about_timmy_fucking_dying.sum(),
                             comments_about_timmy.sum()-comments_about_timmy_fucking_dying.sum()], hole=.3
        )
    ]
)
fig.update_layout(
        template='simple_white'
)
fig.show()

So.. What happens to Timmy if he doesn't fucking die?

In [None]:
df_timmy_not_dying = df[(comments_about_timmy) & (~comments_about_timmy_fucking_dying)]
df_timmy_not_dying['ending'] = [x.split('>')[-1] for x in df_timmy_not_dying['comment_cleaned']]
df_timmy_not_dying = df_timmy_not_dying.sort_values('score')

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Bar(
        y=df_timmy_not_dying['ending'],
        x=df_timmy_not_dying['score'],
        orientation='h'
    )
)
fig.update_layout(
        width=800, 
        height=900,
        title='Best scoring alternative endings to poems about Timmy',
        title_x=0.5,
        template='simple_white',
        xaxis_title='Score',
        yaxis_title='',
        yaxis=dict(
            tickfont=dict( size=10),
            tickvals=df_timmy_not_dying['ending']
    )
)
fig.show()

# Rhyming

In [None]:
import pronouncing

In [None]:
def get_last_word_per_line(poem):
    return [re.findall(r"\s([^\.?!,\s]+)[\.?!,\s']*$",line)[0] if re.findall(r"\s([^\.?!,\s]+)[\.?!,\s']*$",line) else None 
     for line in poem.split('>')]

In [None]:
last_words_list = [get_last_word_per_line(mystr) for mystr in df['comment_cleaned']]

In [None]:
def get_rhyme_scheme(last_words_per_line):
    rhyme_scheme = np.empty(len(last_words_per_line),dtype=str)
    k=0
    for i in range(len(last_words_per_line)):
        if rhyme_scheme[i]=='':
            if last_words_per_line[i] is not None:
                rhyme_scheme[i]=alphabet[k % 26]
                rhyme_list = pronouncing.rhymes(last_words_per_line[i])
                rhyme_scheme[(np.array([x in rhyme_list for x in last_words_per_line]) & (rhyme_scheme == ''))] = alphabet[k % 26]
                k+=1
            else:
                rhyme_scheme[i] = '?'               
    return ''.join(rhyme_scheme) 

In [None]:
rhyme_schemes = [get_rhyme_scheme(x) for x in last_words_list]
df['rhyme_scheme'] = rhyme_schemes

In [None]:
most_common_rhyme_schemes = df['rhyme_scheme'].value_counts().head(25)

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Bar(
        y=most_common_rhyme_schemes.index[::-1],
        x=most_common_rhyme_schemes[::-1],
        name='rhyme',
        orientation='h'
    )
)
fig.update_layout(
        width=800, 
        height=900,
        title='The 25 most common rhyming schemes in poems by /u/poem_for_your_sprog',
        title_x=0.5,
        template='simple_white',
        xaxis_title='Number of poems',
        yaxis_title='',
        yaxis=dict(
            tickfont=dict(size=10),
            tickvals=most_common_rhyme_schemes.index[::-1]
    )
)
fig.show()

In [None]:
df_top_rhymes = df[df['rhyme_scheme'].isin(most_common_rhyme_schemes.index[:10])]

In [None]:
# get dataframe sorted by life Expectancy in each continent 
df_top_rhymes = (df_top_rhymes
                 .groupby(["rhyme_scheme"])
                 .apply(lambda x: x.sort_values(["score"], ascending = False))
                 .reset_index(drop=True)
                 .groupby('rhyme_scheme')
                 .head(10))

In [None]:
fig = go.Figure()

for rhyme_scheme in most_common_rhyme_schemes.index[:10]:
    df_subset = df_top_rhymes[df_top_rhymes['rhyme_scheme'] == rhyme_scheme]
    fig.add_trace(go.Scatter(
        x=df_subset['average_line_length'],
        y=df_subset['score'],
        mode='markers',
        name=rhyme_scheme,
        marker=dict(
            size = 8,
            line_width=1,
            opacity=0.7
        ),
        hoverinfo = 'text',
        text=[re.sub('>','<br>',comment) for comment in df_subset['comment_cleaned']]
    )
 )


fig.update_layout(
    title='Top rated poems in the 10 most common rhyming schemes by u/poem_for_your_sprog',
    title_x=0.5,
    template='simple_white',
    xaxis_title='Average line length',
    yaxis_title='Score'
)
fig.show()

# Rhyme sets

In [None]:
def get_rhyme_tuples(last_words_per_line):
    all_rhymes = list()
    for i in range(len(last_words_per_line)-1):
        if last_words_per_line[i] is not None:
            rhymes_with = pronouncing.rhymes(last_words_per_line[i])
            next_words = last_words_per_line[(i+1):np.min([len(last_words_per_line),i+4])]
            index_of_next_rhyme_words = np.where([x in rhymes_with for x in next_words])[0]
            if index_of_next_rhyme_words.size>0:
                all_rhymes.append((last_words_per_line[i], next_words[index_of_next_rhyme_words[0]]))
    return all_rhymes

In [None]:
from functools import reduce
all_rhyme_tuples = [get_rhyme_tuples(x) for x in last_words_list]
all_rhyme_tuples = reduce(lambda x,y: x+y,all_rhyme_tuples)

In [None]:
all_rhyme_tuples