# Average engagement in Twitter threads
## Read the data
I'll work with threads of 15-20 length first (`fifteen_twenty.csv`), but you can throw any other bin at this notebook.

In [1]:
import pandas as pd

csv = pd.read_csv('fifteen_twenty.csv', encoding='iso-8859-1')
csv.head()

Unnamed: 0,id,thread_number,timestamp,text,retweets,likes,replies
0,998968203681427458,Thread 1,1527007554,1) WE HAVE A BREAKING FREAKING STORY HERE\r\r\...,335,679,94
1,998968303136727041,Thread 1,1527007578,2) I AM SO EXCITE I MIGHT NOT EVEN SWEAR OR GO...,60,294,13
2,998968508225589249,Thread 1,1527007626,"3) CNN, AP, and MSNBC....",120,291,5
3,998968614509273088,Thread 1,1527007652,4) HAVE JUST BEEN RAIDED BY THE FCC https://t....,174,465,39
4,998969018781523975,Thread 1,1527007748,5) My one source inside one of these networks ...,242,558,18


## Grouping by thread
Let's transform the data into three dataframes (retweets, likes and replies), where each column is a thread and it has as many rows as tweets. (Maybe this isn't the best way to store it...?)

In [2]:
grouped_by_thread = csv.groupby(['thread_number'])

retweets = {}
likes = {}
replies = {}

# retweets
for thread, data in dict(list(grouped_by_thread)).items():
    retweets[thread] = list(data['retweets'])

retweets_by_thread = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in retweets.items()]))

# likes
for thread, data in dict(list(grouped_by_thread)).items():
    likes[thread] = list(data['likes'])

likes_by_thread = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in likes.items()]))

# replies
for thread, data in dict(list(grouped_by_thread)).items():
    replies[thread] = list(data['replies'])

replies_by_thread = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in replies.items()]))

retweets_by_thread.head()

Unnamed: 0,Thread 1,Thread 10,Thread 11,Thread 12,Thread 13,Thread 14,Thread 15,Thread 16,Thread 17,Thread 18,...,Thread 90,Thread 91,Thread 92,Thread 93,Thread 94,Thread 95,Thread 96,Thread 97,Thread 98,Thread 99
0,335.0,33,386,52.0,10.0,1.0,75.0,180.0,403,6.0,...,1625.0,25.0,145,132.0,407.0,71,284.0,136.0,143.0,65.0
1,60.0,6,65,17.0,0.0,0.0,1.0,8.0,124,1.0,...,89.0,845.0,0,2.0,118.0,10,140.0,23.0,19.0,12.0
2,120.0,5,96,13.0,0.0,0.0,2.0,6.0,179,0.0,...,82.0,13.0,0,7.0,212.0,15,82.0,12.0,13.0,8.0
3,174.0,3,100,15.0,0.0,0.0,3.0,7.0,95,1.0,...,89.0,3.0,0,4.0,95.0,33,71.0,15.0,12.0,18.0
4,242.0,3,73,11.0,0.0,0.0,3.0,6.0,104,1.0,...,90.0,4.0,0,3.0,244.0,26,84.0,32.0,10.0,32.0


## Average length of thread
These threads are 15-20 tweets long, so in theory the average length should be somewhere around `17.5`...

In [3]:
average_length = grouped_by_thread.size().mean()
average_length

17.666666666666668

## Visualize average engagement along threads
Let's plot the average engagement a thread receives as it unfolds...

In [4]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import Span

output_notebook()

In [5]:
# averages
avg = pd.DataFrame()
avg['Retweets'] = retweets_by_thread.mean(axis=1)
avg['Likes'] = likes_by_thread.mean(axis=1)
avg['Replies'] = replies_by_thread.mean(axis=1)

average_engagement = figure(plot_width=700, 
           plot_height=350, 
           title='Average engagement in 15-20 tweet-long Twitter threads', 
           background_fill_color="#f2f3f7", 
           y_axis_label='Engagement (# of interactions)', 
           x_axis_label='Tweets along the thread')

average_engagement.line(list(range(1,21)), avg['Retweets'].values,line_color='#17bf63', legend='Retweets')
average_engagement.line(list(range(1,21)), avg['Likes'].values, line_color='#e0245e', legend='Likes')
average_engagement.line(list(range(1,21)), avg['Replies'].values, line_color='#1da1f2', legend='Replies')

show(average_engagement)

### Some interesting things happen here... 
 - The drop of engagement from the first to the second tweet
 - The spikes: when retweets rise, likes rise too
 - Replies are almost non-significant compared to the other types of engagement
 - The engagement kinda *plateaus* after the first 2-3 tweets, but increases again at the end
 - The final statement: good threads finish with a good statement, which usually helps to increase engagement 



## Better ways of describing the data
Let's make more graphs to better understand what's happening...

In [6]:
from bokeh.layouts import row as bokeh_row

scatter_rts = figure(plot_width=420, plot_height=310, title='Scatter plot of retweets in 15-20 tweet-long threads', x_axis_label='Tweets along the thread', y_axis_label='# of Retweets')
scatter_likes = figure(plot_width=420, plot_height=310, title='Scatter plot of likes in in 15-20 tweet-long threads', x_axis_label='Tweets along the thread', y_axis_label='# of Likes')
scatter_replies = figure(plot_width=420, plot_height=310, title='Scatter plot of replies in 15-20 tweet-long threads', x_axis_label='Tweets along the thread', y_axis_label='# of Replies')

# add each data point to the retweets scatter plot
for row in retweets_by_thread:
    scatter_rts.circle(list(range(1,21)), retweets_by_thread.loc[:, row], size=3, line_color="#17bf63", fill_color="#17bf63", fill_alpha=0.5)

# add each data point to the likes scatter plot    
for row in likes_by_thread:
    scatter_likes.circle(list(range(1,21)), likes_by_thread.loc[:, row], size=3, line_color="#e0245e", fill_color="#e0245e", fill_alpha=0.5)

# add each data point to the replies scatter plot
for row in replies_by_thread:
    scatter_replies.circle(list(range(1,21)), replies_by_thread.loc[:, row], size=3, line_color="#1da1f2", fill_color="#1da1f2", fill_alpha=0.5)
    
show(bokeh_row(scatter_rts, scatter_likes, scatter_replies))

### There are outliers...
Outliers can—and will—mess up with averages, making the *awesome* line graph above kind of wrong and useless.  
  
Averages are good for describing normally distribuited data, so let's check for that...
  
I'm gonna plot histograms only for the **first tweet** in each thread.  
Let's see its engagement distribution.

In [7]:
import numpy as np

# retweets
hist_rts_values, rt_edges = np.histogram(retweets_by_thread.iloc[0, :])
hist_rts = figure(plot_width=420, plot_height=310, 
                  title='Histogram of retweets in the first tweet of each thread', 
                  x_axis_label='# of Retweets', 
                  y_axis_label='Frequency')

hist_rts.quad(top=hist_rts_values, bottom=0, left=rt_edges[:-1], right=rt_edges[1:],
        fill_color="#17bf63", line_color="#17bf63")

# likes
hist_likes_values, likes_edges = np.histogram(likes_by_thread.iloc[0, :])
hist_likes = figure(plot_width=420, plot_height=310, 
                  title='Histogram of likes in the first tweet of each thread', 
                  x_axis_label='# of Likes', 
                  y_axis_label='Frequency')

hist_likes.quad(top=hist_likes_values, bottom=0, left=likes_edges[:-1], right=likes_edges[1:],
        fill_color="#e0245e", line_color="#e0245e")

# replies
hist_rpl_values, rpl_dges = np.histogram(replies_by_thread.iloc[0, :])
hist_replies = figure(plot_width=420, plot_height=310, 
                  title='Histogram of replies in the first tweet of each thread', 
                  x_axis_label='# of Replies', 
                  y_axis_label='Frequency')

hist_replies.quad(top=hist_rpl_values, bottom=0, left=rpl_dges[:-1], right=rpl_dges[1:],
        fill_color="#1da1f2", line_color="#1da1f2")

# show results
show(bokeh_row(hist_rts, hist_likes, hist_replies))

### Positively skewed data...
Ok so the data is definitely *not* normally distribuited. Therefore, the graph showing average engagement as the thread unfolds is... useless? Maybe.  
  
Now that we know that the data is positively skewed, averages are noisy. So instead I'm gonna use the **median** to describe the data.  
Let's see the median of the engagement as the thread unfolds.

In [8]:
# median of engagement
median = pd.DataFrame()
median['Retweets'] = retweets_by_thread.median(axis=1)
median['Likes'] = likes_by_thread.median(axis=1)
median['Replies'] = replies_by_thread.median(axis=1)

median_engagement = figure(plot_width=700, 
           plot_height=350, 
           title='Median of engagement in 15-20 tweet-long Twitter threads', 
           background_fill_color="#f2f3f7", 
           y_axis_label='Engagement (# of interactions)', 
           x_axis_label='Tweets along the thread')

# add a line renderer
median_engagement.line(list(range(1,21)), median['Retweets'].values,line_color='#17bf63', legend='Retweets')
median_engagement.line(list(range(1,21)), median['Likes'].values, line_color='#e0245e', legend='Likes')
median_engagement.line(list(range(1,21)), median['Replies'].values, line_color='#1da1f2', legend='Replies')

show(bokeh_row(median_engagement, average_engagement))

## Conclusions
The rightside graph is, uh, noisy, because I used averages in non-normally distribuited data. It shows an interesting relationship between retweets and likes though.  
  
Anyway, some things that I stated at the beginning of the notebook still hold true:
 - The drop of engagement is real, and very steep.
 - The engagement is somewhat steady in the middle
 - Some spikes of engagement are still present, especially at the end (the final statement)
 - In some parts retweets make likes increase (or maybe the other way around?)
 - Replies still non-significant  
 
 And notice how the y-axis is *a lot* lower now. Thanks, outliers.  
   
 Overall, looks like the [headline problem](https://www.washingtonpost.com/news/the-fix/wp/2014/03/19/americans-read-headlines-and-not-much-else/) might be present, but we can't be 100% sure, because engagement and tweet impressions aren't the same.