# Reddit Data Analysis

The Reddit Data Collection notebook collected Reddit data from the PushShift API, and in this notebook we can begin to analyze it, and get it ready for more qualitative investigation.

## Submissions

Users post *submissions* to Reddit, which share resources and then generate conversation as comments. First lets look at the submissions that were collected.

In [3]:
import pandas

pv = pandas.read_csv('data/reddit-police-violence-20200101-20201231.csv', parse_dates=['created'])
pv.describe()

Unnamed: 0,pwls,third_party_trackers,mobile_ad_url,eventsOnRender,score,thumbnail_height,domain_override,events,upvote_ratio,call_to_action,...,total_awards_received,num_comments,retrieved_on,outbound_link,author_id,num_crossposts,subreddit_subscribers,created_utc,edited,edited.1
count,4331.0,0.0,0.0,0.0,6097.0,4522.0,0.0,0.0,5921.0,0.0,...,6097.0,6097.0,6097.0,0.0,0.0,6097.0,6097.0,6097.0,20.0,20.0
mean,5.317479,,,,15.890274,94.282839,,,0.968846,,...,0.002624,10.868788,1594226000.0,,,0.020666,1928572.0,1594211000.0,1591922000.0,1591922000.0
std,2.323117,,,,487.659461,22.941147,,,0.106014,,...,0.072405,131.256218,4790657.0,,,0.343528,5939344.0,4796104.0,1301060.0,1301060.0
min,0.0,,,,0.0,1.0,,,0.11,,...,0.0,0.0,1578135000.0,,,0.0,0.0,1578068000.0,1590954000.0,1590954000.0
25%,6.0,,,,1.0,78.0,,,1.0,,...,0.0,0.0,1591242000.0,,,0.0,101.0,1591242000.0,1591127000.0,1591127000.0
50%,6.0,,,,1.0,93.0,,,1.0,,...,0.0,0.0,1592006000.0,,,0.0,6115.0,1591913000.0,1591335000.0,1591335000.0
75%,7.0,,,,1.0,105.0,,,1.0,,...,0.0,2.0,1598387000.0,,,0.0,168250.0,1598387000.0,1591862000.0,1591862000.0
max,7.0,,,,33061.0,140.0,,,1.0,,...,3.0,8160.0,1609358000.0,,,15.0,34793610.0,1609358000.0,1595437000.0,1595437000.0


### Posts per Day

We can look at the number of posts per day.

In [8]:
pv_day_counts = pv['created'].dt.floor('d').value_counts().rename_axis('date').reset_index(name='count')
pv_day_counts

Unnamed: 0,date,count
0,2020-06-03,296
1,2020-06-07,285
2,2020-06-05,260
3,2020-06-02,244
4,2020-06-01,218
...,...,...
302,2020-04-19,1
303,2020-04-18,1
304,2020-04-23,1
305,2020-04-29,1


In [70]:
import plotly.express as px

px.bar(pv_day_counts, x='date', y='count', title='Police Violence: Reddit Posts per Day')

Unsurprisingly there's a lot of post activity starting right after the murder of George Floyd by Minnesota police on May 25.

### Subreddits

Reddit is a collection of topical subreddits. We can look and see which ones have the most activity over this period.

In [13]:
pv_subs = pv['subreddit'].value_counts().rename_axis('subreddit').reset_index(name='count')
pv_subs

Unnamed: 0,subreddit,count
0,u_toronto_news,641
1,newsbotbot,269
2,Bad_Cop_No_Donut,186
3,politics,184
4,AutoNewspaper,129
...,...,...
1303,statistics,1
1304,McMaster,1
1305,Vaporwave,1
1306,Fun,1


1308 categories won't fit comfortably in a bar graph, so lets remove the long tail by examining only the subreddits that were posted to more than 20 times:

In [69]:
px.bar(pv_subs[pv_subs['count'] >= 20], x='subreddit', y='count', title='Police Violence: Subreddits')

### Score

Each submission has a *score* that is used determine the post's position in the list of submissions. The score reflects the number of upvotes minus the number of downvotes. So if the score is 0 that could mean either nobody has voted on it, or if there are lots of engagement (e.g. comments) it could mean there has been more downvoting activity. Although it's hard to say based on the score of 0 whether there was *more* downvoting activity.

In [68]:
px.line(pv.resample('D', on='created').mean().score, title='Police Violence: Average Score per Day')

It looks like there are no scores less than zero. There seems to be some [opinion](https://www.reddit.com/r/NoStupidQuestions/comments/1y34a0/why_does_reddit_always_display_a_post_with/) out there that indicates this is another thing Reddit do to prevent automated manipulation and perhaps harrassment.

### Upvote Ratio

Reddit's curation algorithm is driven by user's voting up and down submissions. 7 years ago they [stopped](https://www.reddit.com/r/announcements/comments/28hjga/reddit_changes_individual_updown_vote_counts_no/) making the number of upvotes and downvotes available in order to reduce the gaming of the platform. Instead they show the *upvote ratio*, or the ratio of upvotes to the total number of votes (upvotes + downvotes). This makes it harder for users to determine what the effect of the vote might be. 

First lets see what the average upvote ratio looks like on the whole.

In [33]:
pv['upvote_ratio'].describe()

count    5921.000000
mean        0.968846
std         0.106014
min         0.110000
25%         1.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: upvote_ratio, dtype: float64

Ok, pretty much overwhelmingly positive. What does the average upvote ratio look over time?

In [57]:
upvote_time = (pv.resample('D', on='created')
    .mean()['upvote_ratio']
    .rename_axis('date')
    .reset_index(name='upvote_ratio')
)
upvote_time

Unnamed: 0,date,upvote_ratio
0,2020-01-03,
1,2020-01-04,
2,2020-01-05,
3,2020-01-06,
4,2020-01-07,
...,...,...
358,2020-12-26,1.0
359,2020-12-27,
360,2020-12-28,1.0
361,2020-12-29,


In [67]:
px.line(upvote_time, x='date', y='upvote_ratio', title='Police Violence: Average Upvote per Day')

It looks like there are periods where the upvote ratio dips due downvoting. These could be interesting to look at more closely, especially to see what content is being downvoted. Also notice the lack of engagement prior to May.

It also could be interesting to see if the average (mean) is much different than the median.

In [74]:
px.line(pv.resample('D', on='created').median()['upvote_ratio'].rename_axis('date').reset_index(name='upvote_ratio'),
x='date', y='upvote_ratio')

It looks less noisy, but seems to mirror the same as the mean.

## Comments

We can also look at the number of comments. First what do they look like in aggregate?

In [53]:
pv['num_comments'].describe()

count    6097.000000
mean       10.868788
std       131.256218
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max      8160.000000
Name: num_comments, dtype: float64

Hmm, quite a bit of variability. Most of the time very little engagement, but one with 8160 comments! We can simmilarly look at them over time by day.

In [66]:
comments_time = (pv.resample('D', on='created')
    .mean()['num_comments']
    .rename_axis('created')
    .reset_index(name='comments')
)
px.line(comments_time, x='created', y='comments', title='Police Violence: Average Comments per Day')