Skip to content
A (not-so-scientific) look at how reddit comments change over time
Python
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
.gitignore
LICENSE
README.md
WEEK_EXPERIMENT_RESULTS.md
bot.py
data.py
praw.ini

README.md

Reddit Comment Analysis

About

This was much more of a curiosity project than anything else. I wanted to learn some more Python, as I haven't used the language for much in the past. I figured Python is the perfect language to write a Reddit bot in (because of PRAW). I didn't want to make a bot that replies to comments (those are super annoying and overdone at this point), so I had to think of something else. I settled on a data collection bot that I can use to compile statistics based on comment patterns. It's not complicated (code-wise), but I made it in hoped that the results would be interesting.

My idea was to collect a bunch of Reddit comments on one day of the week, and then to collect any changes to those comments a day later (account karma gains, how many comments were deleted, etc...) and then collect a different set of comment data on a different day of the week and then repeat the process.

My data (the data provided in this repo) consists of 10,000 comments collected from /r/all on Monday, 6/18, as well as any changes from those comments, which was collected on Tuesday, 6/19. I then collected 10,000 different comments on the following Saturday, 6/23, and collected the same data (with any differences) on Sunday, 6/24.

These are my results. It's worth noting that these results are definitely not super scientific or anything. This was purely just a learning experience.

In the future, I'd like to do a longer experiment, perhaps a week or two between collecting initial and final data. Both of the trials I ran with this lasted 24 hours, and it limited some of the results more than I expected.

Setup

The only outside libraries you'll need to run this project yourself are PRAW, and Matplotlib. Both are available via pip.

Then you need to setup a script application on Reddit, and enter the login information in the praw.ini file. See instructions here. You also need to change the default user agent in the authenticate function of bot.py.

At the top of bot.py, add the relative path to your initial file and final file locations. If collecting initial data, then change the value of COMMENTS_TO_GET to the number of comments you want to collect. bot.py should then run without issue.

In the middle of data.py, at the top of the print_data_to_file method, add a location where you would like some statistics printed. Then, at the bottom of data.py, in main, add the relative path to your initial file and final file locations. data.py should then run without issue.

Once those are setup, you're ready to collect data. You can set the subreddit as well as the number of comments you'd like to collect within bot.py Then, simply run python bot.py with the run_initial(reddit) method uncommented, and the run_final(reddit) method commented out. Once you have your data, run bot.py again with run_initial(reddit) commented out and run_final(reddit) uncommented. These method calls are made in main.

Data File Format

The initial and final data file layouts each differ, so I'll explain how they're both laid out.

Initial File

The initial file stores 8 data points for each comment. In order, these are:

  1. The number comment we're collecting. Should match the line number of the document. (If we've collected 5000 comments, the next comment will have comment number 5001).
  2. The comment author's Reddit username
  3. The comment author's account creation time (in seconds since epoch). When printing to a results file, this time is converted to UTC time using the strftime method of the time module.
  4. The comment author's total karma. The 'total' karma is calculated by adding the author's comment and post karma together
  5. The subreddit that the comment came from (we're collecting comments from /r/all which a very large number of subreddits are a part of)
  6. The permalink to the comment (this isn't explicitly used anywhere, so feel free to remove it)
  7. The ID of the comment. This is used to collect data on the comments after they were initially collected (for the final phase)
  8. The length of the comment

Final File

The final file stores 8 data points for each comment. In order, these are:

  1. The number comment we're collecting. Should match the line number of the document. (If we've collected 5000 comments, the next comment will have comment number 5001).
  2. The comment author's Reddit username
  3. The comment author's account creation time (in seconds since epoch). When printing to a results file, this time is converted to UTC time using the strftime method of the time module.
  4. The comment score, which is correlated with the number of upvotes and downvotes a comment receives.
  5. The number of 'top level' (for lack of a better term) replies. By 'top-level' I mean only the initial replies to a comment. If a comment receives one reply, and then somebody replies to that reply, then the initial comment only has one reply.
  6. The permalink to the comment (this isn't explicitly used anywhere, so feel free to remove it)
  7. The ID of the comment. This is used to collect data on the comments after they were initially collected (for the final phase)
  8. The length of the comment

Findings

If you're interested, a week-long data collection was done, and you can read about the results here.

Monday - Tuesday Data

The full results are available in data/monday-tuesday/results.txt, but I will go over the basics here.

Relatively unsurprisingly, of the 10,000 comments I collected, the most, 517, in fact, came from r/AskReddit. See the chart below to see other top subreddits from my data. image

Overall, I collected comments from a staggering 2795 subreddits.

There were many interesting stats from the accounts that made the comments as well. The average account from the comments I collected was made on September 15th, 2015. The oldest account I collected was made on September 19th, 2005 (oddly similar dates compared with the average). The average account had 153 total karma.

In total, there were 401 deleted or removed comments. After 24 hours, the average comment had a score of just over 8, and the average number of replies was just over 0.5. image

Again, to see more complete data, check the data/monday-tuesday/results.txt file.

Saturday - Sunday Data

The full results are available in data/saturday-sunday/results.txt, but I will go over the basics here.

Again, as with the monday-tuesday results, most of the comments (by a wide margin), came from r/AskReddit. See the chart below to see other top subreddits. image

This time around, I collected comments from 2818 subreddits, a number very similar to the previous 2795.

The oldest account I collected was made on September 13th, 2005, while the average account I collected data from was made on November 6th, 2015. The average account had 165 total karma.

In total, there were 399 deleted or removed comments. After 24 hours, the average comment had a score of just over 9, and the average number of replies was just over .5. image

Again, to see more complete data, check the data/saturday-sunday/results.txt file.

Overall Conclusions

After looking at the results, I would like to do multiple, longer length runs of this project. I am planning on doing multiple week long experiments and comparing those results. When I do these, I will attach the results to a new README, and hopefully the results will be a little more interesting.

One thing I noticed from the data is the popularity of certain subreddits during/after certain events have happened. There are examples of this in both the initial and final data sets, actually.

In the initial data set, which has comments from Monday, June 18th, r/hiphopheads was the third most popular subreddit in r/all with regards to number of comments. This can be attributed to the murder of popular rapper XXXTentacion.

In the final data set, r/soccer had the second most comments of any subreddit in r/all, and this can be attributed to the World Cup.

Overall, the data was very similar (much more so than I expected. r/AskReddit was the most popular subreddit in both data sets, but that's about the only similarity I expected. Of the comments that I collected, the average age of the accounts from both data sets differed by only months, and the oldest account from both sets differed by only TWO days.

Everything else was very similar too. From the average account karma, to the average comment score after a day, to the average number of replies, I would have never expected all of these to be so close together.

Ultimately, this deserves to be looked further into with more trials, each of a longer period. And that's what I plan to do, as I've mentioned.

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.

You can’t perform that action at this time.