## Data Collection

We shall use the `psaw` python wrapper (https://github.com/pushshift/api) for the pushshift.io API to collect reddit data. We use the pushshift.io API instead of the reddit API because the latter only allows us to get 1000 posts at a time. 

In [77]:
from psaw import PushshiftAPI
api = PushshiftAPI()

`psaw` allows us to search for information about reddit posts (called "submissions") from a given subreddit and a given time range. For instance, we can collect the information on all the submissions to the **r/personalfinance** subreddit from October 1, 2020 to October 1, 2021. This takes a little under half an hour.

In [78]:
import datetime as dt
import time
start_epoch = int(dt.datetime(2020, 10, 1).timestamp())
end_epoch = int(dt.datetime(2021, 10, 1).timestamp())

print("Downloading submission data...")
tic = time.perf_counter()
subs = list(api.search_submissions(after=start_epoch,
                                   before=end_epoch,
                                   subreddit='personalfinance',
                                   filter = ['id','author', 'title', 'selftext','score', 'num_comments']))
toc = time.perf_counter()
print(f"Downloaded submission data in {toc - tic:0.2f} seconds")

Downloading submission data...




Downloaded submission data in 1620.18 seconds


In [79]:
import pandas as pd
df = pd.DataFrame([thing.d_ for thing in subs])
df.head()

Unnamed: 0,author,created_utc,id,num_comments,score,selftext,title,created
0,DeadStarMan,1633060765,pz02c1,23,1,I'm getting a 15k cash settlement at age 29. \...,How should I use my cash settlement?,1633075000.0
1,mojo3jojo,1633060323,pyzyav,45,1,I make about $2500 a month. I pay $816 per mon...,Can I afford a $35k car?,1633075000.0
2,PolarisSONE,1633060288,pyzy0b,5,1,"Hi there, I have a 401K with Fidelity and I've...","Wanting to keep things simple, have a 401K and...",1633075000.0
3,CarWreckFiance,1633060269,pyzxv1,3,1,[removed],Car wreck right before big move,1633075000.0
4,watchoutitstaco,1633060017,pyzvna,9,1,Hello pals!\n\n&amp;#x200B;\n\nI'd like to hel...,How to pay off loans for my partner (is there ...,1633074000.0


In [179]:
len(df.index)

138823

Unfortunately, the reddit data available via the pushshift.io API seems to record the "score" (the number of upvotes - downvotes) of a post fairly quickly after the post is uploaded, and then doesn't update it. Thus, we need to correct the scores for each entry. We can use the `praw` wrapper (https://praw.readthedocs.io/en/stable/) for the regular Reddit API to accomplish this. This takes about 45 minutes.

In [80]:
import praw
reddit = praw.Reddit(client_id = "", # Enter credentials here
                client_secret = "", # and here
                user_agent='')
subred = reddit.subreddit('personalfinance')

See: https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html#praw.Reddit.info
See: https://praw.readthedocs.io/en/stable/code_overview/models/submission.html#praw.models.Submission.fullname

For some reason one row is skipped? Seems like an error occurs at 104607. The post by r/DoLifeBetter seems to be deleted from reddit, but was still on the pushshift database. 

In [159]:
post_ids = ['t3_'+name for name in df['id'].values]
tic = time.perf_counter()
gen = reddit.info(post_ids)
praw_score = []
praw_author = []
for i in gen:
    praw_author.append(i.author)
    praw_score.append(i.score)
toc = time.perf_counter()
print(f"Downloaded score data in {toc - tic:0.2f} seconds")

Downloaded score data in 2506.64 seconds


Observe that the post by r/tropicalweeds should be in the 104608 position, but is in the 104607 position in the data from praw. We insert a score of 0 at the 104606 position, which thens shifts everything else appropriately.

In [187]:
#[i for i in range(len(praw_author)) if df.author[i] != praw_author[i] and praw_author[i] != None]
praw_author[104605:104609]

[Redditor(name='kuntpower'), None, Redditor(name='tropicalweeds'), None]

In [194]:
df.author[104605:104609]

104605        kuntpower
104606     DoLifeBetter
104607         yipyip-1
104608    tropicalweeds
Name: author, dtype: object

In [195]:
praw_author2 = praw_author.insert(104606, 'None')

In [201]:
praw_author[104605:104609]

[Redditor(name='kuntpower'), 'None', None, Redditor(name='tropicalweeds')]

In [198]:
praw_score2 = praw_score.copy()

In [199]:
praw_score2.insert(104606,0)

In [202]:
df['score'] = praw_score2

The time of submission is presented in Unix time. We extract the associated date and time, and add the corresponding columns.

In [206]:
import pytz
from pytz import timezone
eastern = timezone('US/Eastern')
utc = pytz.utc
df['date'] = [dt.datetime.fromtimestamp(unix_time, tz=eastern).date() for unix_time in df['created_utc']]
df['time'] = [dt.datetime.fromtimestamp(unix_time, tz=eastern).time() for unix_time in df['created_utc']]
df.head()

Unnamed: 0,author,created_utc,id,num_comments,score,selftext,title,created,date,time
0,DeadStarMan,1633060765,pz02c1,23,4,I'm getting a 15k cash settlement at age 29. \...,How should I use my cash settlement?,1633075000.0,2021-09-30,23:59:25
1,mojo3jojo,1633060323,pyzyav,45,0,I make about $2500 a month. I pay $816 per mon...,Can I afford a $35k car?,1633075000.0,2021-09-30,23:52:03
2,PolarisSONE,1633060288,pyzy0b,5,2,"Hi there, I have a 401K with Fidelity and I've...","Wanting to keep things simple, have a 401K and...",1633075000.0,2021-09-30,23:51:28
3,CarWreckFiance,1633060269,pyzxv1,3,1,[removed],Car wreck right before big move,1633075000.0,2021-09-30,23:51:09
4,watchoutitstaco,1633060017,pyzvna,9,0,Hello pals!\n\n&amp;#x200B;\n\nI'd like to hel...,How to pay off loans for my partner (is there ...,1633074000.0,2021-09-30,23:46:57


We can then save the resulting data frame to a file.

In [204]:
df2 = df[['id', 'author', 'title', 'selftext' ,'time', 'date', 'score', 'num_comments']]

Make train test split, with 90% of entries in the training set

In [16]:
train = df2.sample(frac=0.9, random_state=42).copy()
test = df2.drop(train.index).copy()
train.to_pickle("./data/train.pkl")
test.to_pickle("./data/test.pkl")

# User Data Collection

We shall augment the data with user data. In particular, for each post, we want to extract the following:

* When the author created their account
* The number of comments the author had made prior to the post
* The number of submissions the author had made prior to the post
* The max and median scores of the author's comments prior to the post
* The max and median scores of the author's submissions prior to the post

The account age and number of comments/submissions serve as an indicator of a user's engagement/experience with Reddit, with the hypothesis that more engaged/experienced users are more likely to write posts that gain traction. A high maximum score of previous comments/submissions indicates they're capable of authoring viral content, and the median score indicates their usual performance.


The Reddit API provides access to the user's current karma, as well as whether the user has verified their email, is a mod, a Reddit employee, has gold, etc. These are tempting sources of information, but unfortunately, we want to restrict our predictive data to that available at the time of posting. Indeed, a redditor's karma dramatically increases when their post goes viral.

Loading libraries:

In [1]:
import numpy as np
import pandas as pd # data manipulation
import praw # Python Reddit API Wrapper
import time # see how long it takes
import datetime # working with datetimes
import statistics # get median

Authentication with the Reddit API. Steps:
* Create a reddit account.
* Go to https://www.reddit.com/prefs/apps.
* At the bottom, select create app, and fill out form accordingly (personal use)
* Fill out the arguments of the below, using information from the app description

In [None]:
reddit = praw.Reddit(
client_id="", #appears at top of app description
client_secret="", # labeled as "secret" in app description
password="", # reddit account password here
user_agent="Hello world", # put whatever
username="" # reddit username here
)

If the above worked, the following code will produce your username and karma:

In [None]:
print(f'{reddit.user.me()} has {reddit.user.me().comment_karma} karma.')

In [None]:
df = pd.read_pickle('data/train.pkl')
usernames = list(df['author'])
date_created = list(df['date'])
time_created = list(df['time'])
df.head(10)

Using the PRAW to extract information about the users. See https://praw.readthedocs.io/en/stable/code_overview/models/redditor.html. Unfortunately, the following block marches along at a crawl (over a second per user).

In [None]:
user_data = []
tic = time.perf_counter()
#for i in range(len(df)):
for i in range(60000, len(df)):
    username = usernames[i]
    post_created_utc = post_created_utcs[i]
    
    user = reddit.redditor(username)
    try:
        created_utc = user.created_utc
    except Exception:
        created_utc = 'NA'

    try:
        previous_comment_scores = [comm.score for comm in user.comments.hot() if comm.created_utc < post_created_utc]
    except Exception:
        previous_comment_scores = []

    num_previous_comments = len(previous_comment_scores)
    if num_previous_comments > 0:
        median_comment_scores = np.nanmedian(previous_comment_scores)
        max_comment_scores = max(previous_comment_scores, default=0)
    else:
        median_comment_scores = 'NA'
        max_comment_scores = 'NA'

    try:        
        previous_submission_scores = [sub.score for sub in user.submissions.hot() if sub.created_utc < post_created_utc]
    except Exception:
        previous_submission_scores = []

    num_previous_submissions = len(previous_submission_scores)

    if num_previous_submissions > 0:
        median_submission_scores = np.nanmedian(previous_submission_scores)
        max_submission_scores =  max(previous_submission_scores, default=0)
    else:
        median_submission_scores = 'NA'
        max_submission_scores = 'NA'
    user_data.append([username, created_utc, num_previous_comments, median_comment_scores, max_comment_scores, num_previous_submissions,
                     median_submission_scores, max_submission_scores])
    if i % 10 == 0:
        toc = time.perf_counter()
        print(f"{i}/{len(df)} Elapsed time: {np.round(toc-tic,2)} seconds")

In [None]:
user_df = pd.DataFrame(user_data, columns = ['Author','Created', 'nComments', 'medianCommentScore', 'maxCommentScore', 
                                             'nSubmissions', 'medianSubmissionScore', 'maxSubmissionScore'])
user_df.head(50)

In [None]:
len(user_df)

In [None]:
user_df.to_pickle('./data/personal_finance_user_data_60000_end.pkl')