# User Data Collection

We shall augment the data found in `data_collection.ipynb`. In particular, for each post, we want to extract the following:

* When the author created their account
* The number of comments the author had made prior to the post
* The number of submissions the author had made prior to the post
* The max and median scores of the author's comments prior to the post
* The max and median scores of the author's submissions prior to the post

The account age and number of comments/submissions serve as an indicator of a user's engagement/experience with Reddit, with the hypothesis that more engaged/experienced users are more likely to write posts that gain traction. A high maximum score of previous comments/submissions indicates they're capable of authoring viral content, and the median score indicates their usual performance.


The Reddit API provides access to the user's current karma, as well as whether the user has verified their email, is a mod, a Reddit employee, has gold, etc. These are tempting sources of information, but unfortunately, we want to restrict our predictive data to that available at the time of posting. Indeed, a redditor's karma dramatically increases when their post goes viral.

Loading libraries:

In [123]:
import numpy as np
import pandas as pd # data manipulation
import praw # Python Reddit API Wrapper
import time # see how long it takes
import datetime # working with datetimes
import statistics # get median

Authentication with the Reddit API. Steps:
* Create a reddit account.
* Go to https://www.reddit.com/prefs/apps.
* At the bottom, select create app, and fill out form accordingly (personal use)
* Fill out the arguments of the below, using information from the app description

In [126]:
reddit = praw.Reddit(
client_id="", #appears at top of app description
client_secret="", # labeled as "secret" in app description
password="", # reddit account password here
user_agent="Hello world", # put whatever
username="" # reddit username here
)

If the above worked, the following code will produce your username and karma:

In [129]:
print(f'{reddit.user.me()} has {reddit.user.me().comment_karma} karma.')

Spiffy_Lee has 6096 karma.


Importing the data, and getting the column of usernames:

In [130]:
df = pd.read_pickle('data/train.pkl')
usernames = list(df['author'])
date_created = list(df['date'])
time_created = list(df['time'])
df.head(10)

Unnamed: 0,id,author,title,selftext,time,date,score,num_comments
54442,n97ehm,weremanthing,Refinance my home to free up VA loan or wait?,First let me say thank you for looking at my p...,11:20:52,2021-05-10,1,2
41531,nzf89i,b1ackcat,Thank you for being such a great resource; you...,[removed],01:08:32,2021-06-14,1,2
61126,mxntnt,runnerup,"401k vs 457b, not sure which to max first",My work has both the 401k and 457b plans. They...,12:50:32,2021-04-24,3,7
87222,lg8t6y,Bunburier,"Student Loans, Interest Rate, and Payment Stra...",I'll be attending graduate school soon. Tuitio...,12:44:56,2021-02-09,2,2
34549,obzx08,Mxnchkinz_,What do I put under Gross Income when applying...,I'm applying for a Discover Secured Credit Car...,21:27:28,2021-07-01,0,15
94551,l3pqzy,qnxz,Can I claim expenses paid with HSA from insura...,Can I pay for a medical expense with my HSA de...,20:44:33,2021-01-23,3,36
73894,m49pky,No_Mortgage_ForYou,HSBC refuses to close on mortgage refinance. A...,"**Summary:** HSBC, the holder of our current m...",11:30:32,2021-03-13,0,23
125114,jvohkf,john__Webster,How to make money online,[removed],01:57:54,2020-11-17,1,0
114387,kd5rij,caramelbananas,"Receiving a rather sizable end of year bonus, ...",I'm going to be a receiving an end of year bon...,15:27:37,2020-12-14,2,16
90782,l9hept,wpomp,Refund Tax Credit after Marriage?,"Hi Everyone,\n\nI am going through my taxes (u...",12:44:21,2021-01-31,2,10


In [131]:
len(df)

124941

In [132]:
post_created_utcs = [datetime.datetime.combine(date_created[i], time_created[i]).timestamp() for i in range(len(date_created))]

Using the PRAW to extract information about the users. See https://praw.readthedocs.io/en/stable/code_overview/models/redditor.html. Unfortunately, the following block marches along at a crawl (over a second per user).

In [None]:
user_data = []
tic = time.perf_counter()
#for i in range(len(df)):
for i in range(30000, 60000):
    username = usernames[i]
    post_created_utc = post_created_utcs[i]
    
    user = reddit.redditor(username)
    try:
        created_utc = user.created_utc
    except Exception:
        created_utc = 'NA'

    try:
        previous_comment_scores = [comm.score for comm in user.comments.hot() if comm.created_utc < post_created_utc]
    except Exception:
        previous_comment_scores = []

    num_previous_comments = len(previous_comment_scores)
    if num_previous_comments > 0:
        median_comment_scores = np.nanmedian(previous_comment_scores)
        max_comment_scores = max(previous_comment_scores, default=0)
    else:
        median_comment_scores = 'NA'
        max_comment_scores = 'NA'

    try:        
        previous_submission_scores = [sub.score for sub in user.submissions.hot() if sub.created_utc < post_created_utc]
    except Exception:
        previous_submission_scores = []

    num_previous_submissions = len(previous_submission_scores)

    if num_previous_submissions > 0:
        median_submission_scores = np.nanmedian(previous_submission_scores)
        max_submission_scores =  max(previous_submission_scores, default=0)
    else:
        median_submission_scores = 'NA'
        max_submission_scores = 'NA'
    user_data.append([username, created_utc, num_previous_comments, median_comment_scores, max_comment_scores, num_previous_submissions,
                     median_submission_scores, max_submission_scores])
    if i % 10 == 0:
        toc = time.perf_counter()
        print(f"{i}/{len(df)} Elapsed time: {np.round(toc-tic,2)} seconds")

30000/124941 Elapsed time: 7.63 seconds
30010/124941 Elapsed time: 18.24 seconds
30020/124941 Elapsed time: 30.56 seconds
30030/124941 Elapsed time: 39.62 seconds
30040/124941 Elapsed time: 54.57 seconds
30050/124941 Elapsed time: 68.76 seconds
30060/124941 Elapsed time: 83.23 seconds
30070/124941 Elapsed time: 99.25 seconds
30080/124941 Elapsed time: 113.4 seconds
30090/124941 Elapsed time: 125.88 seconds
30100/124941 Elapsed time: 136.71 seconds
30110/124941 Elapsed time: 150.0 seconds
30120/124941 Elapsed time: 160.28 seconds
30130/124941 Elapsed time: 169.72 seconds
30140/124941 Elapsed time: 181.74 seconds
30150/124941 Elapsed time: 190.0 seconds
30160/124941 Elapsed time: 202.14 seconds
30170/124941 Elapsed time: 232.42 seconds
30180/124941 Elapsed time: 262.53 seconds
30190/124941 Elapsed time: 292.96 seconds
30200/124941 Elapsed time: 322.34 seconds
30210/124941 Elapsed time: 352.18 seconds
30220/124941 Elapsed time: 382.41 seconds
30230/124941 Elapsed time: 411.48 seconds
3024

31930/124941 Elapsed time: 5508.34 seconds
31940/124941 Elapsed time: 5537.63 seconds
31950/124941 Elapsed time: 5567.14 seconds
31960/124941 Elapsed time: 5598.98 seconds
31970/124941 Elapsed time: 5628.73 seconds
31980/124941 Elapsed time: 5658.03 seconds
31990/124941 Elapsed time: 5687.69 seconds
32000/124941 Elapsed time: 5718.85 seconds
32010/124941 Elapsed time: 5748.67 seconds
32020/124941 Elapsed time: 5778.27 seconds
32030/124941 Elapsed time: 5807.38 seconds
32040/124941 Elapsed time: 5837.44 seconds
32050/124941 Elapsed time: 5868.7 seconds
32060/124941 Elapsed time: 5898.51 seconds
32070/124941 Elapsed time: 5927.63 seconds
32080/124941 Elapsed time: 5958.44 seconds
32090/124941 Elapsed time: 5988.41 seconds
32100/124941 Elapsed time: 6018.77 seconds
32110/124941 Elapsed time: 6048.59 seconds
32120/124941 Elapsed time: 6077.15 seconds
32130/124941 Elapsed time: 6107.76 seconds
32140/124941 Elapsed time: 6138.69 seconds
32150/124941 Elapsed time: 6168.0 seconds
32160/124941 

33840/124941 Elapsed time: 11236.06 seconds
33850/124941 Elapsed time: 11266.59 seconds
33860/124941 Elapsed time: 11296.11 seconds
33870/124941 Elapsed time: 11326.19 seconds
33880/124941 Elapsed time: 11355.99 seconds
33890/124941 Elapsed time: 11387.08 seconds
33900/124941 Elapsed time: 11415.82 seconds
33910/124941 Elapsed time: 11445.46 seconds
33920/124941 Elapsed time: 11476.95 seconds
33930/124941 Elapsed time: 11506.52 seconds
33940/124941 Elapsed time: 11536.38 seconds
33950/124941 Elapsed time: 11565.49 seconds
33960/124941 Elapsed time: 11595.55 seconds
33970/124941 Elapsed time: 11627.15 seconds
33980/124941 Elapsed time: 11655.33 seconds
33990/124941 Elapsed time: 11686.84 seconds
34000/124941 Elapsed time: 11716.09 seconds
34010/124941 Elapsed time: 11744.96 seconds
34020/124941 Elapsed time: 11775.75 seconds
34030/124941 Elapsed time: 11805.01 seconds
34040/124941 Elapsed time: 11836.64 seconds
34050/124941 Elapsed time: 11865.47 seconds
34060/124941 Elapsed time: 11895

35710/124941 Elapsed time: 16850.81 seconds
35720/124941 Elapsed time: 16880.56 seconds
35730/124941 Elapsed time: 16910.46 seconds
35740/124941 Elapsed time: 16941.3 seconds
35750/124941 Elapsed time: 16971.29 seconds
35760/124941 Elapsed time: 17000.6 seconds
35770/124941 Elapsed time: 17031.59 seconds
35780/124941 Elapsed time: 17061.41 seconds
35790/124941 Elapsed time: 17090.64 seconds
35800/124941 Elapsed time: 17120.07 seconds
35810/124941 Elapsed time: 17150.31 seconds
35820/124941 Elapsed time: 17181.92 seconds
35830/124941 Elapsed time: 17209.93 seconds
35840/124941 Elapsed time: 17241.5 seconds
35850/124941 Elapsed time: 17270.98 seconds
35860/124941 Elapsed time: 17300.51 seconds
35870/124941 Elapsed time: 17331.08 seconds
35880/124941 Elapsed time: 17362.57 seconds
35890/124941 Elapsed time: 17390.25 seconds
35900/124941 Elapsed time: 17421.63 seconds
35910/124941 Elapsed time: 17450.52 seconds
35920/124941 Elapsed time: 17480.4 seconds
35930/124941 Elapsed time: 17512.24 

37580/124941 Elapsed time: 22461.63 seconds
37590/124941 Elapsed time: 22490.77 seconds
37600/124941 Elapsed time: 22520.87 seconds
37610/124941 Elapsed time: 22550.82 seconds
37620/124941 Elapsed time: 22580.92 seconds
37630/124941 Elapsed time: 22612.1 seconds
37640/124941 Elapsed time: 22641.59 seconds
37650/124941 Elapsed time: 22670.33 seconds
37660/124941 Elapsed time: 22701.51 seconds
37670/124941 Elapsed time: 22730.5 seconds
37680/124941 Elapsed time: 22760.78 seconds
37690/124941 Elapsed time: 22790.49 seconds
37700/124941 Elapsed time: 22820.08 seconds
37710/124941 Elapsed time: 22852.49 seconds
37720/124941 Elapsed time: 22881.19 seconds
37730/124941 Elapsed time: 22910.34 seconds
37740/124941 Elapsed time: 22940.29 seconds
37750/124941 Elapsed time: 22971.95 seconds
37760/124941 Elapsed time: 23001.44 seconds
37770/124941 Elapsed time: 23031.57 seconds
37780/124941 Elapsed time: 23062.17 seconds
37790/124941 Elapsed time: 23092.83 seconds
37800/124941 Elapsed time: 23123.3

39450/124941 Elapsed time: 28073.53 seconds
39460/124941 Elapsed time: 28102.49 seconds
39470/124941 Elapsed time: 28131.66 seconds
39480/124941 Elapsed time: 28161.64 seconds
39490/124941 Elapsed time: 28192.86 seconds
39500/124941 Elapsed time: 28222.67 seconds
39510/124941 Elapsed time: 28252.16 seconds
39520/124941 Elapsed time: 28282.62 seconds
39530/124941 Elapsed time: 28312.76 seconds
39540/124941 Elapsed time: 28342.78 seconds
39550/124941 Elapsed time: 28373.55 seconds
39560/124941 Elapsed time: 28400.46 seconds
39570/124941 Elapsed time: 28427.3 seconds
39580/124941 Elapsed time: 28458.13 seconds
39590/124941 Elapsed time: 28487.41 seconds
39600/124941 Elapsed time: 28516.91 seconds
39610/124941 Elapsed time: 28549.34 seconds
39620/124941 Elapsed time: 28577.67 seconds
39630/124941 Elapsed time: 28607.16 seconds
39640/124941 Elapsed time: 28637.84 seconds
39650/124941 Elapsed time: 28668.64 seconds
39660/124941 Elapsed time: 28698.31 seconds
39670/124941 Elapsed time: 28729.

41320/124941 Elapsed time: 33677.44 seconds
41330/124941 Elapsed time: 33707.1 seconds
41340/124941 Elapsed time: 33738.77 seconds
41350/124941 Elapsed time: 33768.01 seconds
41360/124941 Elapsed time: 33797.84 seconds
41370/124941 Elapsed time: 33828.76 seconds
41380/124941 Elapsed time: 33858.48 seconds
41390/124941 Elapsed time: 33888.15 seconds
41400/124941 Elapsed time: 33918.78 seconds
41410/124941 Elapsed time: 33948.33 seconds
41420/124941 Elapsed time: 33977.72 seconds
41430/124941 Elapsed time: 34007.39 seconds
41440/124941 Elapsed time: 34037.25 seconds
41450/124941 Elapsed time: 34068.7 seconds
41460/124941 Elapsed time: 34098.69 seconds
41470/124941 Elapsed time: 34127.36 seconds
41480/124941 Elapsed time: 34159.37 seconds
41490/124941 Elapsed time: 34187.6 seconds
41500/124941 Elapsed time: 34218.75 seconds
41510/124941 Elapsed time: 34248.92 seconds
41520/124941 Elapsed time: 34277.87 seconds
41530/124941 Elapsed time: 34309.09 seconds
41540/124941 Elapsed time: 34338.29

43190/124941 Elapsed time: 39288.22 seconds
43200/124941 Elapsed time: 39319.94 seconds
43210/124941 Elapsed time: 39349.27 seconds
43220/124941 Elapsed time: 39378.69 seconds
43230/124941 Elapsed time: 39409.81 seconds
43240/124941 Elapsed time: 39438.99 seconds
43250/124941 Elapsed time: 39468.49 seconds
43260/124941 Elapsed time: 39500.2 seconds
43270/124941 Elapsed time: 39529.78 seconds
43280/124941 Elapsed time: 39559.11 seconds
43290/124941 Elapsed time: 39588.72 seconds
43300/124941 Elapsed time: 39618.4 seconds
43310/124941 Elapsed time: 39649.67 seconds
43320/124941 Elapsed time: 39678.28 seconds
43330/124941 Elapsed time: 39708.65 seconds
43340/124941 Elapsed time: 39739.37 seconds
43350/124941 Elapsed time: 39769.17 seconds
43360/124941 Elapsed time: 39798.94 seconds
43370/124941 Elapsed time: 39829.24 seconds
43380/124941 Elapsed time: 39858.78 seconds
43390/124941 Elapsed time: 39889.56 seconds
43400/124941 Elapsed time: 39918.49 seconds
43410/124941 Elapsed time: 39949.3

45060/124941 Elapsed time: 44895.15 seconds
45070/124941 Elapsed time: 44926.35 seconds
45080/124941 Elapsed time: 44956.86 seconds
45090/124941 Elapsed time: 44986.48 seconds
45100/124941 Elapsed time: 45015.97 seconds
45110/124941 Elapsed time: 45045.98 seconds
45120/124941 Elapsed time: 45075.86 seconds
45130/124941 Elapsed time: 45107.01 seconds
45140/124941 Elapsed time: 45135.77 seconds
45150/124941 Elapsed time: 45165.32 seconds
45160/124941 Elapsed time: 45196.08 seconds
45170/124941 Elapsed time: 45226.27 seconds
45180/124941 Elapsed time: 45255.13 seconds
45190/124941 Elapsed time: 45285.05 seconds
45200/124941 Elapsed time: 45316.14 seconds
45210/124941 Elapsed time: 45346.62 seconds
45220/124941 Elapsed time: 45376.79 seconds
45230/124941 Elapsed time: 45406.64 seconds
45240/124941 Elapsed time: 45436.35 seconds
45250/124941 Elapsed time: 45465.08 seconds
45260/124941 Elapsed time: 45496.78 seconds
45270/124941 Elapsed time: 45526.68 seconds
45280/124941 Elapsed time: 45555

46930/124941 Elapsed time: 50503.52 seconds
46940/124941 Elapsed time: 50533.72 seconds
46950/124941 Elapsed time: 50562.49 seconds
46960/124941 Elapsed time: 50593.72 seconds
46970/124941 Elapsed time: 50620.2 seconds
46980/124941 Elapsed time: 50651.08 seconds
46990/124941 Elapsed time: 50680.97 seconds
47000/124941 Elapsed time: 50710.25 seconds
47010/124941 Elapsed time: 50740.59 seconds
47020/124941 Elapsed time: 50769.53 seconds
47030/124941 Elapsed time: 50798.96 seconds
47040/124941 Elapsed time: 50829.16 seconds
47050/124941 Elapsed time: 50860.32 seconds
47060/124941 Elapsed time: 50890.77 seconds
47070/124941 Elapsed time: 50919.97 seconds
47080/124941 Elapsed time: 50950.5 seconds
47090/124941 Elapsed time: 50980.07 seconds
47100/124941 Elapsed time: 51010.82 seconds
47110/124941 Elapsed time: 51039.86 seconds
47120/124941 Elapsed time: 51069.02 seconds
47130/124941 Elapsed time: 51100.59 seconds
47140/124941 Elapsed time: 51129.45 seconds
47150/124941 Elapsed time: 51160.6

48800/124941 Elapsed time: 56108.31 seconds
48810/124941 Elapsed time: 56136.55 seconds
48820/124941 Elapsed time: 56166.11 seconds
48830/124941 Elapsed time: 56196.87 seconds
48840/124941 Elapsed time: 56228.71 seconds
48850/124941 Elapsed time: 56256.9 seconds
48860/124941 Elapsed time: 56287.35 seconds
48870/124941 Elapsed time: 56317.53 seconds
48880/124941 Elapsed time: 56347.85 seconds
48890/124941 Elapsed time: 56376.17 seconds
48900/124941 Elapsed time: 56407.59 seconds
48910/124941 Elapsed time: 56437.42 seconds
48920/124941 Elapsed time: 56466.1 seconds
48930/124941 Elapsed time: 56497.73 seconds
48940/124941 Elapsed time: 56526.36 seconds
48950/124941 Elapsed time: 56556.25 seconds
48960/124941 Elapsed time: 56587.81 seconds
48970/124941 Elapsed time: 56615.83 seconds
48980/124941 Elapsed time: 56645.09 seconds
48990/124941 Elapsed time: 56678.11 seconds
49000/124941 Elapsed time: 56705.25 seconds
49010/124941 Elapsed time: 56736.83 seconds
49020/124941 Elapsed time: 56765.4

50670/124941 Elapsed time: 61716.38 seconds
50680/124941 Elapsed time: 61746.24 seconds
50690/124941 Elapsed time: 61776.65 seconds
50700/124941 Elapsed time: 61808.04 seconds
50710/124941 Elapsed time: 61836.4 seconds
50720/124941 Elapsed time: 61867.42 seconds
50730/124941 Elapsed time: 61896.45 seconds
50740/124941 Elapsed time: 61926.96 seconds
50750/124941 Elapsed time: 61955.84 seconds
50760/124941 Elapsed time: 61987.09 seconds
50770/124941 Elapsed time: 62016.75 seconds
50780/124941 Elapsed time: 62047.33 seconds
50790/124941 Elapsed time: 62076.61 seconds
50800/124941 Elapsed time: 62106.5 seconds
50810/124941 Elapsed time: 62137.65 seconds
50820/124941 Elapsed time: 62166.24 seconds
50830/124941 Elapsed time: 62196.39 seconds
50840/124941 Elapsed time: 62228.02 seconds
50850/124941 Elapsed time: 62257.24 seconds
50860/124941 Elapsed time: 62287.56 seconds
50870/124941 Elapsed time: 62317.56 seconds
50880/124941 Elapsed time: 62347.64 seconds
50890/124941 Elapsed time: 62376.2

52540/124941 Elapsed time: 67327.11 seconds
52550/124941 Elapsed time: 67357.5 seconds
52560/124941 Elapsed time: 67386.33 seconds
52570/124941 Elapsed time: 67418.26 seconds
52580/124941 Elapsed time: 67446.26 seconds
52590/124941 Elapsed time: 67477.87 seconds
52600/124941 Elapsed time: 67506.29 seconds
52610/124941 Elapsed time: 67537.94 seconds
52620/124941 Elapsed time: 67567.67 seconds
52630/124941 Elapsed time: 67596.52 seconds
52640/124941 Elapsed time: 67627.18 seconds
52650/124941 Elapsed time: 67658.08 seconds
52660/124941 Elapsed time: 67686.11 seconds
52670/124941 Elapsed time: 67717.71 seconds
52680/124941 Elapsed time: 67747.73 seconds
52690/124941 Elapsed time: 67777.18 seconds
52700/124941 Elapsed time: 67806.51 seconds
52710/124941 Elapsed time: 67838.41 seconds
52720/124941 Elapsed time: 67866.53 seconds
52730/124941 Elapsed time: 67896.29 seconds
52740/124941 Elapsed time: 67927.56 seconds
52750/124941 Elapsed time: 67957.25 seconds
52760/124941 Elapsed time: 67987.

In [138]:
user_df = pd.DataFrame(user_data, columns = ['Author','Created', 'nComments', 'medianCommentScore', 'maxCommentScore', 
                                             'nSubmissions', 'medianSubmissionScore', 'maxSubmissionScore'])
user_df.head(50)

Unnamed: 0,Author,Created,nComments,medianCommentScore,maxCommentScore,nSubmissions,medianSubmissionScore,maxSubmissionScore
0,weremanthing,1472830000.0,94,1.0,1871.0,39,3.0,25906.0
1,b1ackcat,1275530000.0,0,,,92,6.0,4888.0
2,runnerup,1255600000.0,0,,,0,,
3,Bunburier,1500550000.0,0,,,74,2.0,490.0
4,Mxnchkinz_,1598490000.0,29,1.0,3.0,44,1.5,42.0
5,qnxz,1611450000.0,0,,,0,,
6,No_Mortgage_ForYou,1615650000.0,0,,,0,,
7,john__Webster,,0,,,0,,
8,caramelbananas,1523330000.0,10,1.0,8.0,2,2.0,2.0
9,wpomp,1334520000.0,62,1.0,8.0,23,2.0,49.0


In [140]:
len(user_df)

30000

In [141]:
user_df.to_pickle('./data/personal_finance_user_data_30000_60000.pkl')