# Reddit

There is a [subreddit dedicated to the topic of Coronavirus](https://www.reddit.com/r/Coronavirus/). It currently has 1.5 million members.

> On December 2019, a novel coronavirus strain (SARS-CoV-2) emerged in the city of Wuhan, China. This subreddit seeks to monitor the spread of the disease COVID-19, declared a pandemic by the WHO. Please be civil and empathetic. This subreddit is for high quality posts and discussion.

Despite this description, this subreddit actually started in on May 3, 2013! Yep, coronaviruses have been with us for some time.

Is it possible to look through the URLs being shared there to see if there are seeds for the COVID-19 project? Let's install [praw](https://praw.readthedocs.io/en/latest/) the Python Reddit API Wrapper and find out.

In [2]:
! pip --quiet install praw

https://www.reddit.com/prefs/apps

In [6]:
import praw

reddit = praw.Reddit(
    client_id = 'jzlgUXvJdNIrEA', 
    client_secret = 'JF0GJm26ZAkVjBFWvQH2ueuUH6g', 
    user_agent = 'praw-edsu')

Let's look specifically at the Coronavirus subreddit:

In [7]:
covid19 = reddit.subreddit('Coronavirus') 

## Posts

Reddit started as a site to share links to things, vote and comment on them. It attempts to rank these posts using what it calls "hottness" which is a ranking based on the votes it has received and its age. At one point in time the source code for Reddit was available, and it was possible to [definitively say](https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9) how the algorithm worked. But now it's not entirely clear. At any rate the Reddit API provides access to the hottest posts.

In [8]:
posts = covid19.hot(limit=10)

for post in posts:
    print(post.title)
    print(post.url)
    print("")

Daily Discussion Post - March 23 | Questions, images, videos, comments, unconfirmed reports, theories, suggestions
https://www.reddit.com/r/Coronavirus/comments/fnkb5o/daily_discussion_post_march_23_questions_images/

Humanity wins: our fight to unlock 32,544 COVID-19 articles for the world. This petition is dedicated to the victims of the outbreak and their families. We fought for every article for every scientist for you.
https://twitter.com/freereadorg/status/1236104420217286658

COVID-19 front-line workers deserve financial reward
https://ottawacitizen.com/opinion/letters/todays-letters-covid-19-front-line-workers-deserve-financial-reward?utm_medium=Social&utm_source=Facebook#Echobox=1584960626

Sen. Klobuchar says her husband has coronavirus: “We just got the test results at 7 a.m. this morning ... He now has pneumonia and is on oxygen but not a ventilator.”
https://twitter.com/nbcnews/status/1242100184101851137?s=21

We have surpassed 100,000 recovered cases worldwide
https://twi

## Archived URLs

Now lets take a look at the top 100 stories, and see how many URLs have been archived at the Internet Archive.

In [9]:
from wayback import WaybackClient
from wayback.exceptions import WaybackException

wb = WaybackClient()

checked = []
archived = []
errors = []

for post in covid19.hot(limit=25):
    checked.append(post.url)
    try:
        versions = wb.search(post.url)
        if len(list(versions)) > 0:
            archived.append(post.url)
    except WaybackException:
        errors.append(post.url)
    except Exception as e:
        print(e)
    
print('{0} URLs checked ; {1} archived ({2:.1f}%) ; {3} unable to be archived.'.format(
    len(checked),
    len(archived),
    len(archived) / len(checked) * 100,
    len(errors)
))

25 URLs checked ; 8 archived (32.0%) ; 0 unable to be archived.


Ok, so what URLs have already been archived?

In [10]:
for url in archived:
    print(url)

https://twitter.com/freereadorg/status/1236104420217286658
https://www.reddit.com/r/Coronavirus/comments/fnl0n6/im_a_critical_care_doctor_working_in_a_uk_high/
https://replyua.net/news/201575-koronavirusa-nikakogo-net-a-est-koronavirus-v-golovah-u-chinovnikov-shahovu-pripomnili-kak-on-otrical-ugrozu-virusa-covid-19-dlya-ukrainy.html
https://www.theguardian.com/commentisfree/2020/mar/23/us-students-are-being-asked-to-work-remotely-but-22-of-homes-dont-have-internet
https://twitter.com/WellingMichael/status/1241491706677284870
https://nypost.com/2020/03/23/coronavirus-crisis-will-get-bad-this-week-surgeon-general-warns/
https://www.freep.com/story/news/local/michigan/oakland/2020/03/23/whitmer-michigan-lock-down-like-ohio-six-others-coronavirus-covid-19/2896041001/
https://www.sciencealert.com/mild-covid-19-might-cause-a-lost-of-smell-or-taste


We can also see which ones have not been archived:

In [11]:
for url in set(checked) - set(archived):
    print(url)

https://twitter.com/nbcnews/status/1242100184101851137?s=21
http://kdvr.com/news/coronavirus/fda-approves-rapid-coronavirus-test-created-in-colorado
https://edition.cnn.com/world/live-news/coronavirus-outbreak-03-23-20-intl-hnk/h_8df53475be78f0280ae6eeebe098885a
https://www.latimes.com/opinion/story/2020-03-23/coronovirus-healthcare-workers-risk
https://edition.cnn.com/2020/03/23/health/us-coronavirus-updates-monday/index.html
https://twitter.com/BAG_OFSP_UFSP/status/1242051679098474498
https://ottawacitizen.com/opinion/letters/todays-letters-covid-19-front-line-workers-deserve-financial-reward?utm_medium=Social&utm_source=Facebook#Echobox=1584960626
https://twitter.com/nbcnews/status/1241914972579467266?s=21
https://www.newsweek.com/new-york-city-now-has-more-confirmed-cases-coronavirus-all-south-korea-1493755
https://www.ladbible.com/news/news-gran-95-becomes-oldest-woman-in-italy-to-recover-from-covid-19-20200323
https://www.channel24.co.za/News/Local/queens-royal-aide-tests-positiv

## PushShift API

Unfortunately, Reddit's API doesn't let you retrieve more than 100 of the hottest posts. But there is a service called [PushShift](https://pushift.io) that does make this data available via an [API] of their own. It appears that PushShift has [some kind of relationship](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/) with Reddit to make this possible, but it's not entirely clear what that relationship is.

So lets look the posts in the last hour on the Coronavirus subreddit using the PushShift API. The API is public (no authentication requied), and allows us to search the Coronavirus subreddit using time slices delimited by a `before` and `after` parameters.

In [114]:
import requests

url = "https://api.pushshift.io/reddit/search/submission?subreddit=Coronavirus"
params = {
    "subreddit": "Coronavirus",
    "after": "1h",
    "limit": 1000
}

results = requests.get(url, params=params).json()['data']

In [115]:
len(results)

162

PushShift make quite a bit more data available for each post--here's the first one:

In [116]:
results[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'FrackingFrackers',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_e4dd0',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1585224568,
 'domain': 'ncbi.nlm.nih.gov',
 'full_link': 'https://www.reddit.com/r/Coronavirus/comments/fpafct/the_neuroinvasive_potential_of_sarscov2_may_play/',
 'gildings': {},
 'id': 'fpafct',
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': False,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '#0079d3',
 'link_flair_richtext': [],
 'link_flair_template_id': '394b6910-5e49-11ea-843b-0ebc196d18f7',
 'link_flair_text': 'Academic Report',
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locke

For the purposes of appraising content about COVID-19 the `id`, `url`, `title`, `created_utc`, `score` and `num_comments` look like they could be useful. Let's create a function that will walk all the results back to January 1, 2020 (since we're interested in COVID-19) and write them to a CSV file. This is a bit of a longer stretch of code that I normally like to put in a Jupyter Notebook cell, but we need to do a bit of exception handling in case the API returns strangely.

A few things to note about this code. We start out fetching 1 hour windows with the `before` and `after` parameters. But as we encounter time slices that have no posts we expand this window using the `step` variable.

In [117]:
import csv
import sys
import time
import datetime

out = csv.writer(open('data/reddit.csv', 'w'))
out.writerow(['id', 'url', 'title', 'created', 'creator', 'score', 'comments'])

url = "https://api.pushshift.io/reddit/search/submission?subreddit=Coronavirus"
params = {
    "subreddit": "Coronavirus",
    "limit": 1000
}

# keep track of the hour we are interested in
hour = 1

# step is the number of hours to look for at a time
step = 1

# calculate the number of hours since 2020-01-01
num_hours = (datetime.datetime.now() - datetime.datetime(2020, 1, 1)).total_seconds() / (60 * 60)

while hour < num_hours:
    sys.stdout.write('.')
    sys.stdout.flush()
    
    params['before'] = '{}h'.format(hour)
    params['after'] = '{}h'.format(hour + step)
    
    try:
        resp = requests.get(url, params=params)
        if resp.status_code != 200:
            print('received {} response'.format(resp.status_code))
        else:
            results = resp.json()['data']
            if len(results) > 0: 
                for result in results:
                    created = datetime.datetime.fromtimestamp(result['created_utc'])
                    out.writerow([
                        result['id'],
                        result['url'],
                        result['title'],
                        created.isoformat(),
                        result['author'],
                        result['score'],
                        result['num_comments']
                    ])

                # move the clock back further
                hour += step
            
            else:
                # we didn't find anything so increase the range
                step += 1
                        
    except Exception as e:
        print('uhoh: {}'.format(e))        

    # be polite and wait a little between requests
    time.sleep(0.5)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

## Sorted

For manual inspection, and seeing what content changes over time it's actually useful to save the CSV sorted by the time it was created.

In [118]:
import pandas

reddit = pandas.read_csv('data/reddit.csv', parse_dates=['created'])
reddit.to_csv('data/reddit.csv')
reddit

Unnamed: 0,id,url,title,created,creator,score,comments
0,fp9r47,https://www.wyff4.com/article/82-new-covid-19-...,82 new COVID-19 cases reported in South Caroli...,2020-03-26 07:13:05,scdblaggie,1,15
1,fp9r5m,https://www.businessinsider.com/kious-kelly-ho...,How the only Superpower is down to using garba...,2020-03-26 07:13:11,ptyblog,1,2
2,fp9r6a,https://www.guampdn.com/story/news/local/2020/...,"Guam reports 8 new positive cases, bringing th...",2020-03-26 07:13:13,ItsReptarOnRice,1,1
3,fp9ref,https://swarajyamag.com/science/a-short-introd...,A Short Introduction To Chloroquine: The Anti-...,2020-03-26 07:13:48,calciummag95,1,0
4,fp9rgj,https://www.theatlantic.com/politics/archive/2...,All the President’s Lies About the Coronavirus...,2020-03-26 07:13:55,ryanbank,1,0
...,...,...,...,...,...,...,...
135740,et02kx,https://twitter.com/DarrenPlymouth/status/1220...,Can anyone confirm if this is real?,2020-01-23 16:48:54,Puzzled-Mango,1,37
135741,et0cps,https://www.reddit.com/r/Coronavirus/comments/...,Doctor at Wuhan hospital states “ the virus is...,2020-01-23 17:08:53,Alirup123,1,18
135742,et0hy9,https://www.nature.com/news/inside-the-chinese...,This raises a question to me as to the true or...,2020-01-23 17:18:46,charlocity,1,9
135743,et0lyg,https://www.reddit.com/r/Coronavirus/comments/...,Would the flu shot provide any protection agai...,2020-01-23 17:26:39,whatisthatexactly,1,56


Let's sort it by creation time:

In [121]:
reddit = reddit.sort_values('created', ascending=True)
reddit

Unnamed: 0,id,url,title,created,creator,score,comments
135744,eszmer,https://www.reddit.com/r/Coronavirus/comments/...,Package from South Korea.,2020-01-23 16:18:13,Stochastic-Wolf,1,6
135739,eszxtd,https://i.redd.it/6gsubq3cllc41.jpg,First exclusive images of the virus under a mi...,2020-01-23 16:39:50,Isaac1234690,1,22
135740,et02kx,https://twitter.com/DarrenPlymouth/status/1220...,Can anyone confirm if this is real?,2020-01-23 16:48:54,Puzzled-Mango,1,37
135741,et0cps,https://www.reddit.com/r/Coronavirus/comments/...,Doctor at Wuhan hospital states “ the virus is...,2020-01-23 17:08:53,Alirup123,1,18
135742,et0hy9,https://www.nature.com/news/inside-the-chinese...,This raises a question to me as to the true or...,2020-01-23 17:18:46,charlocity,1,9
...,...,...,...,...,...,...,...
122,fpagfn,https://www.anandtech.com/show/15661/folding-a...,"Folding@Home Reaches Exascale: 1,500,000,000,0...",2020-03-26 08:11:35,Marha01,1,19
123,fpagg1,https://open.spotify.com/track/0yD9384WRaskrYj...,"Watch the news about corona, freak out about c...",2020-03-26 08:11:36,manyatomman,1,1
124,fpagqh,https://thehill.com/policy/healthcare/489604-c...,"Coronavirus deaths top 1,000 in US",2020-03-26 08:12:08,TrixyUkulele,1,6
125,fpagv2,https://marginalrevolution.com/marginalrevolut...,"""Test pooling"": A smart way to make scarce tes...",2020-03-26 08:12:23,redditinface,1,3


## Activity

Since there is so much activity it would be interesting to see what it looks like over time. But first we need to summarize it, because it's too many rows to pass off directly to Altair.

In [94]:
import altair
altair.renderers.enable('html')

posts_by_hour = reddit.groupby('created').count()
posts_by_hour = posts_by_hour.resample('1D').sum()

posts_by_hour

altair.Chart(posts_by_hour.reset_index(), title="Coronavirus Subreddit Posts", width=800).mark_bar().encode(
    altair.X('monthdate(created)', title='Time (Days)'),
    altair.Y('id', title='Posts per Day')
)

## Domains

What are the top websites (domains) that are being submitted to?

## Users

Given some [recent discussion](https://twitter.com/MatthiasWhist/status/1243145523017797634) of how bots are influencing votes I thought it could be interesting to see what users are most present in this data, to try see if there is some obviously automated behavior.

In [132]:
user_counts = reddit.groupby('creator').count()
user_counts = user_counts.sort_values('id', ascending=False)
user_counts['id'].head(50)

creator
-ZeuS--                 2432
KinnerNevada            1706
pink_paper_heart        1265
SeventhConstellation    1162
[deleted]               1047
Gonzo_B                  898
mythrowawaybabies        850
CgmatterTutorials        606
Frocharocha              566
NoticiES2020             551
Yamagemazaki             493
SantiGir20               446
johntempleton            440
n1ght_w1ng08             433
Viewfromthe31stfloor     415
BalkanEagles             410
lifeandmylens            331
Russi2020                291
hash0t0                  288
Zuom                     276
abdouh15                 255
hildebrand_rarity        247
mchamst3r                223
johntwit                 207
lexinshanghai            206
MaleficentRespect3       204
shallah                  187
Sleegan                  184
ssldvr                   179
Smileitsolga             179
twistedlogicx            176
Temstar                  176
srvnmdomdotnet           173
DoremusJessup            169
HugeDe

Does it seem kind of weird to you that the top posting user apparently has no posts on Reddit? 

https://www.reddit.com/user/KinnerNevada