# Reddit

There is a [subreddit dedicated to the topic of Coronavirus](https://www.reddit.com/r/Coronavirus/). It currently has 1.5 million members.

> On December 2019, a novel coronavirus strain (SARS-CoV-2) emerged in the city of Wuhan, China. This subreddit seeks to monitor the spread of the disease COVID-19, declared a pandemic by the WHO. Please be civil and empathetic. This subreddit is for high quality posts and discussion.

Despite this description, this subreddit actually started in on May 3, 2013! Yep, coronaviruses have been with us for some time.

Is it possible to look through the URLs being shared there to see if there are seeds for the COVID-19 project? Let's install [praw](https://praw.readthedocs.io/en/latest/) the Python Reddit API Wrapper and find out.

In [2]:
! pip --quiet install praw

To use praw you'll need API keys. Go over to https://www.reddit.com/prefs/apps and create an app. You can copy/paste them directly into your notebook below, but I've set mine in the environment for my Jupyter notebook to keep them private while sharing this notebook.

In [6]:
import os
import praw

reddit = praw.Reddit(
    client_id = os.environ.get('REDDIT_CLIENT_ID')
    client_secret = os.environ.get('REDDIT_CLIENT_SECRET')
    user_agent = 'praw-edsu')

Let's look specifically at the Coronavirus subreddit:

In [7]:
covid19 = reddit.subreddit('Coronavirus') 

## Posts

Reddit started as a site to share links to things, vote and comment on them. It attempts to rank these posts using what it calls "hottness" which is a ranking based on the votes it has received and its age. At one point in time the source code for Reddit was available, and it was possible to [definitively say](https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9) how the algorithm worked. But now it's not entirely clear. At any rate the Reddit API provides access to the hottest posts.

In [8]:
posts = covid19.hot(limit=10)

for post in posts:
    print(post.title)
    print(post.url)
    print("")

Daily Discussion Post - March 23 | Questions, images, videos, comments, unconfirmed reports, theories, suggestions
https://www.reddit.com/r/Coronavirus/comments/fnkb5o/daily_discussion_post_march_23_questions_images/

Humanity wins: our fight to unlock 32,544 COVID-19 articles for the world. This petition is dedicated to the victims of the outbreak and their families. We fought for every article for every scientist for you.
https://twitter.com/freereadorg/status/1236104420217286658

COVID-19 front-line workers deserve financial reward
https://ottawacitizen.com/opinion/letters/todays-letters-covid-19-front-line-workers-deserve-financial-reward?utm_medium=Social&utm_source=Facebook#Echobox=1584960626

Sen. Klobuchar says her husband has coronavirus: “We just got the test results at 7 a.m. this morning ... He now has pneumonia and is on oxygen but not a ventilator.”
https://twitter.com/nbcnews/status/1242100184101851137?s=21

We have surpassed 100,000 recovered cases worldwide
https://twi

## Archived URLs

Now lets take a look at the top 100 stories, and see how many URLs have been archived at the Internet Archive.

In [9]:
from wayback import WaybackClient
from wayback.exceptions import WaybackException

wb = WaybackClient()

checked = []
archived = []
errors = []

for post in covid19.hot(limit=25):
    checked.append(post.url)
    try:
        versions = wb.search(post.url)
        if len(list(versions)) > 0:
            archived.append(post.url)
    except WaybackException:
        errors.append(post.url)
    except Exception as e:
        print(e)
    
print('{0} URLs checked ; {1} archived ({2:.1f}%) ; {3} unable to be archived.'.format(
    len(checked),
    len(archived),
    len(archived) / len(checked) * 100,
    len(errors)
))

25 URLs checked ; 8 archived (32.0%) ; 0 unable to be archived.


Ok, so what URLs have already been archived?

In [10]:
for url in archived:
    print(url)

https://twitter.com/freereadorg/status/1236104420217286658
https://www.reddit.com/r/Coronavirus/comments/fnl0n6/im_a_critical_care_doctor_working_in_a_uk_high/
https://replyua.net/news/201575-koronavirusa-nikakogo-net-a-est-koronavirus-v-golovah-u-chinovnikov-shahovu-pripomnili-kak-on-otrical-ugrozu-virusa-covid-19-dlya-ukrainy.html
https://www.theguardian.com/commentisfree/2020/mar/23/us-students-are-being-asked-to-work-remotely-but-22-of-homes-dont-have-internet
https://twitter.com/WellingMichael/status/1241491706677284870
https://nypost.com/2020/03/23/coronavirus-crisis-will-get-bad-this-week-surgeon-general-warns/
https://www.freep.com/story/news/local/michigan/oakland/2020/03/23/whitmer-michigan-lock-down-like-ohio-six-others-coronavirus-covid-19/2896041001/
https://www.sciencealert.com/mild-covid-19-might-cause-a-lost-of-smell-or-taste


We can also see which ones have not been archived:

In [11]:
for url in set(checked) - set(archived):
    print(url)

https://twitter.com/nbcnews/status/1242100184101851137?s=21
http://kdvr.com/news/coronavirus/fda-approves-rapid-coronavirus-test-created-in-colorado
https://edition.cnn.com/world/live-news/coronavirus-outbreak-03-23-20-intl-hnk/h_8df53475be78f0280ae6eeebe098885a
https://www.latimes.com/opinion/story/2020-03-23/coronovirus-healthcare-workers-risk
https://edition.cnn.com/2020/03/23/health/us-coronavirus-updates-monday/index.html
https://twitter.com/BAG_OFSP_UFSP/status/1242051679098474498
https://ottawacitizen.com/opinion/letters/todays-letters-covid-19-front-line-workers-deserve-financial-reward?utm_medium=Social&utm_source=Facebook#Echobox=1584960626
https://twitter.com/nbcnews/status/1241914972579467266?s=21
https://www.newsweek.com/new-york-city-now-has-more-confirmed-cases-coronavirus-all-south-korea-1493755
https://www.ladbible.com/news/news-gran-95-becomes-oldest-woman-in-italy-to-recover-from-covid-19-20200323
https://www.channel24.co.za/News/Local/queens-royal-aide-tests-positiv

## PushShift API

Unfortunately, Reddit's API doesn't let you retrieve more than 100 of the hottest posts. But there is a service called [PushShift](https://pushift.io) that does make this data available via an [API] of their own. It appears that PushShift has [some kind of relationship](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/) with Reddit to make this possible, but it's not entirely clear what that relationship is.

So lets look the posts in the last hour on the Coronavirus subreddit using the PushShift API. The API is public (no authentication requied), and allows us to search the Coronavirus subreddit using time slices delimited by a `before` and `after` parameters.

In [1]:
import requests

url = "https://api.pushshift.io/reddit/search/submission?subreddit=Coronavirus"
params = {
    "subreddit": "Coronavirus",
    "after": "1h",
    "limit": 1000
}

results = requests.get(url, params=params).json()['data']

In [2]:
len(results)

194

PushShift make quite a bit more data available for each post--here's the first one:

In [3]:
results[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'Hard_at_it',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_bu5i3',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1585318603,
 'domain': 'wbur.org',
 'gildings': {},
 'id': 'fpxo4o',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 13,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_status': 'no_ads',
 'pinned': False,
 'pwls': 0,
 'retrieved_on': 1585318605,
 'score': 1,
 'selftext': '',
 'send_

For the purposes of appraising content about COVID-19 the `id`, `url`, `title`, `created_utc`, `score` and `num_comments` look like they could be useful. Let's create a function that will walk all the results back to January 1, 2020 (since we're interested in COVID-19) and write them to a CSV file. This is a bit of a longer stretch of code that I normally like to put in a Jupyter Notebook cell, but we need to do a bit of exception handling in case the API returns strangely.

A few things to note about this code. We start out fetching 1 hour windows with the `before` and `after` parameters. But as we encounter time slices that have no posts we expand this window using the `step` variable.

In [4]:
import csv
import sys
import time
import datetime

out = csv.writer(open('data/reddit.csv', 'w'))
out.writerow(['id', 'url', 'title', 'created', 'creator', 'score', 'comments'])

url = "https://api.pushshift.io/reddit/search/submission?subreddit=Coronavirus"
params = {
    "subreddit": "Coronavirus",
    "limit": 1000
}

# keep track of the hour we are interested in
hour = 1

# step is the number of hours to look for at a time
step = 1

# calculate the number of hours since 2020-01-01
num_hours = (datetime.datetime.now() - datetime.datetime(2020, 1, 1)).total_seconds() / (60 * 60)

while hour < num_hours:
    sys.stdout.write('.')
    sys.stdout.flush()
    
    params['before'] = '{}h'.format(hour)
    params['after'] = '{}h'.format(hour + step)
    
    try:
        resp = requests.get(url, params=params)
        if resp.status_code != 200:
            print('received {} response'.format(resp.status_code))
        else:
            results = resp.json()['data']
            if len(results) > 0: 
                for result in results:
                    created = datetime.datetime.fromtimestamp(result['created_utc'])
                    out.writerow([
                        result['id'],
                        result['url'],
                        result['title'],
                        created.isoformat(),
                        result['author'],
                        result['score'],
                        result['num_comments']
                    ])

                # move the clock back further
                hour += step
            
            else:
                # we didn't find anything so increase the range
                step += 1
                        
    except Exception as e:
        print('uhoh: {}'.format(e))        

    # be polite and wait a little between requests
    time.sleep(0.5)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

## Sorted

For manual inspection, and seeing what content changes over time it's actually useful to save the CSV sorted by the time it was created.

In [89]:
import pandas
pandas.set_option('display.max_colwidth', None)

reddit = pandas.read_csv('data/reddit.csv', parse_dates=['created'], index_col='id')
reddit

Unnamed: 0_level_0,url,title,created,creator,score,comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
fpwrte,https://www.purevpn.com/business-vpn?123,VPN traffic surging in response to increased telework - Here's the best VPN for Remote Work,2020-03-27 09:17:53,rickmartingt,1,3
fpwrtt,https://www.livemint.com/news/world/coronavirus-fearing-next-wave-china-doesn-t-want-its-diaspora-coming-back-11585310784475.html,"Fearing next wave, China doesn’t want its diaspora coming back",2020-03-27 09:17:55,SwimmingFault,1,0
fpwruk,https://www.msnbc.com/rachel-maddow/watch/army-corps-of-engineers-gives-options-to-states-facing-covid-crush-81265733635,"Army Corps converting arenas, dorms, and hotels into hospitals.",2020-03-27 09:17:56,somethingstinkd567,1,9
fpwrv5,http://vaccine-covid19.org/,2019-nCov Vaccine WOOOW ITS REAL??,2020-03-27 09:17:58,Equivalent_Frosting,1,0
fpwrvn,https://www.foxnews.com/entertainment/prince-charles-camilla-clap-health-care-workers-fighting-coronavirus-diagnosis,"Prince Charles, Camilla virtually clap for health care workers fighting coronavirus after royal's diagnosis",2020-03-27 09:18:00,alexfedp26,1,2
...,...,...,...,...,...,...
eszmer,https://www.reddit.com/r/Coronavirus/comments/eszmer/package_from_south_korea/,Package from South Korea.,2020-01-23 16:18:13,Stochastic-Wolf,1,6
eszrr8,https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6,Virus Map (,2020-01-23 16:28:21,JimmyBones79,1,1
eszxtd,https://i.redd.it/6gsubq3cllc41.jpg,First exclusive images of the virus under a microscope,2020-01-23 16:39:50,Isaac1234690,1,22
et02kx,https://twitter.com/DarrenPlymouth/status/1220427053717250050,Can anyone confirm if this is real?,2020-01-23 16:48:54,Puzzled-Mango,1,37


Let's sort it by creation time:

In [50]:
reddit = reddit.sort_values('created', ascending=True)
reddit

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,url,title,created,creator,score,comments
139888,139888,139888,esydul,https://i.redd.it/q4i8ya2i2lc41.jpg,So far this is what I have gathered. By now numbers could very well have changed.,2020-01-23 14:54:22,ShadowSociety247,1,15
139884,139884,139884,eszmer,https://www.reddit.com/r/Coronavirus/comments/eszmer/package_from_south_korea/,Package from South Korea.,2020-01-23 16:18:13,Stochastic-Wolf,1,6
139885,139885,139885,eszrr8,https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6,Virus Map (,2020-01-23 16:28:21,JimmyBones79,1,1
139886,139886,139886,eszxtd,https://i.redd.it/6gsubq3cllc41.jpg,First exclusive images of the virus under a microscope,2020-01-23 16:39:50,Isaac1234690,1,22
139887,139887,139887,et02kx,https://twitter.com/DarrenPlymouth/status/1220427053717250050,Can anyone confirm if this is real?,2020-01-23 16:48:54,Puzzled-Mango,1,37
...,...,...,...,...,...,...,...,...,...
165,165,165,fpxngr,https://www.cdamlab.com/faceshield.html,NZ university shares designs for laser cut face shields. Apparently faster and cheaper to mass produce than 3D printing.,2020-03-27 10:15:29,ABetterToday,1,3
166,166,166,fpxnj5,https://www.bloomberg.com/news/articles/2020-03-27/sony-says-virus-may-wipe-out-forecast-upgrade-delay-earnings,Sony does not expect to delay PS5 due to pandemic,2020-03-27 10:15:37,Frocharocha,1,26
167,167,167,fpxnqp,https://www.theguardian.com/commentisfree/2020/mar/27/can-us-reopen-economy-coronavirus,Trump thinks he can snap his fingers and reopen the economy. It won't work. [Guardian],2020-03-27 10:16:02,everydayiscience,1,1
168,168,168,fpxnw9,https://billswire.usatoday.com/2020/03/27/2020-nfl-draft-will-proceed-scheduled-roger-goodell/,2020 NFL Draft will proceed as scheduled,2020-03-27 10:16:20,into_the_space,1,9


## Activity

Since there is so much activity it would be interesting to see what it looks like over time. But first we need to summarize it, because it's too many rows to pass off directly to Altair.

In [74]:
import altair
altair.renderers.enable('html')

posts_by_hour = reddit.groupby('created').count()
posts_by_hour = posts_by_hour.resample('1D').sum()

posts_by_hour

altair.Chart(posts_by_hour.reset_index(), title="Coronavirus Subreddit Posts", width=800).mark_bar().encode(
    altair.X('monthdate(created)', title='Time (Days)'),
    altair.Y('id', title='Posts per Day')
)

## Domains

What are the top websites (domains) that are being submitted to?

In [51]:
from urllib.parse import urlparse

reddit['domain'] = reddit['url'].map(lambda u: urlparse(u).netloc)
reddit.groupby('domain').count().sort_values('id', ascending=False)

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,id,url,title,created,creator,score,comments
domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
www.reddit.com,18584,18584,18584,18584,18584,18584,18584,18584,18584
i.redd.it,11307,11307,11307,11307,11307,11307,11307,11307,11307
twitter.com,9348,9348,9348,9348,9348,9348,9348,9348,9348
youtu.be,5276,5276,5276,5276,5276,5276,5276,5276,5276
www.youtube.com,4254,4254,4254,4254,4254,4254,4254,4254,4254
...,...,...,...,...,...,...,...,...,...
shop.hobbio.co,1,1,1,1,1,1,1,1,1
shop.spreadshirt.com,1,1,1,1,1,1,1,1,1
shopmobileclean.com,1,1,1,1,1,1,1,1,1
shopsy.pk,1,1,1,1,1,1,1,1,1


There are lots of subdomains at the same domain. Let's see if the [tld](https://pypi.org/project/tld/) module can help:

In [94]:
from tld import get_tld

def fld(url):
    tld = get_tld(url, as_object=True, fail_silently=True)
    return tld.fld if tld else ''

reddit['fld'] = reddit['url'].map(fld)
domains = reddit.groupby('fld').count().sort_values('url', ascending=False)
domains['url'].head(25)

fld
reddit.com                    18673
redd.it                       14117
twitter.com                    9961
youtu.be                       5279
youtube.com                    4567
google.com                     2867
imgur.com                      2507
reuters.com                    2360
nytimes.com                    2247
cnn.com                        1926
theguardian.com                1230
channelnewsasia.com             910
cnbc.com                        881
facebook.com                    860
straitstimes.com                794
yahoo.com                       762
nypost.com                      759
bbc.com                         729
thehill.com                     688
bloomberg.com                   674
washingtonpost.com              663
bbc.co.uk                       627
scmp.com                        573
abc.net.au                      570
elcoronavirus.blogspot.com      561
Name: url, dtype: int64

In [95]:
thehill = reddit[reddit['fld'] == 'thehill.com']
thehill

Unnamed: 0_level_0,url,title,created,creator,score,comments,fld
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fpx8f1,https://thehill.com/homenews/administration/489608-white-house-deviated-from-pandemic-plan-report?amp,Trump administration deviated from original pandemic plan.,2020-03-27 09:48:58,superguy224,1,0,thehill.com
fpxdxt,https://thehill.com/policy/international/europe/489819-germany-takes-in-some-coronavirus-patients-from-italy,Germany takes in some coronavirus patients from Italy,2020-03-27 09:58:59,CRI0ST0IR,1,1,thehill.com
fpvfq7,https://thehill.com/policy/international/europe/489797-boris-johnson-tests-positive-for-coronavirus,Boris Johnson has tested positive for COVID-19,2020-03-27 07:37:29,please_remember-me,1,2,thehill.com
fpvko8,https://thehill.com/blogs/blog-briefing-room/news/489798-nypd-announces-first-employee-death-due-to-coronavirus,NYPD announces first employee death due to coronavirus,2020-03-27 07:48:58,fredfredburger0123,1,11,thehill.com
fpr7b6,https://thehill.com/policy/healthcare/489774-birx-cautions-against-inaccurate-models-predicting-signficant-coronavirus?amp,Birx cautions against inaccurate models predicting significant coronavirus spread,2020-03-27 01:13:28,imfartandsmunny,1,4,thehill.com
...,...,...,...,...,...,...,...
f4ojjp,https://thehill.com/changing-america/well-being/prevention-cures/481652-what-does-the-movie-contagion-tell-us-about,What does the movie 'Contagion' tell us about coronavirus (COVID-19)?,2020-02-16 04:41:51,IvyGold,1,30,thehill.com
f1hnq8,https://thehill.com/policy/technology/481956-reddit-enlists-users-to-combat-coronavirus-misinformation,Who gets to decide what misinformation is or isn't...? Why Reddit of course? How many 'misinformation' stories have turned out to be true?,2020-02-09 19:02:54,adam_the_eve,1,43,thehill.com
ez9t7u,https://thehill.com/changing-america/well-being/prevention-cures/481257-why-are-we-panicked-about-coronavirus-and-calm,Why are we panicked about coronavirus — and calm about the flu? You will all vote and comment before reading the article further validating the points the article asserts.,2020-02-05 08:48:50,CoffeeNTrees,1,18,thehill.com
eyu6k6,https://thehill.com/changing-america/well-being/prevention-cures/481367-who-coronavirus-outbreak-not-yet-pandemic,"WHO: Coronavirus outbreak not yet pandemic, but ""Epidemic with multiple foci""",2020-02-04 13:07:51,LeftistsAreBigots,1,13,thehill.com


688 URLs from thehill.com?! Are they distinct?

In [96]:
len(thehill['url'].unique())

663

Surprisingly, yes. How about the New York Times?

In [97]:
nyt = reddit[reddit['fld'] == 'nytimes.com']
print(len(nyt['url']), 'total')
print(len(nyt['url'].unique()), 'unique')

2247 total
2151 unique


How about ProPublica?

In [107]:
propublica = reddit[reddit['fld'] == 'propublica.org']
propublica = propublica.sort_values('score', ascending=False)
print(len(propublica['url']), 'total')
print(len(propublica['url'].unique()), 'unique')
cols = ['created', 'creator', 'score', 'comments', 'url', 'title', 'fld']
df = propublica[cols]
df
#propublica = propublica.sort_values('votes', ascending=False)
#propublica = propublica['created', 'creator', 'score', 'comments', 'fld', 'url', 'title']
df.to_csv('propublica-reddit.csv')

75 total
65 unique


## Users

Given some [recent discussion](https://twitter.com/MatthiasWhist/status/1243145523017797634) of how bots are influencing votes I thought it could be interesting to see what users are most present in this data, to try see if there is some obviously automated behavior.

In [132]:
user_counts = reddit.groupby('creator').count()
user_counts = user_counts.sort_values('id', ascending=False)
user_counts['id'].head(50)

creator
-ZeuS--                 2432
KinnerNevada            1706
pink_paper_heart        1265
SeventhConstellation    1162
[deleted]               1047
Gonzo_B                  898
mythrowawaybabies        850
CgmatterTutorials        606
Frocharocha              566
NoticiES2020             551
Yamagemazaki             493
SantiGir20               446
johntempleton            440
n1ght_w1ng08             433
Viewfromthe31stfloor     415
BalkanEagles             410
lifeandmylens            331
Russi2020                291
hash0t0                  288
Zuom                     276
abdouh15                 255
hildebrand_rarity        247
mchamst3r                223
johntwit                 207
lexinshanghai            206
MaleficentRespect3       204
shallah                  187
Sleegan                  184
ssldvr                   179
Smileitsolga             179
twistedlogicx            176
Temstar                  176
srvnmdomdotnet           173
DoremusJessup            169
HugeDe

Does it seem kind of weird to you that the top posting user apparently has no posts on Reddit? 

https://www.reddit.com/user/KinnerNevada