# Reddit

Reddit [supposedly](https://thenextweb.com/contributors/2018/04/19/reddit-now-active-users-twitter-engaging-porn/) has more active users than Twitter. But who are those users and what are the users doing? One indicator of that is the [Coronavirus Subreddit](https://www.reddit.com/r/Coronavirus/) which has 1.9 million members. We can use the PushShift API to see how the DomainTools domains are being shared and interacted with. Reddit is primarily a social bookmarking site where people can share links to things and have conversations about them. [PushShift's API](https://github.com/pushshift/api) lets you search for submissions that include links to a particular domain.

For example to search for the nytimes.com domain:

    http://api.pushshift.io/reddit/submission/search?domain=nytimes.com

In [5]:
import requests

url = "http://api.pushshift.io/reddit/submission/search?domain=nytimes.com"
results = requests.get(url).json()['data']
results[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'demonbadger',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_a6y3w',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1586263598,
 'domain': 'nytimes.com',
 'full_link': 'https://www.reddit.com/r/Idaho/comments/fwjwar/a_liberty_rebellion_in_idaho_threatens_to/',
 'gildings': {},
 'id': 'fwjwar',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 1,
 'num_crossposts': 0,
 'over_18': False,
 'permalink': '/r/Idaho/

We get *a lot* of information back, but some key bits that will be useful for us are:

* **title:** A ‘Liberty’ Rebellion in Idaho Threatens to Undermine Coronavirus Orders - The New York Times'
* **url:** The URL for the nytimes.com post
* **author:** the user who sent the post
* **comments:** the number of comments the post received
* **score:** the number of upvotes the post received 
* **created_utc:** the date the post was created
* **id**: Reddit's unique identifier for the post

## Collect the Data

So we're ready to look at DomainTools data. Before we do that it's worth pointint out that we can only get up to 1000 posts at a time. While it's certainly possible to have more than 1000 posts for the *nytimes.com* domain, I'm thinking that we won't see more than this for the DomainTools domains. But it will be useful to print out a message if we do. So we will get back at most 1000 results at a time, and order them by their *score* using these API parameters:

* **limit:** the maximum number of results to return (no higher than 1000) 
* **sort_type:** we can use this to order results by their score 

Let's create a function that looks up the data for a given domain and returns it. The only small thing we do here is convert the `created_utc` value to a formatted datetime.

In [27]:
import datetime

def reddit_posts(domain):
    url = "http://api.pushshift.io/reddit/submission/search"
    params = {"domain": domain, "limit": 1000, "sort_type": "score"}
    resp = requests.get(url, params=params)
    if resp.status_code != 200:
        print(resp)
    results = resp.json()['data']
    for result in results:
        result['created'] = datetime.datetime.fromtimestamp(result['created_utc']).isoformat()
        yield result

Next we'll need the DomainTools data:

In [7]:
import pandas

df = pandas.read_csv('data/domaintools/2020-04-06.csv.gz',
    parse_dates=['created'], 
    sep='\t',
    names=['domain', 'created', 'risk']
)

Let's write out the data as a CSV.

In [11]:
import csv
import datetime

today = datetime.date.today().strftime('%Y-%m-%d')
csv_path = 'data/reddit/{}.csv'.format(today)
cols=['id', 'created', 'url', 'title', 'author', 'comments', 'score', 'domain']

Ok we're ready to start. But first lets limit to the riskiest domains.

In [18]:
riskiest = df[df.risk >= 99.0]
len(riskiest)

91644

If we pause for 1 second between requests to PushShift's API just to be nice. It should take us no shorter than this many hours:

In [19]:
len(riskiest) / 60 / 60

25.456666666666667

We better get started then! The DictWriter class lets us define which columns to include in our output, and ignore ones that we are not interested in.

In [31]:
import sys
import time

cols = ['id', 'created', 'author', 'title', 'url', 'domain', 'score', 'comments']

with open(csv_path, 'w') as fh:
    out = csv.DictWriter(fh, fieldnames=cols, extrasaction='ignore')
    for domain in riskiest.domain:
        for post in reddit_posts(domain):
            out.writerow(post)
            sys.stdout.write('+')
        time.sleep(1)
        sys.stdout.write('.')

...................................................................................................................................................

KeyboardInterrupt: 