# Data Science for Social Justice Workshop: Optional Material

## The Reddit API

In this notebook, we'll access the Reddit API to get your own data.

The Reddit API allows you to do lots of things, such as automatically post as a user. It also allows you to retrieve data from Reddit, such as subreddit posts and comments. 

There are restrictions in place: Reddit's API only allows you to retrieve 1000 posts (and associated comments) per task. While we can create a script that takes note of the timecodes of posts so as to scrape the entiry of a subreddit in multiple tasks, for now we will just download 1000 posts from our dataset (or fewer, if your subreddit has fewer than 1000 posts).

Follow these steps to get started:

1. **Sign Up.** First, you will need to sign up with Reddit to run some of the code. Go to http://www.reddit.com and **sign up** for an account.

2. **Create an App.** Go to [this page](https://ssl.reddit.com/prefs/apps/) and click on the `are you a developer? create an app` button at the bottom.

3. **Fill Out the Form.** Fill out the form that appears. For the name, you can enter whatever you'd like. Select "script". Enter the redirect uri as shown. Otherwise, you can leave everything else blank. Then, click "create app".

![redditapi](../../img/reddit_api.png)

4. **Note API Credentials.** You should see a new box appear, with some important information. This includes:
    - Client ID: A 14-character string (at least) listed just under “personal use script” for the desired developed application.
    - Client Secret: A 27-character string (at least) listed adjacent to secret for the application.
    - Username: The username of the Reddit account used to register the application.
    - Password: This is not shown here, but you should remember your password to your account.
    
![redditapi2](../../img/reddit_api2.png)

## Importing and Using `praw`

Even though we're set up with the API, we still need to have a way to use Python to interface with the API. Luckily, this is already done for us via the Python Reddit API Wrapper: `praw`. This is a package we can download and use.

Install `praw`, and fill out your details below to create a `reddit` variable.

In [None]:
!pip install praw

In [None]:
import praw

reddit = praw.Reddit(client_id='YOUR_CLIENT_NAME_HERE',
                     client_secret='YOUR_CLIENT_SECRET_HERE',
                     password='YOUR_REDDIT_PSW_HERE',
                     user_agent='Get Reddit data 1.0 by /u/YOUR_REDDIT_NAME_HERE',
                     username='YOUR_REDDIT_USERNAME_HERE')

## Getting Data with the Reddit API

For the purpose of this exercise, we'll download Reddit data in one file, but it's common practice to download posts and comments in two different relational databases.

First, we enter the user details of the app we just created. Then, we run a function that retrieves the post and its associated metadata, as well as the comments. We'll save the information in a CSV.

**Note:** you might want to add other metadata elements to your function, or organize it differently. For example, Reddit submissions also have a "spoiler" attribute that indicates whether a response is a spoiler (relevant if you're gathering data from a movie or game-related subreddit!). For a list of all the attibutes you can use, check:

* [Submissions/Posts](https://praw.readthedocs.io/en/latest/code_overview/models/submission.html)
* [Comments](https://praw.readthedocs.io/en/latest/code_overview/models/comment.html)

In [None]:
import csv
from datetime import datetime

def get_reddit_data(subreddit_name, max_count):
    """Scrapes Reddit submissions and comments.
    
    Parameters
    ----------
    subreddit_name : string
        The subreddit name.
    max_count : int
        The maximum number of posts to query.
    """
    filename = subreddit_name + '_' + str(max_count) + '_' + datetime.now().strftime('%Y%m%d') + '.csv'
    # Setting up a csv writer and write the first row 
    writer = csv.writer(open(filename, 'wt', encoding = 'utf-8'))
    writer.writerow(['idstr', 'created', 'created_datetime', 'nsfw', 'flair_text', 'flair_css_class',
                     'author', 'title', 'selftext', 'score', 'upvote_ratio', 
                     'distinguished', 'textlen', 'num_comments', 'top_comments'])   
    item_count = 0
    comment_count = 0
    for submission in reddit.subreddit(subreddit_name).hot(limit=None): 
        try:
            item_count += 1
            idstr = submission.id
            created = submission.created
            created_datetime = datetime.fromtimestamp(created).strftime('%Y' + '-' + '%m' + '-' + '%d')
            nsfw = submission.over_18
            flair_text = submission.link_flair_text
            flair_css_class = submission.link_flair_css_class
            author = submission.author
            title = submission.title
            selftext = submission.selftext
            score = submission.score
            upvote_ratio = submission.upvote_ratio
            distinguished = submission.distinguished
            textlen = len(submission.selftext)
            num_comments = submission.num_comments
            comment_list = []
            submission.comments.replace_more(limit=None)
            for comment in submission.comments.list():
                if comment.author != None:
                    comment_count += 1
                    comment_list.append(comment.body)
            comments = ' '.join(comment_list)
            writer.writerow((idstr, created, created_datetime, nsfw, flair_text, flair_css_class,
                             author, title, selftext, score, upvote_ratio,
                             distinguished, textlen, num_comments, comments))
            print('.', end='', flush=True)
        except:
            print('Error found--resuming...')
        if item_count == max_count:
            break

    if item_count > 0:
        print('Done!' + '\n' + 'Found ' + str(item_count) + ' posts' + 
              '\n' + 'Found ' + str(comment_count) + ' comments')


Now that we're set up, let's get our data. Change "amitheasshole" in the function call below to your preferred subreddit name (you can find it in Reddit's URL, after "/r/").

In the `for` loop statement above, instead of using `.hot` (currently popular posts), you can also try `.top` (top scoring posts), `.new` (the latest posts), or `.controversial` (posts with a lot of up- and downvotes).

In [None]:
get_reddit_data('amitheasshole', 3)

This function wrote to a file for us, which we can access in this folder.