# Week 6': Creating our data file

## Overview

This notebook is used to create the data file which we will use for the remainder of the course. It is strongly based off of the material from the week 2 Reddit notebook, but trimmed down to just the pieces we need. Since students do not need to run this notebook, I have focus my annotations on coding best practices rather than on describing what is going on.

In [1]:
# You should put all of your imports at the top of the file or notebook. This way future users know what dependencies they need to install

import pandas as pd
import praw
import json

In [2]:
# Next I create a reddit instance using my login information, which is saved locally.

with open("secret.json", "r") as f:
    secret = json.load(f)

reddit = praw.Reddit(
    client_id = secret['client_id'],
    client_secret = secret['client_secret'],
    username = secret['username'],
    password = secret['password'],
    user_agent = secret['user_agent']
)

reddit.read_only = True

<div class="alert alert-info">

<b>Functions:</b> There are several steps which I would like to do multiple times for each post, so I have written them as functions. When the internal steps are sufficiently complicated, I also write functions to call within the functions. If you are interested in functions, the expand_comments function uses a technique called `recursion` where the function calls itself in order to deal with the unknown depth of the comment forest.
    
</div>

In [10]:
def access_reddit_post(post_id):
    '''This function takes a Reddit post id as a string and returns
    a dictionary containing information about the post as well as 
    the complete comments section.
    
    Post ids can be found in the url of the Reddit post.
    '''
    
    
    # Access the post
    submission = reddit.submission(post_id)
    
    # Extract the fields of interest
    post_info = {'title': submission.title,
             'created_at': submission.created_utc,
             'id': submission.id,
             'permalink': submission.permalink,
             'num_comments': submission.num_comments,
             'score': submission.score,
             'upvote_ratio': submission.upvote_ratio,            
             'external_link': submission.url,
             }

    # Expand the comment forest into a list of comment info dictionaries
    post_info['comments'] = expand_comments(submission.comments)
    
    return post_info

In [4]:
def expand_comments(comment_forest):
    ''' Takes a praw CommentForest object and recursively retrieves every comment
    in the forest or replying to a comment in the forest in the form of a list of 
    Comment objects.'''
    
    # Turn any MoreComments objects into actual comments
    replace_more_comments(comment_forest)
    
    # Initialize any exmpty list to which to add comments
    cumulative_list = []
    
    # Recursively unpack the replies to each comment
    for comment in list(comment_forest):
        if len(comment.replies) > 0:
            replies = expand_comments(comment.replies)
            cumulative_list = cumulative_list + replies
        cumulative_list.append(get_comment_info(comment))
            
    return cumulative_list

In [5]:
def replace_more_comments(comment_forest):
    '''Expands a praw CommentForest object by calling the more_comments()
    method until all of the comments have been accessed. Returns None because
    the input CommentForest is mutated.'''
    
    # Creates the initial conditions
    more_to_replace = True

    # Repeats until the condition, no more comments to expand, is false
    while more_to_replace:
        remaining_more_comments = comment_forest.replace_more()
        more_to_replace = (len(remaining_more_comments) > 0)

In [15]:
def get_comment_info(comment):
    '''Extracts a set of useful fields from a praw Comment object
    and returns a dictionary of the fields'''
    
    # Extracts useful fields from the comment
    comment_info = {'body': comment.body,
                    'created_at': comment.created_utc,
                    'id': comment.id,
                    'parent': comment.parent_id,
                    'score': comment.score
                   }
    
    return comment_info

<div class="alert alert-info">

<b>Collecting Data:</b> Since the difficult work is all in the functions, only two cells are required to call the functions, collect the data, and write it to a file. 
</div>

In [11]:
# List the posts we care about
post_ids = ["10py29t","10plq1f"]

# Creates the dictionary to store them in
post_info = {}

# Loops through the list of posts we care about and gets the data
for post in post_ids:
    post_info[post] = access_reddit_post(post)

In [16]:
# Save the data to a file

with open("reddit_data.json", "w") as f:
    json.dump(post_info, f)