To gather user comments and other relevant information, we will employ the Python Reddit API Wrapper (PRAW) which builds upon the publically available reddit API in a user-friendly manner.

In [2]:
import praw
import pandas as pd

# initialize login-info
reddit = praw.Reddit(
    client_id="SgqCbXgBegHP1bRZ2qc7FQ",
    client_secret="OrSeWWIvWoOJl7WVniFKDlMSkvW-rQ",
    user_agent="macos:Sentiment_Analysis:v1 (by u/Minimum-Tadpole-5000)"
)

The previous block initializes a read-only instance of reddit that can be used to access public subbreddits of interest for submissions, comments, redditors, etc. To perform meaningful analysis, we are motivated to create a CSV file of the data. The following code block does exactly that, manipulating the CommentForest object and gathering all comments through a breadth-first-search procedure.

In [None]:
subreddit = reddit.subreddit("UCSantaBarbara") # Define subreddit
comments = [] # initialize comment list

for submission in subreddit.top(limit=10): # Adjust type of post (top, hot, controversial, etc.) and number returned
    submission.comments.replace_more(limit=None) # flatten nested comment structure
    for comment in submission.comments.list():
        comments.append({
            "id": comment.id,
            "author": str(comment.author),
            "body": comment.body,
            "score": comment.score,
            "created_utc": comment.created_utc,
            "parent_id": comment.parent_id,
            "link_id": comment.link_id,
            "permalink": comment.permalink,
            "is_submitter": comment.is_submitter,
            "distinguished": comment.distinguished,
            "stickied": comment.stickied
        })

df = pd.DataFrame(comments)
df.to_csv("../data/UCSantaBarbara-top-10.csv") # save to .csv


This boilerplate code can easily be generalized into a function that takes in a string containing the name of a subreddit to scrape and the desired number of submissions. 

In [19]:
def get_comments_as_csv(subreddit: str, num_submissions: int, sort_method: str):
    sub = reddit.subreddit(subreddit) # Define subreddit
    comments = [] # initialize comment list

    # Use getattr to dynamically access the sort method
    sort_type = getattr(sub, sort_method)

    for submission in sort_type(limit=num_submissions): # Adjust type of post (top, hot, controversial, etc.) and number returned
        submission.comments.replace_more(limit=None) # flatten nested comment structure
        for comment in submission.comments.list():
            comments.append({
                "id": comment.id,
                "author": str(comment.author),
                "body": comment.body,
                "score": comment.score,
                "comment_karma": getattr(comment.author, "comment_karma", None),
                "created_utc": comment.created_utc,
                "parent_id": comment.parent_id,
                "link_id": comment.link_id,
                "permalink": comment.permalink,
                "is_submitter": comment.is_submitter,
                "distinguished": comment.distinguished,
                "stickied": comment.stickied
            })

    df = pd.DataFrame(comments)
    df.to_csv(f"../data/{subreddit}-{sort_method}-{num_submissions}.csv", index=False) # save to .csv

We can now focus our data collection on a certain subject across subreddits to analyze sentiment. For example, we can gather the comments from the top post of the UCSB, UCSD, UCD, UCLA, and UCI subreddits. Recall that gathering this data relies on the public reddit API, which can only handle so many requests at a time. It is important to respect the network and avoid overloading it with requests.

In [20]:
get_comments_as_csv("UCSantaBarbara", 1, "top")
get_comments_as_csv("UCSD", 1, "top")
get_comments_as_csv("UCDavis", 1, "top")
get_comments_as_csv("ucla", 1, "top")
get_comments_as_csv("UCI", 1, "top")

TooManyRequests: received 429 HTTP response