To gather user comments and other relevant information, we will employ the Python Reddit API Wrapper (PRAW) which builds upon the publically available reddit API in a user-friendly manner.

In [20]:
import praw
import time
import pandas as pd
from prawcore.exceptions import TooManyRequests

# initialize login-info
reddit = praw.Reddit(
    client_id="SgqCbXgBegHP1bRZ2qc7FQ",
    client_secret="OrSeWWIvWoOJl7WVniFKDlMSkvW-rQ",
    user_agent="macos:Sentiment_Analysis:v1 (by u/Minimum-Tadpole-5000)"
)

The previous block initializes a read-only instance of reddit that can be used to access public subbreddits of interest for submissions, comments, redditors, etc. To perform meaningful analysis, we are motivated to create a CSV file of the data. The following function definition does exactly that, manipulating the CommentForest object and gathering all comments through a breadth-first-search procedure.

In [22]:
def get_comments_as_csv(subreddit: str, num_submissions: int, sort_method: str, delay: float = 2.0):
    sub = reddit.subreddit(subreddit)
    comments = []
    sort_type = getattr(sub, sort_method)
    for i, submission in enumerate(sort_type(limit=num_submissions)): 
        try:
            submission.comments.replace_more(limit=None) # flatten nested comment structure
            for comment in submission.comments.list():
                comments.append({
                    "id": comment.id,
                    "author": str(comment.author),
                    "body": comment.body,
                    "score": comment.score,
                    "comment_karma": getattr(comment.author, "comment_karma", None),
                    "created_utc": comment.created_utc
                })
            time.sleep(delay)
        except TooManyRequests as e:
            print("Rate limit hit. Resting for a while...")
            time.sleep(e.sleep_time if hasattr(e, 'sleep_time') else 60)
        except Exception as e:
            print(f"Error processing submission {i+1}: {e}")

    df = pd.DataFrame(comments)
    df['subreddit'] = subreddit
    df.to_csv(f"../data/{subreddit}-{sort_method}-{num_submissions}.csv", index=False) # save to .csv

We can now focus our data collection on a certain subject across subreddits to analyze sentiment. For example, we can gather the comments from several of the most recent hot posts of NBA subreddits for various teams. Recall that gathering this data relies on the public reddit API, which can only handle so many requests at a time. It is important to respect the network and avoid overloading it with requests by spacing out calls. The block below defines a list of NBA subreddits and fetches the comments from 10 hot posts.

In [None]:
nba_subs = ["AtlantaHawks", "bostonceltics", "GoNets", "CharlotteHornets", "chicagobulls",
            "clevelandcavs", "DetroitPistons", "pacers", "heat", "MkeBucks", "NYKnicks",
            "OrlandoMagic", "sixers", "torontoraptors", "washingtonwizards", "Mavericks",
            "denvernuggets", "warriors", "rockets", "LAClippers", "lakers", "memphisgrizzlies",
            "timberwolves", "NOLAPelicans", "Thunder", "suns", "ripcity", "kings", "NBASpurs",
            "UtahJazz"]

for subreddit in nba_subs:
    get_comments_as_csv(subreddit, 10, "hot")

After gathering the individual csv files, we can combine them as follows:

In [27]:
import glob

# grab all file paths
files = glob.glob("../data/*.csv")

combined_data = pd.concat([pd.read_csv(f) for f in files])

combined_data.to_csv("../data/comments.csv", index=False)