# Python Web APIs: Accessing Reddit Data with PRAW

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Setting up PRAW](#praw)
2. [Accessing Subreddits](#subreddits)
3. [Retrieving Posts and Comments](#posts)
4. [Data Analysis with Reddit Data](#analysis)
5. [Demo: Comment Thread Analysis](#demo)

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime
import seaborn as sns

<a id='praw'></a>

# Reddit API with PRAW

Reddit is one of the most popular social media platforms, often called "the front page of the internet." We'll use PRAW (Python Reddit API Wrapper) to access Reddit's vast database of posts, comments, and user interactions.

Before proceeding, you'll need to:
1. Create a Reddit account (if you don't have one)
2. Create a Reddit application to get API credentials
3. Install the PRAW library

## Setting Up Reddit API Access

To use Reddit's API, you need to create an application:

1. Go to https://www.reddit.com/prefs/apps
2. Click "Create App" or "Create Another App"
3. Choose "script" as the app type
4. Fill in:
   - **Name**: Your app name (e.g., "Data Science Project")
   - **Description**: Brief description
   - **Redirect URI**: http://localhost:8080 (required but not used for scripts)
5. Note down your **Client ID** (under the app name) and **Client Secret**

## Installing PRAW

PRAW (Python Reddit API Wrapper) makes it easy to interact with Reddit's API:

In [None]:
%pip install praw

## Initializing PRAW

Now let's create a Reddit instance using our credentials:

In [None]:
import praw

# Initialize Reddit instance
reddit = praw.Reddit(
    client_id=reddit_creds['client_id'],
    client_secret=reddit_creds['client_secret'],
    user_agent=reddit_creds['user_agent']
)

# Test the connection
print(f"Connected to Reddit as: {reddit.user.me() if reddit.user.me() else 'Read-only mode'}")
print(f"Read-only mode: {reddit.read_only}")

<a id='subreddits'></a>

# Accessing Subreddits

Reddit is organized into subreddits - communities focused on specific topics. Let's start by exploring a popular subreddit:

In [None]:
# Access a subreddit
subreddit = reddit.subreddit("datascience")

# Basic subreddit information
print(f"Subreddit: r/{subreddit.display_name}")
print(f"Title: {subreddit.title}")
print(f"Subscribers: {subreddit.subscribers:,}")
print(f"Description: {subreddit.public_description[:200]}...")

## Retrieving Hot Posts

Let's get the current "hot" posts from this subreddit.

In [None]:
# Get hot posts
hot_posts = list(subreddit.hot(limit=10))

print(f"Retrieved {len(hot_posts)} hot posts from r/{subreddit.display_name}")

# Look at the first post
first_post = hot_posts[0]
print(f"\nFirst post title: {first_post.title}")
print(f"Author: u/{first_post.author}")
print(f"Score: {first_post.score}")
print(f"Comments: {first_post.num_comments}")

<a id='posts'></a>

# Retrieving Posts and Comments

Let's collect more detailed information about posts and organize it into a pandas DataFrame.

In [None]:
def extract_post_data(post):
    """Extract relevant data from a Reddit post"""
    return {
        'id': post.id,
        'title': post.title,
        'author': str(post.author) if post.author else '[deleted]',
        'score': post.score,
        'upvote_ratio': post.upvote_ratio,
        'num_comments': post.num_comments,
        'created_utc': datetime.fromtimestamp(post.created_utc),
        'selftext': post.selftext[:500] if post.selftext else '',  # First 500 chars
        'url': post.url,
        'is_self': post.is_self,
        'over_18': post.over_18,
        'spoiler': post.spoiler,
        'stickied': post.stickied,
        'subreddit': str(post.subreddit)
    }

# Collect data from multiple sorting methods
post_data = []

# Get hot posts
for post in subreddit.hot(limit=25):
    post_data.append(extract_post_data(post))

# Get new posts
for post in subreddit.new(limit=25):
    if post.id not in [p['id'] for p in post_data]:  # Avoid duplicates
        post_data.append(extract_post_data(post))

# Create DataFrame
df_posts = pd.DataFrame(post_data)
print(f"Collected {len(df_posts)} posts")
df_posts.head()

In [None]:
# Basic information about our dataset
df_posts.info()

## 🥊 Challenge: Exploring Different Subreddits

- Choose a subreddit relevant to your interests
- Collect the top 20 posts from that subreddit
- What's the average score? How many comments do posts typically get?

In [None]:
# YOUR CODE HERE



## Retrieving Comments

Let's examine the comments from a popular post:

In [None]:
# Get a post with many comments
popular_post = df_posts.loc[df_posts['num_comments'].idxmax()]
print(f"Post with most comments: {popular_post['title']}")
print(f"Number of comments: {popular_post['num_comments']}")

# Get the actual post object
post = reddit.submission(id=popular_post['id'])

# Collect top-level comments
post.comments.replace_more(limit=0)  # Remove "more comments" objects
comments_data = []

for comment in post.comments.list()[:20]:  # Get first 20 comments
    if hasattr(comment, 'body'):  # Make sure it's a comment, not deleted
        comments_data.append({
            'id': comment.id,
            'author': str(comment.author) if comment.author else '[deleted]',
            'body': comment.body[:200],  # First 200 characters
            'score': comment.score,
            'created_utc': datetime.fromtimestamp(comment.created_utc),
            'is_root': comment.parent_id.startswith('t3_')  # True if top-level comment
        })

df_comments = pd.DataFrame(comments_data)
print(f"\nCollected {len(df_comments)} comments")
df_comments.head()

<a id='analysis'></a>

# Data Analysis with Reddit Data

Now let's perform some analysis on our collected Reddit data.

## Post Engagement Analysis

In [None]:
# Basic statistics
print("Post Statistics:")
print(f"Average score: {df_posts['score'].mean():.1f}")
print(f"Average comments: {df_posts['num_comments'].mean():.1f}")
print(f"Average upvote ratio: {df_posts['upvote_ratio'].mean():.2f}")

# Distribution of scores
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
df_posts['score'].hist(bins=20, alpha=0.7)
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Distribution of Post Scores')

plt.subplot(1, 2, 2)
df_posts['num_comments'].hist(bins=20, alpha=0.7)
plt.xlabel('Number of Comments')
plt.ylabel('Frequency')
plt.title('Distribution of Comment Counts')

plt.tight_layout()
plt.show()

## Engagement vs. Time Analysis

In [None]:
# Add hour of day column
df_posts['hour'] = df_posts['created_utc'].dt.hour

# Average engagement by hour
hourly_stats = df_posts.groupby('hour').agg({
    'score': 'mean',
    'num_comments': 'mean',
    'upvote_ratio': 'mean'
}).round(2)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
hourly_stats['score'].plot(kind='bar', alpha=0.7)
plt.xlabel('Hour of Day (UTC)')
plt.ylabel('Average Score')
plt.title('Average Post Score by Hour')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
hourly_stats['num_comments'].plot(kind='bar', alpha=0.7, color='orange')
plt.xlabel('Hour of Day (UTC)')
plt.ylabel('Average Comments')
plt.title('Average Comments by Hour')
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

## Text Analysis: Post Titles

In [None]:
# Install textblob for sentiment analysis
%pip install textblob

In [None]:
from textblob import TextBlob

# Calculate sentiment for post titles
def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

df_posts['title_sentiment'] = df_posts['title'].apply(get_sentiment)

# Plot sentiment vs engagement
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.scatter(df_posts['title_sentiment'], df_posts['score'], alpha=0.6)
plt.xlabel('Title Sentiment')
plt.ylabel('Score')
plt.title('Post Score vs Title Sentiment')

plt.subplot(1, 2, 2)
df_posts['title_sentiment'].hist(bins=15, alpha=0.7)
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Distribution of Title Sentiment')

plt.tight_layout()
plt.show()

print(f"Average title sentiment: {df_posts['title_sentiment'].mean():.3f}")

## 🥊 Challenge: Content Type Analysis

- Compare self posts (text posts) vs link posts
- Which type gets more engagement?
- What about the upvote ratio?

In [None]:
# YOUR CODE HERE



<a id='demo'></a>

# 🎬 Demo: Comment Thread Analysis

Let's dive deeper into comment threads and see how engagement varies by comment depth and timing.

In [None]:
def analyze_comment_thread(submission_id, max_comments=100):
    """Analyze comment thread structure and engagement"""
    submission = reddit.submission(id=submission_id)
    submission.comments.replace_more(limit=0)
    
    comments_data = []
    
    def extract_comment(comment, depth=0):
        if len(comments_data) >= max_comments:
            return
            
        comments_data.append({
            'id': comment.id,
            'depth': depth,
            'score': comment.score,
            'body_length': len(comment.body),
            'created_utc': datetime.fromtimestamp(comment.created_utc),
            'author': str(comment.author) if comment.author else '[deleted]'
        })
        
        # Recursively process replies
        for reply in comment.replies:
            if hasattr(reply, 'body'):  # Make sure it's a comment
                extract_comment(reply, depth + 1)
    
    # Process all top-level comments and their replies
    for comment in submission.comments:
        if hasattr(comment, 'body'):
            extract_comment(comment)
    
    return pd.DataFrame(comments_data)

# Analyze the most commented post
most_commented_id = df_posts.loc[df_posts['num_comments'].idxmax(), 'id']
df_thread = analyze_comment_thread(most_commented_id)

print(f"Analyzed {len(df_thread)} comments from thread")
print(f"Maximum depth: {df_thread['depth'].max()}")

In [None]:
# Analyze comment engagement by depth
depth_stats = df_thread.groupby('depth').agg({
    'score': ['mean', 'count'],
    'body_length': 'mean'
}).round(2)

depth_stats.columns = ['avg_score', 'count', 'avg_length']
depth_stats = depth_stats.reset_index()

plt.figure(figsize=(15, 4))

plt.subplot(1, 3, 1)
plt.bar(depth_stats['depth'], depth_stats['avg_score'], alpha=0.7)
plt.xlabel('Comment Depth')
plt.ylabel('Average Score')
plt.title('Comment Score by Depth')

plt.subplot(1, 3, 2)
plt.bar(depth_stats['depth'], depth_stats['count'], alpha=0.7, color='orange')
plt.xlabel('Comment Depth')
plt.ylabel('Number of Comments')
plt.title('Comment Count by Depth')

plt.subplot(1, 3, 3)
plt.bar(depth_stats['depth'], depth_stats['avg_length'], alpha=0.7, color='green')
plt.xlabel('Comment Depth')
plt.ylabel('Average Length (characters)')
plt.title('Comment Length by Depth')

plt.tight_layout()
plt.show()

print("Comment thread analysis:")
print(depth_stats)

## Collecting Data for Your Final Project

Here's a template for collecting Reddit data that you might use in your final project:

In [None]:
def collect_subreddit_data(subreddit_name, num_posts=100, include_comments=False):
    """Collect comprehensive data from a subreddit for analysis"""
    subreddit = reddit.subreddit(subreddit_name)
    
    posts_data = []
    comments_data = []
    
    # Collect posts from different sorting methods
    post_sources = [
        (subreddit.hot(limit=num_posts//3), 'hot'),
        (subreddit.new(limit=num_posts//3), 'new'),
        (subreddit.top(limit=num_posts//3, time_filter='week'), 'top_week')
    ]
    
    seen_posts = set()
    
    for posts, source in post_sources:
        for post in posts:
            if post.id not in seen_posts:
                seen_posts.add(post.id)
                
                post_data = extract_post_data(post)
                post_data['source'] = source
                posts_data.append(post_data)
                
                # Optionally collect comments
                if include_comments and post.num_comments > 0:
                    post.comments.replace_more(limit=0)
                    for comment in post.comments.list()[:10]:  # Top 10 comments
                        if hasattr(comment, 'body'):
                            comment_data = {
                                'post_id': post.id,
                                'comment_id': comment.id,
                                'author': str(comment.author) if comment.author else '[deleted]',
                                'body': comment.body,
                                'score': comment.score,
                                'created_utc': datetime.fromtimestamp(comment.created_utc)
                            }
                            comments_data.append(comment_data)
    
    df_posts = pd.DataFrame(posts_data)
    df_comments = pd.DataFrame(comments_data) if comments_data else None
    
    return df_posts, df_comments

# Example usage
# df_posts, df_comments = collect_subreddit_data('python', num_posts=50, include_comments=True)
# df_posts.to_csv('reddit_posts.csv', index=False)
# if df_comments is not None:
#     df_comments.to_csv('reddit_comments.csv', index=False)

<div class="alert alert-success">

## ❗ Key Points

* Reddit's API through PRAW provides access to posts, comments, and user data from thousands of communities
* Different sorting methods (hot, new, top) give different perspectives on community content
* Reddit data includes rich metadata like scores, timestamps, and comment threads
* Comment threads have hierarchical structure that can reveal conversation patterns
* Engagement metrics vary by posting time, content type, and community
* Reddit data is excellent for sentiment analysis, trend analysis, and social network research
  
</div>