# Lesson 1:  Reddit Comment Scraper

## üéØ Learning Objectives
By the end of this lesson, you will be able to:
1. **Extract comments** from any Reddit thread
2. **Change the target URL** to scrape different discussions
4. **Change extraction quantity** to scrape more or less data
3. **Save data to CSV** for further analysis


## üöÄ What You'll Build
A simple but powerful Reddit comment scraper that can:
- Take any Reddit thread URL
- Extract all comments with metadata (author, score, text, timestamp)
- Save the results to a CSV file for use in other tools

In [4]:
# Import required libraries
import praw
import pandas as pd
from datetime import datetime
import os

print("‚úÖ Libraries imported successfully!")
print("üì¶ Ready to scrape Reddit comments")

‚úÖ Libraries imported successfully!
üì¶ Ready to scrape Reddit comments


## Reddit API Authentication

### Overview
Reddit allows users to "scrape" data from their website using an API (Application Programming Interface). Because "scraping" can be taxing on their servers, you have to authenticate as a user to scrape more posts per minute. The script below authenticates you as a user so you can "scrape" (download posts) at a higher limit. If you are not authenticated, you can still get data, but it's slower.

### üéØ Two Authentication Methods:

| Method | Rate Limit | Best For |
|--------|------------|----------|
| **Authenticated (Encrypted)** | 600 requests/minute | Large subreddits, many posts |
| **Anonymous (Read-only)** | 60 requests/minute | Single threads, small datasets |

### üîí Security:
I have set up **encrypted credentials** that give you higher rate limits while keeping the actual login details secure. All of this authentication code has been hidden from you in a separate module.

### üîß Setup:
The authentication cell below will **automatically** choose the best available method - no setup needed on your end! All the complex authentication code is hidden in a separate file to keep things simple.

In [3]:
# üîß Setup Reddit Connection


# Clear any previous imports to avoid caching issues
import importlib
import sys
if 'reddit_auth' in sys.modules:
    importlib.reload(sys.modules['reddit_auth'])

from reddit_auth import setup_reddit_connection

print("üîó Connecting to Reddit...")

# Single function call with explicit variable assignment
reddit, auth_mode, rate_limit = setup_reddit_connection()

# Single, clear status message
status_msg = "‚úÖ Authenticated connection ready! (600 requests/minute)" if auth_mode == "authenticated" else "‚úÖ Read-only connection ready! (60 requests/minute)"
print(status_msg)
print("üéØ Ready to scrape!")

üîó Connecting to Reddit...
‚úÖ Read-only connection ready! (60 requests/minute)
üéØ Ready to scrape!
‚úÖ Read-only connection ready! (60 requests/minute)
üéØ Ready to scrape!


## STEP 1: Choose Your Reddit Thread

To find a thread URL:
1. Go to Reddit.com
2. Find an interesting post with lots of comments
3. Copy the full URL from your browser
4. Paste it below (replace the current URL)

In [2]:
#Change this URL
url = "https://www.reddit.com/r/jmu/comments/1lbrjnx/best_jmu_suitestyle_halls_need_help_ranking_area/"

print(f"üîó Target URL: {url}")
print("üìù To change this, modify the 'url' variable above")

# Load the Reddit thread
try:
    submission = reddit.submission(url=url)
    print(f"\n‚úÖ Successfully loaded thread!")
    print(f"üì∞ Title: '{submission.title}'")
    print(f"üìä Score: {submission.score}")
    print(f"üí¨ Comments: {submission.num_comments}")
    print(f"üìÖ Subreddit: r/{submission.subreddit}")
    
except Exception as e:
    print(f"‚ùå Error loading thread: {e}")
    print("üí° Make sure the URL is a valid Reddit thread URL")

üîó Target URL: https://www.reddit.com/r/jmu/comments/1lbrjnx/best_jmu_suitestyle_halls_need_help_ranking_area/
üìù To change this, modify the 'url' variable above
‚ùå Error loading thread: name 'reddit' is not defined
üí° Make sure the URL is a valid Reddit thread URL


## STEP 2: Extract All Comments from the Thread

Once you have downloaded all of the comments, these are bundled in a special format. The comments need to be "extracted" into something humans can read. The function below does this.

In [4]:
# üîç STEP 2: Extract All Comments from the Thread

print("üîç Extracting comments from the thread...")
print("‚è≥ This may take a few seconds for threads with many comments...")

comments_data = []

try:
    # Loop through all top-level comments
    for comment in submission.comments:
        if hasattr(comment, 'body'):  # Skip "MoreComments" objects
            # Convert timestamp to readable date
            comment_date = datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S')
            
            # Store comment information
            comment_info = {
                'author': str(comment.author) if comment.author else '[deleted]',
                'score': comment.score,
                'text': comment.body,
                'date': comment_date,
                'thread_title': submission.title,
                'subreddit': str(submission.subreddit)
            }
            comments_data.append(comment_info)
    
    print(f"‚úÖ Successfully extracted {len(comments_data)} comments!")
    
    # Show a preview of the first few comments
    if comments_data:
        print(f"\nüìã Preview of first 3 comments:")
        for i, comment in enumerate(comments_data[:3], 1):
            print(f"\n{i}. Author: {comment['author']} | Score: {comment['score']}")
            print(f"   Text: {comment['text'][:100]}...")
            print(f"   Date: {comment['date']}")
    
except Exception as e:
    print(f"‚ùå Error extracting comments: {e}")
    comments_data = []

üîç Extracting comments from the thread...
‚è≥ This may take a few seconds for threads with many comments...
‚úÖ Successfully extracted 5 comments!

üìã Preview of first 3 comments:

1. Author: Pitiful-Pickle-5101 | Score: 2
   Text: Do you mean Jack and Jill style bathrooms? Or like village suite style where 3 sets of roommates sha...
   Date: 2025-06-15 14:56:40

2. Author: An51759 | Score: 1
   Text: Also I heard assignment is random, I was wondering how I‚Äôll be able to pick a suite?...
   Date: 2025-06-15 13:13:36

3. Author: flutiful_fiona | Score: 1
   Text: the village dorms are suite style, but don't live there unless you like to party. the bathrooms get ...
   Date: 2025-06-20 13:02:33


### Reflection

Note that we now get a visual of the Author, Score, Text, and Date of each comment. The main issue is that we are going comment-by-comment, which are unlikely to have a lot of data in them, unless it's a hot topic. Instead, we want to grab the entire JMU subreddit. This is slightly more complicated, because we want each post and then the subsequent comments. It also means we will get A LOT of data. We want to be able to limit this somehow. Likewise, we don't necessarily want to grab just a random sample but grab things that seem the most relevant.

# ? STEP 3: Save Your Data to CSV

Now let's save the scraped comments to a CSV file that you can open in Excel, Google Sheets, or use for further analysis!

# üéØ Try Different Threads!

Now that you have a working scraper, try it with different Reddit threads:

## üî• Suggested Thread Types:

### Current Events & News
- r/worldnews - Global news discussions
- r/politics - Political discussions
- r/technology - Tech news and discussions

### Questions & Discussions  
- r/AskReddit - Open-ended questions
- r/explainlikeimfive - Simple explanations
- r/changemyview - Debate and discussion

### Hobbies & Interests
- r/movies - Film discussions
- r/gaming - Video game discussions  
- r/science - Scientific discussions

## üìù How to Use:
1. **Find a thread**: Browse Reddit and find an interesting post with lots of comments
2. **Copy the URL**: Copy the full URL from your browser address bar
3. **Update the code**: Change the `url` variable in Step 1
4. **Run again**: Execute the cells to scrape the new thread
5. **Save**: Each run creates a new CSV file with timestamp

## üí° Pro Tips:
- **Popular threads** have more comments but take longer to scrape
- **Recent threads** may have more active discussions
- **Different subreddits** have different discussion styles and topics
- **Sort by "Hot" or "Top"** to find the most engaging threads

## üìä What's in Your CSV File:
- **author**: Username who posted the comment
- **score**: Upvotes minus downvotes  
- **text**: The actual comment text
- **date**: When the comment was posted
- **thread_title**: Title of the Reddit thread
- **subreddit**: Which subreddit the thread is from

## ? Next Steps:
- Try analyzing your CSV data in Excel or Google Sheets
- Look for patterns in comment scores or lengths
- Compare discussions across different subreddits
- Use the data for sentiment analysis or word cloud generation

# Step 3: Scraping Entire Subreddits

The scraper below takes three main variables: 
- `subreddit_name` - the name of the subreddit without /r
- `num_posts` - The number of posts you want to scrape (note that you are limited to 600 posts/minute)
- `sort_method` - Options: "hot", "new", "top", "rising". Reddit uses these to organize posts

You can experiment with the different settings to get a collection of posts that will show up in the output. Keep in mind, how you extract the data determines what data you'll be analyzing. If you are only looking at "new" comments, you might miss a major issue in the community. If you only look at "top" comments, you'll miss what folks are currently concerned with.

In [5]:

subreddit_name = "JMU"  # Change this to any subreddit (without r/)
num_posts = 10         # How many posts to scrape
sort_method = "top"     # Options: "hot", "new", "top", "rising"

print(f"üéØ Scraping r/{subreddit_name} for text analysis")
print(f"üìä Getting {num_posts} {sort_method} posts...")
print("‚è≥ This may take a minute...")

# Simple data structure - no duplicates
text_data = []

try:
    # Get the subreddit
    subreddit = reddit.subreddit(subreddit_name)
    
    # Choose sorting method
    if sort_method == "hot":
        posts = subreddit.hot(limit=num_posts)
    elif sort_method == "new":
        posts = subreddit.new(limit=num_posts)
    elif sort_method == "top":
        posts = subreddit.top(limit=num_posts)
    elif sort_method == "rising":
        posts = subreddit.rising(limit=num_posts)
    else:
        posts = subreddit.hot(limit=num_posts)
    
    # Loop through each post
    for post_num, submission in enumerate(posts, 1):
        print(f"üìù Processing post {post_num}/{num_posts}: {submission.title[:50]}...")
        
        # Get post date for reference
        post_date = datetime.fromtimestamp(submission.created_utc).strftime('%Y-%m-%d %H:%M:%S')
        
        # Add the post itself (title + content)
        post_text = submission.title
        if submission.selftext.strip():  # Add post content if it exists
            post_text += " " + submission.selftext
            
        text_data.append({
            'type': 'post',
            'title': submission.title,
            'text': post_text,
            'date': post_date,
            'score': submission.score
        })
        
        # Get comments for this post
        try:
            submission.comments.replace_more(limit=0)  # Don't expand "more comments"
            
            for comment in submission.comments.list()[:50]:  # Get more comments per post
                if hasattr(comment, 'body') and comment.body.strip():
                    comment_date = datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S')
                    
                    text_data.append({
                        'type': 'comment',
                        'title': submission.title,  # Thread title for reference
                        'text': comment.body,
                        'date': comment_date,
                        'score': comment.score
                    })
        except:
            print(f"   ‚ö†Ô∏è Could not load comments for this post")
    
    print(f"\n‚úÖ Successfully collected {len(text_data)} text items!")
    print(f"üìä Ready for text analysis from r/{subreddit_name}")
    
except Exception as e:
    print(f"‚ùå Error scraping subreddit: {e}")
    print("üí° Make sure the subreddit name is correct and accessible")

üéØ Scraping r/JMU for text analysis
üìä Getting 10 top posts...
‚è≥ This may take a minute...
üìù Processing post 1/10: President Alger leaving to take same job at Americ...
üìù Processing post 2/10: Alger‚Äôs response to hitting over 500 cases in a we...
üìù Processing post 3/10: Virginia schools be like...
üìù Processing post 2/10: Alger‚Äôs response to hitting over 500 cases in a we...
üìù Processing post 3/10: Virginia schools be like...
üìù Processing post 4/10: The Dukes Advance!...
üìù Processing post 5/10: took My graduation pictures today while maintainin...
üìù Processing post 4/10: The Dukes Advance!...
üìù Processing post 5/10: took My graduation pictures today while maintainin...
üìù Processing post 6/10: Taking Senior Photos During JMU Construction (2019...
üìù Processing post 7/10: Campus Reopening Plan...
üìù Processing post 6/10: Taking Senior Photos During JMU Construction (2019...
üìù Processing post 7/10: Campus Reopening Plan...
üìù Processing post

# Step 4: Generate Output

Once the data has been created, you want to write that to a file. The script below creates two different files:

- `reddit_text_analysis...csv`
- `reddit_voyant.txt`

We will use both files eventually. The `.csv` file is a table of all the comments and posts and the `.txt` file is the same information, but only as text, which is useful for visualizing in Voyant.


In [None]:
# üíæ Save Clean Text Data for Topic Modeling

if text_data:
    try:
        # Create data folder if it doesn't exist
        data_folder = "data"
        os.makedirs(data_folder, exist_ok=True)
        
        # Convert to DataFrame
        df_text = pd.DataFrame(text_data)
        
        # Clean text function for analysis
        def clean_text_for_analysis(text):
            if pd.isna(text):
                return ""
            text = str(text)
            # Handle encoding
            text = text.encode('utf-8', errors='ignore').decode('utf-8')
            # Remove control characters but keep newlines
            text = ''.join(char for char in text if ord(char) >= 32 or char in '\n\r\t')
            # Remove extra whitespace
            text = ' '.join(text.split())
            return text
        
        # Clean the text data
        df_text['text'] = df_text['text'].apply(clean_text_for_analysis)
        df_text['title'] = df_text['title'].apply(clean_text_for_analysis)
        
        # Remove empty entries
        df_text = df_text[df_text['text'].str.len() > 10]  # Remove very short texts
        
        # Ensure proper data types
        df_text['score'] = pd.to_numeric(df_text['score'], errors='coerce')
        df_text['date'] = pd.to_datetime(df_text['date'], errors='coerce')
        df_text['type'] = df_text['type'].astype('category')
        
        # Create filename for analysis (save to data folder)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = os.path.join(data_folder, f"reddit_text_analysis_{subreddit_name}_{sort_method}_num_posts_{num_posts}_{timestamp}.csv")
        
        # Save to CSV optimized for text analysis
        df_text.to_csv(filename, index=False, encoding='utf-8-sig')
        
      
        
        
        
        
     
        # Show statistics for text analysis
        posts_count = (df_text['type'] == 'post').sum()
        comments_count = (df_text['type'] == 'comment').sum()
        
        print(f"\nüìà TEXT ANALYSIS STATS:")
        print(f"   ‚Ä¢ Posts: {posts_count}")
        print(f"   ‚Ä¢ Comments: {comments_count}")
        print(f"   ‚Ä¢ Total words: {df_text['text'].str.split().str.len().sum():,}")
        print(f"   ‚Ä¢ Average text length: {df_text['text'].str.len().mean():.0f} characters")
        print(f"   ‚Ä¢ Longest text: {df_text['text'].str.len().max()} characters")
        
        print(f"\nüìã SAMPLE TEXTS:")
        for idx, row in enumerate(df_text.head(3).iterrows(), 1):
            _, row_data = row
            print(f"{idx}. [{row_data['type'].upper()}] {row_data['text'][:100]}...")
        
        # Create a separate file with just the text for Voyant (save to data folder)
        text_only_filename = os.path.join(data_folder, f"reddit_voyant_{subreddit_name}_{timestamp}.txt")
        with open(text_only_filename, 'w', encoding='utf-8') as f:
            for _, row in df_text.iterrows():
                f.write(f"{row['text']}\n\n")
        
        print(f"\nüéØ Created Voyant-ready file: {text_only_filename}")
        
       
    except Exception as e:
        print(f"‚ùå Error saving text data: {e}")
        
else:
    print("‚ùå No text data to save.")


üìà TEXT ANALYSIS STATS:
   ‚Ä¢ Posts: 10
   ‚Ä¢ Comments: 112
   ‚Ä¢ Total words: 2,817
   ‚Ä¢ Average text length: 133 characters
   ‚Ä¢ Longest text: 857 characters

üìã SAMPLE TEXTS:
1. [POST] President Alger leaving to take same job at American University at the end of this academic year...
2. [COMMENT] Like him or not, he did help transform this school. Applications to JMU have drastically increased u...
3. [COMMENT] Massive changes happening at JMU this year. Alger stepping down, AD Bourne retiring, Cignetti left f...

üéØ Created Voyant-ready file: data\reddit_voyant_JMU_20250905_041030.txt

üí° TO RELOAD DATA WITH PROPER TYPES:
   CSV method:
   df = pd.read_csv('data\reddit_text_analysis_JMU_top_num_posts_10_20250905_041030.csv')
   df['score'] = pd.to_numeric(df['score'])
   df['date'] = pd.to_datetime(df['date'])
   df['type'] = df['type'].astype('category')
   
   Pickle method (easiest):
   df = pd.read_pickle('data\reddit_data_JMU_20250905_040840.pkl')


## üéØ Simplified Text Scraper for Topic Modeling

### üìù What This Does:
This streamlined version focuses on **clean text extraction** for analysis tools like Voyant:
- **No duplicates**: Each piece of text appears only once
- **Clean format**: Text optimized for analysis
- **Two outputs**: CSV for data analysis + TXT file for Voyant

### ? Quick Setup:
1. **Change subreddit**: `subreddit_name = "JMU"`
2. **Set post count**: `num_posts = 100` 
3. **Choose sorting**: `sort_method = "hot"`
4. **Run both cells above**

### ? What You Get:

#### CSV File Contains:
- **type**: "post" or "comment"
- **title**: Thread title (for context)
- **text**: The actual text content
- **date**: When it was posted
- **score**: Reddit score

#### TXT File Contains:
- Pure text, one item per line
- Perfect for uploading directly to Voyant
- No metadata, just content

### üéØ Perfect for Topic Modeling:
- **Voyant Tools**: Upload the .txt file directly
- **Other text analysis**: Use the .csv file
- **Clean data**: Removed duplicates and empty entries
- **Focused content**: Just the text you need

### üí° Analysis Tips:
- **Voyant**: Use the .txt file for word clouds, trends, correlations
- **Text length**: Longer texts work better for topic modeling
- **Sample size**: 50-100 posts usually gives good results
- **Time periods**: Try different sorting methods to see trends

This approach gives you exactly what you need for text analysis without the complexity!