# QQQ Reddit Sentiment Scraper

This notebook collects Reddit posts and comments related to QQQ ETF from the past month. It uses the PRAW library to interact with Reddit's API and pandas for data organization.

## 1. Install and Import Required Libraries

In [5]:
# Install required packages if not already installed
%pip install praw pandas vaderSentiment

# Import required libraries
import praw
import pandas as pd
from datetime import datetime, timedelta
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: C:\Users\aarit\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## 2. Configure Reddit API

Replace the placeholders below with your Reddit API credentials. You can get these by:
1. Going to https://www.reddit.com/prefs/apps
2. Creating a new app (choose "script" type)
3. Copy the client_id and client_secret

In [6]:
# Reddit API Configuration
reddit = praw.Reddit(
    client_id="x6_wNje4h80DiBvs-Ly6Kw",
    client_secret="PTk8OvQaUcyyq9n0QUhWD9PnpqrgKA",
    user_agent="QQQ_Sentiment_Analysis_Bot/1.0",
)

## 3. Define QQQ Data Collection Function

In [7]:
def scrape_qqq_data(days=30):
    """
    Scrape QQQ-related posts from Reddit
    Args:
        days (int): Number of days of historical data to collect
    Returns:
        DataFrame: Processed Reddit posts data
    """
    # Define relevant subreddits and search terms
    subreddits = ['investing', 'stocks', 'wallstreetbets', 'StockMarket', 'ETFs']
    search_terms = ['QQQ', 'NASDAQ-100', 'Invesco QQQ Trust']
    
    # Calculate date range
    end_date = datetime.now()
    start_date = end_date - timedelta(days=days)
    
    # Initialize data collection
    posts_data = []
    
    print(f"Collecting QQQ posts from {start_date.date()} to {end_date.date()}")
    
    # Scrape data from each subreddit
    for sub_name in subreddits:
        try:
            subreddit = reddit.subreddit(sub_name)
            print(f"\nScraping r/{sub_name}...")
            
            for term in search_terms:
                for post in subreddit.search(term, limit=100, sort='new'):
                    post_date = datetime.fromtimestamp(post.created_utc)
                    
                    if start_date <= post_date <= end_date:
                        # Extract post data
                        post_data = {
                            'date': post_date,
                            'subreddit': sub_name,
                            'title': post.title,
                            'text': post.selftext,
                            'score': post.score,
                            'num_comments': post.num_comments,
                            'upvote_ratio': post.upvote_ratio,
                            'url': f"https://reddit.com{post.permalink}"
                        }
                        
                        # Add sentiment analysis
                        analyzer = SentimentIntensityAnalyzer()
                        sentiment = analyzer.polarity_scores(post.title + ' ' + post.selftext)
                        post_data.update(sentiment)
                        
                        posts_data.append(post_data)
        
        except Exception as e:
            print(f"Error scraping r/{sub_name}: {str(e)}")
            continue
    
    # Convert to DataFrame
    df = pd.DataFrame(posts_data)
    
    print(f"\nCollection complete! Found {len(df)} posts")
    return df

## 4. Execute Data Collection and Save Results

In [8]:
# Collect QQQ Reddit data
qqq_posts = scrape_qqq_data(days=30)

# Save data to CSV
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"QQQ_reddit_data_{timestamp}.csv"
qqq_posts.to_csv(filename, index=False)
print(f"\nData saved to {filename}")

# Display sample of collected data
print("\nSample of collected data:")
print(qqq_posts[['date', 'subreddit', 'title', 'score', 'compound']].head())

Collecting QQQ posts from 2025-10-04 to 2025-11-03

Scraping r/investing...

Scraping r/stocks...

Scraping r/stocks...

Scraping r/wallstreetbets...

Scraping r/wallstreetbets...

Scraping r/StockMarket...

Scraping r/StockMarket...

Scraping r/ETFs...

Scraping r/ETFs...

Collection complete! Found 199 posts

Data saved to QQQ_reddit_data_20251103_170415.csv

Sample of collected data:
                 date  subreddit  \
0 2025-11-03 02:41:34  investing   
1 2025-11-01 18:24:49  investing   
2 2025-11-01 13:36:05  investing   
3 2025-10-31 19:05:30  investing   
4 2025-10-31 18:37:50  investing   

                                               title  score  compound  
0  Portfolio Planning Advice - Roth IRA vs Brokerage      1    0.9724  
1  For those who invest all savings after emergen...     25    0.9320  
2     Best instrument to express a bearish TSLA view      0    0.2886  
3       It's never a bad idea to take profits right?    172    0.7037  
4                        Diversif

## 5. Merge Sentiment Data with Technical Data

In [21]:
# Load technical data
technical_data = pd.read_csv("../data/QQQ_Historical_DayByDay.csv")
# Convert to just the date using map
technical_data['Date'] = pd.to_datetime(technical_data['Date']).map(lambda x: x.date())

# Aggregate sentiment data by date
daily_sentiment = qqq_posts.copy()
# Convert to just the date using map
daily_sentiment['date'] = pd.to_datetime(daily_sentiment['date']).map(lambda x: x.date())
daily_sentiment = daily_sentiment.groupby('date').agg({
    'compound': 'mean',
    'pos': 'mean',
    'neg': 'mean',
    'neu': 'mean',
    'score': 'sum',
    'num_comments': 'sum'
}).reset_index()

# Get the last month cutoff date
last_month = datetime.now().date()
last_month = last_month - timedelta(days=30)

# Filter technical data for last month
recent_technical = technical_data[technical_data['Date'] >= last_month].copy()

# Merge technical and sentiment data
merged_data = recent_technical.merge(
    daily_sentiment,
    left_on='Date',
    right_on='date',
    how='left'
).drop('date', axis=1)

# Fill missing sentiment values with 0 (days with no Reddit posts)
sentiment_columns = ['compound', 'pos', 'neg', 'neu', 'score', 'num_comments']
merged_data[sentiment_columns] = merged_data[sentiment_columns].fillna(0)

# Save merged data
merged_filename = f"QQQ_technical_sentiment_{timestamp}.csv"
merged_data.to_csv(merged_filename, index=False)
print(f"\nMerged data saved to {merged_filename}")

# Display sample of merged data
print("\nSample of merged data:")
display(merged_data.head())


Merged data saved to QQQ_technical_sentiment_20251103_170415.csv

Sample of merged data:


  technical_data['Date'] = pd.to_datetime(technical_data['Date']).map(lambda x: x.date())


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Capital Gains,Daily_Return,...,Volume_Ratio,High_20d,Low_20d,Price_Position,compound,pos,neg,neu,score,num_comments
0,2025-10-06,608.450012,609.359985,605.969971,607.710022,41962100,0.0,0.0,0.0,0.00751,...,0.816829,609.359985,576.371816,0.949983,0.6039,0.1292,0.0542,0.8168,99,96
1,2025-10-07,609.02002,609.710022,603.030029,604.51001,58209500,0.0,0.0,0.0,-0.005266,...,1.147453,609.710022,577.880053,0.836632,0.222786,0.068571,0.053286,0.878143,18,122
2,2025-10-08,605.409973,611.75,605.26001,611.440002,50629800,0.0,0.0,0.0,0.011464,...,0.996507,611.75,580.946513,0.989936,0.383325,0.082,0.0255,0.89275,54,79
3,2025-10-09,611.47998,611.609985,607.47998,610.700012,45551000,0.0,0.0,0.0,-0.00121,...,0.943664,611.75,583.423619,0.962933,0.134686,0.038,0.056143,0.906143,591,865
4,2025-10-10,611.400024,613.179993,589.049988,589.5,97614800,0.0,0.0,0.0,-0.034714,...,1.855862,613.179993,583.693348,0.196925,0.223386,0.092571,0.052,0.855357,5050,797


### Data Description

The merged dataset now contains:

Technical Data:
- Date: Trading date
- Open, High, Low, Close: Daily price data
- Volume: Trading volume
- Adj Close: Adjusted closing price

Sentiment Data:
- compound: Overall sentiment score (-1 to 1)
- pos: Positive sentiment score (0 to 1)
- neg: Negative sentiment score (0 to 1)
- neu: Neutral sentiment score (0 to 1)
- score: Sum of Reddit post scores for the day
- num_comments: Total number of comments on QQQ posts for the day

Note: Days without any Reddit posts will have sentiment scores of 0.