## Data Scraping From Reddit for Trend Analysis Model Building

This notebook is focused on scraping data from Reddit to build a comprehensive dataset for trend analysis. The goal is to collect posts, comments, and metadata from various subreddits that are relevant to trending topics. This data will be used to train and evaluate machine learning models for trend prediction and creating dashboards. There will be a minimal version of this in a .py file, I am just used to running things sequentially in a notebook

In [1]:
import praw
import sqlite3
import time
import sys
#uncomment below if running from notebooks folder
#sys.path.insert(0, '../src')
from src.config import REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, USER_AGENT, SUBREDDITS, POST_LIMIT, DB_PATH

Remember to run 
> sqlite3 file.db "VACUUM;" 
to initialize an empty database in the data folder, rename it to the database specified in config file


In [2]:
def init_db():
    """Initialize database with schema.sql."""
    with open("schema.sql", "r") as f:
        schema = f.read()
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.executescript(schema)
    conn.commit()
    conn.close()

In [3]:
def connect_reddit():
    """Connect to Reddit API using PRAW."""
    reddit = praw.Reddit(
        client_id=REDDIT_CLIENT_ID,
        client_secret=REDDIT_CLIENT_SECRET,
        user_agent=USER_AGENT
    )
    return reddit


In [4]:
def scrape_and_store(subreddit_name):
    """Scrape data from Reddit and store it in the database. Only top posts and comments."""
    reddit = connect_reddit()
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()

    subreddit = reddit.subreddit(subreddit_name)
    for post in subreddit.top(limit=POST_LIMIT):
        cur.execute(
            "INSERT OR IGNORE INTO posts (id, title, score, created_utc, num_comments) VALUES (?, ?, ?, ?, ?)",
            (post.id, post.title, post.score, post.created_utc, post.num_comments)
        )
        post.comments.replace_more(limit=0)
        for comment in post.comments.list():
            cur.execute(
                "INSERT OR IGNORE INTO comments (id, post_id, body, score, created_utc) VALUES (?, ?, ?, ?, ?)",
                (comment.id, 
                post.id, 
                comment.body, 
                comment.score, 
                comment.created_utc)
            )
    conn.commit()
    conn.close()


In [6]:
if __name__ == "__main__":
    init_db()
    for sub in SUBREDDITS:
        print(f"Scraping {sub}...")
        scrape_and_store(sub)
    print("Scraping complete, data stored in data folder")

Scraping r/HomeDecorating...


BadRequest: received 400 HTTP response