# Data Pipeline Prototype

## Importing libraries

In [None]:
# ! pip install pandas praw prawcore python-dotenv pyarrow tqdm

In [163]:
import pandas as pd
import praw, prawcore, time, os, sys, functools, random, json
import datetime as dt
import pyarrow as pa
import pyarrow.parquet as pq
from dotenv import load_dotenv
from typing import List, Dict, Any, Union, Tuple, NamedTuple
from itertools import product
from collections import deque, namedtuple
from collections.abc import Generator, Callable
from tqdm import tqdm
from contextlib import ExitStack

## Setting up access to Reddit API
Access keys to Reddit API are stored in a .env file under the config directory of this repository. A template for the .env file is provided in the config directory.

The config.py script assigns the environment variables to the `PRAW_ID`, `PRAW_SECRET`, `PRAW_USER_AGENT`, `PRAW_USERNAME`, and `PRAW_PASSWORD` global variables respectively.  

In [81]:
# Load .env file for access keys
load_dotenv(os.path.join('..', 'config', '.env'))

# Import config.py to access environment variables
sys.path.append('../config')
from config import PRAW_ID, PRAW_SECRET, PRAW_USER_AGENT, PRAW_USERNAME, PRAW_PASSWORD

In [34]:
# Initialize PRAW 
reddit = praw.Reddit(
    client_id=PRAW_ID,
    client_secret=PRAW_SECRET,
    username=PRAW_USERNAME,
    password=PRAW_PASSWORD,
    user_agent=PRAW_USER_AGENT
)

## Extracting text data

This section deals with the process of extracting and storing text data and metadata from Reddit posts and comments. My objective is to present my thought process and design principles in implementing the data pipeline for this project.

__Data Pipeline Overview:__
1. Establish access to Reddit API
2. Crawl predefined subreddits by searching submissions using predefined queries
3. Extract textual data and metadata from relevant posts and child comments
4. Preprocess data (Optional)
5. Store extracted data to disk

### The Challenge of Reddit API Rate Limiting

__Building the search query and subreddit pairs__

To scrape the relevant text data from Reddit, I created a small list of queries covering diverse yet relevant topics to buying affordable used vehicles. The queries involved location-specific, model-specific, and thematic keywords to ensure that the search covers as much ground as possible. Chosen subreddits have > 1e5 subscribers to ensure that search queries will yield a significant amount of results per API request. These queries and subreddits can be accessed in the paths: `../src/search_queries.txt` and `../src/subreddits.txt`, respectively. 

__Searching relevant posts per Subreddit__

The objective is to search and scrape for posts (and child comments) within the specified subreddits using the search queries provided. However, with a 10x10 query and subreddit array, I expect at least an initial 100 requests for the subreddit search yielding 100x100 submissions at most. Fetching the comments involves significantly more requests as each submission requires 1 request to yield the CommentForest. Fetching the comments will require at least 10,000 requests.

__Expected Minimum API Requests__
|Search Requests|Comment Fetch Requests|Total Requests|
|:----------|:----------|:----------|
|100      |10,000  |10,100|

From the table above, a single batch job covering all query-subreddit combinations will yield at least 10,100 API requests in a single go, which wildly exceeds the Reddit API fair use policy (i.e. Cap requests to 100/min averaged over 10-minute sliding window). 

__Implementing a sliding window request counter and backoff algorithms__

To ensure the script adheres to fair use policies, I implemented two-pronged fail-safe logic:
1. Handle transient failures for each API request by implementing a backoff algorithm
2. Mitigate the risk of #1 happening by implementing a program-level API request counter that tracks current and expected calls within a specified sliding window. This rate limiter will throttle requests until there's an available slot.

#### API Request Fail-safe Functions
Although PRAW already implements throttling, I took the challenge to implement a backoff algorithm and rate limiter for my script to better understand APIs and industry practices.

##### Decorator Factory for implementing Exponential Backoff and Full Jitter
`backoff_on_rate_limit`: This is a decorator factory that builds a custom decorator based on specified backoff parameters (max retries, base delay, cap, jitter). The decorator itself is a wrapper for custom functions that call PRAW methods such as `fetch_submissions` and `fetch_comments`, which call subreddit.search() and submission.comments.replace_more() respectively. The decorator implements exponential backoff with optional full jitter to respect Reddit API rate limits while handling transient failures.
<blockquote>

__Inputs:__
- Integer value for max retries. When attempts exceed this number, an Exception is raised
- Float for base delay in seconds (i.e. Delay at first failed attempt)
- Float for maximum delay in seconds
- Bool on whether to implement full jitter or not

__Outputs:__
- Decorator to be applied to an PRAW API request wrapper
</blockquote>

In [82]:
def backoff_on_rate_limit(max_retries:int=5, 
                        base_delay:float=1.0, 
                        cap_delay:float=60.0, 
                        jitter:bool=True):
    """
    Decorator factory that applies exponential backoff (with optional jitter)
    when Reddit API rate limits (HTTP 429) or server errors occur.
    Stops after max_retries and re-raises the exception.
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # Start with base delay, then exponentially scale by attempt
            attempt = 0
            while True:
                try:
                    return func(*args, **kwargs)
                except prawcore.exceptions.ResponseException as e:
                    if attempt > max_retries:
                        raise Exception("Max retries exceeded with Reddit API.")
                    delay = min(cap_delay, base_delay * 2 ** attempt)
                    if jitter:
                        delay = random.uniform(0, delay)
                    print(f"[WARNING] {e.__class__.__name__} on attempt {attempt+1}, retrying after {delay:.2f}s.")
                    time.sleep(delay)
                    attempt += 1
        return wrapper
    return decorator

#### API Call Wrappers with Backoff Algorithms

The helper functions were designed to extract relevant data and metadata from Reddit submissions and comments, and package the data into a dict of dicts that can be easily parsed into a Pandas DataFrame object for further analysis. The backoff decorator is applied to each API call wrapper to handle transient errors raised by HTTP 429 response (Too Many Requests).

In [83]:
@backoff_on_rate_limit()
def fetch_submissions(subreddit:object, query:str, limit:int=100, **kwargs):
    """Modify the subreddit search from PRAw to ensure adherence to safe request limits."""
    return subreddit.search(**kwargs, query=query, limit=limit)

@backoff_on_rate_limit()
def fetch_comments(submission:object, limit:int=0):
    """Modify the comment fetch from PRAW to ensure adherence to safe request limits."""
    # Replace 'more' with specified limit (default = 0 or retain top-level comments only)
    submission.comments.replace_more(limit=limit)
    for comment in submission.comments:
        yield comment

##### Rate Limiter Class

The RateLimiter class is initialized at the beginning of the script and is used to track API requests made within a specific sliding window. Requests are throttled when total expected requests go beyond rate limits (Reddit = 100/min) for the current window. Jitter is injected to the wait time  

In [217]:
class RateLimiter(object):
    """Rate Limiter with sliding window implementation."""
    def __init__(self, max_requests:int=100, period:float=60.0, jitter:Union[List[float],Tuple]=(0.1,5.0)):
        self.max_requests = max_requests
        self.period = period
        self.jitter = jitter
        self.trace_requests = deque()
        self.total_requests = 0
        
    def wait_for_slot(self, n_request:int=1) -> None:
        """
        Delays execution of subsequent API request or code chunk to ensure maximum
        function calls or request adheres to rate limits within a specified window.
        """
        window_end = time.time()
        window_start = window_end - self.period
        
        # Remove older batches when timestamp is out of current window
        while self.trace_requests and self.trace_requests[0] < window_start:
            self.trace_requests.popleft()
        
        # Check if additional request can be accommodated given requests made in current window
        if len(self.trace_requests) + n_request > self.max_requests:
            # Wait time is adjusted by jitter
            wait_time = (self.trace_requests[0] + self.period) - window_end + random.uniform(*self.jitter)
            time.sleep(max(wait_time, 0))
            # Re-run the function and determine if request can be accommodated
            return self.wait_for_slot(n_request)
        
        # Enqueue current request to trace requests
        for _ in range(n_request):
            self.trace_requests.append(time.time())
            self.tally_request()
    
    def tally_request(self, n_request:int=1):
        """Tallies current request and stores it in instance memory."""
        self.total_requests += n_request
        return self
    
    def __str__(self):
        return f'RateLimiter Class ({", ".join([f'{k}:{v}' for k, v in self.__get__().items()])})'
    
    def __get__(self):
        return {'max_requests':self.max_requests, 'period':self.period, 'jitter':self.jitter, 'trace_requests': len(self.trace_requests), 'total_requests': self.total_requests}
        

### Data Streaming

__Rationale: Scalability of Scraping Logic__

Previously, I explored building dictionaries within each scraping function and returning that dictionary to the data storage logic. However, this approach doesn't scale well since device memory may become a bottleneck with larger volumes of API calls. From my research, it's recommended to use generators to stream data from APIs as memory overhead is limited to the data extracted from the most recent call.

__Data Extraction Overview__:

1. Initialize the rate limiter class to keep track of requests within a 60-second sliding window.
2. Parse the text files containing subreddit names and search queries, then get the combination of search pairs.
3. Given a search pair, initialize a Subreddit class and search relevant submissions within that subreddit.
4. For every relevant submission, extract relevant data from Submission class attributes and stream a tuple of record type and submission data dictionary
5. Subsequently, for every submission, request the top-level comments, and for each top-level comment, stream a tuple of record type and comment data dictionary.

#### Data Streaming Functions

`stream_comments`: This is a utility function that fetches comments from a given post and formats each comment as a dictionary of dictionaries with key as comment id and value as a dictionary of comment content and metadata (e.g. body, timestamp, upvotes).
<blockquote>

__Inputs:__ 
- Submission object from PRAW (i.e. Reddit posts)
- Integer for .replace_more limit parameter, default=0 (i.e. top/parent comments only)

__Output:__
- Dict of comments in the format {comment_id : {data_header: data_value}}
</blockquote>

`stream_submissions_and_comments`: This is a utility function that fetches submissions (posts) from a given subreddit using a predefined search query (i.e. keywords). Submissions are formatted into a dict of dicts with format {submission id : {data_header : data_value}}. This returns a tuple of submission data and comment data.
<blockquote>

__Inputs:__ 
- String of Subreddit name
- String of search query
- Integer for limit of submissions yielded by PRAW subreddit search

__Output:__
- Tuple of submission data dict and comment data dict
</blockquote>

`stream_aggregate_results`: This function is a wrapper for the `stream_submissions_and_comments` generator function and takes a list of subreddit names and search queries to feed the subreddit-query pairs to the wrapped function. A time delay is included between every inner function call to ensure adherence to the 100 requests/minute rate limit.
<blockquote>

__Inputs:__ 
- List of subreddit name strings
- List of search query strings
- Int of maximum requests per minute, also determines upper bound of search result limit
- Int of minimum requests, which is the floor of search result limit
- Float of seconds denoting the time period for counting the API call limits
- List of float values of seconds to randomly add to interval delay 

__Output:__
- Tuple of aggregated submissions dict and comments dict
</blockquote>

__Read more:__
1. [Rate Limiter - Sliding Window Counter](https://medium.com/@avocadi/rate-limiter-sliding-window-counter-7ec08dbe21d6)

__Read more:__
1. [API Rate Limits Explained: Best Practices for 2025](https://orq.ai/blog/api-rate-limit)
2. [Exponential Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
3. [Yield Statements vs. Returning Lists in Python](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://community.aws/content/2h01Byx1ytU8357tp2bvcUuJ2j0/yield-statements-vs-returning-lists-in-python%23:~:text%3DYield%253A%2520Ideal%2520for%2520large%2520data,potentially%2520leading%2520to%2520memory%2520errors.&ved=2ahUKEwjzvJvd74uOAxVkQ6QEHVAVMHcQFnoECBIQAw&usg=AOvVaw3hMoJHnPwBIQOdBmB_NiBD)

In [218]:
def stream_comments(submission:object, limit:int=0):
    """
    Fetches comments from a Submission objects then parses each comment into a dictionary record.
    Each entry is streamed for efficient memory footprint when handling larger CommentForests.
    """
    # Update comments dict with info dict 
    for comment in fetch_comments(submission, limit=limit):
        # Record API requests and delay execution when rate limit reached
        rate_limiter.wait_for_slot()
        
        # Stream comment data when slot available in current window
        yield "comment", {
            'comment_id':comment.id,
            'body':comment.body,
            'score':comment.score,
            'timestamp':int(comment.created_utc),
            'subreddit':comment.subreddit_name_prefixed,
            'parent_submission_id':submission.id
            }

In [219]:
def stream_submissions_and_comments(subreddit_name:str, query:str, limit:int=50, **search_kwargs):
    """
    Fetches submissions, parses each submission into a dictionary record, and calls the stream_comments
    function on each submission. Submission data and comment data are streamed for efficient memory 
    footprint when handling larger datasets. 
    """
    SUB = reddit.subreddit(subreddit_name)
    
    # Fetch submissions, and for every submission, fetch the comments
    for submission in fetch_submissions(**search_kwargs, subreddit=SUB, query=query, limit=limit):
        # Stream comment data from current submission ("submission", Dict[str, Any])
        yield from stream_comments(submission)
        
        # Record API requests and delay execution when rate limit reached
        rate_limiter.wait_for_slot()
        
        # Stream submission data when slot available in current window
        yield "submission", {
            'submission_id':submission.id,
            'title':submission.title,
            'selftext':submission.selftext,
            'score':submission.score,
            'upvote_ratio':submission.upvote_ratio,
            'timestamp':int(submission.created_utc),
            'subreddit':submission.subreddit_name_prefixed,
            'num_comments':submission.num_comments
            }

In [220]:
def stream_aggregate_results(subreddits:List[str], 
                             queries:List[str],
                             max_requests:int=100, 
                             min_requests:int=50,
                             show_progress_bar:bool=False,
                             **search_kwargs):
    """
    Wrapper for streaming functions. Takes a list of subreddits and queries, then calls the 
    stream_search_results  function for each combination of subreddit and query. Jitter is implemented 
    to introduce randomness in number of API requests with a short backoff in each iteration to ensure
    adherence to Reddit API rate limits.
    """
    assert isinstance(subreddits, list), "Argument 'subreddits' expects a list of subreddit names."
    assert isinstance(queries, list), "Argument 'queries' expects a list of search queries names."
    
    search_pairs = product(subreddits, queries)
    
    if show_progress_bar:
        total_pairs = len(subreddits) * len(queries)
        search_pairs = tqdm(search_pairs, total=total_pairs, desc="Subreddit-Query Pairs")
    
    # Parse submission and comment data with jittered API calls
    for subreddit, query in search_pairs:
        # Random number of requests per iteration to reduce predictability
        submission_limit = int(random.uniform(min_requests, max_requests))
        
        # Stream data 
        yield from stream_submissions_and_comments(**search_kwargs, subreddit_name=subreddit, query=query, limit=submission_limit)
        

#### Text File Parser for Subreddit and Search Queries

`parse_txt_file`: Parses text files containing data separated by newlines. Returns a list. Used for containerizing search_queries and subreddit strings into separate text files that can be easily mutated without modifying source code.
<blockquote>

__Input:__
- String for the path of text file, with each item separated by a newline

__Output:__
- List (e.g. search queries, subreddit names)
</blockquote>

In [221]:
def parse_txt_file(file_path:str):
    """
    Utility function for parsing a multi-line text file where each item is separated
    by a newline.
    """
    with open(file_path, 'r') as f:
        # Ignore comments and empty lines
        results = [line.rstrip("\n") for line in f if not (line.startswith('#') or line.startswith("\n"))]
    return results

#### Data Storage

Data will be stored as Pyarrow tables since Parquet files have higher compression rates resulting in smaller memory footprint, which is beneficial for larger datasets.

__Data Storage Logic:__
1. Store the file paths for submission data and comment data Parquet files
2. Define the schema for the parquet files to preserve data type on export, marginally improve write performance, and avoid silent errors
3. Initialize the data generator with the list of search pairs
4. Execute the data writer function given the generator and write to disk

`write_to_parquet`: This function takes the generator created when calling the `stream_aggregate_results`, a parquet config variable, and a buffer byte size target, then proceeds to write batched data to a Parquet file. Data is streamed from the generator, stored in a buffer, and when the buffer reaches the target memory size, the buffered data is converted to a Pyarrow RecordBatch that is fed to a ParquetWriter that writes to the predefined Parquet files.

<blockquote>

__Input:__
- Generator from when either the `stream_aggregate_results` or `stream_submissions_and_comments` functions are called
- NamedTuple containing the file paths and schemas for the datasets
- Int value of the target buffer size in MB (MebiBytes)

__Output:__
- None; function writes Parquet files to the project repository's ../data directory
</blockquote>

In [222]:
def write_to_parquet(data_stream: Generator, 
                     config: Tuple[NamedTuple]=(SUBMISSION_CONFIG, COMMENT_CONFIG), 
                     target_MB: int=8):
    """
    Streams data ("record type", record_dict) from an input generator, stores the data into a 
    dictionary buffer, and writes to disk when a target byte size or when the function call has 
    finished.
    """
    assert isinstance(data_stream, Generator), "data_stream must be a generator."
    assert isinstance(config, tuple) and all((isinstance(ntuple, ParquetConfig) for ntuple in config)), "Config must be a tuple containing ParquetConfig namedtuples."
    assert isinstance(target_MB, (int, float)) and target_MB > 0, "target_MB must be a positive non-zero numeric value."
    
    # Hashmap for buffers and schemas for O(1) lookup
    schemas = {ntuple.record_type : ntuple.schema for ntuple in config}
    buffers = {record_type : {col : [] for col in schemas[record_type].names} 
               for record_type in schemas}
    
    # Initialize byte counter and convert target_MB to bytes
    byte_counts = {ntuple.record_type : 0 for ntuple in config}
    TARGET_BYTES = int(target_MB * 2 ** 20)
    
    # Convert current buffer to RecordBatch and write to Parquet file
    # Clear the lists in each key of the current buffer
    # Then reset the byte count for current buffer
    def write_then_flush(record_type: str):
        """Convert buffer to Pyarrow container, write to Parquet, then flush buffer and byte count."""
        batch = pa.RecordBatch.from_pydict(buffers[record_type], schema=schemas[record_type])
        writers[record_type].write_batch(batch)
        for container in buffers[record_type].values():
            container.clear()
        byte_counts[record_type] = 0
    
    # Context manager where streamed data will be written to disk as Parquet files
    with ExitStack() as stack:

        # Initialize writers
        writers = {ntuple.record_type : stack.enter_context(pq.ParquetWriter(where=ntuple.file_path, schema=ntuple.schema)) 
                   for ntuple in config}
        
        # Stream the data, append to buffer, and update byte count
        # Convert buffer to RecordBatch when target byte size met
        for record_type, record in data_stream:
            buffer = buffers[record_type]
            
            for col in buffer:
                buffer[col].append(record.get(col))
            
            byte_counts[record_type] += len(json.dumps(record, separators=(",", ":")).encode("utf-8"))
            
            if byte_counts[record_type] >= TARGET_BYTES:
                write_then_flush(record_type)
                
        # Final write for remaining data in both buffers after streaming data
        # Only write if there are remaining records to avoid null records in Parquet file
        for record_type in buffers:
            if byte_counts[record_type] > 0:
                write_then_flush(record_type)

### Building and executing the data pipeline

In [215]:
# Data Streaming Prerequisites
## Initialize the rate limiter class
rate_limiter = RateLimiter()

## Parse text files containing search queries and subreddit names
search_queries = parse_txt_file("../src/search_queries.txt")
subreddits = parse_txt_file("../src/subreddits.txt")

# Data Storage Prerequisites
## Defining paths for parquet files
SUBMISSION_PATH = os.path.join('..','data','submission_data.parquet')
COMMENT_PATH = os.path.join('..','data','comment_data.parquet')

## Defining schemas for storing submission and comment data
SUBMISSION_SCHEMA = pa.schema([
    ("submission_id", pa.string()),
    ("title", pa.string()),
    ("selftext", pa.string()),
    ("score", pa.int64()),
    ("upvote_ratio", pa.float64()),
    ("timestamp", pa.timestamp("s")),
    ("subreddit", pa.string()),
    ("num_comments", pa.int32()),
])

COMMENT_SCHEMA = pa.schema([
    ("comment_id", pa.string()),
    ("body", pa.string()),
    ("score", pa.int64()),
    ("timestamp", pa.timestamp("s")),
    ("subreddit", pa.string()),
    ("parent_submission_id", pa.string()),
])

## Initialize namedtuple for Parquet Config for ease of access and immutability
ParquetConfig = namedtuple('ParquetConfig',['record_type','file_path','schema'])
SUBMISSION_CONFIG = ParquetConfig('submission', SUBMISSION_PATH, SUBMISSION_SCHEMA)
COMMENT_CONFIG = ParquetConfig('comment', COMMENT_PATH, COMMENT_SCHEMA)

In [204]:
time.time()

1750929823.4805272

In [194]:
# Only get a subset of search queries
search_queries = search_queries[0:1]
subreddits = subreddits[0:1]

print('-------Search Pairs-------')
for (subreddit, query) in product([subreddits], [search_queries]):
    print(subreddit,"-",query)

-------Search Pairs-------
['CarsAustralia'] - ['affordable reliable used cars under 15k Australia']


In [None]:
# Assign the generator
agg_data_stream = stream_aggregate_results(subreddits=subreddits, queries=search_queries, show_progress_bar=True, max_requests=50, min_requests=20)

In [196]:
# Assign the generator
ind_data_stream = stream_submissions_and_comments(subreddit_name=subreddits[-1], query=search_queries[-1])

In [197]:
%%time
data_stream = agg_data_stream
# Write to parquet file
write_to_parquet(data_stream)

Subreddit-Query Pairs: 100%|██████████| 1/1 [24:22<00:00, 1462.85s/it]


ArrowTypeError: Expected np.float16 instance

In [200]:
%%time
# Parse sample parquet files
submissions_df = pd.read_parquet(PARQUET_CONFIG.submission_path)
comments_df = pd.read_parquet(PARQUET_CONFIG.comment_path)

CPU times: user 3.81 ms, sys: 3.7 ms, total: 7.51 ms
Wall time: 6 ms


In [201]:
submissions_df.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,submission_id,title,selftext,score,upvote_ratio,timestamp,subreddit,num_comments


In [202]:
comments_df.head()

Unnamed: 0,comment_id,body,score,timestamp,subreddit,parent_submission_id


## Storing the scraped data

### Formatting to a Pandas DataFrame

### Exporting DataFrame to a Parquet file for efficient storage

## Exploratory Data Analysis

## Test