# Used Car Analysis from Reddit Discussions

__Goal:__ Determine ideal used car models for my personal profile by leveraging crowd knowlege.

__Objectives:__
1. Establish access to Reddit API either via PRAW or Requests
2. Extract and parse relevant submissions and respective top-level comments
3. Explore and analyze textual data
4. Build a recommender that takes into account my personal profile and returns a comprehensive discussion on the ideal used vehicle model/s for me.

__Motivation:__ As a first-time car buyer, it's easy to get lost in the sea of options in the used car market. The plethora of options for brands, models, model years and trims, and whatnot, can make it daunting to do your own research. Personally, I find myself gravitating to online forums when doing my research on used cars as I believe in the concept of "Wisdom of the crowd", wherein majority sentiment may be indicative of some measurable and verifiable phenomena. 

In the case of used cars, if a significant number of people praise a specific Japanese-manufactured vehicle model while citing its faults, an algorithm that sufficiently captures this data may be able to present a high confidence level for the said model while presenting an overview of common issues to be aware of. In doing so, the work involved in research is cut in half, and prospective buyers can trim their options down to a few models that fit their criteria, while being made aware of the salient points that require extensive due diligence (e.g. common issues, upkeep, regulations) that only humans can do.

## Importing libraries

In [None]:
# ! pip install pandas praw prawcore python-dotenv

In [None]:
import pandas as pd
import praw, prawcore, time, os, sys, functools, random
from dotenv import load_dotenv
from typing import List, Dict, Any
from collections.abc import Callable, Iterator

# Load .env file for access keys
load_dotenv(os.path.join('..', 'config', '.env'))

# Import config.py to access environment variables
sys.path.append('../config')
from config import PRAW_ID, PRAW_SECRET, PRAW_USER_AGENT, PRAW_USERNAME, PRAW_PASSWORD

## Setting up access to Reddit API
Access keys to Reddit API are stored in a .env file under the config directory of this repository. A template for the .env file is provided in the config directory.

The config.py script assigns the environment variables to the `PRAW_ID`, `PRAW_SECRET`, `PRAW_USER_AGENT`, `PRAW_USERNAME`, and `PRAW_PASSWORD` global variables respectively.  

In [None]:
# Initialize PRAW 
reddit = praw.Reddit(
    client_id = PRAW_ID,
    client_secret = PRAW_SECRET,
    username = PRAW_USERNAME,
    password = PRAW_PASSWORD,
    user_agent = PRAW_USER_AGENT
)

## Extracting text data

### Utility Functions
The helper functions were designed to extract relevant data and metadata from Reddit submissions and comments, and package the data into a dict of dicts that can be easily parsed into a Pandas DataFrame object for further analysis.

`backoff_on_rate_limit`: This is a decorator factory that builds a custom decorator based on specified backoff parameters (max retries, base delay, cap, jitter). The decorator itself is a wrapper for custom functions that call PRAW methods such as `fetch_submissions` and `fetch_comments`, which call subreddit.search() and submission.comments.replace_more() respectively. The decorator implements exponential backoff with optional full jitter to respect Reddit API rate limits while handling transient failures.

__Inputs:__
- Integer value for max retries. When attempts exceed this number, an Exception is raised
- Float for base delay in seconds (i.e. Delay at first failed attempt)
- Float for maximum delay in seconds
- Bool on whether to implement full jitter or not

__Outputs:__
- Decorator to be applied to an PRAW API request wrapper

`parse_submission_comments`: This is a utility function that fetches comments from a given post and formats each comment as a dictionary of dictionaries with key as comment id and value as a dictionary of comment content and metadata (e.g. body, timestamp, upvotes).

__Inputs:__ 
- Submission object from PRAW (i.e. Reddit posts)
- Integer for .replace_more limit parameter, default=0 (i.e. top/parent comments only)

__Output:__
- Dict of comments in the format {comment_id : {data_header: data_value}}

`parse_search_results`: This is a utility function that fetches submissions (posts) from a given subreddit using a predefined search query (i.e. keywords). Submissions are formatted into a dict of dicts with format {submission id : {data_header : data_value}}. This function also wraps the comment fetching function and aggregates the comments from all submissions into a single dict. 

__Inputs:__ 
- String of Subreddit name
- String of search query
- Integer for limit of submissions yielded by PRAW subreddit search

__Output:__
- Tuple of submissions dict and comments dict

__More reading materials:__
1. [API Rate Limits Explained: Best Practices for 2025](https://orq.ai/blog/api-rate-limit)
2. [Exponential Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)

In [237]:
def backoff_on_rate_limit(max_retries:int=5, 
                        base_delay:float=1.0, 
                        cap_delay:float=60.0, 
                        jitter:bool=True) -> Callable:
    """
    Decorator factory that applies exponential backoff (with optional jitter)
    when Reddit API rate limits (HTTP 429) or server errors occur.
    Stops after max_retries and re-raises the exception.
    """
    def decorator(func: Callable):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # Start with base delay, then exponentially scale by attempt
            attempt = 0
            while True:
                try:
                    return func(*args, **kwargs)
                except prawcore.exceptions.ResponseException as e:
                    if attempt > max_retries:
                        raise Exception("Max retries exceeded with Reddit API.")
                    delay = min(cap_delay, base_delay * 2 ** attempt)
                    if jitter:
                        delay = random.uniform(0, delay)
                    print(f"[WARNING] {e.__class__.__name__} on attempt {attempt+1}, retrying after {delay:.2f}s...")
                    time.sleep(delay)
                    attempt += 1
        return wrapper
    return decorator

@backoff_on_rate_limit()
def fetch_submissions(subreddit:object, query:str, limit:int=100, **kwargs) -> Iterator:
    """Modify the subreddit search from PRAw to ensure adherence to safe request limits."""
    return subreddit.search(**kwargs, query=query, limit=limit)

@backoff_on_rate_limit()
def fetch_comments(submission:object, limit:int=0) -> list:
    """Modify the comment fetch from PRAW to ensure adherence to safe request limits."""
    submission.comments.replace_more(limit=limit)
    return submission.comments.list()

In [None]:
def parse_comments(submission:object, limit:int=0):
    """
    Parses comments from a Reddit submission and packages it into a dict of dicts.
    Function is called within the main wrapper fetch_search_results.
    """
    # Dict of dicts with format {comment_id : comment_info_dict}
    comment_data: Dict[str, Dict[str, Any]] = {}
    # Update comments dict with info dict 
    for comment in fetch_comments(submission, limit=limit):
        # Duplicate handling; Skip building throwaway dict
        if comment.id in comment_data:
            continue
        comment_data[comment.id] = {
            'body':comment.body,
            'score':comment.score,
            'timestamp':comment.created_utc,
            'subreddit':comment.subreddit_name_prefixed,
            'parent_submission_id':submission.id
        }
    return comment_data

In [290]:
def parse_submissions(subreddit_name:str, query:str, limit:int=50, **search_kwargs):
    """
    Fetch submissions/posts given a user-defined search query and returns a tuple of parsed submission
    data and Submission objects. The parsed data is a dict of dicts where each key is a submission id 
    and value is a dict of content and metadata for that particular submission. The object is cached for
    subsequent API calls for extracting comments.
    """
    sub = reddit.subreddit(subreddit_name)
    # Dict of dicts with format {id : data_dict}
    submission_data: Dict[str, Dict[str, Any]] = {}
    # List of Submission objects
    submissions: List[object] = []
    for submission in fetch_submissions(**search_kwargs, subreddit=sub, query=query, limit=limit):
        # Duplicate handling; Skip building throwaway dict
        if submission.id in submission_data:
            continue
        # Update submissions dict with info dict from submission
        submission_data[submission.id] = {
            'title':submission.title,
            'selftext':submission.selftext,
            'score':submission.score,
            'upvote_ratio':submission.upvote_ratio,
            'timestamp':submission.created_utc,
            'subreddit':submission.subreddit_name_prefixed,
            'num_comments':submission.num_comments
            }
        submissions.append(submission)
    return (submission_data, submissions)

In [None]:
def parse_search_results(subreddit_name:str, query:str, limit:int=50, **search_kwargs):
    """
    Fetch submissions/posts and respective comments from a subreddit given a user-defined
    search query and returns a tuple of relevant data from submissions and comments. Each dataset 
    is formatted as a dict of dicts in the form {id : {data_header : data_value}} for submissions 
    and comments.
    """
    sub = reddit.subreddit(subreddit_name)
    # Dict of dicts with format {id : data_dict}
    submission_data: Dict[str, Dict[str, Any]] = {}
    comment_data: Dict[str, Dict[str, Any]] = {}
    # Perform subreddit search given query
    for submission in fetch_submissions(**search_kwargs, subreddit=sub, query=query, limit=limit):
        # Duplicate handling; Skip building throwaway dict
        if submission.id in submission_data:
            continue
        # Update submissions dict with info dict from submission
        submission_data[submission.id] = {
            'title':submission.title,
            'selftext':submission.selftext,
            'score':submission.score,
            'upvote_ratio':submission.upvote_ratio,
            'timestamp':submission.created_utc,
            'subreddit':submission.subreddit_name_prefixed,
            'num_comments':submission.num_comments
            }
        # Fetch comments from submission and update comments dict
        comment_data.update(parse_comments(submission))
    return (submission_data, comment_data)

### Provide an initial list of search queries and Subreddits

To scrape the relevant text data from Reddit, I created a small list of queries covering diverse yet relevant topics to buying affordable used vehicles. The queries involved location-specific, model-specific, and thematic keywords to ensure that the search covers as much ground as possible. Chosen subreddits have > 1e5 subscribers to ensure that search queries will yield a significant amount of results per API request.

With a 10x10 query and subreddit array, I expect at least an initial 100 requests for the subreddit search yielding 100x50 submissions at most.

Fetching the comments involves significantly more requests as each submission requires 1 request to yield the CommentForest. Fetching the comments will require at least 10,000 requests.

__Expected Minimum API Requests__
|Search Requests|Comment Fetch Requests|Total Requests|
|:----------|:----------|:----------|
|100      |5,000  |5,100|

As such, a single batch job covering all query-subreddit combinations will yield at least 10,100 API requests in a single go, which wildly exceeds the Reddit API fair use policy (i.e. Cap requests to 100/min averaged over 10-minute sliding window). To address this issue, batched processing will be implemented to ensure average requests is under safe rate limits.

### Utility Functions

`parse_txt_file`: Parses text files containing data separated by newlines. Returns a list. Used for containerizing search_queries and subreddit strings into separate text files that can be easily mutated without modifying source code.

__Input:__
- String for the path of text file, with each item separated by a newline

__Output:__
- List (e.g. search queries, subreddit names)

`aggregate_search_results`: This function wraps the parse_search_results and feeds a combination of subreddit and query arguments from a list of subreddit and queries. It returns a tuple of dictionaries, similar to the parse_search_results function, but aggregated over all submissions and comments.

__Inputs:__ 
- List of subreddit name strings
- List of search query strings

__Output:__
- Tuple of aggregated submissions dict and comments dict

In [264]:
def parse_txt_file(file_path:str):
    """
    Utility function for parsing a multi-line text file where each item is separated
    by a newline.
    Input: String for file path
    Output: List
    """
    with open(file_path, 'r') as f:
        # Ignore comments and empty lines
        results = [line.rstrip("\n") for line in f if not (line.startswith('#') or line.startswith("\n"))]
    return results

In [None]:
def aggregate_search_results(subreddits:List[str], 
                             queries:List[str],
                             submission_limit:int, 
                             max_requests:int=90, 
                             min_requests:int=50,
                             jitter:bool=True, 
                             **search_kwargs):
    """
    Wrapper for fetch_search_results func that takes a list of subreddits and queries, then
    calls the fetch function for each combination of subreddit and query. Submission and comment
    results from each inner function call is aggregated and returned as respective aggregate
    dictionaries for submissions and comments.
    """
    agg_submission_data: Dict[str, Dict[str, Any]] = {}
    agg_comment_data: Dict[str, Dict[str, Any]] = {}
    
    # Implement jitter to remove predictability
    if jitter:
        max_requests = max(min_requests, random.uniform(min_requests, max_requests)//1)
    
    for subreddit, query in zip(subreddits, queries):
        # Extract submissions and comments data from search results
        submission_data, comment_data = parse_search_results(**search_kwargs, subreddit_name=subreddit, query=query, limit=submission_limit)
        # Update aggregate dictionaries with data from each subreddit query
        agg_submission_data.update(submission_data)
        agg_comment_data.update(comment_data)
    return (agg_submission_data, agg_comment_data)

In [None]:
def parse_search_results(subreddits:List[str], 
                             queries:List[str],
                             submission_limit:int=50, 
                             max_requests:int=90, 
                             min_requests:int=50,
                             jitter:bool=True, 
                             short_delay:List[int]=[0,1],
                             long_delay:List[float] = [60.0,120.0],
                             **search_kwargs):
    """
    Wrapper for parsing functions. Takes a list of subreddits and queries, then calls the parsing 
    function for each combination of subreddit and query. Submission and comment results from each 
    inner function call is aggregated and returned as dictionaries for submissions and comments.
    
    Jitter is implemented to introduce randomness in number of API requests with a short backoff
    in each iteration to ensure 
    """
    assert isinstance(subreddits, list), "Argument 'subreddits' expects a list of subreddit names."
    assert isinstance(queries, list), "Argument 'queries' expects a list of search queries names."
    
    # Container variables for aggregate data
    agg_sub_data: Dict[str, Dict[str, Any]] = {}
    agg_comm_data: Dict[str, Dict[str, Any]] = {}
    
    # Parse submission and comment data with jittered API calls
    for subreddit, query in zip(subreddits, queries):
        # Always ensure that submissions are under max API requests
        if jitter:
            submission_limit = max(min_requests, random.uniform(max_requests))
        else:
            submission_limit = min(submission_limit, max_requests)
        
        # Extract submissions from each subreddit-query pair
        submission_data, submissions = parse_submissions(**search_kwargs, 
                                                         subreddit_name=subreddit, 
                                                         query=query, 
                                                         limit=submission_limit)
        agg_sub_data.update(submission_data)
        
        # Extract comments from each submission
        for submission in submissions:
            comment_data = parse_comments(submission)
            agg_comm_data.update(comment_data)
            # Short delay in extracting comments between each submission
            time.sleep(random.uniform(*short_delay))
        
        # Longer delay after each subreddit-query pair
        time.sleep(random.uniform(*long_delay))
    
    return (agg_sub_data, agg_comm_data)
        

In [288]:
max(random.uniform(50,90)//1,40)

66.0

In [None]:
# Parse text files containing search queries and subreddit names
search_queries = parse_txt_file("../src/search_queries.txt")
subreddits = parse_txt_file("../src/subreddits.txt")

# Convert to iterator
search_queries

### Fetching and parsing search results from Reddit used car communities

In [None]:
%%time

sample_query = search_queries[0]
sample_subreddit = subreddits[0]

# Fetch search results
# ~101 API requests for limit=100; 1 search request, 100 comment fetch requests
submission_data, comment_data = fetch_search_results(subreddit_name=sample_subreddit, query=sample_query)