# Used Car Analysis from Reddit Discussions

__Goal:__ Determine ideal used car models for my personal profile by leveraging crowd knowlege.

__Objectives:__
1. Establish access to Reddit API either via PRAW or Requests
2. Extract and parse relevant submissions and respective top-level comments
3. Explore and analyze textual data
4. Build a recommender that takes into account my personal profile and returns a comprehensive discussion on the ideal used vehicle model/s for me.

__Motivation:__ As a first-time car buyer, it's easy to get lost in the sea of options in the used car market. The plethora of options for brands, models, model years and trims, and whatnot, can make it daunting to do your own research. Personally, I find myself gravitating to online forums when doing my research on used cars as I believe in the concept of "Wisdom of the crowd", wherein majority sentiment may be indicative of some measurable and verifiable phenomena. 

In the case of used cars, if a significant number of people praise a specific Japanese-manufactured vehicle model while citing its faults, an algorithm that sufficiently captures this data may be able to present a high confidence level for the said model while presenting an overview of common issues to be aware of. In doing so, the work involved in research is cut in half, and prospective buyers can trim their options down to a few models that fit their criteria, while being made aware of the salient points that require extensive due diligence (e.g. common issues, upkeep, regulations) that only humans can do.

## Importing libraries

In [None]:
# ! pip install pandas praw prawcore python-dotenv pyarrow

In [None]:
import pandas as pd
import praw, prawcore, time, os, sys, functools, random
from dotenv import load_dotenv
from typing import List, Dict, Any
from collections.abc import Callable, Iterator
from itertools import product

# Load .env file for access keys
load_dotenv(os.path.join('..', 'config', '.env'))

# Import config.py to access environment variables
sys.path.append('../config')
from config import PRAW_ID, PRAW_SECRET, PRAW_USER_AGENT, PRAW_USERNAME, PRAW_PASSWORD

## Setting up access to Reddit API
Access keys to Reddit API are stored in a .env file under the config directory of this repository. A template for the .env file is provided in the config directory.

The config.py script assigns the environment variables to the `PRAW_ID`, `PRAW_SECRET`, `PRAW_USER_AGENT`, `PRAW_USERNAME`, and `PRAW_PASSWORD` global variables respectively.  

In [None]:
# Initialize PRAW 
reddit = praw.Reddit(
    client_id = PRAW_ID,
    client_secret = PRAW_SECRET,
    username = PRAW_USERNAME,
    password = PRAW_PASSWORD,
    user_agent = PRAW_USER_AGENT
)

## Extracting text data

### Utility Functions
The helper functions were designed to extract relevant data and metadata from Reddit submissions and comments, and package the data into a dict of dicts that can be easily parsed into a Pandas DataFrame object for further analysis.

`backoff_on_rate_limit`: This is a decorator factory that builds a custom decorator based on specified backoff parameters (max retries, base delay, cap, jitter). The decorator itself is a wrapper for custom functions that call PRAW methods such as `fetch_submissions` and `fetch_comments`, which call subreddit.search() and submission.comments.replace_more() respectively. The decorator implements exponential backoff with optional full jitter to respect Reddit API rate limits while handling transient failures.

__Inputs:__
- Integer value for max retries. When attempts exceed this number, an Exception is raised
- Float for base delay in seconds (i.e. Delay at first failed attempt)
- Float for maximum delay in seconds
- Bool on whether to implement full jitter or not

__Outputs:__
- Decorator to be applied to an PRAW API request wrapper

`parse_comments`: This is a utility function that fetches comments from a given post and formats each comment as a dictionary of dictionaries with key as comment id and value as a dictionary of comment content and metadata (e.g. body, timestamp, upvotes).

__Inputs:__ 
- Submission object from PRAW (i.e. Reddit posts)
- Integer for .replace_more limit parameter, default=0 (i.e. top/parent comments only)

__Output:__
- Dict of comments in the format {comment_id : {data_header: data_value}}

`parse_search_results`: This is a utility function that fetches submissions (posts) from a given subreddit using a predefined search query (i.e. keywords). Submissions are formatted into a dict of dicts with format {submission id : {data_header : data_value}}. This returns a tuple of submission data and comment data.

__Inputs:__ 
- String of Subreddit name
- String of search query
- Integer for limit of submissions yielded by PRAW subreddit search

__Output:__
- Tuple of submission data dict and comment data dict

__More reading materials:__
1. [API Rate Limits Explained: Best Practices for 2025](https://orq.ai/blog/api-rate-limit)
2. [Exponential Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)

In [313]:
def backoff_on_rate_limit(max_retries:int=5, 
                        base_delay:float=1.0, 
                        cap_delay:float=60.0, 
                        jitter:bool=True) -> Callable:
    """
    Decorator factory that applies exponential backoff (with optional jitter)
    when Reddit API rate limits (HTTP 429) or server errors occur.
    Stops after max_retries and re-raises the exception.
    """
    def decorator(func: Callable):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # Start with base delay, then exponentially scale by attempt
            attempt = 0
            while True:
                try:
                    return func(*args, **kwargs)
                except prawcore.exceptions.ResponseException as e:
                    if attempt > max_retries:
                        raise Exception("Max retries exceeded with Reddit API.")
                    delay = min(cap_delay, base_delay * 2 ** attempt)
                    if jitter:
                        delay = random.uniform(0, delay)
                    print(f"[WARNING] {e.__class__.__name__} on attempt {attempt+1}, retrying after {delay:.2f}s...")
                    time.sleep(delay)
                    attempt += 1
        return wrapper
    return decorator

@backoff_on_rate_limit()
def fetch_submissions(subreddit:object, query:str, limit:int=100, **kwargs) -> Iterator:
    """Modify the subreddit search from PRAw to ensure adherence to safe request limits."""
    return subreddit.search(**kwargs, query=query, limit=limit)

@backoff_on_rate_limit()
def fetch_comments(submission:object, limit:int=0) -> list:
    """Modify the comment fetch from PRAW to ensure adherence to safe request limits."""
    submission.comments.replace_more(limit=limit)
    return submission.comments.list()

In [314]:
def parse_comments(submission:object, limit:int=0):
    """Parses comments from a Reddit submission and packages it into a dict of dicts."""
    # Dict of dicts with format {comment_id : comment_info_dict}
    comment_data: Dict[str, Dict[str, Any]] = {}
    # Update comments dict with info dict 
    for comment in fetch_comments(submission, limit=limit):
        comment_data[comment.id] = {
            'body':comment.body,
            'score':comment.score,
            'timestamp':comment.created_utc,
            'subreddit':comment.subreddit_name_prefixed,
            'parent_submission_id':submission.id
        }
    return comment_data

In [343]:
def parse_search_results(subreddit_name:str, query:str, limit:int=50, **search_kwargs):
    """
    Fetch submissions/posts given a user-defined search query and returns a tuple of parsed submission
    data and comment data. The parsed data is a dict of dicts where each key is a submission/comment id 
    and value is a dict of content and metadata for that particular submission/comment.
    """
    sub = reddit.subreddit(subreddit_name)
    
    # Dict of dicts with format {id : data_dict}
    submission_data: Dict[str, Dict[str, Any]] = {}
    comment_data: Dict[str, Dict[str, Any]] = {}
    
    # Fetch submissions, and for every submission, fetch the comments
    for submission in fetch_submissions(**search_kwargs, subreddit=sub, query=query, limit=limit):
        # Update submissions dict with info dict from submission
        submission_data[submission.id] = {
            'title':submission.title,
            'selftext':submission.selftext,
            'score':submission.score,
            'upvote_ratio':submission.upvote_ratio,
            'timestamp':submission.created_utc,
            'subreddit':submission.subreddit_name_prefixed,
            'num_comments':submission.num_comments
            }
        # Update comments dict with comments from the current submission
        comment_data.update(parse_comments(submission))
        
    return (submission_data, comment_data)

### Provide an initial list of search queries and Subreddits

To scrape the relevant text data from Reddit, I created a small list of queries covering diverse yet relevant topics to buying affordable used vehicles. The queries involved location-specific, model-specific, and thematic keywords to ensure that the search covers as much ground as possible. Chosen subreddits have > 1e5 subscribers to ensure that search queries will yield a significant amount of results per API request.

With a 10x10 query and subreddit array, I expect at least an initial 100 requests for the subreddit search yielding 100x50 submissions at most.

Fetching the comments involves significantly more requests as each submission requires 1 request to yield the CommentForest. Fetching the comments will require at least 10,000 requests.

__Expected Minimum API Requests__
|Search Requests|Comment Fetch Requests|Total Requests|
|:----------|:----------|:----------|
|100      |5,000  |5,100|

As such, a single batch job covering all query-subreddit combinations will yield at least 10,100 API requests in a single go, which wildly exceeds the Reddit API fair use policy (i.e. Cap requests to 100/min averaged over 10-minute sliding window). To address this issue, batched processing will be implemented to ensure average requests is under safe rate limits.

### Utility Functions

`parse_txt_file`: Parses text files containing data separated by newlines. Returns a list. Used for containerizing search_queries and subreddit strings into separate text files that can be easily mutated without modifying source code.

__Input:__
- String for the path of text file, with each item separated by a newline

__Output:__
- List (e.g. search queries, subreddit names)

`aggregate_search_results`: This function is a wrapper for the `parse_search_results` call and calls the inner function for each subreddit-query pair formed from the input list arguments. Requests are tracked at every iteration and compared against maximum requests per minute. If expected total requests go beyond rate limit, program execution is paused for at least a minute to ensure that requests are within safe rate limits. The number of submissions requested are also randomized per search pair to reduce predictability of scraping pattern.

__Inputs:__ 
- List of subreddit name strings
- List of search query strings
- Int of maximum requests per minute, also determines upper bound of search result limit
- Int of minimum requests, which is the floor of search result limit
- List of float values denoting delay in seconds for long delay (interval between search pairs); minimum of 60s

__Output:__
- Tuple of aggregated submissions dict and comments dict

In [316]:
def parse_txt_file(file_path:str):
    """
    Utility function for parsing a multi-line text file where each item is separated
    by a newline.
    Input: String for file path
    Output: List
    """
    with open(file_path, 'r') as f:
        # Ignore comments and empty lines
        results = [line.rstrip("\n") for line in f if not (line.startswith('#') or line.startswith("\n"))]
    return results

In [None]:
def aggregate_search_results(subreddits:List[str], 
                             queries:List[str],
                             max_requests:int=90, 
                             min_requests:int=50,
                             delay:List[float] = [60.0,120.0],
                             **search_kwargs):
    """
    Wrapper for parsing functions. Takes a list of subreddits and queries, then calls the parsing 
    function for each combination of subreddit and query. Submission and comment results from each 
    inner function call is aggregated and returned as dictionaries for submissions and comments.
    
    Jitter is implemented to introduce randomness in number of API requests with a short backoff
    in each iteration to ensure 
    """
    assert isinstance(subreddits, list), "Argument 'subreddits' expects a list of subreddit names."
    assert isinstance(queries, list), "Argument 'queries' expects a list of search queries names."
    
    # Container variables for aggregate data
    agg_sub_data: Dict[str, Dict[str, Any]] = {}
    agg_comm_data: Dict[str, Dict[str, Any]] = {}
    
    # API request counter for triggering execution cooldown
    trace_requests = []
    total_requests = 0
    
    # Parse submission and comment data with jittered API calls
    for subreddit, query in product(subreddits, queries):
        # Always ensure that submissions are under max API requests
        submission_limit = int(random.uniform(min_requests, max_requests))
            
        # Calculate estimated requests from current iteration
        required_requests = 1 + submission_limit # 1 from search, N for comments from N submissions
        
        # Check if longer delay should be triggered on this iteration
        if total_requests + required_requests > max_requests:
            cooldown = random.uniform(*delay)
            print(f"Cooldown triggered: sleeping for {cooldown}s to avoid rate limit.")
            time.sleep(cooldown)
            total_requests = 0
        
        # Extract submissions from each subreddit-query pair
        submission_data, comment_data = parse_search_results(**search_kwargs, 
                                                            subreddit_name=subreddit, 
                                                            query=query, 
                                                            limit=submission_limit)
        agg_sub_data.update(submission_data)
        agg_comm_data.update(comment_data)
        total_requests += required_requests
        trace_requests.append(total_requests)
    
    print(f'Execution finished: Total of {sum(trace_requests)} requests made with trace {"\n".join(trace_requests)}.')
    return (agg_sub_data, agg_comm_data)
        

In [338]:
# Parse text files containing search queries and subreddit names
search_queries = parse_txt_file("../src/search_queries.txt")
subreddits = parse_txt_file("../src/subreddits.txt")

### Fetching and parsing search results from Reddit used car communities

In [339]:
search_queries = search_queries[:2]
subreddits = subreddits[:2]
print('-------Search Pairs-------')
for (subreddit, query) in product(subreddits, search_queries):
    print(subreddit,"-",query)

-------Search Pairs-------
CarsAustralia - affordable reliable used cars under 15k Australia
CarsAustralia - affordable reliable used cars under 10k USA
UsedCars - affordable reliable used cars under 15k Australia
UsedCars - affordable reliable used cars under 10k USA


In [344]:
%%time
# Fetch search results and parse to dict of dicts
submission_data, comment_data = aggregate_search_results(subreddits=subreddits, queries=search_queries)

Cooldown triggered: sleeping for 109.40741414375071s to avoid rate limit.
Cooldown triggered: sleeping for 71.57586797057071s to avoid rate limit.
Cooldown triggered: sleeping for 65.00197525949584s to avoid rate limit.
CPU times: user 2.8 s, sys: 314 ms, total: 3.11 s
Wall time: 7min 44s


## Storing the scraped data

### Formatting to a Pandas DataFrame

In [348]:
submission_df = pd.DataFrame.from_dict(submission_data, orient='index')
comment_df = pd.DataFrame.from_dict(comment_data, orient='index')

### Exporting DataFrame to a Parquet file for efficient storage

In [349]:
submission_df.to_parquet(os.path.join("..","data","submission_data.parquet"), 
                         engine='pyarrow',
                         compression='gzip')

comment_df.to_parquet(os.path.join("..","data","comment_data.parquet"),
                      engine='pyarrow',
                      compression='gzip')

## Exploratory Data Analysis