### Problem Statement

Mental health challenges such as depression, anxiety, and stress are widespread, yet early detection remains difficult. Traditional screening methods rely on self-reporting or clinical visits, which many individuals avoid due to stigma or lack of access. Social media platforms like Reddit host millions of posts where users openly discuss their struggles, but this data is underutilised for proactive risk detection. There is a need for automated, ethical, and eplainable systems that can identify potential mental health risks from online text to support early intervention. 

Current machine learning models for mental health detection often rely on small, curated datasets, limiting their generalizability. Reddit provides a rich, diverse source of user-generated text, but challenges remain in preprocessing noisy data, labelling risk indicators and ensuring model transparency. The problem is how to design a reproducible pipeline that collects, cleans, and models Reddit text to classify mental health risk while respecting privacy and ethical boundaries. 

Despite advances in NLP, few tools exist that allow practitioners, researchers, or evne individuals to interactively assess mental health risk in real time. The problem is the lack of accessible, user-friendly applications that can process social medial text and provide interpretable risk indicators. Building a dashboard or API that highlights risky phrases and explains predictions could bridge the gap between research and practical use.  

In [None]:
# Importing the data loader
from data_loader import load_data, get_data_info, MentalHealthDataLoader

# Load all datasets at once
all_datasets = load_data()

# Access specific datasets
adhd_comments = all_datasets['adhd_comments']
depression_data = all_datasets['depression_reddit']

# Get information about all datasets
dataset_info = get_data_info()
print(dataset_info)

# Using the class directly
loader = MentalHealthDataLoader()
adhd_data_only = loader.load_adhd_data()
mental_health_data_only = loader.load_mental_health_data()

{'adhd_comments': {'shape': (3356541, 5), 'columns': ['body', 'id', 'score', 'created_utc', 'created_datetime'], 'memory_usage': '2042.70 MB'}, 'adhd_posts': {'shape': (336066, 8), 'columns': ['title', 'selftext', 'score', 'id', 'url', 'num_comments', 'created_utc', 'created_datetime'], 'memory_usage': '413.30 MB'}, 'adhd_women_comments': {'shape': (202658, 5), 'columns': ['body', 'id', 'score', 'created_utc', 'created_datetime'], 'memory_usage': '149.22 MB'}, 'adhd_women_posts': {'shape': (44384, 8), 'columns': ['title', 'selftext', 'score', 'id', 'url', 'num_comments', 'created_utc', 'created_datetime'], 'memory_usage': '70.32 MB'}, 'conversation': {'shape': (3725, 3), 'columns': ['Unnamed: 0', 'question', 'answer'], 'memory_usage': '0.66 MB'}, 'depression_reddit': {'shape': (7731, 2), 'columns': ['clean_text', 'is_depression'], 'memory_usage': '3.15 MB'}, 'health_anxiety': {'shape': (1967, 350), 'columns': ['subreddit', 'author', 'date', 'post', 'automated_readability_index', 'colem

In [None]:
from data_loader import load_data

# Ascertaining that all the datasets have been loaded successfully
all_datasets = load_data()
print(f"Successfully loaded {len(all_datasets)} datasets")

Successfully loaded 10 datasets
