# Mental Health Dataset Creation

This notebook creates a new, balanced mental health testing dataset and an updated training dataset to be used in Phase 3. The aim is to supplement the previous data with more diverse sources, combining clinical and social media content for improved model robustness and generalization. The resulting datasets will provide a stronger foundation for evaluating and retraining your classifier in the next modeling phase.


## 1. Import Libraries

In [225]:
import pandas as pd

print("Libraries imported successfully.")

Libraries imported successfully.


## 2. Load your datasets 

In [227]:
suicide_df = pd.read_csv("Suicide_Detection.csv")     
posts_train = pd.read_csv("posts_train.csv")       
posts_test = pd.read_csv("posts_test.csv")           

## 3. Inspect Columns

In [229]:
print("Datasets loaded successfully.")
print("Suicide_Detection:", suicide_df.shape)
print("posts_train:", posts_train.shape)
print("posts_test:", posts_test.shape)

Datasets loaded successfully.
Suicide_Detection: (232074, 3)
posts_train: (13727, 4)
posts_test: (1488, 4)


## 3. Normalize column names
Renames disparate columns (e.g., 'post', 'classname', 'class') so all DataFrames share the same names ('text' for statements/posts, 'status' for class labels).

In [231]:
# For Suicide Detection: use 'text' and 'class'
suicide_df = suicide_df[['text', 'class']].rename(columns={'class':'status'})

# For Posts: use 'post' and 'class_name'
posts_train = posts_train[['post', 'class_name']].rename(columns={'post': 'text', 'class_name': 'status'})
posts_test = posts_test[['post', 'class_name']].rename(columns={'post': 'text', 'class_name': 'status'})

## 5. Combine datasets
Concatenates suicidedf and poststest into a single DataFrame (combineddf) for creating a unified test set.

In [233]:
combined_df = pd.concat([suicide_df, posts_test], ignore_index=True)

## 4. Normalize labels to four classes
Maps string labels in test data to integers (consistent with training), handles potential label typos by dropping unmapped samples, and prints new label distribution.

In [235]:
def normalize_label(label):
    label = str(label).strip().title()
    mapping = {
        'Suicide': 'Suicidal',
        'Suicidal': 'Suicidal',
        'Non-Suicide': 'Normal',
        'Normal': 'Normal',
        'Depression': 'Depression',
        'Depressed': 'Depression',
        'Anxiety': 'Anxiety',
    }
    return mapping.get(label, label)

combined_df['status'] = combined_df['status'].apply(normalize_label)

*Decision Note: Label normalization ensures all downstream analysis and modeling work consistently, despite differences in original labeling. It prevents confusion and invalid mappings.*

## 5. Fltering Required Classes
Filters dataset to only include rows with the four desired classes; prints the class distribution.

In [237]:
target_classes = ['Suicidal', 'Depression', 'Anxiety', 'Normal']
filtered_df = combined_df[combined_df['status'].isin(target_classes)].copy()

print("Class distribution before balancing:")
print(filtered_df['status'].value_counts())

Class distribution before balancing:
status
Suicidal      116037
Normal        116037
Depression       248
Anxiety          248
Name: count, dtype: int64


## 6. Remove duplicates and missing
Drops rows with missing 'text' or 'status' values and removes duplicate text entries.

In [239]:
filtered_df = filtered_df.dropna(subset=['text', 'status'])
filtered_df = filtered_df.drop_duplicates(subset=['text'])

## 7. Check results
Displays value counts and sample rows from the cleaned, filtered DataFrame for verification.

In [241]:
print(filtered_df['status'].value_counts())
print(filtered_df.head())

status
Suicidal      116037
Normal        116037
Depression       248
Anxiety          248
Name: count, dtype: int64
                                                text    status
0  Ex Wife Threatening SuicideRecently I left my ...  Suicidal
1  Am I weird I don't get affected by compliments...    Normal
2  Finally 2020 is almost over... So I can never ...    Normal
3          i need helpjust help me im crying so hard  Suicidal
4  I’m so lostHello, my name is Adam (16) and I’v...  Suicidal


## 8. Class Balancing
Downsamples every class to the size of the minority class so all are equally represented (248 per class).

In [242]:
desired_n = 248  # Match the minority class count

balanced_df = (
    filtered_df.groupby('status', group_keys=False)
    .apply(lambda x: x.sample(n=desired_n, random_state=42))
    .reset_index(drop=True)
)

print(balanced_df['status'].value_counts())
print(balanced_df.head())

status
Anxiety       248
Depression    248
Normal        248
Suicidal      248
Name: count, dtype: int64
                                                text   status
0  i don't understand whats wrong with me. i don'...  Anxiety
1  usually when i have anxiety just chatting with...  Anxiety
2  well, i've had anxiety and panic syndrome for ...  Anxiety
3  for the most minimal of things, like standing ...  Anxiety
4  i stay away from family and live with my roomm...  Anxiety


  .apply(lambda x: x.sample(n=desired_n, random_state=42))


## Save combined dataset for model testing

In [244]:
balanced_df.to_csv("mental_health_combined_test.csv", index=False, encoding='utf-8')
print("Combined test dataset saved as 'mental_health_combined_test.csv'")

Combined test dataset saved as 'mental_health_combined_test.csv'


## Save combined dataset for model training

In [246]:
combined_train = pd.concat([posts_train, suicide_df], ignore_index=True)
combined_train['status'] = combined_train['status'].apply(normalize_label)
target_classes = ['Suicidal', 'Normal', 'Depression', 'Anxiety']
sample_size = min(
    combined_train['status'].value_counts()[['Depression', 'Anxiety']].min(),  # smallest minority class
    2400  # cap at 2400 for practical balance if needed
)

# Sample equally from each class
balanced_samples = []
for cls in target_classes:
    cls_subset = combined_train[combined_train['status'] == cls]
    # If class has enough samples, randomly sample without replacement; else, use all available
    balanced_cls = cls_subset.sample(n=sample_size, random_state=42) if len(cls_subset) >= sample_size else cls_subset
    balanced_samples.append(balanced_cls)

balanced_train = pd.concat(balanced_samples, ignore_index=True)

# Shuffle the balanced dataframe
balanced_train = balanced_train.sample(frac=1, random_state=42).reset_index(drop=True)

# Save to CSV
balanced_train.to_csv("mental_health_balanced_train.csv", index=False, encoding='utf-8')
print("Balanced training data saved as 'mental_health_balanced_train.csv'")

Balanced training data saved as 'mental_health_balanced_train.csv'
