# Mental Health Detection Model

This notebook contains a complete pipeline for detecting mental health issues (specifically depression) from text/tweets.

## Features:
- Text preprocessing pipeline
- TF-IDF vectorization
- Logistic Regression model (95.6% accuracy)
- Prediction function for new tweets
- Model saving and loading capabilities


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import re
import string
import pickle
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report, roc_curve, auc, roc_auc_score

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Download NLTK resources (only needed once)
try:
    nltk.download('stopwords', quiet=True)
    nltk.download('wordnet', quiet=True)
    nltk.download('omw-1.4', quiet=True)
except:
    pass

print("Libraries imported successfully!")


Libraries imported successfully!


## 1. Text Preprocessing Function

This function cleans and preprocesses text data to prepare it for model training and prediction.


In [2]:
# Initialize preprocessing components
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """
    Preprocess text by:
    1. Converting to lowercase
    2. Removing URLs
    3. Removing HTML tags
    4. Removing punctuation
    5. Removing numbers and non-alphabetic characters
    6. Removing stopwords
    7. Lemmatizing words
    
    Args:
        text (str): Input text to preprocess
        
    Returns:
        str: Preprocessed text
    """
    # Handle NaN and None values
    if pd.isna(text) or text is None:
        return ''
    
    text = str(text).lower()                                   # Lowercase
    text = re.sub(r"http\S+|www\S+", '', text)                 # Remove URLs
    text = re.sub(r"<.*?>", " ", text)                         # Remove HTML tags
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = text.split()
    words = [w for w in words if w.isalpha()]                  # Remove numbers & non-alphabetic
    words = [w for w in words if w not in stop_words]          # Remove stopwords
    words = [lemmatizer.lemmatize(w) for w in words]           # Lemmatize
    return ' '.join(words)

# Test the preprocessing function
test_text = "I'm feeling really down today. Can't stop thinking about negative things. https://example.com"
print("Original:", test_text)
print("Processed:", preprocess_text(test_text))


Original: I'm feeling really down today. Can't stop thinking about negative things. https://example.com
Processed: im feeling really today cant stop thinking negative thing


## 2.1. Data Visualization - Label Distribution


In [None]:
# Visualize label distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
label_counts = df['is_depression'].value_counts()
axes[0].bar(['No Depression', 'Depression'], label_counts.values, color=['#4caf50', '#f44336'], alpha=0.7)
axes[0].set_title('Label Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_xlabel('Label', fontsize=12)
for i, v in enumerate(label_counts.values):
    axes[0].text(i, v + 50, str(v), ha='center', fontweight='bold', fontsize=12)

# Pie chart
label_props = df['is_depression'].value_counts(normalize=True)
colors = ['#4caf50', '#f44336']
axes[1].pie(label_props.values, labels=['No Depression', 'Depression'], autopct='%1.1f%%', 
            colors=colors, startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'})
axes[1].set_title('Label Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"Dataset is {'balanced' if abs(label_props[0] - label_props[1]) < 0.1 else 'imbalanced'}")


## 2.2. Text Length Analysis


In [None]:
# Calculate text lengths
df['text_length'] = df['clean_text'].apply(lambda x: len(str(x).split()))
df['char_length'] = df['clean_text'].apply(lambda x: len(str(x)))

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Word count distribution by label
for label in [0, 1]:
    label_name = 'No Depression' if label == 0 else 'Depression'
    data = df[df['is_depression'] == label]['text_length']
    axes[0, 0].hist(data, bins=50, alpha=0.6, label=label_name, 
                    color='#4caf50' if label == 0 else '#f44336', edgecolor='black')
axes[0, 0].set_xlabel('Word Count', fontsize=12)
axes[0, 0].set_ylabel('Frequency', fontsize=12)
axes[0, 0].set_title('Word Count Distribution by Label', fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Box plot comparison
df_box = df[['is_depression', 'text_length']].copy()
df_box['is_depression'] = df_box['is_depression'].map({0: 'No Depression', 1: 'Depression'})
sns.boxplot(data=df_box, x='is_depression', y='text_length', ax=axes[0, 1], 
            palette=['#4caf50', '#f44336'])
axes[0, 1].set_title('Word Count Distribution (Box Plot)', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Label', fontsize=12)
axes[0, 1].set_ylabel('Word Count', fontsize=12)
axes[0, 1].grid(True, alpha=0.3)

# 3. Average text length by label
avg_lengths = df.groupby('is_depression')['text_length'].mean()
bars = axes[1, 0].bar(['No Depression', 'Depression'], avg_lengths.values, 
                      color=['#4caf50', '#f44336'], alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Average Word Count by Label', fontsize=14, fontweight='bold')
axes[1, 0].set_ylabel('Average Word Count', fontsize=12)
axes[1, 0].set_xlabel('Label', fontsize=12)
for i, v in enumerate(avg_lengths.values):
    axes[1, 0].text(i, v + 2, f'{v:.1f}', ha='center', fontweight='bold', fontsize=12)
axes[1, 0].grid(True, alpha=0.3, axis='y')

# 4. Character length distribution
for label in [0, 1]:
    label_name = 'No Depression' if label == 0 else 'Depression'
    data = df[df['is_depression'] == label]['char_length']
    axes[1, 1].hist(data, bins=50, alpha=0.6, label=label_name,
                    color='#4caf50' if label == 0 else '#f44336', edgecolor='black')
axes[1, 1].set_xlabel('Character Count', fontsize=12)
axes[1, 1].set_ylabel('Frequency', fontsize=12)
axes[1, 1].set_title('Character Count Distribution by Label', fontsize=14, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print statistics
print("\n=== Text Length Statistics ===")
print(f"\nOverall Statistics:")
print(df['text_length'].describe())
print(f"\nBy Label:")
print(df.groupby('is_depression')['text_length'].describe())


## 2.3. Word Frequency Analysis & Word Clouds


In [None]:
# Analyze most common words by label
def get_top_words(texts, n=20):
    """Get top N words from a list of texts"""
    all_words = []
    for text in texts:
        words = str(text).lower().split()
        all_words.extend(words)
    word_freq = Counter(all_words)
    return word_freq.most_common(n)

# Get top words for each class
depression_texts = df[df['is_depression'] == 1]['processed_text']
no_depression_texts = df[df['is_depression'] == 0]['processed_text']

top_depression_words = get_top_words(depression_texts, 20)
top_no_depression_words = get_top_words(no_depression_texts, 20)

# Visualize top words
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Depression words
dep_words, dep_counts = zip(*top_depression_words)
axes[0].barh(range(len(dep_words)), dep_counts, color='#f44336', alpha=0.7)
axes[0].set_yticks(range(len(dep_words)))
axes[0].set_yticklabels(dep_words, fontsize=10)
axes[0].set_xlabel('Frequency', fontsize=12)
axes[0].set_title('Top 20 Words - Depression Class', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')

# No Depression words
no_dep_words, no_dep_counts = zip(*top_no_depression_words)
axes[1].barh(range(len(no_dep_words)), no_dep_counts, color='#4caf50', alpha=0.7)
axes[1].set_yticks(range(len(no_dep_words)))
axes[1].set_yticklabels(no_dep_words, fontsize=10)
axes[1].set_xlabel('Frequency', fontsize=12)
axes[1].set_title('Top 20 Words - No Depression Class', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Word Clouds
print("\nGenerating Word Clouds...")
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Depression word cloud
depression_text = ' '.join(depression_texts.astype(str))
wordcloud_dep = WordCloud(width=800, height=400, background_color='white', 
                          colormap='Reds', max_words=100).generate(depression_text)
axes[0].imshow(wordcloud_dep, interpolation='bilinear')
axes[0].axis('off')
axes[0].set_title('Word Cloud - Depression Class', fontsize=16, fontweight='bold', pad=20)

# No Depression word cloud
no_depression_text = ' '.join(no_depression_texts.astype(str))
wordcloud_no_dep = WordCloud(width=800, height=400, background_color='white', 
                            colormap='Greens', max_words=100).generate(no_depression_text)
axes[1].imshow(wordcloud_no_dep, interpolation='bilinear')
axes[1].axis('off')
axes[1].set_title('Word Cloud - No Depression Class', fontsize=16, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()


## 3.1. TF-IDF Feature Matrix Visualization


In [None]:
# Visualize TF-IDF feature statistics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Calculate feature statistics
feature_means = np.array(X_tfidf.mean(axis=0)).flatten()
feature_stds = np.array(X_tfidf.std(axis=0)).flatten()

# 1. Feature mean distribution
axes[0].hist(feature_means, bins=50, color='#2196F3', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Mean TF-IDF Value', fontsize=12)
axes[0].set_ylabel('Number of Features', fontsize=12)
axes[0].set_title('Distribution of Feature Means', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axvline(x=np.mean(feature_means), color='red', linestyle='--', linewidth=2, 
                label=f'Overall Mean: {np.mean(feature_means):.4f}')
axes[0].legend()

# 2. Feature standard deviation distribution
axes[1].hist(feature_stds, bins=50, color='#ff9800', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Standard Deviation of TF-IDF Values', fontsize=12)
axes[1].set_ylabel('Number of Features', fontsize=12)
axes[1].set_title('Distribution of Feature Standard Deviations', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].axvline(x=np.mean(feature_stds), color='red', linestyle='--', linewidth=2,
                label=f'Overall Mean: {np.mean(feature_stds):.4f}')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"\n=== TF-IDF Feature Statistics ===")
print(f"Total features: {X_tfidf.shape[1]}")
print(f"Mean TF-IDF value: {np.mean(feature_means):.4f}")
print(f"Std TF-IDF value: {np.mean(feature_stds):.4f}")
print(f"Max TF-IDF value: {X_tfidf.max():.4f}")
print(f"Min TF-IDF value: {X_tfidf.min():.4f}")


## 4.1. Train-Test Split Visualization


In [None]:
# Visualize train-test split
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Train set distribution
train_counts = pd.Series(y_train).value_counts()
axes[0].bar(['No Depression', 'Depression'], train_counts.values, 
            color=['#4caf50', '#f44336'], alpha=0.7, edgecolor='black')
axes[0].set_title(f'Train Set Distribution (n={len(y_train)})', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_xlabel('Label', fontsize=12)
for i, v in enumerate(train_counts.values):
    axes[0].text(i, v + 20, f'{v}\n({v/len(y_train)*100:.1f}%)', 
                ha='center', fontweight='bold', fontsize=11)
axes[0].grid(True, alpha=0.3, axis='y')

# Test set distribution
test_counts = pd.Series(y_test).value_counts()
axes[1].bar(['No Depression', 'Depression'], test_counts.values, 
            color=['#4caf50', '#f44336'], alpha=0.7, edgecolor='black')
axes[1].set_title(f'Test Set Distribution (n={len(y_test)})', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_xlabel('Label', fontsize=12)
for i, v in enumerate(test_counts.values):
    axes[1].text(i, v + 5, f'{v}\n({v/len(y_test)*100:.1f}%)', 
                ha='center', fontweight='bold', fontsize=11)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\n=== Train-Test Split Summary ===")
print(f"Training samples: {len(y_train)} ({len(y_train)/len(y)*100:.1f}%)")
print(f"Test samples: {len(y_test)} ({len(y_test)/len(y)*100:.1f}%)")
print(f"\nTrain set is balanced: {abs(train_counts[0]/len(y_train) - train_counts[1]/len(y_train)) < 0.05}")
print(f"Test set is balanced: {abs(test_counts[0]/len(y_test) - test_counts[1]/len(y_test)) < 0.05}")


## 2. Load and Prepare Data


## 5.1. Model Performance Visualization


In [None]:
# Create comprehensive model performance visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Confusion Matrix (Heatmap)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0],
            xticklabels=['No Depression', 'Depression'],
            yticklabels=['No Depression', 'Depression'],
            cbar_kws={'label': 'Count'})
axes[0, 0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('True Label', fontsize=12)
axes[0, 0].set_xlabel('Predicted Label', fontsize=12)

# Add percentages
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
for i in range(2):
    for j in range(2):
        axes[0, 0].text(j+0.5, i+0.7, f'{cm_percent[i, j]:.1f}%', 
                       ha='center', va='center', fontsize=10, color='red', fontweight='bold')

# 2. ROC Curve
y_pred_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

axes[0, 1].plot(fpr, tpr, color='#2196F3', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
axes[0, 1].plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--', label='Random Classifier')
axes[0, 1].set_xlim([0.0, 1.0])
axes[0, 1].set_ylim([0.0, 1.05])
axes[0, 1].set_xlabel('False Positive Rate', fontsize=12)
axes[0, 1].set_ylabel('True Positive Rate', fontsize=12)
axes[0, 1].set_title('ROC Curve', fontsize=14, fontweight='bold')
axes[0, 1].legend(loc="lower right")
axes[0, 1].grid(True, alpha=0.3)

# 3. Metrics Bar Chart
metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1
}
bars = axes[1, 0].bar(metrics.keys(), metrics.values(), color=['#4caf50', '#2196F3', '#ff9800', '#9c27b0'], alpha=0.7, edgecolor='black')
axes[1, 0].set_ylim([0, 1])
axes[1, 0].set_ylabel('Score', fontsize=12)
axes[1, 0].set_title('Model Performance Metrics', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3, axis='y')
for i, (key, value) in enumerate(metrics.items()):
    axes[1, 0].text(i, value + 0.02, f'{value:.3f}', ha='center', fontweight='bold', fontsize=11)

# 4. Prediction Probability Distribution
y_pred_proba_depression = y_pred_proba
y_pred_proba_no_depression = 1 - y_pred_proba

axes[1, 1].hist(y_pred_proba_no_depression[y_test == 0], bins=30, alpha=0.6, 
                label='No Depression (True)', color='#4caf50', edgecolor='black')
axes[1, 1].hist(y_pred_proba_depression[y_test == 1], bins=30, alpha=0.6, 
                label='Depression (True)', color='#f44336', edgecolor='black')
axes[1, 1].axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Decision Threshold (0.5)')
axes[1, 1].set_xlabel('Predicted Probability', fontsize=12)
axes[1, 1].set_ylabel('Frequency', fontsize=12)
axes[1, 1].set_title('Prediction Probability Distribution', fontsize=14, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n=== Model Performance Summary ===")
print(f"ROC-AUC Score: {roc_auc:.4f}")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


## 5.2. Feature Importance Analysis


In [None]:
# Get feature importance from Logistic Regression model
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

# Get top features for depression (positive coefficients)
top_depression_features = sorted(zip(feature_names, coefficients), key=lambda x: x[1], reverse=True)[:20]
# Get top features for no depression (negative coefficients)
top_no_depression_features = sorted(zip(feature_names, coefficients), key=lambda x: x[1])[:20]

# Visualize feature importance
fig, axes = plt.subplots(1, 2, figsize=(18, 10))

# Top features indicating depression
dep_features, dep_coeffs = zip(*top_depression_features)
axes[0].barh(range(len(dep_features)), dep_coeffs, color='#f44336', alpha=0.7)
axes[0].set_yticks(range(len(dep_features)))
axes[0].set_yticklabels(dep_features, fontsize=10)
axes[0].set_xlabel('Coefficient Value', fontsize=12)
axes[0].set_title('Top 20 Features - Depression Indicators', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')
axes[0].axvline(x=0, color='black', linestyle='-', linewidth=1)

# Top features indicating no depression
no_dep_features, no_dep_coeffs = zip(*top_no_depression_features)
axes[1].barh(range(len(no_dep_features)), no_dep_coeffs, color='#4caf50', alpha=0.7)
axes[1].set_yticks(range(len(no_dep_features)))
axes[1].set_yticklabels(no_dep_features, fontsize=10)
axes[1].set_xlabel('Coefficient Value', fontsize=12)
axes[1].set_title('Top 20 Features - No Depression Indicators', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(True, alpha=0.3, axis='x')
axes[1].axvline(x=0, color='black', linestyle='-', linewidth=1)

plt.tight_layout()
plt.show()

print("\n=== Top Features for Depression Detection ===")
print("Positive coefficients (indicate depression):")
for feature, coeff in top_depression_features[:10]:
    print(f"  {feature}: {coeff:.4f}")
    
print("\nNegative coefficients (indicate no depression):")
for feature, coeff in top_no_depression_features[:10]:
    print(f"  {feature}: {coeff:.4f}")


## 5.3. Error Analysis - Misclassified Samples


In [None]:
# Analyze misclassified samples
# Get test indices from the split - recreate the split to get indices
_, test_indices = train_test_split(range(len(df)), test_size=0.2, random_state=42, stratify=y)

# Get original text for misclassified samples
test_df = df.iloc[test_indices].copy()
test_df = test_df.reset_index(drop=True)

# Add predictions
test_df['predicted'] = y_pred
test_df['actual'] = y_test
test_df['predicted_proba'] = y_pred_proba

# False Positives (predicted depression but actually no depression)
false_positives = test_df[(test_df['predicted'] == 1) & (test_df['actual'] == 0)]
# False Negatives (predicted no depression but actually depression)
false_negatives = test_df[(test_df['predicted'] == 0) & (test_df['actual'] == 1)]

print(f"=== Error Analysis ===")
print(f"\nFalse Positives (Predicted: Depression, Actual: No Depression): {len(false_positives)}")
print(f"False Negatives (Predicted: No Depression, Actual: Depression): {len(false_negatives)}")

# Visualize error distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# False Positives probability distribution
if len(false_positives) > 0:
    axes[0].hist(false_positives['predicted_proba'], bins=20, color='#ff9800', alpha=0.7, edgecolor='black')
    axes[0].set_xlabel('Predicted Depression Probability', fontsize=12)
    axes[0].set_ylabel('Frequency', fontsize=12)
    axes[0].set_title(f'False Positives Distribution (n={len(false_positives)})', fontsize=14, fontweight='bold')
    axes[0].axvline(x=0.5, color='red', linestyle='--', linewidth=2, label='Decision Threshold')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

# False Negatives probability distribution
if len(false_negatives) > 0:
    axes[1].hist(false_negatives['predicted_proba'], bins=20, color='#9c27b0', alpha=0.7, edgecolor='black')
    axes[1].set_xlabel('Predicted Depression Probability', fontsize=12)
    axes[1].set_ylabel('Frequency', fontsize=12)
    axes[1].set_title(f'False Negatives Distribution (n={len(false_negatives)})', fontsize=14, fontweight='bold')
    axes[1].axvline(x=0.5, color='red', linestyle='--', linewidth=2, label='Decision Threshold')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Show sample misclassified texts
print("\n=== Sample False Positives (Predicted Depression, Actually No Depression) ===")
for idx, row in false_positives.head(5).iterrows():
    print(f"\nText: {row['clean_text'][:200]}...")
    print(f"Predicted Probability: {row['predicted_proba']:.3f}")

print("\n=== Sample False Negatives (Predicted No Depression, Actually Depression) ===")
for idx, row in false_negatives.head(5).iterrows():
    print(f"\nText: {row['clean_text'][:200]}...")
    print(f"Predicted Probability: {row['predicted_proba']:.3f}")


## 7.1. Prediction Visualization Examples


In [None]:
# Visualize predictions for sample tweets
test_tweets = [
    "I'm feeling really down today. Can't stop thinking about negative things. Life feels meaningless.",
    "Had a great day today! Went for a walk and met some friends. Feeling happy and energized!",
    "I don't want to get out of bed. Everything feels hopeless and I can't see a way out.",
    "Just finished a productive day at work. Looking forward to the weekend!",
    "I've been having suicidal thoughts lately. I don't know what to do anymore."
]

# Get predictions
results = []
for tweet in test_tweets:
    result = predict_mental_health(tweet, model=model, vectorizer=vectorizer, return_probability=True, depression_threshold=0.40)
    results.append(result)

# Visualize predictions
fig, axes = plt.subplots(len(results), 1, figsize=(14, 3*len(results)))
if len(results) == 1:
    axes = [axes]

for i, (tweet, result) in enumerate(zip(test_tweets, results)):
    probs = result['probabilities']
    colors = ['#4caf50' if result['prediction'] == 0 else '#f44336', 
              '#f44336' if result['prediction'] == 1 else '#4caf50']
    
    bars = axes[i].bar(['No Depression', 'Depression'], 
                       [probs['No depression'], probs['Depression']], 
                       color=colors, alpha=0.7, edgecolor='black')
    axes[i].set_ylabel('Probability (%)', fontsize=11)
    axes[i].set_ylim([0, 100])
    axes[i].set_title(f"Text: {tweet[:80]}...", fontsize=11, fontweight='bold', pad=10)
    axes[i].grid(True, alpha=0.3, axis='y')
    
    # Add value labels on bars
    for j, (label, prob) in enumerate(probs.items()):
        axes[i].text(j, prob + 2, f'{prob}%', ha='center', fontweight='bold', fontsize=10)
    
    # Add prediction label
    pred_label = result['label']
    pred_color = '#f44336' if result['prediction'] == 1 else '#4caf50'
    axes[i].text(0.5, 90, pred_label, ha='center', fontsize=12, fontweight='bold', 
                color=pred_color, bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

# Print detailed results
print("\n=== Detailed Prediction Results ===")
for i, (tweet, result) in enumerate(zip(test_tweets, results), 1):
    print(f"\n{i}. {tweet[:100]}...")
    print(f"   Prediction: {result['label']}")
    print(f"   Depression Probability: {result['depression_probability']}%")
    print(f"   Confidence: {result['confidence']}%")
    print(f"   Probabilities: {result['probabilities']}")


In [3]:
# Set base directory (adjust if needed)
base_dir = 'D:/mental_health_detector'  # Change this to your project path if different

# Load the processed dataset
data_path = os.path.join(base_dir, 'data/processed/depression_dataset_processed.csv')
raw_data_path = os.path.join(base_dir, 'data/raw/depression_dataset_reddit_cleaned.csv')

# Check if processed data exists, otherwise use raw data
if os.path.exists(data_path):
    df = pd.read_csv(data_path)
    print(f"Loaded processed dataset: {df.shape}")
    # If processed_text column exists, use it; otherwise process clean_text
    if 'processed_text' in df.columns:
        print("Using existing processed_text column")
    else:
        print("Processing clean_text column...")
        df['processed_text'] = df['clean_text'].apply(preprocess_text)
else:
    # Load raw data and process it
    print(f"Processed data not found. Loading raw data from {raw_data_path}")
    df = pd.read_csv(raw_data_path)
    print(f"Raw dataset shape: {df.shape}")
    df['processed_text'] = df['clean_text'].apply(preprocess_text)

# Clean the data - handle NaN values
print(f"\nChecking for missing values:")
print(f"Missing values in clean_text: {df['clean_text'].isna().sum()}")
print(f"Missing values in processed_text: {df['processed_text'].isna().sum()}")
print(f"Missing values in is_depression: {df['is_depression'].isna().sum()}")

# Fill NaN values in processed_text with empty string or reprocess clean_text
if df['processed_text'].isna().sum() > 0:
    print("\nFilling NaN values in processed_text...")
    # If processed_text has NaN, try to reprocess from clean_text
    mask = df['processed_text'].isna()
    df.loc[mask, 'processed_text'] = df.loc[mask, 'clean_text'].apply(preprocess_text)
    
# Remove rows where processed_text is still empty or NaN after processing
initial_shape = df.shape[0]
df = df[df['processed_text'].notna() & (df['processed_text'].str.strip() != '')]
df = df[df['is_depression'].notna()]  # Also remove rows with missing labels
final_shape = df.shape[0]

if initial_shape != final_shape:
    print(f"\nRemoved {initial_shape - final_shape} rows with empty or invalid processed text")

# Display dataset info
print(f"\nDataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['is_depression'].value_counts())
print(f"\nLabel proportions:")
print(df['is_depression'].value_counts(normalize=True))
print(f"\nFirst few rows:")
df.head()


Loaded processed dataset: (7731, 4)
Using existing processed_text column

Checking for missing values:
Missing values in clean_text: 0
Missing values in processed_text: 1
Missing values in is_depression: 0

Filling NaN values in processed_text...

Removed 1 rows with empty or invalid processed text

Dataset shape: (7730, 4)
Columns: ['clean_text', 'is_depression', 'text_length', 'processed_text']

Label distribution:
is_depression
0    3900
1    3830
Name: count, dtype: int64

Label proportions:
is_depression
0    0.504528
1    0.495472
Name: proportion, dtype: float64

First few rows:


Unnamed: 0,clean_text,is_depression,text_length,processed_text
0,we understand that most people who reply immed...,1,813,understand people reply immediately op invitat...
1,welcome to r depression s check in post a plac...,1,429,welcome r depression check post place take mom...
2,anyone else instead of sleeping more when depr...,1,45,anyone else instead sleeping depressed stay ni...
3,i ve kind of stuffed around a lot in my life d...,1,110,kind stuffed around lot life delaying inevitab...
4,sleep is my greatest and most comforting escap...,1,54,sleep greatest comforting escape whenever wake...


## 3. Feature Extraction with TF-IDF


In [13]:
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

# Ensure processed_text has no NaN values before vectorization
print("Final check for NaN values before vectorization:")
print(f"NaN count in processed_text: {df['processed_text'].isna().sum()}")
print(f"Empty string count: {(df['processed_text'].str.strip() == '').sum()}")

# Convert to list and ensure all are strings (no NaN)
processed_texts = df['processed_text'].fillna('').astype(str).tolist()

# Fit and transform the processed text
print("\nFitting TF-IDF vectorizer...")
X_tfidf = vectorizer.fit_transform(processed_texts)
y = df['is_depression'].values

print(f"\nTF-IDF matrix shape: {X_tfidf.shape}")
print(f"Number of features: {len(vectorizer.get_feature_names_out())}")
print(f"\nSample feature names: {vectorizer.get_feature_names_out()[:20]}")


Final check for NaN values before vectorization:
NaN count in processed_text: 0
Empty string count: 0

Fitting TF-IDF vectorizer...

TF-IDF matrix shape: (7730, 5000)
Number of features: 5000

Sample feature names: ['aa' 'ab' 'abandoned' 'abandonment' 'ability' 'able' 'able find'
 'able get' 'able go' 'absence' 'absolute' 'absolutely'
 'absolutely nothing' 'abt' 'abuse' 'abused' 'abusive'
 'abusive relationship' 'academic' 'accept']


## 4. Split Data into Train and Test Sets


In [4]:
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"\nLabel distribution in train:")
train_dist = pd.Series(y_train).value_counts(normalize=True).to_dict()
print(train_dist)
print(f"\nLabel distribution in test:")
test_dist = pd.Series(y_test).value_counts(normalize=True).to_dict()
print(test_dist)


NameError: name 'X_tfidf' is not defined

## 5. Train the Model


In [None]:
# Initialize and train Logistic Regression model
# Using class_weight='balanced' to handle any class imbalance better
model = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
print("Training model...")
model.fit(X_train, y_train)
print("Model training completed!")

# Predict on test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')

print(f"\n=== Model Performance ===")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

print(f"\n=== Classification Report ===")
print(classification_report(y_test, y_pred))

print(f"\n=== Confusion Matrix ===")
print(confusion_matrix(y_test, y_pred))


Training model...
Model training completed!

=== Model Performance ===
Accuracy: 0.9547
Precision: 0.9780
Recall: 0.9295
F1-Score: 0.9531

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.93      0.98      0.96       780
           1       0.98      0.93      0.95       766

    accuracy                           0.95      1546
   macro avg       0.96      0.95      0.95      1546
weighted avg       0.96      0.95      0.95      1546


=== Confusion Matrix ===
[[764  16]
 [ 54 712]]


## 6. Save Model and Vectorizer


In [16]:
# Set base directory (should match the one used above)
base_dir = 'D:/mental_health_detector'  # Change this to your project path if different

# Create models directory if it doesn't exist
models_dir = os.path.join(base_dir, 'models')
os.makedirs(models_dir, exist_ok=True)

# Save the model
model_path = os.path.join(models_dir, 'mental_health_model.pkl')
with open(model_path, 'wb') as f:
    pickle.dump(model, f)
print(f"Model saved to {model_path}")

# Save the vectorizer
vectorizer_path = os.path.join(models_dir, 'tfidf_vectorizer.pkl')
with open(vectorizer_path, 'wb') as f:
    pickle.dump(vectorizer, f)
print(f"Vectorizer saved to {vectorizer_path}")

# Save preprocessing function info (we'll recreate it in prediction)
print("\nModel and vectorizer saved successfully!")


Model saved to D:/mental_health_detector\models\mental_health_model.pkl
Vectorizer saved to D:/mental_health_detector\models\tfidf_vectorizer.pkl

Model and vectorizer saved successfully!


## 7. Prediction Function for New Tweets

This function can be used to predict mental health status from a new tweet or text.


In [1]:
def predict_mental_health(text, model=None, vectorizer=None, return_probability=False, 
                          base_dir='D:/mental_health_detector', depression_threshold=0.40):
    """
    Predict mental health status from a text/tweet.
    
    Args:
        text (str): Input text/tweet to analyze
        model: Trained model (if None, loads from saved file)
        vectorizer: Trained vectorizer (if None, loads from saved file)
        return_probability (bool): If True, returns probability scores
        base_dir (str): Base directory path for loading saved models
        depression_threshold (float): Threshold for depression detection (default 0.40 = 40%)
        
    Returns:
        dict: Prediction results with label and confidence
    """
    # Load model and vectorizer if not provided
    if model is None:
        model_path = os.path.join(base_dir, 'models/mental_health_model.pkl')
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
    
    if vectorizer is None:
        vectorizer_path = os.path.join(base_dir, 'models/tfidf_vectorizer.pkl')
        with open(vectorizer_path, 'rb') as f:
            vectorizer = pickle.load(f)
    
    # Preprocess the text
    processed_text = preprocess_text(text)
    
    # Handle empty processed text
    if not processed_text or processed_text.strip() == '':
        return {
            'text': text,
            'prediction': -1,
            'label': 'Insufficient text for analysis',
            'confidence': 0.0,
            'probabilities': {'No depression': 50.0, 'Depression': 50.0}
        }
    
    # Transform to TF-IDF features
    text_tfidf = vectorizer.transform([processed_text])
    
    # Get probabilities
    probability = model.predict_proba(text_tfidf)[0]
    depression_prob = probability[1]  # Probability of depression (class 1)
    
    # Use threshold-based approach for better sensitivity
    # If depression probability is above threshold, flag it
    if depression_prob >= depression_threshold:
        prediction = 1
        label = "Depression detected"
        confidence = depression_prob * 100
    else:
        prediction = 0
        label = "No depression detected"
        confidence = probability[0] * 100
    
    # Check if it's a borderline case
    is_borderline = abs(depression_prob - 0.5) < 0.15  # Within 15% of 50/50
    
    result = {
        'text': text,
        'prediction': int(prediction),
        'label': label,
        'confidence': round(confidence, 2),
        'depression_probability': round(depression_prob * 100, 2),
        'is_borderline': is_borderline
    }
    
    if return_probability:
        result['probabilities'] = {
            'No depression': round(probability[0] * 100, 2),
            'Depression': round(probability[1] * 100, 2)
        }
    
    return result

# Test the prediction function with sample tweets
print("=== Testing Prediction Function (with 40% depression threshold) ===\n")

test_tweets = [
    "I'm feeling really down today. Can't stop thinking about negative things. Life feels meaningless.",
    "Had a great day today! Went for a walk and met some friends. Feeling happy and energized!",
    "I don't want to get out of bed. Everything feels hopeless and I can't see a way out.",
    "Just finished a productive day at work. Looking forward to the weekend!",
    "I've been having suicidal thoughts lately. I don't know what to do anymore."
]

for tweet in test_tweets:
    result = predict_mental_health(tweet, return_probability=True, depression_threshold=0.40)
    print(f"Tweet: {tweet[:80]}...")
    print(f"Prediction: {result['label']}")
    print(f"Depression Probability: {result['depression_probability']}%")
    print(f"Confidence: {result['confidence']}%")
    if result.get('is_borderline', False):
        print(f"⚠️  Borderline case - model is uncertain")
    print(f"Full Probabilities: {result['probabilities']}")
    print("-" * 80)


=== Testing Prediction Function (with 40% depression threshold) ===



NameError: name 'os' is not defined

## 8. Interactive Prediction

Use this cell to test your own tweets or text.


In [None]:
# Enter your tweet/text here
your_tweet = "I'm feeling great today! Everything is going well."

# Make prediction (using 40% threshold for better sensitivity)
result = predict_mental_health(your_tweet, return_probability=True, depression_threshold=0.40)

# Display results
print("=" * 80)
print("MENTAL HEALTH DETECTION RESULT")
print("=" * 80)
print(f"\nInput Text: {result['text']}")
print(f"\nPrediction: {result['label']}")
print(f"Depression Probability: {result['depression_probability']}%")
print(f"Confidence: {result['confidence']}%")
if result.get('is_borderline', False):
    print(f"\n⚠️  Borderline case - model is uncertain")
print(f"\nDetailed Probabilities:")
for label, prob in result['probabilities'].items():
    print(f"  {label}: {prob}%")
print("=" * 80)


MENTAL HEALTH DETECTION RESULT

Input Text: I'm feeling great today! Everything is going well.

Prediction: No depression detected
Confidence: 77.0%

Detailed Probabilities:
  No depression: 77.0%
  Depression: 23.0%


## 9. Batch Prediction Function

For predicting multiple tweets at once.


In [None]:
def predict_batch(texts, model=None, vectorizer=None, base_dir='D:/mental_health_detector', 
                depression_threshold=0.40):
    """
    Predict mental health status for multiple texts.
    
    Args:
        texts (list): List of texts/tweets to analyze
        model: Trained model (if None, loads from saved file)
        vectorizer: Trained vectorizer (if None, loads from saved file)
        base_dir (str): Base directory path for loading saved models
        depression_threshold (float): Threshold for depression detection (default 0.40 = 40%)
        
    Returns:
        list: List of prediction results
    """
    # Load model and vectorizer if not provided
    if model is None:
        model_path = os.path.join(base_dir, 'models/mental_health_model.pkl')
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
    
    if vectorizer is None:
        vectorizer_path = os.path.join(base_dir, 'models/tfidf_vectorizer.pkl')
        with open(vectorizer_path, 'rb') as f:
            vectorizer = pickle.load(f)
    
    # Preprocess all texts
    processed_texts = [preprocess_text(text) for text in texts]
    
    # Filter out empty texts and keep track of indices
    valid_indices = [i for i, text in enumerate(processed_texts) if text and text.strip() != '']
    valid_texts = [processed_texts[i] for i in valid_indices]
    
    if not valid_texts:
        return []
    
    # Transform to TF-IDF features
    texts_tfidf = vectorizer.transform(valid_texts)
    
    # Get probabilities
    probabilities = model.predict_proba(texts_tfidf)
    
    # Format results
    results = []
    valid_idx = 0
    for i, text in enumerate(texts):
        if i not in valid_indices:
            results.append({
                'text': text,
                'prediction': -1,
                'label': 'Insufficient text for analysis',
                'confidence': 0.0,
                'depression_probability': 0.0,
                'probabilities': {'No depression': 50.0, 'Depression': 50.0}
            })
            continue
        
        prob = probabilities[valid_idx]
        depression_prob = prob[1]  # Probability of depression
        
        # Use threshold-based approach
        if depression_prob >= depression_threshold:
            prediction = 1
            label = "Depression detected"
            confidence = depression_prob * 100
        else:
            prediction = 0
            label = "No depression detected"
            confidence = prob[0] * 100
        
        is_borderline = abs(depression_prob - 0.5) < 0.15
        
        results.append({
            'text': text,
            'prediction': int(prediction),
            'label': label,
            'confidence': round(confidence, 2),
            'depression_probability': round(depression_prob * 100, 2),
            'is_borderline': is_borderline,
            'probabilities': {
                'No depression': round(prob[0] * 100, 2),
                'Depression': round(prob[1] * 100, 2)
            }
        })
        valid_idx += 1
    
    return results

# Example: Batch prediction
sample_tweets = [
    "Feeling really sad and hopeless today",
    "Great weather today! Going to the park",
    "I can't find motivation to do anything",
    "Excited about my new project!"
]

batch_results = predict_batch(sample_tweets, depression_threshold=0.40)

print("=== Batch Prediction Results ===\n")
for i, result in enumerate(batch_results, 1):
    print(f"{i}. {result['text']}")
    print(f"   → {result['label']}")
    print(f"   Depression Probability: {result['depression_probability']}%")
    print(f"   Confidence: {result['confidence']}%")
    if result.get('is_borderline', False):
        print(f"   ⚠️  Borderline case - model is uncertain")
    print(f"   Full probabilities: {result['probabilities']}")
    print()


=== Batch Prediction Results ===

1. Feeling really sad and hopeless today
   → No depression detected (Confidence: 89.14%)

2. Great weather today! Going to the park
   → No depression detected (Confidence: 94.94%)

3. I can't find motivation to do anything
   → No depression detected (Confidence: 81.98%)

4. Excited about my new project!
   → No depression detected (Confidence: 93.65%)



## 10. Model Summary

This notebook provides a complete pipeline for mental health detection:

1. **Preprocessing**: Text cleaning, normalization, and feature extraction
2. **Model Training**: Logistic Regression with TF-IDF features
3. **Model Performance**: ~95.6% accuracy on test set
4. **Prediction Functions**: 
   - Single tweet prediction
   - Batch prediction
   - Probability scores included

### Usage:
- Use `predict_mental_health(text)` for single predictions
- Use `predict_batch(texts)` for multiple predictions
- Both functions can work with saved models or accept model/vectorizer as parameters

### Model Files:
- Model: `../models/mental_health_model.pkl`
- Vectorizer: `../models/tfidf_vectorizer.pkl`


