# Mental Health Prediction Based on Social Media Usage

## Project Overview
This project aims to predict mental health status among students based on their social media usage patterns. We analyze various features including:
- Daily social media usage hours
- Number of platforms used
- Social comparison behavior
- Sleep disruption
- Academic performance (GPA)

We'll build and compare multiple machine learning models to predict mental health categories: Poor, Fair, or Good.

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 2. Load and Explore Data

In [None]:
# Load the dataset
df = pd.read_csv('social_media_mental_health.csv')

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

In [None]:
# Data information
print("\nDataset Info:")
print(df.info())

print("\nStatistical Summary:")
print(df.describe())

In [None]:
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check class distribution
print("\nMental Health Category Distribution:")
print(df['Mental_Health_Category'].value_counts())
print("\nPercentage Distribution:")
print(df['Mental_Health_Category'].value_counts(normalize=True) * 100)

## 3. Data Visualization

In [None]:
# Distribution of Mental Health Categories
plt.figure(figsize=(10, 6))
df['Mental_Health_Category'].value_counts().plot(kind='bar', color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.title('Distribution of Mental Health Categories', fontsize=16, fontweight='bold')
plt.xlabel('Mental Health Category', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Mental Health Score Distribution
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(df['Mental_Health_Score'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Mental Health Scores', fontsize=14, fontweight='bold')
plt.xlabel('Mental Health Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

plt.subplot(1, 2, 2)
df.boxplot(column='Mental_Health_Score', by='Mental_Health_Category', figsize=(12, 6))
plt.title('Mental Health Score by Category', fontsize=14, fontweight='bold')
plt.suptitle('')
plt.xlabel('Mental Health Category', fontsize=12)
plt.ylabel('Mental Health Score', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Key Features vs Mental Health
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Daily Usage Hours
df.boxplot(column='Daily_Usage_Hours', by='Mental_Health_Category', ax=axes[0, 0])
axes[0, 0].set_title('Daily Social Media Usage by Mental Health')
axes[0, 0].set_xlabel('Mental Health Category')
axes[0, 0].set_ylabel('Hours per Day')

# Social Comparison
df.boxplot(column='Social_Comparison', by='Mental_Health_Category', ax=axes[0, 1])
axes[0, 1].set_title('Social Comparison by Mental Health')
axes[0, 1].set_xlabel('Mental Health Category')
axes[0, 1].set_ylabel('Social Comparison Score')

# FOMO Score
df.boxplot(column='FOMO_Score', by='Mental_Health_Category', ax=axes[0, 2])
axes[0, 2].set_title('FOMO Score by Mental Health')
axes[0, 2].set_xlabel('Mental Health Category')
axes[0, 2].set_ylabel('FOMO Score')

# Sleep Disruption
df.boxplot(column='Sleep_Disruption', by='Mental_Health_Category', ax=axes[1, 0])
axes[1, 0].set_title('Sleep Disruption by Mental Health')
axes[1, 0].set_xlabel('Mental Health Category')
axes[1, 0].set_ylabel('Hours')

# GPA
df.boxplot(column='GPA', by='Mental_Health_Category', ax=axes[1, 1])
axes[1, 1].set_title('GPA by Mental Health')
axes[1, 1].set_xlabel('Mental Health Category')
axes[1, 1].set_ylabel('GPA')

# Number of Platforms
df.boxplot(column='Num_Platforms', by='Mental_Health_Category', ax=axes[1, 2])
axes[1, 2].set_title('Number of Platforms by Mental Health')
axes[1, 2].set_xlabel('Mental Health Category')
axes[1, 2].set_ylabel('Number of Platforms')

plt.suptitle('')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(14, 10))
correlation = df.drop(['Mental_Health_Category'], axis=1).corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0, fmt='.2f', 
            square=True, linewidths=1)
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

## 4. Data Preprocessing

In [None]:
# Prepare features and target
X = df.drop(['Mental_Health_Score', 'Mental_Health_Category'], axis=1)
y = df['Mental_Health_Category']

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeature columns:")
print(X.columns.tolist())

In [None]:
# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print("Label encoding:")
for i, label in enumerate(label_encoder.classes_):
    print(f"{label}: {i}")

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, 
                                                      random_state=42, stratify=y_encoded)

print("Training set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nData preprocessing completed!")

## 5. Model Training and Evaluation

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42, n_estimators=100),
    'Support Vector Machine': SVC(random_state=42, kernel='rbf')
}

# Train and evaluate models
results = {}

for name, model in models.items():
    print(f"\n{'='*60}")
    print(f"Training {name}...")
    print('='*60)
    
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'predictions': y_pred
    }
    
    print(f"Test Accuracy: {accuracy:.4f}")
    print(f"Cross-validation Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, 
                                target_names=label_encoder.classes_))

In [None]:
# Compare model performances
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Test Accuracy': [results[m]['accuracy'] for m in results.keys()],
    'CV Mean': [results[m]['cv_mean'] for m in results.keys()],
    'CV Std': [results[m]['cv_std'] for m in results.keys()]
})

comparison_df = comparison_df.sort_values('Test Accuracy', ascending=False)
print("\nModel Performance Comparison:")
print(comparison_df.to_string(index=False))

# Visualize model comparison
plt.figure(figsize=(12, 6))
x_pos = np.arange(len(comparison_df))
plt.bar(x_pos, comparison_df['Test Accuracy'], alpha=0.8, color='steelblue', label='Test Accuracy')
plt.bar(x_pos, comparison_df['CV Mean'], alpha=0.6, color='coral', label='CV Mean')
plt.xlabel('Models', fontsize=12, fontweight='bold')
plt.ylabel('Accuracy', fontsize=12, fontweight='bold')
plt.title('Model Performance Comparison', fontsize=16, fontweight='bold')
plt.xticks(x_pos, comparison_df['Model'], rotation=45, ha='right')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Best Model Analysis

In [None]:
# Select best model
best_model_name = max(results, key=lambda x: results[x]['accuracy'])
best_model = results[best_model_name]['model']
best_predictions = results[best_model_name]['predictions']

print(f"Best Model: {best_model_name}")
print(f"Test Accuracy: {results[best_model_name]['accuracy']:.4f}")
print(f"Cross-validation Score: {results[best_model_name]['cv_mean']:.4f} (+/- {results[best_model_name]['cv_std']:.4f})")

In [None]:
# Confusion matrix for best model
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_test, best_predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_,
            cbar_kws={'label': 'Count'})
plt.title(f'Confusion Matrix - {best_model_name}', fontsize=16, fontweight='bold', pad=20)
plt.ylabel('True Label', fontsize=12, fontweight='bold')
plt.xlabel('Predicted Label', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nDetailed Classification Report for Best Model:")
print(classification_report(y_test, best_predictions, target_names=label_encoder.classes_))

In [None]:
# Feature importance (for tree-based models)
if best_model_name in ['Decision Tree', 'Random Forest', 'Gradient Boosting']:
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("\nFeature Importance:")
    print(feature_importance.to_string(index=False))
    
    plt.figure(figsize=(12, 6))
    plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='teal')
    plt.xlabel('Importance', fontsize=12, fontweight='bold')
    plt.ylabel('Features', fontsize=12, fontweight='bold')
    plt.title(f'Feature Importance - {best_model_name}', fontsize=16, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

## 7. Conclusions and Insights

### Key Findings:

1. **Model Performance**: Multiple machine learning models were trained and evaluated. The best performing model achieved high accuracy in predicting mental health categories.

2. **Important Features**: The analysis reveals which social media usage patterns have the strongest association with mental health outcomes:
   - Social comparison behavior
   - FOMO (Fear of Missing Out) scores
   - Daily usage hours
   - Sleep disruption
   - Academic performance (GPA)

3. **Mental Health Patterns**:
   - Higher social media usage and social comparison are associated with poorer mental health outcomes
   - Better academic performance (higher GPA) correlates with better mental health
   - Sleep disruption due to social media shows negative impacts on mental health

### Recommendations:

1. **For Students**:
   - Monitor and limit daily social media usage
   - Be mindful of social comparison behavior
   - Prioritize sleep over late-night social media use
   - Balance online and offline activities

2. **For Educational Institutions**:
   - Implement awareness programs about healthy social media usage
   - Provide mental health support services
   - Encourage digital wellness practices

3. **Future Work**:
   - Collect longitudinal data to track changes over time
   - Include more diverse features (specific platforms, content types)
   - Develop intervention strategies based on model predictions
   - Validate findings with clinical mental health assessments