## Model Training and Evaluation

Train Random Forest and Logistic Regression models, evaluate their performance, and save the best model.

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import joblib
import numpy as np

# Load the preprocessed data
X_train = pd.read_csv('../archive/X_train.csv')
X_test = pd.read_csv('../archive/X_test.csv')
y_train = pd.read_csv('../archive/y_train.csv').squeeze()
y_test = pd.read_csv('../archive/y_test.csv').squeeze()

# Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_probs = np.array(rf.predict_proba(X_test))
print("Random Forest:\n", classification_report(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, rf_probs[:, 1]))

# Train Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
lr_probs = np.array(lr.predict_proba(X_test))
print("Logistic Regression:\n", classification_report(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, lr_probs[:, 1]))

# Save the best model (choose based on ROC-AUC)
if roc_auc_score(y_test, rf_probs[:, 1]) >= roc_auc_score(y_test, lr_probs[:, 1]):
    joblib.dump(rf, "../models/mental_health_rf.pkl")
    print("Random Forest model saved as ../models/mental_health_rf.pkl")
else:
    joblib.dump(lr, "../models/mental_health_lr.pkl")
    print("Logistic Regression model saved as ../models/mental_health_lr.pkl") 

Random Forest:
               precision    recall  f1-score   support

           0       0.75      0.74      0.74       129
           1       0.73      0.74      0.73       123

    accuracy                           0.74       252
   macro avg       0.74      0.74      0.74       252
weighted avg       0.74      0.74      0.74       252

ROC-AUC: 0.7941955000945359
Logistic Regression:
               precision    recall  f1-score   support

           0       0.59      0.53      0.56       129
           1       0.56      0.62      0.59       123

    accuracy                           0.58       252
   macro avg       0.58      0.58      0.58       252
weighted avg       0.58      0.58      0.57       252

ROC-AUC: 0.6238734480368059
Random Forest model saved as ../models/mental_health_rf.pkl


ABNORMAL: 

You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
