___
<img style="float: right; margin: 15px 15px 15px 15px;" src="https://img.freepik.com/free-vector/depression-concept-illustration_114360-3747.jpg?t=st=1657678284~exp=1657678884~hmac=b8b1d71ca0a8eb2e4ff5bf31d6a98624112f1a2254b0f39e92254ed12d7875b2" width="240px" height="180px" />

# <font color= #bbc28d> **Clasificación de Depresión - Modelado** </font>
#### <font color= #2E9AFE> `Proyecto de Ciencia de Datos`</font>
- <Strong> Sofía Maldonado, Diana Valdivia, Samantha Sánchez & Vivienne Toledo </Strong>
- <Strong> Fecha </Strong>: 11/11/2025.

___

<p style="text-align:right;"> Image retrieved from: https://img.freepik.com/free-vector/depression-concept-illustration_114360-3747.jpg?t=st=1657678284~exp=1657678884~hmac=b8b1d71ca0a8eb2e4ff5bf31d6a98624112f1a2254b0f39e92254ed12d7875b2/p>

In [2]:
# General Libraries
import os
import pandas as pd

# Databricks Env
import pathlib
import pickle
from dotenv import load_dotenv

# Feature Engineering
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Optimization
import math
import optuna
from optuna.samplers import TPESampler

# MLFlow
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
from mlflow import MlflowClient

# Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score

## <font color= #bbc28d>• **Credenciales & Set-up inicial** </font>
Para poder trabajar con MLFlow es necesario ingresar con nuestros tokens de acceso y definir la base con la que estaremos trabajando, en nuestro caso será con Databricks:

In [None]:
# ======================================
# Load .env and Log in to Databricks
# ======================================

# Cargar las variables del archivo .env
load_dotenv(override=True)  # Carga las variables del archivo .env
EXPERIMENT_NAME = "/Users/pipochatgpt@gmail.com/PEPE"

mlflow.set_tracking_uri("databricks")
experiment = mlflow.set_experiment(experiment_name=EXPERIMENT_NAME)

2025/11/11 16:22:37 INFO mlflow.tracking.fluent: Experiment with name '/Users/pipochatgpt@gmail.com/depression-project' does not exist. Creating a new experiment.


RestException: RESOURCE_DOES_NOT_EXIST: Parent directory /Users/pipochatgpt@gmail.com does not exist.

## <font color= #bbc28d>• **Modelado** </font>
Retomando un poco lo de entregas pasadas, este proyecto trabaja con un conjunto de datos cuyo objetivo es  **clasificar a estudiantes** en dos categorías: aquellos que presentan **síntomas de depresión** y **aquellos que no**. Debido a la naturaleza de los datos, estamos hablando de un problema de clasificación binaria, así que para esto, elegiremos modelos que se ajustan bien a este tipo de problemas:
- Logistic Regression
- SVC
- XGBoost

A continuación realizaremos la **optimización de hiperparámetros** y el **entrenamiento de tres modelos de clasificación binaria**. Para cada modelo:

1. Se utiliza **Optuna** para explorar diferentes combinaciones de hiperparámetros y maximizar la `F1-score` (Esta es la métrica más balanceada ya que es un promedio). Cada combinación de parámetros se evalúa mediante una función objetivo (`objective`) que entrena el modelo, realiza predicciones sobre el conjunto de prueba y calcula métricas de rendimiento como `accuracy`, `precision`, `f1` y `recall`.

2. Se emplea **MLflow** para hacer un seguimiento automático de los experimentos (`autolog`) y registrar los parámetros, métricas y modelos entrenados. 

3. Para Logistic Regression y SVC, se crean estudios de Optuna que prueban un número definido de configuraciones (`n_trials=3`) y se seleccionan los mejores parámetros encontrados. Para XGBoost, además se ajustan hiperparámetros como número de árboles, profundidad máxima, tasa de aprendizaje y gamma.

### <font color= #7fb2b5>• **Logistic Regression** </font>

In [None]:
def hp_tuning_lr(X_train, X_test, Y_train, Y_test):

    mlflow.sklearn.autolog()

    training_dataset = mlflow.data.from_numpy(X_train.data, targets=Y_train, name='Train Data')
    validation_dataset = mlflow.data.from_numpy(X_test.data, targets=Y_test, name='Test Data')

    def objective_lr(trial: optuna.trial.Trial):
        params = {
            'penalty': trial.suggest_categorical('penalty', ['l2','l1','elasticnet'])
        }

        with mlflow.start_run(nested=True):
            mlflow.set_tag('model_family', 'logistic_regression')
            mlflow.log_params(params)

            lr_model = LogisticRegression(**params)
            lr_model.fit(X_train, Y_train)

            y_pred = lr_model.predict(X_test)
            acc = accuracy_score(Y_test, y_pred)
            precision = precision_score(Y_test, y_pred)
            f1 = f1_score(Y_test, y_pred)
            recall = recall_score(Y_test, y_pred)

            mlflow.log_metric('acc', acc)
            mlflow.log_metric('precision', precision)
            mlflow.log_metric('f1', f1)
            mlflow.log_metric('recall', recall)

            signature = infer_signature(X_test, y_pred)

            mlflow.sklearn.log_model(
                lr_model,
                name='lr_model',
                input_example=X_test[:5],
                signature=signature
            )
        
        return f1
    
    sampler = TPESampler(seed=42)
    lr_study = optuna.create_study(direction='maximize', sampler=sampler)

    with mlflow.start_run(run_name='Logisitc Regression (Optuna)', nested=True):
        lr_study.optimize(objective_lr, n_trials=3)
    
    best_params_lr = lr_study.best_params

    return best_params_lr

### <font color= #7fb2b5>• **SVC** </font>

In [None]:
def hp_tuning_svc(X_train, X_test, Y_train, Y_test):

    mlflow.sklearn.autolog()

    training_dataset = mlflow.data.from_numpy(X_train.data, targets=Y_train, name='Train Data')
    validation_dataset = mlflow.data.from_numpy(X_test.data, targets=Y_test, name='Test Data')

    def objective_svc(trial: optuna.trial.Trial):
        params = {
            'kernel': trial.suggest_categorical('kernel', ['sigmoid','poly','linear','rbf'])
        }

        with mlflow.start_run(nested=True):
            mlflow.set_tag('model_family', 'svc')
            mlflow.log_params(params)

            svc_model = SVC(**params)
            svc_model.fit(X_train, Y_train)

            y_pred = svc_model.predict(X_test)
            acc = accuracy_score(Y_test, y_pred)
            precision = precision_score(Y_test, y_pred)
            f1 = f1_score(Y_test, y_pred)
            recall = recall_score(Y_test, y_pred)

            mlflow.log_metric('acc', acc)
            mlflow.log_metric('precision', precision)
            mlflow.log_metric('f1', f1)
            mlflow.log_metric('recall', recall)

            signature = infer_signature(X_test, y_pred)

            mlflow.sklearn.log_model(
                svc_model,
                name='svc_model',
                input_example=X_test[:5],
                signature=signature
            )
        
        return f1
    
    sampler = TPESampler(seed=42)
    svc_study = optuna.create_study(direction='maximize', sampler=sampler)

    with mlflow.start_run(run_name='Support Vector Classifier (Optuna)', nested=True):
        svc_study.optimize(objective_svc, n_trials=3)
    
    best_params_svc = svc_study.best_params

    best_params_svc['random_state'] = 42

    return best_params_svc

### <font color= #7fb2b5>• **XGBoost** </font>

In [None]:
def hp_tuning_xgboost(X_train, X_test, Y_train, Y_test):
    # Habilitar autolog
    mlflow.xgboost.autolog()

    # Crear datasets de entrenamiento y validación para MLflow
    training_dataset = mlflow.data.from_numpy(X_train, targets=Y_train, name='Train Data')
    validation_dataset = mlflow.data.from_numpy(X_test, targets=Y_test, name='Test Data')

    # Función objetivo para Optuna
    def objective_xgb(trial: optuna.trial.Trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 150),
            'max_depth': trial.suggest_int('max_depth', 2, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'gamma': trial.suggest_float('gamma', 0, 5),
            'eval_metric': 'logloss'
        }

        with mlflow.start_run(nested=True):
            mlflow.set_tag('model_family', 'Xgboost')
            mlflow.log_params(params)

            xgb_model = XGBClassifier(**params)
            xgb_model.fit(X_train, Y_train)

            y_pred = xgb_model.predict(X_test)
            acc = accuracy_score(Y_test, y_pred)
            precision = precision_score(Y_test, y_pred)
            f1 = f1_score(Y_test, y_pred)
            recall = recall_score(Y_test, y_pred)

            mlflow.log_metric('acc', acc)
            mlflow.log_metric('precision', precision)
            mlflow.log_metric('f1', f1)
            mlflow.log_metric('recall', recall)

            signature = infer_signature(X_test, y_pred)

            mlflow.xgboost.log_model(
                xgb_model,
                artifact_path='xgboost_model',
                input_example=X_test[:5],
                signature=signature
            )
        
        return f1

    # Crear y ejecutar el estudio de Optuna
    sampler = TPESampler(seed=42)
    xgb_study = optuna.create_study(direction='maximize', sampler=sampler)

    with mlflow.start_run(run_name='XGBoost (Optuna)', nested=True):
        xgb_study.optimize(objective_xgb, n_trials=3)

    # Obtener los mejores parámetros
    best_params_xgb = xgb_study.best_params
    best_params_xgb['random_state'] = 42

    return best_params_xgb


## <font color= #bbc28d>• **MLFlow Registry** </font>
En esta función se entrenan y evalúan los tres modelos seleccionados: Logistic Regression, SVC y XGBoost, utilizando los mejores hiperparámetros encontrados previamente. Para cada modelo se registran los parámetros, se calculan métricas de desempeño como accuracy, precision, recall y F1-score, y finalmente se almacenan los modelos en MLflow para su seguimiento y futura reutilización. La idea principal es automatizar el entrenamiento, evaluación y registro de los modelos de manera consistente y reproducible.

In [None]:
def train_best_models(X_train, Y_train, X_test, Y_test, best_params_lr, best_params_svc, best_params_xgb) -> None:
    with mlflow.start_run(run_name='Logistic Regression Model'):
        mlflow.log_params(best_params_lr)
        mlflow.set_tags({
            'project': 'Depression Prediction Project',
            'optimizer_engine': 'Optuna',
            'model_family': 'logistic_regression',
            'feature_set_version': 1,
            'candidate': 'true'
        })

        lr = LogisticRegression(**best_params_lr)
        lr.fit(X_train, Y_train)

        y_pred_lr = lr.predict(X_test)

        acc_lr = accuracy_score(Y_test, y_pred_lr)
        precision_lr = precision_score(Y_test, y_pred_lr)
        f1_lr = f1_score(Y_test, y_pred_lr)
        recall_lr = recall_score(Y_test, y_pred_lr)

        mlflow.log_metric('acc', acc_lr)
        mlflow.log_metric('precision', precision_lr)
        mlflow.log_metric('f1', f1_lr)
        mlflow.log_metric('recall', recall_lr)

        mlflow.sklearn.log_model(
            lr,
            name='model'
        )
    
    with mlflow.start_run(run_name='SVC Model'):
        mlflow.log_params(best_params_svc)
        mlflow.set_tags({
            'project': 'Depression Prediction Project',
            'optimizer_engine': 'Optuna',
            'model_family': 'svc',
            'feature_set_version': 1,
            'candidate': 'true'
        })

        svc = SVC(**best_params_svc)
        svc.fit(X_train, Y_train)

        y_pred_svc = svc.predict(X_test)

        acc_svc = accuracy_score(Y_test, y_pred_svc)
        precision_svc = precision_score(Y_test, y_pred_svc)
        f1_svc = f1_score(Y_test, y_pred_svc)
        recall_svc = recall_score(Y_test, y_pred_svc)

        mlflow.log_metric('acc', acc_svc)
        mlflow.log_metric('precision', precision_svc)
        mlflow.log_metric('f1', f1_svc)
        mlflow.log_metric('recall', recall_svc)

        mlflow.sklearn.log_model(
            svc,
            name='model'
        )
    
    with mlflow.start_run(run_name='XGBoost Model'):
        mlflow.log_params(best_params_xgb)
        mlflow.set_tags({
            'project': 'Depression Prediction Project',
            'optimizer_engine': 'Optuna',
            'model_family': 'Trees',
            'feature_set_version': 1,
            'candidate': 'true'
        })

        xgb = XGBClassifier(**best_params_xgb)
        xgb.fit(X_train, Y_train)
        y_pred_xgb = xgb.predict(X_test)

        acc_xgb = accuracy_score(Y_test, y_pred_xgb)
        precision_xgb = precision_score(Y_test, y_pred_xgb)
        f1_xgb = f1_score(Y_test, y_pred_xgb)
        recall_xgb = recall_score(Y_test, y_pred_xgb)

        mlflow.log_metric('acc', acc_xgb)
        mlflow.log_metric('precision', precision_xgb)
        mlflow.log_metric('f1', f1_xgb)
        mlflow.log_metric('recall', recall_xgb)

        mlflow.xgboost.log_model(
            xgb,
            name='model'
        )

Esta función se encarga de registrar automáticamente los dos mejores modelos de un experimento en el **Model Registry** de MLflow y asignarles los aliases `Champion` y `Challenger`. 
1. Primero busca todos los runs marcados como candidatos (`candidate=true`) y los ordena según la métrica F1. 
2. Luego selecciona los dos primeros: el de mayor F1 se registra como `Champion` y el segundo como `Challenger`. 
3. Cada modelo se registra en el model registry y se le asigna su alias correspondiente.

In [None]:
def register_champion_challenger(experiment_id="0", model_registry_name="workspace."):
    client = MlflowClient()

    runs = client.search_runs(
        experiment_ids=[experiment_id],
        filter_string="tags.candidate = 'true'",
        order_by=["metrics.f1 DESC"]
    )

    # Tomar los dos primeros runs 
    champion_run = runs[0]
    challenger_run = runs[1] if len(runs) > 1 else None

    def register(run, alias):
        if run is None:
            return
        # Registrar modelo
        result = mlflow.register_model(
            model_uri=f"runs:/{run.info.run_id}/model",
            name=model_registry_name
        )
        # Asignar alias
        client.set_registered_model_alias(
            name=model_registry_name,
            alias=alias,
            version=result.version
        )
        print(f"{alias} registrado: {run.data.tags['model_family']} con F1={run.data.metrics['f1']} (Run ID: {run.info.run_id})")

    # Registrar Champion y Challenger
    register(champion_run, "Champion")
    register(challenger_run, "Challenger")