___
<img style="float: right; margin: 15px 15px 15px 15px;" src="https://img.freepik.com/free-vector/depression-concept-illustration_114360-3747.jpg?t=st=1657678284~exp=1657678884~hmac=b8b1d71ca0a8eb2e4ff5bf31d6a98624112f1a2254b0f39e92254ed12d7875b2" width="240px" height="180px" />

# <font color= #bbc28d> **Clasificaci√≥n de Depresi√≥n - Modelado** </font>
#### <font color= #2E9AFE> `Proyecto de Ciencia de Datos`</font>
- <Strong> Sof√≠a Maldonado, Diana Valdivia, Samantha S√°nchez & Vivienne Toledo </Strong>
- <Strong> Fecha </Strong>: 11/11/2025.

___

<p style="text-align:right;"> Image retrieved from: https://img.freepik.com/free-vector/depression-concept-illustration_114360-3747.jpg?t=st=1657678284~exp=1657678884~hmac=b8b1d71ca0a8eb2e4ff5bf31d6a98624112f1a2254b0f39e92254ed12d7875b2/p>

# <font color= #bbc28d>**Datos y Data Readiness** </font>
En esta etapa se sigui√≥ trabajando con la misma base de datos la cu√°l intenta clasificar a los j√≥venes estudiantes en dos categor√≠as: aquellos que presentan s√≠ntomas de depresi√≥n y aquellos que no, siendo esta columna nuestro objetivo. Aplicando diferentes transformaciones y limpiezas para preparar los datos antes del modelado. Las acciones principales fueron:

#### <font color=#99c0c4>1. **Tratamiento de datos faltantes** </font>
- Se detect√≥ que la columna Financial Stress conten√≠a 3 valores nulos.
- Dado que era una cantidad peque√±a en comparaci√≥n con el total de registros, se decidi√≥ eliminar esas filas.

#### <font color=#99c0c4>2. **Filtrado de datos categ√≥ricos** </font>
- Se identificaron variables con categor√≠as poco representativas (con una o dos filas por valor), como City, Dietary Habits, Sleep Duration y Degree.
- Se eliminaron registros aislados para reducir ruido y mejorar la representatividad de los datos.

#### <font color=#99c0c4>3. **Ajuste del dataset al enfoque del proyecto** </font>
- Se filtraron los registros de Age para conservar

In [1]:
# General Libraries
import os
import pandas as pd

# Databricks Env
import pathlib
import pickle
from dotenv import load_dotenv

# Feature Engineering
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Optimization
import math
import optuna
from optuna.samplers import TPESampler

# MLFlow
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
from mlflow import MlflowClient

# Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score

## <font color= #bbc28d>‚Ä¢ **Credenciales & Set-up inicial** </font>
Para poder trabajar con MLFlow es necesario ingresar con nuestros tokens de acceso y definir la base con la que estaremos trabajando, en nuestro caso ser√° con Databricks:

In [31]:
# ======================================
# Load .env and Log in to Databricks
# ======================================

# Cargar las variables del archivo .env
load_dotenv(override=True)  # Carga las variables del archivo .env
EXPERIMENT_NAME = "/Users/pipochatgpt@gmail.com/Depression_Class"

mlflow.set_tracking_uri("databricks")
experiment = mlflow.set_experiment(experiment_name=EXPERIMENT_NAME)

## <font color= #bbc28d>‚Ä¢ **Preprocesamiento** </font>

Convertimos lo planteado en la libreta de limpieza de datos en una funci√≥n que limpie y preprocese los datos de la siguiente manera. 

### <font color= #7fb2b5>‚Ä¢ **Limpiado de Datos** </font>

Consiste en una funci√≥n que toma el dataframe original y:

- Elimina valores nulos
- Filtra categor√≠as que tengan poca fuerza predictora
- Codifica las variables categ√≥ricas binarias
- Codifica las variables categ√≥ricas m√∫ltiples mediante codificaci√≥n ordinal
- Realiza un train-test-val split (70-20-10) con una semilla fija

Adem√°s, almacena los datos en carpetas dentro del almacenamiento local si el usuario lo indica.

### <font color= #7fb2b5>‚Ä¢ **Preprocesamiento** </font>

El preprocesamiento de datos retoma los valores obtenidos en la funci√≥n anterior y aplica una codificaci√≥n de tipo OneHot, adem√°s de una estandarizaci√≥n est√°ndar, con la librer√≠a de Scikit-Learn. Estos artefactos son almacenados en almacenamiento local y dentro de MLflow, si el usuario lo indica. 

El preprocesamiento se aplica antes de correr cada modelo y guarda los artefactos (OneHot y Standard Scaler) asociados a la run. 

In [3]:
df = pd.read_csv("../data/raw/depression_dataset.csv")

def clean_data(df, save_data=False):
    # 1. Eliminar valores nulos
    df = df.dropna()


    # 2. Filtrado de categor√≠as que otorgan poca informaci√≥n debido a su baja prevalencia
    # City
    ciudades = df['City'].value_counts()[df['City'].value_counts() < 450]
    df = df[~df['City'].isin(ciudades.index)]
    # Dietary Habits
    df = df[df['Dietary Habits'] != 'Others']
    # Sleep Duration
    df = df[df['Sleep Duration'] != 'Others']
    # Degree
    df = df[df['Degree'] != 'Others']
    # Age
    df = df[df['Age'] <= 35]
    # Academic Pressure
    df = df[df['Academic Pressure'] > 0]
    # Study Satisfaction
    df = df[df['Study Satisfaction'] > 0]

    # 3. Eliminar variables que no son buenas predictoras
    df.drop(columns=['Work Pressure', 'Profession', 'Job Satisfaction', 'id'], axis=1, inplace=True)


    # 4. Mapear las variables categ√≥ricas binarias
    gender = {'Male' : 0, 'Female' : 1}
    general = {'Yes' : 1, 'No' : 0}
    df['Gender'] = df['Gender'].map(gender)
    df['Have you ever had suicidal thoughts ?'] = df['Have you ever had suicidal thoughts ?'].map(general)
    df['Family History of Mental Illness'] = df['Family History of Mental Illness'].map(general)


    # 5. Mapear las variables categ√≥ricas m√∫ltiples
    degree = {
    "Class 12": "Secondary",
    "B.Pharm": "Undergraduate", "BSc": "Undergraduate", "BA": "Undergraduate", "BCA": "Undergraduate",
    "B.Ed": "Undergraduate", "LLB": "Undergraduate", "BE": "Undergraduate", "BHM": "Undergraduate",
    "B.Com": "Undergraduate", "B.Arch": "Undergraduate", "B.Tech": "Undergraduate", "BBA": "Undergraduate",
    "M.Tech": "Postgraduate", "M.Ed": "Postgraduate", "MSc": "Postgraduate", "M.Pharm": "Postgraduate",
    "MCA": "Postgraduate", "MA": "Postgraduate", "MBA": "Postgraduate", "M.Com": "Postgraduate", "MHM": "Postgraduate",
    "PhD": "Doctorate", "MD": "Doctorate", "MBBS": "Doctorate", "LLM": "Doctorate", "ME": "Postgraduate"
    }
    orden_degree = {"Secondary": 0, "Undergraduate": 1, "Postgraduate": 2, "Doctorate": 3}
    orden_alimentos = {'Healthy': 0, 'Unhealthy': 1, 'Moderate': 2}
    orden_siesta = {'Less than 5 hours': 0, '5-6 hours': 1, '7-8 hours': 2,'More than 8 hours': 3}
    # Aplicar el mapeo
    df['Degree'] = df['Degree'].map(degree)
    df['Degree'] = df['Degree'].map(orden_degree)
    df['Dietary Habits'] = df['Dietary Habits'].map(orden_alimentos)
    df['Sleep Duration'] = df['Sleep Duration'].map(orden_siesta)


    # 6. Train-Test-Val Split (70-20-10)
    X = df.drop(['Depression'], axis=1)
    y = df['Depression']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=42)
    X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.66, random_state=42)

    if save_data:
        # Guardar las variables 
        X_train.to_csv(r'..\data\interim\X_train.csv', index=False)
        X_test.to_csv(r'..\data\interim\X_test.csv', index=False)
        X_val.to_csv(r'..\data\interim\X_val.csv', index=False)
        
        y_train.to_csv(r'..\data\processed\y_train.csv', index=False)
        y_test.to_csv(r'..\data\processed\y_test.csv', index=False)
        y_val.to_csv(r'..\data\processed\y_val.csv', index=False)
    
    # Convertir las variables dependientes en NumPy arrays
    y_train = y_train.to_numpy().ravel()
    y_test = y_test.to_numpy().ravel()
    y_val = y_val.to_numpy().ravel()   

    return X_train, X_test, X_val, y_train, y_test, y_val

In [4]:
def preprocessor(X_train, X_test, X_val=None, save_data=False, save_artifacts=True):
    # Codificar variables m√∫ltiples mediante One-Hot
    encoder = OneHotEncoder(
        drop='first',
        handle_unknown='ignore',        # Evita error si aparece algo nuevo
        sparse_output=False
    )

    # Entrenar el objeto con los datos del train
    encoder.fit(X_train[['City']])
    
    # Aplicar One-Hot
    X_train_city = encoder.transform(X_train[['City']])
    X_test_city = encoder.transform(X_test[['City']])
    X_val_city = encoder.transform(X_val[['City']]) if X_val is not None else None
    
    # Obtener los nombres del One-Hot
    city_cols = encoder.get_feature_names_out(['City'])  # Nombres autom√°ticos de columnas
    
    # Crear un df con las columnas codificadas
    X_train_city_df = pd.DataFrame(X_train_city, columns=city_cols, index=X_train.index)
    X_test_city_df = pd.DataFrame(X_test_city, columns=city_cols, index=X_test.index)
    X_val_city_df = pd.DataFrame(X_val_city, columns=city_cols, index=X_val.index) if X_val is not None else None
    
    # Eliminar la columna original en el dataset
    X_train = X_train.drop(columns=['City'])
    X_test = X_test.drop(columns=['City'])
    X_val = X_val.drop(columns=['City']) if X_val is not None else None
    
    # Juntar las nuevas columnas con el dataset antiguo
    X_train_final = pd.concat([X_train, X_train_city_df], axis=1)
    X_test_final = pd.concat([X_test, X_test_city_df], axis=1)
    X_val_final = pd.concat([X_val, X_val_city_df], axis=1) if X_val is not None else None

    # Aplicar una estandarizaci√≥n a los datos
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_final)
    X_test_scaled = scaler.transform(X_test_final)
    X_val_scaled = scaler.transform(X_val_final) if X_val is not None else None

    # Guardar los artefactos
    if save_artifacts:
        os.makedirs("artifacts/preprocessor", exist_ok=True)

        # Save encoder
        with open('artifacts/preprocessor/encoder.pkl', 'wb') as f_out:
            pickle.dump(encoder, f_out)
        # Save scaler
        with open('artifacts/preprocessor/scaler.pkl', 'wb') as f_out:
            pickle.dump(scaler, f_out)

        # Log artifacts to MLflow
        mlflow.log_artifact("artifacts/preprocessor/encoder.pkl", artifact_path="preprocessor")
        mlflow.log_artifact("artifacts/preprocessor/scaler.pkl", artifact_path="preprocessor")

        print("Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.")

    if save_data:
        # Regresar los datos a dataframe y guardarlos
        X_train_df = pd.DataFrame(X_train_scaled, columns=X_train_final.columns, index=X_train_final.index)
        X_test_df = pd.DataFrame(X_test_scaled, columns=X_test_final.columns, index=X_test_final.index)
        X_val_df = pd.DataFrame(X_val_scaled, columns=X_val_final.columns, index=X_val_final.index) if X_val is not None else None

        X_train_df.to_csv(r'..\data\processed\X_train.csv', index=False)
        X_test_df.to_csv(r'..\data\processed\X_test.csv', index=False)
        X_val_df.to_csv(r'..\data\processed\X_val.csv', index=False)

    return X_train_scaled, X_test_scaled, X_val_scaled, encoder, scaler

In [5]:
# Clean the data and obtain the targets
X_train, X_test, X_val, y_train, y_test, y_val = clean_data(df)

## <font color= #bbc28d>‚Ä¢ **Modelado** </font>
Retomando un poco lo de entregas pasadas, este proyecto trabaja con un conjunto de datos cuyo objetivo es  **clasificar a estudiantes** en dos categor√≠as: aquellos que presentan **s√≠ntomas de depresi√≥n** y **aquellos que no**. Debido a la naturaleza de los datos, estamos hablando de un problema de clasificaci√≥n binaria, as√≠ que para esto, elegiremos modelos que se ajustan bien a este tipo de problemas:
- Logistic Regression
- SVC
- XGBoost

A continuaci√≥n realizaremos la **optimizaci√≥n de hiperpar√°metros** y el **entrenamiento de tres modelos de clasificaci√≥n binaria**. Para cada modelo:

1. Se utiliza **Optuna** para explorar diferentes combinaciones de hiperpar√°metros y maximizar la `F1-score` (Esta es la m√©trica m√°s balanceada ya que es un promedio). Cada combinaci√≥n de par√°metros se eval√∫a mediante una funci√≥n objetivo (`objective`) que entrena el modelo, realiza predicciones sobre el conjunto de prueba y calcula m√©tricas de rendimiento como `accuracy`, `precision`, `f1` y `recall`.

2. Se emplea **MLflow** para hacer un seguimiento autom√°tico de los experimentos (`autolog`) y registrar los par√°metros, m√©tricas y modelos entrenados. 

3. Para Logistic Regression y SVC, se crean estudios de Optuna que prueban un n√∫mero definido de configuraciones (`n_trials=3`) y se seleccionan los mejores par√°metros encontrados. Para XGBoost, adem√°s se ajustan hiperpar√°metros como n√∫mero de √°rboles, profundidad m√°xima, tasa de aprendizaje y gamma.

### <font color= #7fb2b5>‚Ä¢ **Logistic Regression** </font>

In [8]:
def hp_tuning_lr(X_train, X_test, y_train, y_test):

    mlflow.sklearn.autolog()

    # Start Optuna and MLflow
    def objective_lr(trial: optuna.trial.Trial):
        params = {
            'penalty': trial.suggest_categorical('penalty', ['l2','l1','elasticnet']),
            'solver': 'saga'
        }

        with mlflow.start_run(nested=True):
            # Preprocess data and log artifacts
            X_train_scaled, X_test_scaled, _, encoder, scaler = preprocessor(X_train, X_test, X_val=None, save_artifacts=True)

            # Get MLflow ID to store the preprocessing artifacts
            preprocessor_run_id = mlflow.active_run().info.run_id

            mlflow.set_tag('model_family', 'logistic_regression')
            mlflow.log_params(params)
            mlflow.log_param('preprocessor_run_id', preprocessor_run_id)

            lr_model = LogisticRegression(**params)
            lr_model.fit(X_train_scaled, y_train)

            # Get predictions and metrics
            y_pred = lr_model.predict(X_test_scaled)
            acc = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)

            # Log metrics
            mlflow.log_metric('acc', acc)
            mlflow.log_metric('precision', precision)
            mlflow.log_metric('f1', f1)
            mlflow.log_metric('recall', recall)

            signature = infer_signature(X_test_scaled, y_pred)

            # Log the trained model
            mlflow.sklearn.log_model(
                lr_model,
                name='lr_model',
                input_example=X_test_scaled[:5],
                signature=signature
            )
        
        return f1
    
    sampler = TPESampler(seed=42)
    lr_study = optuna.create_study(direction='maximize', sampler=sampler)

    with mlflow.start_run(run_name='Logistic Regression (Optuna)', nested=True):
        lr_study.optimize(objective_lr, n_trials=3)
    
    best_params_lr = lr_study.best_params

    return best_params_lr

### <font color= #7fb2b5>‚Ä¢ **SVC** </font>

In [9]:
def hp_tuning_svc(X_train, X_test, y_train, y_test):

    mlflow.sklearn.autolog()

    def objective_svc(trial: optuna.trial.Trial):
        params = {
            'kernel': trial.suggest_categorical('kernel', ['sigmoid','poly','linear','rbf'])
        }

        with mlflow.start_run(nested=True):
            # Preprocess data and log artifacts
            X_train_scaled, X_test_scaled, _, encoder, scaler = preprocessor(X_train, X_test, X_val=None, save_artifacts=True)

            # Get MLflow ID to store the preprocessing artifacts
            preprocessor_run_id = mlflow.active_run().info.run_id

            mlflow.set_tag('model_family', 'svc')
            mlflow.log_params(params)
            mlflow.log_param('preprocessor_run_id', preprocessor_run_id)

            svc_model = SVC(**params)
            svc_model.fit(X_train_scaled, y_train)

            y_pred = svc_model.predict(X_test_scaled)
            acc = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)

            mlflow.log_metric('acc', acc)
            mlflow.log_metric('precision', precision)
            mlflow.log_metric('f1', f1)
            mlflow.log_metric('recall', recall)

            signature = infer_signature(X_test_scaled, y_pred)

            mlflow.sklearn.log_model(
                svc_model,
                name='svc_model',
                input_example=X_test_scaled[:5],
                signature=signature
            )
        
        return f1

    
    sampler = TPESampler(seed=42)
    svc_study = optuna.create_study(direction='maximize', sampler=sampler)

    with mlflow.start_run(run_name='Support Vector Classifier (Optuna)', nested=True):
        svc_study.optimize(objective_svc, n_trials=3)
    
    best_params_svc = svc_study.best_params

    best_params_svc['random_state'] = 42

    return best_params_svc

### <font color= #7fb2b5>‚Ä¢ **XGBoost** </font>

In [10]:
def hp_tuning_xgboost(X_train, X_test, y_train, y_test):
    # Habilitar autolog
    mlflow.xgboost.autolog()

    # Funci√≥n objetivo para Optuna
    def objective_xgb(trial: optuna.trial.Trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 150),
            'max_depth': trial.suggest_int('max_depth', 2, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'gamma': trial.suggest_float('gamma', 0, 5),
            'eval_metric': 'logloss'
        }

        with mlflow.start_run(nested=True):
            # Preprocess data and log artifacts
            X_train_scaled, X_test_scaled, _, encoder, scaler = preprocessor(X_train, X_test, X_val=None, save_artifacts=True)

            # Get MLflow ID to store the preprocessing artifacts
            preprocessor_run_id = mlflow.active_run().info.run_id

            mlflow.set_tag('model_family', 'Xgboost')
            mlflow.log_params(params)
            mlflow.log_param('preprocessor_run_id', preprocessor_run_id)

            xgb_model = XGBClassifier(**params)
            xgb_model.fit(X_train_scaled, y_train)

            y_pred = xgb_model.predict(X_test_scaled)
            acc = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)

            mlflow.log_metric('acc', acc)
            mlflow.log_metric('precision', precision)
            mlflow.log_metric('f1', f1)
            mlflow.log_metric('recall', recall)

            signature = infer_signature(X_test_scaled, y_pred)

            mlflow.xgboost.log_model(
                xgb_model,
                artifact_path='xgboost_model',
                input_example=X_test_scaled[:5],
                signature=signature
            )
        
        return f1

    # Crear y ejecutar el estudio de Optuna
    sampler = TPESampler(seed=42)
    xgb_study = optuna.create_study(direction='maximize', sampler=sampler)

    with mlflow.start_run(run_name='XGBoost (Optuna)', nested=True):
        xgb_study.optimize(objective_xgb, n_trials=3)

    # Obtener los mejores par√°metros
    best_params_xgb = xgb_study.best_params
    best_params_xgb['random_state'] = 42

    return best_params_xgb


In [12]:
best_params_xgb = hp_tuning_xgboost(X_train, X_test, y_train, y_test)
best_params_svc = hp_tuning_svc(X_train, X_test, y_train, y_test)
best_params_lr = hp_tuning_lr(X_train, X_test, y_train, y_test)

[I 2025-11-11 18:35:04,377] A new study created in memory with name: no-name-3a35aaf4-500c-474e-8461-3eacbfac4450


Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025/11/11 18:35:56 INFO mlflow.models.model: Found the following environment variables used during model inference: [DATABRICKS_HOST, DATABRICKS_TOKEN]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.
[I 2025-11-11 18:35:57,496] Trial 0 finished with value: 0.8603763987792472 and parameters: {'n_estimators': 87, 'max_depth': 10, 'learning_rate': 0.1205712628744377, 'gamma': 2.993292420985183}. Best is trial 0 with value: 0.8603763987792472.


üèÉ View run painted-auk-836 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/b648e2e836ba4bc1bcbc008cdc8270fd
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

[I 2025-11-11 18:36:25,804] Trial 1 finished with value: 0.8494370901892861 and parameters: {'n_estimators': 65, 'max_depth': 3, 'learning_rate': 0.012184186502221764, 'gamma': 4.330880728874676}. Best is trial 0 with value: 0.8603763987792472.


üèÉ View run charming-boar-629 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/c4d85184268247a2ae0a38d21891aeb4
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

[I 2025-11-11 18:37:06,649] Trial 2 finished with value: 0.8604622111180512 and parameters: {'n_estimators': 110, 'max_depth': 8, 'learning_rate': 0.010725209743171996, 'gamma': 4.8495492608099715}. Best is trial 2 with value: 0.8604622111180512.


üèÉ View run kindly-bat-829 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/c0b96f28a7924c59968f8393991e792b
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
üèÉ View run XGBoost (Optuna) at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/5acd2f8926754e1fbc65ab363fa40782
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753


[I 2025-11-11 18:37:08,481] A new study created in memory with name: no-name-fe9dff61-e9ca-4831-805b-cb396635019e


Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025/11/11 18:37:47 INFO mlflow.models.model: Found the following environment variables used during model inference: [DATABRICKS_HOST, DATABRICKS_TOKEN]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.


üèÉ View run loud-chimp-509 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/d84d6b5e1d5a49dbb6ebbfd30b9b3460
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753


[I 2025-11-11 18:38:20,471] Trial 0 finished with value: 0.8556892914753692 and parameters: {'kernel': 'poly'}. Best is trial 0 with value: 0.8556892914753692.


Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025/11/11 18:39:33 INFO mlflow.models.model: Found the following environment variables used during model inference: [DATABRICKS_HOST, DATABRICKS_TOKEN]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.


üèÉ View run serious-cow-296 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/9cdb9de05ae647af881f59166fd3ea8d
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753


[I 2025-11-11 18:39:38,643] Trial 1 finished with value: 0.8614564831261101 and parameters: {'kernel': 'rbf'}. Best is trial 1 with value: 0.8614564831261101.


Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025/11/11 18:40:42 INFO mlflow.models.model: Found the following environment variables used during model inference: [DATABRICKS_HOST, DATABRICKS_TOKEN]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.
[I 2025-11-11 18:40:47,235] Trial 2 finished with value: 0.8614564831261101 and parameters: {'kernel': 'rbf'}. Best is trial 1 with value: 0.8614564831261101.


üèÉ View run handsome-dog-971 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/b042ef11babe4afc892f4c90e58d82c2
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
üèÉ View run Support Vector Classifier (Optuna) at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/e36ea341d2ca42edb40efcf020993553
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753


[I 2025-11-11 18:40:47,823] A new study created in memory with name: no-name-686b74d7-1448-4318-99e8-3b1d4e157d50


Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025/11/11 18:41:13 INFO mlflow.models.model: Found the following environment variables used during model inference: [DATABRICKS_HOST, DATABRICKS_TOKEN]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.
[I 2025-11-11 18:41:15,475] Trial 0 finished with value: 0.8670476190476191 and parameters: {'penalty': 'l1'}. Best is trial 0 with value: 0.8670476190476191.


üèÉ View run rogue-fox-501 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/a330bbd695a648aca02a8da4197c39c5
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025/11/11 18:41:43 INFO mlflow.models.model: Found the following environment variables used during model inference: [DATABRICKS_HOST, DATABRICKS_TOKEN]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.


üèÉ View run sneaky-shrew-96 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/a6f72242dc754b138a358a2ae576e1a0
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753


[I 2025-11-11 18:41:45,148] Trial 1 finished with value: 0.8669037338074677 and parameters: {'penalty': 'l2'}. Best is trial 0 with value: 0.8670476190476191.


Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025/11/11 18:42:11 INFO mlflow.models.model: Found the following environment variables used during model inference: [DATABRICKS_HOST, DATABRICKS_TOKEN]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.
[I 2025-11-11 18:42:12,341] Trial 2 finished with value: 0.8670476190476191 and parameters: {'penalty': 'l1'}. Best is trial 0 with value: 0.8670476190476191.


üèÉ View run upbeat-wasp-948 at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/7db5888fbb324bdba8ecc2ea214b6431
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
üèÉ View run Logistic Regression (Optuna) at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/b128456551a64afd85197be6c640535e
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753


## <font color= #bbc28d>‚Ä¢ **MLFlow Registry** </font>
En esta funci√≥n se entrenan y eval√∫an los tres modelos seleccionados: Logistic Regression, SVC y XGBoost, utilizando los mejores hiperpar√°metros encontrados previamente. Para cada modelo se registran los par√°metros, se calculan m√©tricas de desempe√±o como accuracy, precision, recall y F1-score, y finalmente se almacenan los modelos en MLflow para su seguimiento y futura reutilizaci√≥n. La idea principal es automatizar el entrenamiento, evaluaci√≥n y registro de los modelos de manera consistente y reproducible.

In [44]:
def train_best_models(X_train, y_train, X_test, y_test, best_params_lr, best_params_svc, best_params_xgb) -> None:
    with mlflow.start_run(run_name=' Best Logistic Regression Model'):
        # Preprocess data and log artifacts
        X_train_scaled, X_test_scaled, _, encoder, scaler = preprocessor(X_train, X_test, X_val=None, save_artifacts=True)

        # Get MLflow ID to store the preprocessing artifacts
        preprocessor_run_id = mlflow.active_run().info.run_id
        mlflow.log_param('preprocessor_run_id', preprocessor_run_id)
        mlflow.log_params(best_params_lr)
        mlflow.set_tags({
            'project': 'Depression Prediction Project',
            'optimizer_engine': 'Optuna',
            'model_family': 'logistic_regression',
            'feature_set_version': 1,
            'candidate': 'true'
        })

        lr = LogisticRegression(**best_params_lr, solver='saga')
        lr.fit(X_train_scaled, y_train)

        y_pred_lr = lr.predict(X_test_scaled)

        acc_lr = accuracy_score(y_test, y_pred_lr)
        precision_lr = precision_score(y_test, y_pred_lr)
        f1_lr = f1_score(y_test, y_pred_lr)
        recall_lr = recall_score(y_test, y_pred_lr)

        mlflow.log_metric('acc', acc_lr)
        mlflow.log_metric('precision', precision_lr)
        mlflow.log_metric('f1', f1_lr)
        mlflow.log_metric('recall', recall_lr)

        signature = infer_signature(X_train_scaled, lr.predict(X_train_scaled))
        mlflow.sklearn.log_model(
            lr,
            artifact_path='model',
            signature=signature
        )
    
    with mlflow.start_run(run_name=' Best SVC Model'):
        # Preprocess data and log artifacts
        X_train_scaled, X_test_scaled, _, encoder, scaler = preprocessor(X_train, X_test, X_val=None, save_artifacts=True)

        # Get MLflow ID to store the preprocessing artifacts
        preprocessor_run_id = mlflow.active_run().info.run_id
        mlflow.log_param('preprocessor_run_id', preprocessor_run_id)
        mlflow.log_params(best_params_svc)
        mlflow.set_tags({
            'project': 'Depression Prediction Project',
            'optimizer_engine': 'Optuna',
            'model_family': 'svc',
            'feature_set_version': 1,
            'candidate': 'true'
        })

        svc = SVC(**best_params_svc)
        svc.fit(X_train_scaled, y_train)

        y_pred_svc = svc.predict(X_test_scaled)

        acc_svc = accuracy_score(y_test, y_pred_svc)
        precision_svc = precision_score(y_test, y_pred_svc)
        f1_svc = f1_score(y_test, y_pred_svc)
        recall_svc = recall_score(y_test, y_pred_svc)

        mlflow.log_metric('acc', acc_svc)
        mlflow.log_metric('precision', precision_svc)
        mlflow.log_metric('f1', f1_svc)
        mlflow.log_metric('recall', recall_svc)

        signature = infer_signature(X_train_scaled, svc.predict(X_train_scaled))
        mlflow.sklearn.log_model(
            svc,
            artifact_path='model',
            signature=signature
        )
    
    with mlflow.start_run(run_name=' Best XGBoost Model'):
        # Preprocess data and log artifacts
        X_train_scaled, X_test_scaled, _, encoder, scaler = preprocessor(X_train, X_test, X_val=None, save_artifacts=True)

        # Get MLflow ID to store the preprocessing artifacts
        preprocessor_run_id = mlflow.active_run().info.run_id
        mlflow.log_param('preprocessor_run_id', preprocessor_run_id)
        mlflow.log_params(best_params_xgb)
        mlflow.set_tags({
            'project': 'Depression Prediction Project',
            'optimizer_engine': 'Optuna',
            'model_family': 'Trees',
            'feature_set_version': 1,
            'candidate': 'true'
        })

        xgb = XGBClassifier(**best_params_xgb)
        xgb.fit(X_train_scaled, y_train)
        y_pred_xgb = xgb.predict(X_test_scaled)

        acc_xgb = accuracy_score(y_test, y_pred_xgb)
        precision_xgb = precision_score(y_test, y_pred_xgb)
        f1_xgb = f1_score(y_test, y_pred_xgb)
        recall_xgb = recall_score(y_test, y_pred_xgb)

        mlflow.log_metric('acc', acc_xgb)
        mlflow.log_metric('precision', precision_xgb)
        mlflow.log_metric('f1', f1_xgb)
        mlflow.log_metric('recall', recall_xgb)

        signature = infer_signature(X_train_scaled, xgb.predict(X_train_scaled))
        mlflow.xgboost.log_model(
            xgb,
            artifact_path='model',
            signature=signature
        )

In [45]:
train_best_models(X_train, y_train, X_test, y_test, best_params_lr, best_params_svc, best_params_xgb)

Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.




üèÉ View run  Best Logistic Regression Model at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/9c35dce01110455c896c628517b96a08
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.




üèÉ View run  Best SVC Model at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/3d2d721a3f554ceb8c4dfe5c49a0de75
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753
Preprocessor artifacts (encoder & scaler) successfully logged to MLflow.




üèÉ View run  Best XGBoost Model at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753/runs/b59d2b9b16b14247a07eb765a3ce5eb6
üß™ View experiment at: https://dbc-c600c0c2-acad.cloud.databricks.com/ml/experiments/2425093441161753


Esta funci√≥n se encarga de registrar autom√°ticamente los dos mejores modelos de un experimento en el **Model Registry** de MLflow y asignarles los aliases `Champion` y `Challenger`. 
1. Primero busca todos los runs marcados como candidatos (`candidate=true`) y los ordena seg√∫n la m√©trica F1. 
2. Luego selecciona los dos primeros: el de mayor F1 se registra como `Champion` y el segundo como `Challenger`. 
3. Cada modelo se registra en el model registry y se le asigna su alias correspondiente.

In [None]:
# Setear la URI del Model Registry a legacy Workspace
mlflow.set_registry_uri("databricks-uc")

def register_champion_challenger(exp=EXPERIMENT_NAME, model_registry_name="workspace.default.DepressionClass"):
    client = MlflowClient()

    # Buscar los runs candidatos ordenados por F1
    runs = mlflow.search_runs(
        experiment_names=[exp],
        filter_string="tags.candidate = 'true'",
        order_by=["metrics.f1 DESC"]
    )

    if runs.empty:
        print("No candidate runs found.")
        return

    # Tomar los dos mejores
    champion = runs.iloc[0]
    challenger = runs.iloc[1] if len(runs) > 1 else None

    def register(run_row, alias):
        if run_row is None:
            return

        run_id = run_row['run_id']
        f1 = run_row['metrics.f1']
        model_family = run_row['tags.model_family']

        # Registrar modelo
        result = mlflow.register_model(
            model_uri=f"runs:/{run_id}/model",
            name=model_registry_name
        )

        # Asignar alias
        client.set_registered_model_alias(
            name=model_registry_name,
            alias=alias,
            version=result.version
        )

        print(f"{alias} registrado: {model_family} con F1={f1} (Run ID: {run_id})")

    register(champion, "Champion")
    register(challenger, "Challenger")

register_champion_challenger()

Registered model 'workspace.default.DepressionClass' already exists. Creating a new version of this model...


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/6 [00:00<?, ?it/s]

Created version '1' of model 'workspace.default.depressionclass'.


Champion registrado: logistic_regression con F1=0.8670476190476191 (Run ID: 9c35dce01110455c896c628517b96a08)


Registered model 'workspace.default.DepressionClass' already exists. Creating a new version of this model...


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/6 [00:00<?, ?it/s]

Created version '2' of model 'workspace.default.depressionclass'.


Challenger registrado: svc con F1=0.8614564831261101 (Run ID: 3d2d721a3f554ceb8c4dfe5c49a0de75)
