<h1><center>Laboratorio 9: Optimización de modelos 💯</center></h1>

<center><strong>MDS7202: Laboratorio de Programación Científica para Ciencia de Datos</strong></center>

### Cuerpo Docente:

- Profesor: Ignacio Meza, Gabriel Iturra
- Auxiliar: Sebastián Tinoco
- Ayudante: Arturo Lazcano, Angelo Muñoz

### Equipo: SUPER IMPORTANTE - notebooks sin nombre no serán revisados

- Nombre de alumno 1: Vicente Correa
- Nombre de alumno 2: Diego Kauer


## Temas a tratar

- Predicción de demanda usando `xgboost`
- Búsqueda del modelo óptimo de clasificación usando `optuna`
- Uso de pipelines.

## Reglas:

- **Grupos de 2 personas**
- Cualquier duda fuera del horario de clases al foro. Mensajes al equipo docente serán respondidos por este medio.
- Prohibidas las copias. 
- Pueden usar cualquer material del curso que estimen conveniente.

### Objetivos principales del laboratorio

- Optimizar modelos usando `optuna`
- Recurrir a técnicas de *prunning*
- Forzar el aprendizaje de relaciones entre variables mediante *constraints*
- Fijar un pipeline con un modelo base que luego se irá optimizando.

El laboratorio deberá ser desarrollado sin el uso indiscriminado de iteradores nativos de python (aka "for", "while"). La idea es que aprendan a exprimir al máximo las funciones optimizadas que nos entrega `pandas`, las cuales vale mencionar, son bastante más eficientes que los iteradores nativos sobre DataFrames.

### **Link de repositorio de GitHub:** https://github.com/diegokauer/MDS7202

# Importamos librerias útiles

In [1]:
!pip install xgboost optuna



# 1. El emprendimiento de Fiu

Tras liderar de manera exitosa la implementación de un proyecto de ciencia de datos para caracterizar los datos generados en Santiago 2023, el misterioso corpóreo **Fiu** se anima y decide levantar su propio negocio de consultoría en machine learning. Tras varias e intensas negociaciones, Fiu logra encontrar su *primera chamba*: predecir la demanda (cantidad de venta) de una famosa productora de bebidas de calibre mundial. Como usted tuvo un rendimiento sobresaliente en el proyecto de caracterización de datos, Fiu lo contrata como *data scientist* de su emprendimiento.

Para este laboratorio deben trabajar con los datos `sales.csv` subidos a u-cursos, el cual contiene una muestra de ventas de la empresa para diferentes productos en un determinado tiempo.

Para comenzar, cargue el dataset señalado y visualice a través de un `.head` los atributos que posee el dataset.

<i><p align="center">Fiu siendo felicitado por su excelente desempeño en el proyecto de caracterización de datos</p></i>
<p align="center">
  <img src="https://media-front.elmostrador.cl/2023/09/A_UNO_1506411_2440e.jpg">
</p>

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

df = pd.read_csv('sales.csv')
df['date'] = pd.to_datetime(df['date'])

df.head()

  df['date'] = pd.to_datetime(df['date'])


Unnamed: 0,id,date,city,lat,long,pop,shop,brand,container,capacity,price,quantity
0,0,2012-01-31,Athens,37.97945,23.71622,672130,shop_1,kinder-cola,glass,500ml,0.96,13280
1,1,2012-01-31,Athens,37.97945,23.71622,672130,shop_1,kinder-cola,plastic,1.5lt,2.86,6727
2,2,2012-01-31,Athens,37.97945,23.71622,672130,shop_1,kinder-cola,can,330ml,0.87,9848
3,3,2012-01-31,Athens,37.97945,23.71622,672130,shop_1,adult-cola,glass,500ml,1.0,20050
4,4,2012-01-31,Athens,37.97945,23.71622,672130,shop_1,adult-cola,can,330ml,0.39,25696


In [2]:
df.shape

(7456, 12)

In [3]:
df.nunique()

id           7456
date           84
city            5
lat             6
long            6
pop            35
shop            6
brand           5
container       3
capacity        3
price         402
quantity     6906
dtype: int64

In [4]:
df['quantity'].describe()

count      7456.000000
mean      29408.428380
std       17652.985675
min        2953.000000
25%       16572.750000
50%       25294.500000
75%       37699.000000
max      145287.000000
Name: quantity, dtype: float64

## 1.1 Generando un Baseline (0.5 puntos)

<p align="center">
  <img src="https://media.tenor.com/O-lan6TkadUAAAAC/what-i-wnna-do-after-a-baseline.gif">
</p>

Antes de entrenar un algoritmo, usted recuerda los apuntes de su magíster en ciencia de datos y recuerda que debe seguir una serie de *buenas prácticas* para entrenar correcta y debidamente su modelo. Después de un par de vueltas, llega a las siguientes tareas:

1. Separe los datos en conjuntos de train (70%), validation (20%) y test (10%). Fije una semilla para controlar la aleatoriedad.
2. Implemente un `FunctionTransformer` para extraer el día, mes y año de la variable `date`. Guarde estas variables en el formato categorical de pandas.
3. Implemente un `ColumnTransformer` para procesar de manera adecuada los datos numéricos y categóricos. Use `OneHotEncoder` para las variables categóricas.
4. Guarde los pasos anteriores en un `Pipeline`, dejando como último paso el regresor `DummyRegressor` para generar predicciones en base a promedios.
5. Entrene el pipeline anterior y reporte la métrica `mean_absolute_error` sobre los datos de validación. ¿Cómo se interpreta esta métrica para el contexto del negocio?
6. Finalmente, vuelva a entrenar el `Pipeline` pero esta vez usando `XGBRegressor` como modelo **utilizando los parámetros por default**. ¿Cómo cambia el MAE al implementar este algoritmo? ¿Es mejor o peor que el `DummyRegressor`?
7. Guarde ambos modelos en un archivo .pkl (uno cada uno)

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
import pickle

X = df.loc[:, ~df.columns.isin(['quantity'])]
y = df.loc[:, 'quantity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1,random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size= 0.2/(0.2 + 0.7), random_state=42)

def extract_date_info(df):
    df['date'] = pd.to_datetime(df['date'])
    df['day'] = df['date'].dt.day.astype('category')
    df['month'] = df['date'].dt.month.astype('category')
    df['year'] = df['date'].dt.year.astype('category')
    return df.drop('date', axis=1)

def get_feature_names_out(transformer, input_features=None):
    return ['dia', 'mes', 'anho']


date_transformer = ColumnTransformer(
    [
        ('cat', FunctionTransformer(extract_date_info, validate=False, feature_names_out=get_feature_names_out), ['date'])
    ], remainder='passthrough'
)
date_transformer.set_output(transform="pandas")

numeric_features = X.select_dtypes(include=['int', 'float']).columns
categorical_features = X.select_dtypes(include=['category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('minmax', MinMaxScaler(), [3,5,6,7,12]),
        ('oh', OneHotEncoder(sparse_output=False), [0,1,2]),
    ]
)
preprocessor.set_output(transform="pandas")

### Entrenamiento con regresor Dummy

pipe = Pipeline(steps=[
    ('date_extraction', date_transformer),
    ('preprocessor', preprocessor),
    ('regressor', DummyRegressor())
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
mae_dummy = mean_absolute_error(y_val, y_pred)
print(f'Dummy MAE: {mean_absolute_error(y_val, y_pred)}')

with open('Dummy_model.pkl', 'wb') as f:
    pickle.dump(pipe, f)

### Entrenamiento con regresor XGBoost

pipe = Pipeline(steps=[
    ('date_extraction', date_transformer),
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor())
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
mae_xgb_default = mean_absolute_error(y_val, y_pred)
print(f'XGBoost MAE: {mae_xgb_default}')
# pipe.fit_transform(X_train, y_train)

with open('XGBoost_default_model.pkl', 'wb') as f:
    pickle.dump(pipe, f)

Dummy MAE: 13543.961387782238
XGBoost MAE: 7008.988837929897


## 1.2 Forzando relaciones entre parámetros con XGBoost (1.0 puntos)

<p align="center">
  <img src="https://64.media.tumblr.com/14cc45f9610a6ee341a45fd0d68f4dde/20d11b36022bca7b-bf/s640x960/67ab1db12ff73a530f649ac455c000945d99c0d6.gif">
</p>

Un colega aficionado a la economía le *sopla* que la demanda guarda una relación inversa con el precio del producto. Motivado para impresionar al querido corpóreo, se propone hacer uso de esta información para mejorar su modelo.

Vuelva a entrenar el `Pipeline`, pero esta vez forzando una relación monótona negativa entre el precio y la cantidad. Luego, vuelva a reportar el `MAE` sobre el conjunto de validación. ¿Cómo cambia el error al incluir esta relación? ¿Tenía razón su amigo?

Nuevamente, guarde su modelo en un archivo .pkl

Nota: Para realizar esta parte, debe apoyarse en la siguiente <a href = https://xgboost.readthedocs.io/en/stable/tutorials/monotonic.html>documentación</a>.

Hint: Para implementar el constraint, se le sugiere hacerlo especificando el nombre de la variable. De ser así, probablemente le sea útil **mantener el formato de pandas** antes del step de entrenamiento.

In [16]:
# Inserte su código acá


pipe = Pipeline(steps=[
    ('date_extraction', date_transformer),
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(monotone_constraints={'minmax__remainder__price': -1}))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
mae_xgb_mono = mean_absolute_error(y_val, y_pred)
print(f'XGBoost MAE: {mean_absolute_error(y_val, y_pred)}')

with open('XGBoost.pkl', 'wb') as f:
    pickle.dump(pipe, f)

XGBoost MAE: 7017.723224788185


## 1.3 Optimización de Hiperparámetros con Optuna (2.0 puntos)

<p align="center">
  <img src="https://media.tenor.com/fmNdyGN4z5kAAAAi/hacking-lucy.gif">
</p>

Luego de presentarle sus resultados, Fiu le pregunta si es posible mejorar *aun más* su modelo. En particular, le comenta de la optimización de hiperparámetros con metodologías bayesianas a través del paquete `optuna`. Como usted es un aficionado al entrenamiento de modelos de ML, se propone implementar la descabellada idea de su jefe.

A partir de la mejor configuración obtenida en la sección anterior, utilice `optuna` para optimizar sus hiperparámetros. En particular, se le pide:

- Fijar una semilla en las instancias necesarias para garantizar la reproducibilidad de resultados
- Utilice `TPESampler` como método de muestreo
- De `XGBRegressor`, optimice los siguientes hiperparámetros:
    - `learning_rate` buscando valores flotantes en el rango (0.001, 0.1)
    - `n_estimators` buscando valores enteros en el rango (50, 1000)
    - `max_depth` buscando valores enteros en el rango (3, 10)
    - `max_leaves` buscando valores enteros en el rango (0, 100)
    - `min_child_weight` buscando valores enteros en el rango (1, 5)
    - `reg_alpha` buscando valores flotantes en el rango (0, 1)
    - `reg_lambda` buscando valores flotantes en el rango (0, 1)
- De `OneHotEncoder`, optimice el hiperparámetro `min_frequency` buscando el mejor valor flotante en el rango (0.0, 1.0)
- Explique cada hiperparámetro y su rol en el modelo. ¿Hacen sentido los rangos de optimización indicados?
- Fije el tiempo de entrenamiento a 5 minutos
- Reportar el número de *trials*, el `MAE` y los mejores hiperparámetros encontrados. ¿Cómo cambian sus resultados con respecto a la sección anterior? ¿A qué se puede deber esto?
- Guardar su modelo en un archivo .pkl

In [29]:
# Inserte su código acá
import optuna

def objective_function(trial):
    # Split into features and target
    X = df.loc[:, ~df.columns.isin(['quantity'])]
    y = df.loc[:, 'quantity']

    # Split into train and validation sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1,random_state=42)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size= 0.2/(0.2 + 0.7), random_state=42)

    # Define the hyperparameters to tune
    params_xg = {
        "eval_metric": mean_absolute_error,
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.1),
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "max_leaves": trial.suggest_int("max_leaves", 0, 100),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 5),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 1),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 1),
    }
    params_oh = {
        "min_frequency": trial.suggest_float("min_frequency", 0, 1)
    }

    def extract_date_info(df):
        df['date'] = pd.to_datetime(df['date'])
        df['day'] = df['date'].dt.day.astype('category')
        df['month'] = df['date'].dt.month.astype('category')
        df['year'] = df['date'].dt.year.astype('category')
        return df.drop('date', axis=1)

    def get_feature_names_out(transformer, input_features=None):
        return ['dia', 'mes', 'anho']


    date_transformer = ColumnTransformer(
        [
            ('cat', FunctionTransformer(extract_date_info, validate=False, feature_names_out=get_feature_names_out), ['date'])
        ], remainder='passthrough'
    )
    date_transformer.set_output(transform="pandas")

    numeric_features = X.select_dtypes(include=['int', 'float']).columns
    categorical_features = X.select_dtypes(include=['category']).columns

    preprocessor = ColumnTransformer(
        transformers=[
            ('minmax', MinMaxScaler(), [3,5,6,7,12]),
            ('oh', OneHotEncoder(sparse_output=False, **params_oh), [0,1,2]),
        ]
    )
    preprocessor.set_output(transform="pandas")

    pipe = Pipeline(steps=[
        ('date_extraction', date_transformer),
        ('preprocessor', preprocessor),
        ('regressor', XGBRegressor(seed=42, **params_xg))
    ])


    # Predict and evaluate the model
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_val)
    mae = mean_absolute_error(y_val, y_pred)

    return mae

In [30]:
optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(direction="minimize")
study.optimize(objective_function, timeout=5*60, show_progress_bar = True)

Best trial: 211. Best value: 6581.59:  100%|██████████████████████████████████████████████████████████████| 05:00/05:00


In [31]:
mae_xgb_best = study.best_value
print(f"XGBoost Best MAE: {mae_xgb_best}")
print(f"Optima Trials: {len(study.trials)}")
study.best_params

XGBoost Best MAE: 6581.594174668872
Optima Trials: 418


{'learning_rate': 0.08413604573426753,
 'n_estimators': 578,
 'max_depth': 3,
 'max_leaves': 3,
 'min_child_weight': 2,
 'reg_alpha': 0.8262122480261269,
 'reg_lambda': 0.7704823922790215,
 'min_frequency': 0.035904239618016205}

In [32]:
params_oh = {k: v for k,v in list(study.best_params.items())[-1:]}
params_xg = {k: v for k,v in list(study.best_params.items())[:-1]}

preprocessor = ColumnTransformer(
    transformers=[
        ('minmax', MinMaxScaler(), [3,5,6,7,12]),
        ('oh', OneHotEncoder(sparse_output=False, **params_oh), [0,1,2]),
    ]
)
preprocessor.set_output(transform="pandas")

pipe = Pipeline(steps=[
    ('date_extraction', date_transformer),
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(seed=42, **params_xg))
])


with open('XGBoost_fitted.pkl', 'wb') as f:
    pickle.dump(pipe, f)

## 1.4 Optimización de Hiperparámetros con Optuna y Prunners (1.7)

<p align="center">
  <img src="https://i.pinimg.com/originals/90/16/f9/9016f919c2259f3d0e8fe465049638a7.gif">
</p>

Después de optimizar el rendimiento de su modelo varias veces, Fiu le pregunta si no es posible optimizar el entrenamiento del modelo en sí mismo. Después de leer un par de post de personas de dudosa reputación en la *deepweb*, usted llega a la conclusión que puede cumplir este objetivo mediante la implementación de **Prunning**.

Vuelva a optimizar los mismos hiperparámetros que la sección pasada, pero esta vez utilizando **Prunning** en la optimización. En particular, usted debe:

- Responder: ¿Qué es prunning? ¿De qué forma debería impactar en el entrenamiento?
- Utilizar `optuna.integration.XGBoostPruningCallback` como método de **Prunning**
- Fijar nuevamente el tiempo de entrenamiento a 5 minutos
- Reportar el número de *trials*, el `MAE` y los mejores hiperparámetros encontrados. ¿Cómo cambian sus resultados con respecto a la sección anterior? ¿A qué se puede deber esto?
- Guardar su modelo en un archivo .pkl

Nota: Si quieren silenciar los prints obtenidos en el prunning, pueden hacerlo mediante el siguiente comando:

```
optuna.logging.set_verbosity(optuna.logging.WARNING)
```

De implementar la opción anterior, pueden especificar `show_progress_bar = True` en el método `optimize` para *más sabor*.

Hint: Si quieren especificar parámetros del método .fit() del modelo a través del pipeline, pueden hacerlo por medio de la siguiente sintaxis: `pipeline.fit(stepmodelo__parametro = valor)`

Hint2: Este <a href = https://stackoverflow.com/questions/40329576/sklearn-pass-fit-parameters-to-xgboost-in-pipeline>enlace</a> les puede ser de ayuda en su implementación

In [33]:
# Inserte su código acá


def objective_function_pr(trial):
    # Split into features and target
    X = df.loc[:, ~df.columns.isin(['quantity'])]
    y = df.loc[:, 'quantity']

    # Split into train and validation sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1,random_state=42)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size= 0.2/(0.2 + 0.7), random_state=42)

    # Define the hyperparameters to tune
    params_xg = {
        "eval_metric": mean_absolute_error,
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.1),
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "max_leaves": trial.suggest_int("max_leaves", 0, 100),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 5),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 1),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 1),
    }
    params_oh = {
        "min_frequency": trial.suggest_float("min_frequency", 0, 1)
    }

    # Train the XGBoost model with prunning
    pruning_callback = optuna.integration.XGBoostPruningCallback(
        trial, observation_key="validation_0-mean_absolute_error"
    )

    def extract_date_info(df):
        df['date'] = pd.to_datetime(df['date'])
        df['day'] = df['date'].dt.day.astype('category')
        df['month'] = df['date'].dt.month.astype('category')
        df['year'] = df['date'].dt.year.astype('category')
        return df.drop('date', axis=1)

    def get_feature_names_out(transformer, input_features=None):
        return ['dia', 'mes', 'anho']


    date_transformer = ColumnTransformer(
        [
            ('cat', FunctionTransformer(extract_date_info, validate=False, feature_names_out=get_feature_names_out), ['date'])
        ], remainder='passthrough'
    )
    date_transformer.set_output(transform="pandas")

    numeric_features = X.select_dtypes(include=['int', 'float']).columns
    categorical_features = X.select_dtypes(include=['category']).columns

    preprocessor = ColumnTransformer(
        transformers=[
            ('minmax', MinMaxScaler(), [3,5,6,7,12]),
            ('oh', OneHotEncoder(sparse_output=False, **params_oh), [0,1,2]),
        ]
    )
    preprocessor.set_output(transform="pandas")

    pipe = Pipeline(steps=[
        ('date_extraction', date_transformer),
        ('preprocessor', preprocessor),
        ('regressor', XGBRegressor(seed=42, **params_xg))
    ])

    semi_pipe = Pipeline(steps=[
        ('date_extraction', date_transformer),
        ('preprocessor', preprocessor),
    ])

    X_val_scaled = semi_pipe.fit_transform(X_val)


    # Predict and evaluate the model
    pipe.fit(X_train, y_train, regressor__callbacks=[pruning_callback], regressor__eval_set=[(X_val_scaled, y_val),])
    y_pred = pipe.predict(X_val)
    mae = mean_absolute_error(y_val, y_pred)

    return mae

In [34]:
study_pr = optuna.create_study(direction="minimize")
study_pr.optimize(objective_function_pr, timeout=5*60, show_progress_bar = True)

   0%|                                                                                                    | 00:00/05:00

[0]	validation_0-rmse:31323.23665	validation_0-mean_absolute_error:26386.21680
[1]	validation_0-rmse:29069.91426	validation_0-mean_absolute_error:24159.15625
[2]	validation_0-rmse:27029.95286	validation_0-mean_absolute_error:22129.70117
[3]	validation_0-rmse:25213.01238	validation_0-mean_absolute_error:20292.32617
[4]	validation_0-rmse:23601.71087	validation_0-mean_absolute_error:18643.86523
[5]	validation_0-rmse:22168.07812	validation_0-mean_absolute_error:17174.52539
[6]	validation_0-rmse:20882.79417	validation_0-mean_absolute_error:15886.06836
[7]	validation_0-rmse:19775.75359	validation_0-mean_absolute_error:14778.34570
[8]	validation_0-rmse:18792.38245	validation_0-mean_absolute_error:13831.65820
[9]	validation_0-rmse:17919.66157	validation_0-mean_absolute_error:13018.31348
[10]	validation_0-rmse:17155.26006	validation_0-mean_absolute_error:12333.06445
[11]	validation_0-rmse:16451.00056	validation_0-mean_absolute_error:11725.12109
[12]	validation_0-rmse:15822.34604	validation_0-me


`callbacks` in `fit` method is deprecated for better compatibility with scikit-learn, use `callbacks` in constructor or`set_params` instead.



[28]	validation_0-rmse:11630.20114	validation_0-mean_absolute_error:8211.53418
[29]	validation_0-rmse:11536.26105	validation_0-mean_absolute_error:8149.20898
[30]	validation_0-rmse:11440.68326	validation_0-mean_absolute_error:8091.93359
[31]	validation_0-rmse:11368.92272	validation_0-mean_absolute_error:8053.98145
[32]	validation_0-rmse:11295.62373	validation_0-mean_absolute_error:8006.72119
[33]	validation_0-rmse:11227.66428	validation_0-mean_absolute_error:7960.99756
[34]	validation_0-rmse:11171.49540	validation_0-mean_absolute_error:7928.77490
[35]	validation_0-rmse:11131.74320	validation_0-mean_absolute_error:7909.30957
[36]	validation_0-rmse:11075.05505	validation_0-mean_absolute_error:7872.16748
[37]	validation_0-rmse:11036.74040	validation_0-mean_absolute_error:7853.57129
[38]	validation_0-rmse:10983.69630	validation_0-mean_absolute_error:7821.75537
[39]	validation_0-rmse:10945.77484	validation_0-mean_absolute_error:7794.47363
[40]	validation_0-rmse:10908.12907	validation_0-mean

[132]	validation_0-rmse:10732.65630	validation_0-mean_absolute_error:7707.17871
[133]	validation_0-rmse:10732.08064	validation_0-mean_absolute_error:7707.79980
[134]	validation_0-rmse:10734.78094	validation_0-mean_absolute_error:7710.05420
[135]	validation_0-rmse:10735.14383	validation_0-mean_absolute_error:7712.42383
[136]	validation_0-rmse:10735.29176	validation_0-mean_absolute_error:7712.54883
[137]	validation_0-rmse:10738.86163	validation_0-mean_absolute_error:7717.05908
[138]	validation_0-rmse:10739.89709	validation_0-mean_absolute_error:7717.29102
[139]	validation_0-rmse:10735.70925	validation_0-mean_absolute_error:7714.01074
[140]	validation_0-rmse:10736.10504	validation_0-mean_absolute_error:7714.64258
[141]	validation_0-rmse:10739.05673	validation_0-mean_absolute_error:7717.11670
[142]	validation_0-rmse:10739.32249	validation_0-mean_absolute_error:7717.37402
[143]	validation_0-rmse:10739.47658	validation_0-mean_absolute_error:7717.31641
[144]	validation_0-rmse:10738.32521	vali

Best trial: 0. Best value: 7412.63:    0%|▏                                                               | 00:00/05:00

[0]	validation_0-rmse:33150.68122	validation_0-mean_absolute_error:28154.89062
[1]	validation_0-rmse:32448.34849	validation_0-mean_absolute_error:27464.08008
[2]	validation_0-rmse:31764.64702	validation_0-mean_absolute_error:26790.83984
[3]	validation_0-rmse:31105.82164	validation_0-mean_absolute_error:26136.62109
[4]	validation_0-rmse:30466.56888	validation_0-mean_absolute_error:25502.45508
[5]	validation_0-rmse:29844.63601	validation_0-mean_absolute_error:24882.18164
[6]	validation_0-rmse:29243.04180	validation_0-mean_absolute_error:24280.04883
[7]	validation_0-rmse:28660.44985	validation_0-mean_absolute_error:23695.50977
[8]	validation_0-rmse:28094.79279	validation_0-mean_absolute_error:23124.65430
[9]	validation_0-rmse:27546.30408	validation_0-mean_absolute_error:22569.12695
[10]	validation_0-rmse:27016.05740	validation_0-mean_absolute_error:22030.39453
[11]	validation_0-rmse:26501.28307	validation_0-mean_absolute_error:21505.01367
[12]	validation_0-rmse:26000.80781	validation_0-me


`callbacks` in `fit` method is deprecated for better compatibility with scikit-learn, use `callbacks` in constructor or`set_params` instead.



[36]	validation_0-rmse:17791.47757	validation_0-mean_absolute_error:12796.74219
[37]	validation_0-rmse:17570.88402	validation_0-mean_absolute_error:12604.68457
[38]	validation_0-rmse:17354.56293	validation_0-mean_absolute_error:12420.21973
[39]	validation_0-rmse:17147.43491	validation_0-mean_absolute_error:12246.26856
[40]	validation_0-rmse:16953.16854	validation_0-mean_absolute_error:12079.86035
[41]	validation_0-rmse:16758.22428	validation_0-mean_absolute_error:11919.28418
[42]	validation_0-rmse:16574.59074	validation_0-mean_absolute_error:11768.07519
[43]	validation_0-rmse:16390.57064	validation_0-mean_absolute_error:11616.54981
[44]	validation_0-rmse:16221.82792	validation_0-mean_absolute_error:11479.86621
[45]	validation_0-rmse:16053.99990	validation_0-mean_absolute_error:11348.75293
[46]	validation_0-rmse:15895.55144	validation_0-mean_absolute_error:11222.93848
[47]	validation_0-rmse:15742.23011	validation_0-mean_absolute_error:11102.52637
[48]	validation_0-rmse:15592.78440	valid

[139]	validation_0-rmse:11031.41280	validation_0-mean_absolute_error:7853.98779
[140]	validation_0-rmse:11015.82281	validation_0-mean_absolute_error:7842.58496
[141]	validation_0-rmse:11002.93408	validation_0-mean_absolute_error:7832.92627
[142]	validation_0-rmse:10986.22596	validation_0-mean_absolute_error:7821.92236
[143]	validation_0-rmse:10974.77622	validation_0-mean_absolute_error:7812.86035
[144]	validation_0-rmse:10959.99422	validation_0-mean_absolute_error:7802.03467
[145]	validation_0-rmse:10949.81549	validation_0-mean_absolute_error:7793.66748
[146]	validation_0-rmse:10937.80946	validation_0-mean_absolute_error:7786.05518
[147]	validation_0-rmse:10925.69895	validation_0-mean_absolute_error:7776.58691
[148]	validation_0-rmse:10897.75036	validation_0-mean_absolute_error:7759.44092
[149]	validation_0-rmse:10887.16075	validation_0-mean_absolute_error:7750.56885
[150]	validation_0-rmse:10874.83497	validation_0-mean_absolute_error:7741.66602
[151]	validation_0-rmse:10862.89593	vali

[242]	validation_0-rmse:10121.52423	validation_0-mean_absolute_error:7260.25000
[243]	validation_0-rmse:10118.80629	validation_0-mean_absolute_error:7258.92139
[244]	validation_0-rmse:10117.56724	validation_0-mean_absolute_error:7259.24951
[245]	validation_0-rmse:10114.98717	validation_0-mean_absolute_error:7257.36865
[246]	validation_0-rmse:10113.35715	validation_0-mean_absolute_error:7257.09131
[247]	validation_0-rmse:10112.49818	validation_0-mean_absolute_error:7256.87451
[248]	validation_0-rmse:10109.48363	validation_0-mean_absolute_error:7255.37451
[249]	validation_0-rmse:10105.30505	validation_0-mean_absolute_error:7252.12451
[250]	validation_0-rmse:10105.37309	validation_0-mean_absolute_error:7252.39404
[251]	validation_0-rmse:10101.86633	validation_0-mean_absolute_error:7250.02295
[252]	validation_0-rmse:10100.85598	validation_0-mean_absolute_error:7250.13477
[253]	validation_0-rmse:10097.32264	validation_0-mean_absolute_error:7248.54736
[254]	validation_0-rmse:10095.52356	vali

[345]	validation_0-rmse:10018.94866	validation_0-mean_absolute_error:7207.95703
[346]	validation_0-rmse:10020.86562	validation_0-mean_absolute_error:7208.68555
[347]	validation_0-rmse:10018.29125	validation_0-mean_absolute_error:7206.63135
[348]	validation_0-rmse:10018.31479	validation_0-mean_absolute_error:7206.86328
[349]	validation_0-rmse:10017.77941	validation_0-mean_absolute_error:7206.50244
[350]	validation_0-rmse:10018.63349	validation_0-mean_absolute_error:7207.02832
[351]	validation_0-rmse:10016.86099	validation_0-mean_absolute_error:7205.91553
[352]	validation_0-rmse:10017.95201	validation_0-mean_absolute_error:7206.97705
[353]	validation_0-rmse:10017.33517	validation_0-mean_absolute_error:7207.37647
[354]	validation_0-rmse:10017.49345	validation_0-mean_absolute_error:7207.35400
[355]	validation_0-rmse:10016.14392	validation_0-mean_absolute_error:7206.57715
[356]	validation_0-rmse:10016.12940	validation_0-mean_absolute_error:7206.73535
[357]	validation_0-rmse:10016.61356	vali

Best trial: 1. Best value: 6955.45:    1%|▌                                                               | 00:02/05:00

[0]	validation_0-rmse:32810.51719	validation_0-mean_absolute_error:27832.09961
[1]	validation_0-rmse:31794.97926	validation_0-mean_absolute_error:26841.21094
[2]	validation_0-rmse:30827.95479	validation_0-mean_absolute_error:25893.86523
[3]	validation_0-rmse:29906.10488	validation_0-mean_absolute_error:24986.68164
[4]	validation_0-rmse:29020.28794	validation_0-mean_absolute_error:24108.39648
[5]	validation_0-rmse:28174.20066	validation_0-mean_absolute_error:23266.38281
[6]	validation_0-rmse:27364.59228	validation_0-mean_absolute_error:22454.69922
[7]	validation_0-rmse:26596.34257	validation_0-mean_absolute_error:21679.51172
[8]	validation_0-rmse:25866.31925	validation_0-mean_absolute_error:20937.64062
[9]	validation_0-rmse:25162.78935	validation_0-mean_absolute_error:20221.49023
[10]	validation_0-rmse:24497.01744	validation_0-mean_absolute_error:19535.30078
[11]	validation_0-rmse:23867.28243	validation_0-mean_absolute_error:18887.54492
[12]	validation_0-rmse:23259.34579	validation_0-me


`callbacks` in `fit` method is deprecated for better compatibility with scikit-learn, use `callbacks` in constructor or`set_params` instead.



[37]	validation_0-rmse:14731.41655	validation_0-mean_absolute_error:10325.84180
[38]	validation_0-rmse:14565.61040	validation_0-mean_absolute_error:10202.87012
[39]	validation_0-rmse:14396.83815	validation_0-mean_absolute_error:10078.19727
[40]	validation_0-rmse:14249.18345	validation_0-mean_absolute_error:9968.41113
[41]	validation_0-rmse:14096.84385	validation_0-mean_absolute_error:9860.70410
[42]	validation_0-rmse:13959.89078	validation_0-mean_absolute_error:9765.37500
[43]	validation_0-rmse:13829.85845	validation_0-mean_absolute_error:9669.55469
[44]	validation_0-rmse:13697.98937	validation_0-mean_absolute_error:9573.27832
[45]	validation_0-rmse:13578.34009	validation_0-mean_absolute_error:9489.33008
[46]	validation_0-rmse:13467.01467	validation_0-mean_absolute_error:9409.61816
[47]	validation_0-rmse:13349.79806	validation_0-mean_absolute_error:9329.64258
[48]	validation_0-rmse:13238.31368	validation_0-mean_absolute_error:9253.83691
[49]	validation_0-rmse:13131.39303	validation_0-m

[141]	validation_0-rmse:10673.14703	validation_0-mean_absolute_error:7613.43555
[142]	validation_0-rmse:10672.55775	validation_0-mean_absolute_error:7613.55225
[143]	validation_0-rmse:10672.97052	validation_0-mean_absolute_error:7613.42236
[144]	validation_0-rmse:10668.32467	validation_0-mean_absolute_error:7610.35107
[145]	validation_0-rmse:10663.85562	validation_0-mean_absolute_error:7608.17285
[146]	validation_0-rmse:10662.27979	validation_0-mean_absolute_error:7606.65527
[147]	validation_0-rmse:10663.75056	validation_0-mean_absolute_error:7608.14062
[148]	validation_0-rmse:10664.96420	validation_0-mean_absolute_error:7609.37598
[149]	validation_0-rmse:10661.44124	validation_0-mean_absolute_error:7607.72119
[150]	validation_0-rmse:10659.21944	validation_0-mean_absolute_error:7605.31299
[151]	validation_0-rmse:10659.13926	validation_0-mean_absolute_error:7603.90088
[152]	validation_0-rmse:10660.14520	validation_0-mean_absolute_error:7604.23682
[153]	validation_0-rmse:10659.09125	vali

[244]	validation_0-rmse:10667.03347	validation_0-mean_absolute_error:7615.47998
[245]	validation_0-rmse:10667.22512	validation_0-mean_absolute_error:7615.80029
[246]	validation_0-rmse:10667.55181	validation_0-mean_absolute_error:7617.61182
[247]	validation_0-rmse:10667.46443	validation_0-mean_absolute_error:7617.71436
[248]	validation_0-rmse:10668.75791	validation_0-mean_absolute_error:7617.31982
[249]	validation_0-rmse:10668.73460	validation_0-mean_absolute_error:7617.47607
[250]	validation_0-rmse:10668.00772	validation_0-mean_absolute_error:7616.95117
[251]	validation_0-rmse:10669.43987	validation_0-mean_absolute_error:7618.94092
[252]	validation_0-rmse:10670.65621	validation_0-mean_absolute_error:7618.11914
[253]	validation_0-rmse:10670.70678	validation_0-mean_absolute_error:7618.35010
[254]	validation_0-rmse:10671.06261	validation_0-mean_absolute_error:7618.71973
[255]	validation_0-rmse:10671.21257	validation_0-mean_absolute_error:7618.95557
[256]	validation_0-rmse:10673.69958	vali

[347]	validation_0-rmse:10721.56532	validation_0-mean_absolute_error:7663.19727
[348]	validation_0-rmse:10724.31123	validation_0-mean_absolute_error:7665.34863
[349]	validation_0-rmse:10723.50824	validation_0-mean_absolute_error:7665.10596
[350]	validation_0-rmse:10723.15110	validation_0-mean_absolute_error:7665.27490
[351]	validation_0-rmse:10723.76648	validation_0-mean_absolute_error:7665.33105
[352]	validation_0-rmse:10722.33273	validation_0-mean_absolute_error:7663.58447
[353]	validation_0-rmse:10723.39260	validation_0-mean_absolute_error:7664.26123
[354]	validation_0-rmse:10723.68306	validation_0-mean_absolute_error:7664.36182
[355]	validation_0-rmse:10724.50401	validation_0-mean_absolute_error:7665.11377
[356]	validation_0-rmse:10724.98961	validation_0-mean_absolute_error:7665.58643
[357]	validation_0-rmse:10725.35337	validation_0-mean_absolute_error:7665.70020
[358]	validation_0-rmse:10725.72542	validation_0-mean_absolute_error:7667.02881
[359]	validation_0-rmse:10726.13203	vali


`callbacks` in `fit` method is deprecated for better compatibility with scikit-learn, use `callbacks` in constructor or`set_params` instead.

Best trial: 1. Best value: 6955.45:    1%|▉                                                               | 00:04/05:00

[W 2023-11-16 23:38:59,863] Trial 3 failed with parameters: {'learning_rate': 0.053723907114161855, 'n_estimators': 842, 'max_depth': 4, 'max_leaves': 78, 'min_child_weight': 2, 'reg_alpha': 0.38249412207847044, 'reg_lambda': 0.029488758938357007, 'min_frequency': 0.09408769155494312} because of the following error: ValueError("feature_names mismatch: ['minmax__remainder__id', 'minmax__remainder__lat', 'minmax__remainder__long', 'minmax__remainder__pop', 'minmax__remainder__price', 'oh__cat__dia_30', 'oh__cat__dia_31', 'oh__cat__dia_infrequent_sklearn', 'oh__cat__mes_infrequent_sklearn', 'oh__cat__anho_2012', 'oh__cat__anho_2013', 'oh__cat__anho_2014', 'oh__cat__anho_2015', 'oh__cat__anho_2016', 'oh__cat__anho_2017', 'oh__cat__anho_2018'] ['minmax__remainder__id', 'minmax__remainder__lat', 'minmax__remainder__long', 'minmax__remainder__pop', 'minmax__remainder__price', 'oh__cat__dia_30', 'oh__cat__dia_31', 'oh__cat__dia_infrequent_sklearn', 'oh__cat__mes_2', 'oh__cat__mes_3', 'oh__cat_




ValueError: feature_names mismatch: ['minmax__remainder__id', 'minmax__remainder__lat', 'minmax__remainder__long', 'minmax__remainder__pop', 'minmax__remainder__price', 'oh__cat__dia_30', 'oh__cat__dia_31', 'oh__cat__dia_infrequent_sklearn', 'oh__cat__mes_infrequent_sklearn', 'oh__cat__anho_2012', 'oh__cat__anho_2013', 'oh__cat__anho_2014', 'oh__cat__anho_2015', 'oh__cat__anho_2016', 'oh__cat__anho_2017', 'oh__cat__anho_2018'] ['minmax__remainder__id', 'minmax__remainder__lat', 'minmax__remainder__long', 'minmax__remainder__pop', 'minmax__remainder__price', 'oh__cat__dia_30', 'oh__cat__dia_31', 'oh__cat__dia_infrequent_sklearn', 'oh__cat__mes_2', 'oh__cat__mes_3', 'oh__cat__mes_infrequent_sklearn', 'oh__cat__anho_2012', 'oh__cat__anho_2013', 'oh__cat__anho_2014', 'oh__cat__anho_2015', 'oh__cat__anho_2016', 'oh__cat__anho_2017', 'oh__cat__anho_2018']
training data did not have the following fields: oh__cat__mes_3, oh__cat__mes_2

In [35]:
mae_xgb_best_prunned = study_pr.best_value
print(f"XGBoost Best MAE: {study_pr.best_value}")
print(f"Optima Trials: {len(study_pr.trials)}")
study_pr.best_params

XGBoost Best MAE: 6955.445573167571
Optima Trials: 4


{'learning_rate': 0.02625721429737358,
 'n_estimators': 416,
 'max_depth': 6,
 'max_leaves': 50,
 'min_child_weight': 5,
 'reg_alpha': 0.523281577730208,
 'reg_lambda': 0.18273613412240297,
 'min_frequency': 0.4410867022803}

In [36]:
params_oh = {k: v for k,v in list(study_pr.best_params.items())[-1:]}
params_xg = {k: v for k,v in list(study_pr.best_params.items())[:-1]}

preprocessor = ColumnTransformer(
    transformers=[
        ('minmax', MinMaxScaler(), [3,5,6,7,12]),
        ('oh', OneHotEncoder(sparse_output=False, **params_oh), [0,1,2]),
    ]
)
preprocessor.set_output(transform="pandas")

pipe = Pipeline(steps=[
    ('date_extraction', date_transformer),
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(seed=42, **params_xg))
])


with open('XGBoost_fitted_pruned.pkl', 'wb') as f:
    pickle.dump(pipe, f)

## 1.5 Visualizaciones (0.5 puntos)

<p align="center">
  <img src="https://media.tenor.com/F-LgB1xTebEAAAAd/look-at-this-graph-nickelback.gif">
</p>


Satisfecho con su trabajo, Fiu le pregunta si es posible generar visualizaciones que permitan entender el entrenamiento de su modelo.

A partir del siguiente <a href = https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/005_visualization.html#visualization>enlace</a>, genere las siguientes visualizaciones:

- Gráfico de historial de optimización
- Gráfico de coordenadas paralelas
- Gráfico de importancia de hiperparámetros

Comente sus resultados: ¿Desde qué *trial* se empiezan a observar mejoras notables en sus resultados? ¿Qué tendencias puede observar a partir del gráfico de coordenadas paralelas? ¿Cuáles son los hiperparámetros con mayor importancia para la optimización de su modelo?

In [42]:
# Inserte su código acá
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances

# Gráfico de historial de optimización con Optuna
plot_optimization_history(study)

In [43]:
# Gráfico de historial de optimización con Optuna y Prunning
plot_optimization_history(study_pr)

In [44]:
#Gráfico de coordenadas paralelas de optimización con Optuna
plot_parallel_coordinate(study)

In [45]:
#Gráfico de coordenadas paralelas de optimización con Optuna y Prunning
plot_parallel_coordinate(study_pr)

In [46]:
# Gráfico de importancia de hiperparámetros con Optuna
plot_param_importances(study)

In [47]:
# Gráfico de importancia de hiperparámetros con Optuna
plot_param_importances(study_pr)

## 1.6 Síntesis de resultados (0.3)

Finalmente, genere una tabla resumen del MAE obtenido en los 5 modelos entrenados (desde Baseline hasta XGBoost con Constraints, Optuna y Prunning) y compare sus resultados. ¿Qué modelo obtiene el mejor rendimiento? 

Por último, cargue el mejor modelo, prediga sobre el conjunto de test y reporte su MAE. ¿Existen diferencias con respecto a las métricas obtenidas en el conjunto de validación? ¿Porqué puede ocurrir esto?

In [53]:
MAES = [mae_dummy, mae_xgb_default, mae_xgb_mono, mae_xgb_best, mae_xgb_best_prunned]

df_summary = pd.DataFrame({
    'Modelo': ['Dummy', 'XGB Default', 'XGB Mono', 'XGB Optuna', 'XGB Prunning'],
    'MAE': MAES
})
print(df_summary.to_string(index=False))

      Modelo          MAE
       Dummy 13543.961388
 XGB Default  7008.988838
    XGB Mono  7017.723225
  XGB Optuna  6581.594175
XGB Prunning  6955.445573


En base a los resultados de los MAE de los modelos, se puede ver que el modelo optimizado con Optuna es el que obtiene el mejor desempeño, con un MAE de 6581.6 aproximadamente. Es por esto que a continuación se cargará el modelo optimizado por Optuna para predecir sobre el conjunto de test y predecir su MAE:

# Conclusión
Eso ha sido todo para el lab de hoy, recuerden que el laboratorio tiene un plazo de entrega de una semana. Cualquier duda del laboratorio, no duden en contactarnos por mail o U-cursos.

<p align="center">
  <img src="https://media.tenor.com/8CT1AXElF_cAAAAC/gojo-satoru.gif">
</p>

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=87110296-876e-426f-b91d-aaf681223468' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>