## Reto de ML Modeling - Predicci√≥n de variable continua
### **Script:** Calidad de datos, desarrollo de modelo de regresi√≥n y testeo de modelo ante datos nuevos (producci√≥n).

**Tags:** #datacleaning, #datapreprocessing, #featureengineering, #MLmodel, #regression

**Autor:** Alessio Daniel Hern√°ndez Rojas

**√öltima actualizaci√≥n**: 04-01-2026 (Creado: 02-01-2026)

**Descripci√≥n:** El presente script tiene como objetivo desarrollar y evaluar un modelo de predicci√≥n para una variable continua en un dataset de 20 variables continuas. En particular, este script aborda: 1) Limpieza, 2) preprocesamiento de datos, 3) construcci√≥n del modelo y 4) evaluaci√≥n del modelo con nuevos datos.

**Entrada:** Base de datos con 20 variables continuas y una variable target continua

**Salida:** Predicciones, preprocesado en .pkl, modelo ML en .pkl

#### Instalaci√≥n de librerias

In [1]:
# Instalaci√≥n de librerias necesarias
#!pip install pandas

#### Importaci√≥n de librerias

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from scipy.stats import gaussian_kde
from datetime import datetime
from sklearn.ensemble import IsolationForest
import random
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import ElasticNetCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error
from skopt import BayesSearchCV
from skopt.space import Real, Integer
from sklearn.utils.validation import check_is_fitted
from sklearn.base import clone
import joblib
from sklearn.model_selection import train_test_split


#### Funciones

In [3]:
# Guardado de pipeline y modelo
def guardar_pipeline_y_modelo(pipeline, model, 
                              pipeline_path="preprocessing_pipeline.pkl",
                              model_path="xgb_bestmodel.pkl"):
    """
    Guarda pipeline de preprocesamiento y modelo entrenado en archivos .pkl
    """

    joblib.dump(pipeline, pipeline_path)
    print(f"‚úÖ Pipeline de preprocesamiento guardado en: {pipeline_path}")

    joblib.dump(model, model_path)
    print(f"‚úÖ Modelo XGBoost guardado en: {model_path}")

# Guardado de predicciones
def save_predictions_csv(
    y_pred,
    output_path="predictions_blind_test.csv",
    target_col="target_prediction",
    decimals=2
):
    """
    Guarda predicciones en un CSV con √≠ndice iniciando en 1.

    Parameters
    ----------
    y_pred : array-like
        Vector de predicciones del modelo.
    output_path : str
        Ruta del archivo CSV de salida.
    target_col : str
        Nombre de la columna de predicciones.
    decimals : int
        N√∫mero de decimales para redondeo.

    Returns
    -------
    pd.DataFrame
        DataFrame guardado (√∫til para debug o testing).
    """

    # Validaciones b√°sicas
    if y_pred is None:
        print("‚ùå Error: y_pred es None. No se guard√≥ ning√∫n archivo.")
        return None

    if len(y_pred) == 0:
        print("‚ö†Ô∏è Advertencia: no se encontraron registros para guardar.")
        return None

    # Convertir a array numpy (por seguridad)
    y_pred = np.asarray(y_pred)

    # Crear DataFrame
    df_out = pd.DataFrame(
        {target_col: np.round(y_pred, decimals)},
        index=range(1, len(y_pred) + 1)
    )

    # Guardar CSV
    df_out.to_csv(output_path, index_label="index")

    # Mensaje de √©xito
    print(
        f"‚úÖ Se guardaron {len(df_out)} predicciones correctamente "
        f"en el archivo '{output_path}'"
    )

    return df_out

# Revisa si el modelo fue ya entrenado
def is_fitted(model):
    """
    Devuelve True si el modelo ya ha sido entrenado (fit), False si no.
    Funciona para la mayor√≠a de modelos de sklearn y XGBoost.
    """
    try:
        check_is_fitted(model)
        return True
    except:
        return False

# Lectura de conjunto de datos de blind test
def cargar_datos():
    df_blind_test = pd.read_csv('data/blind_test_data.csv')
    print(f'‚úÖ Dataset cargado correctamente. {df_blind_test.shape[0]} filas y {df_blind_test.shape[1]} columnas.')
    return df_blind_test

class ElasticNetFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.selected_features_ = None

    def fit(self, X, y):
        from sklearn.preprocessing import StandardScaler
        from sklearn.linear_model import ElasticNetCV
        import numpy as np

        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)

        model = ElasticNetCV(
            l1_ratio=[0.1, 0.5, 0.9],
            alphas=np.logspace(-4, 0, 50),
            cv=5,
            max_iter=5000,
            random_state=42
        )
        model.fit(X_scaled, y)

        coef = model.coef_
        self.selected_features_ = X.columns[coef != 0].tolist()
        discarded_features = X.columns[coef == 0].tolist()

        # üîî Mensajes
        print(f"‚úÖ ElasticNet seleccion√≥ {len(self.selected_features_)} features")
        print(f"‚úò ElasticNet descart√≥ {len(discarded_features)} features")

        if len(self.selected_features_) == 0:
            print("‚ö†Ô∏è Advertencia: ElasticNet no seleccion√≥ ninguna feature")

        return self

    def transform(self, X):
        valid_cols = [c for c in self.selected_features_ if c in X.columns]

        if len(valid_cols) < len(self.selected_features_):
            missing = set(self.selected_features_) - set(valid_cols)
            print(f"‚ö†Ô∏è Features seleccionadas que no existen en transform(): {list(missing)}")

        return X[valid_cols]

class DropHighCorr(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.9, verbose=True):
        self.threshold = threshold
        self.verbose = verbose

    def fit(self, X, y=None):
        corr = X.corr().abs()
        upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

        self.to_drop_ = []
        reported = []

        for col in upper.columns:
            high_corr = upper[col][upper[col] > self.threshold]
            if not high_corr.empty:
                self.to_drop_.append(col)
                for idx, val in high_corr.items():
                    reported.append((idx, col, val))

        if self.verbose and reported:
            print("‚ö†Ô∏è Variables con alta correlaci√≥n (>|{:.2f}|):\n".format(self.threshold))
            for v1, v2, val in reported:
                print(f"  - {v1} ‚Üî {v2} : corr = {val:.3f}")

        if self.verbose and not reported:
            print("‚úÖ No se encontraron variables con correlaci√≥n alta.")

        return self

    def transform(self, X):
        return X.drop(columns=self.to_drop_, errors="ignore")

class AddInteractionFeatures(BaseEstimator, TransformerMixin):
    def __init__(self):
        # Pares de interacci√≥n definidos internamente
        self.feature_pairs = [
            ("num_feature_2", "num_feature_9"),
            ("num_feature_2", "num_feature_11"),
            ("num_feature_2", "num_feature_13")
        ]

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_new = X.copy()
        added_features = []

        for f1, f2 in self.feature_pairs:
            if f1 in X_new.columns and f2 in X_new.columns:
                new_col = f"{f1}_x_{f2}"
                X_new[new_col] = X_new[f1] * X_new[f2]
                added_features.append(new_col)
            else:
                missing = [c for c in (f1, f2) if c not in X_new.columns]
                print(f"‚ö†Ô∏è Interacci√≥n omitida por columnas faltantes: {missing}")

        if added_features:
            print(f"‚úÖ Se a√±adieron {len(added_features)} features: {added_features}")
        else:
            print("‚ö†Ô∏è No se a√±adi√≥ ninguna feature de interacci√≥n.")

        return X_new

class FiltroUnarias(BaseEstimator, TransformerMixin):
    def __init__(self, umbral=0.95, mostrar_sesgo=True):
        self.umbral = umbral
        self.mostrar_sesgo = mostrar_sesgo

    def fit(self, X, y=None):
        self.unarias_ = []

        for v in X.columns:
            # Distribuci√≥n de frecuencias relativas
            evaluacion = X[v].value_counts(normalize=True).reset_index()
            evaluacion.columns = [v, "proporcion"]

            # Verificar si la categor√≠a dominante supera el umbral
            if evaluacion["proporcion"].iloc[0] > self.umbral:
                self.unarias_.append(v)

                if self.mostrar_sesgo:
                    print(f"\nüîç Variable: {v}")
                    print(
                        evaluacion.to_string(
                            index=False,
                            formatters={"proporcion": "{:.2%}".format}
                        )
                    )

        print(
            f"‚úÖ Filtro de variables unarias aplicado con umbral {self.umbral}. "
            f"{len(self.unarias_)} columnas eliminadas."
        )

        if not self.unarias_:
            print("‚úÖ No se encontraron variables unarias por encima del umbral.")

        return self

    def transform(self, X):
        df = X.copy()

        if self.unarias_:
            exception = "cat_property_type"
            cols_a_drop = [col for col in self.unarias_ if col != exception]
            df = df.drop(columns=cols_a_drop, errors="ignore")
            print(f"üìâ Columnas eliminadas por ser unarias: {self.unarias_}")

        return df

class FiltroCompletitud(BaseEstimator, TransformerMixin):
    def __init__(self, umbral=90):
        self.umbral = umbral
        self.conservadas_ = None

    def fit(self, X, y=None):
        comple = pd.DataFrame(X.isnull().sum()).reset_index()
        comple = comple.rename(columns={"index": "columna", 0: "total"})
        comple["completitud"] = (1 - comple["total"] / X.shape[0]) * 100
        self.conservadas_ = comple.loc[comple["completitud"] >= self.umbral, "columna"].tolist()

        eliminadas = comple.loc[comple["completitud"] < self.umbral, ["columna", "completitud"]]
        if not eliminadas.empty:
            print("üìâ Columnas eliminadas por baja completitud:")
            for _, row in eliminadas.iterrows():
                print(f" - {row['columna']}: {row['completitud']:.2f}% de completitud")
        else:
            print("‚úÖ No se eliminaron columnas, todas cumplen el umbral.")
        return self

    def transform(self, X):
        # mantener solo columnas que existen
        valid_cols = [c for c in self.conservadas_ if c in X.columns]
        return X[valid_cols]

class RenombrarPorTipo(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # No necesitamos guardar nada
        return self

    def transform(self, X):
        df = X.copy()
        # Tomar TODAS las columnas que empiecen con "feature_"
        cols_to_rename = [c for c in df.columns if c.startswith("feature_")]
        df = rename_cols(df, cols_to_rename, prefix="num_")
        print("‚úÖ Se renombraron columnas")
        return df

# Definici√≥n del pipeline de preprocesamiento
xgb_pipeline = Pipeline([
    ("rename", RenombrarPorTipo()),
    ("completitud", FiltroCompletitud(umbral=80)),
    ("unarias", FiltroUnarias(umbral=0.9)),
    ("interactions", AddInteractionFeatures()),
    ("drop_corr", DropHighCorr(threshold=0.9)),
    ("elasticnet_feature_selection", ElasticNetFeatureSelector())
])


In [4]:
# Renombre de columnas
def rename_cols(df,cols,prefix):
    new_feats=[prefix+col for col in cols]
    df=df.rename(columns=dict(zip(cols,new_feats)))
    return df

# Completitud
def completitud(df):
    comple=pd.DataFrame(df.isnull().sum())
    comple.reset_index(inplace=True)
    comple=comple.rename(columns={"index":"columna",0:"total"})
    comple["completitud"]=(1-comple["total"]/df.shape[0])*100
    comple=comple.sort_values(by="completitud",ascending=True)
    comple.reset_index(drop=True,inplace=True)
    return comple

# Filtro de completitud
def filtrar_por_completitud(df, umbral=90):
    """
    Filtra las columnas de un DataFrame que tengan una completitud menor al umbral indicado.
    Imprime las columnas eliminadas junto con su porcentaje de completitud.
    Devuelve el DataFrame filtrado.
    """
    # Calcular completitud
    comple = pd.DataFrame(df.isnull().sum())
    comple.reset_index(inplace=True)
    comple = comple.rename(columns={"index": "columna", 0: "total"})
    comple["completitud"] = (1 - comple["total"] / df.shape[0]) * 100

    # Identificar columnas que no cumplen el umbral
    eliminadas = comple.loc[comple["completitud"] < umbral, ["columna", "completitud"]]
    conservadas = comple.loc[comple["completitud"] >= umbral, "columna"]

    # Imprimir reporte
    if not eliminadas.empty:
        print("üìâ Columnas eliminadas por baja completitud:")
        for _, row in eliminadas.iterrows():
            print(f" - {row['columna']}: {row['completitud']:.2f}% de completitud")
    else:
        print("‚úÖ No se eliminaron columnas, todas cumplen el umbral.")

    # Retornar DataFrame filtrado
    return df[conservadas]

# Filtro de variables unarias ponderadas
def unarias_ponderadas(df, umbral=0.95, mostrar_sesgo=True):
    """
    Detecta variables en las que una categor√≠a supera el umbral de proporci√≥n.
    Puede adem√°s mostrar el sesgo (distribuci√≥n porcentual) de cada variable detectada.
    """
    unarias = []

    for v in df.columns:
        # Distribuci√≥n de frecuencias relativas
        evaluacion = df[v].value_counts(normalize=True).reset_index()
        evaluacion.columns = [v, "proporcion"]

        # Verificar si la categor√≠a dominante supera el umbral
        if evaluacion["proporcion"].iloc[0] > umbral:
            unarias.append(v)

            if mostrar_sesgo:
                print(f"\nüîç Variable: {v}")
                print(evaluacion.to_string(index=False, formatters={"proporcion": "{:.2%}".format}))

    if not unarias:
        print("‚úÖ No se encontraron variables unarias por encima del umbral.")

    return unarias


def descriptivos(df, tipo_variable = "v_"):
    for col in df.filter(like= tipo_variable ).columns:
        print(col)
        value_counts = df[col].value_counts(1)
        mydf = pd.DataFrame(value_counts)
        mydf['cumulativo'] = df[col].value_counts().cumsum() / df[col].value_counts().sum()
        mydf['total'] = df[col].value_counts()
        mydf['total_cumulativo'] = df[col].value_counts().cumsum()
        display(mydf)
        print()
        print("\n")


def plot_eda_cat(
    df,
    prefix_features,
    prefix_target='tgt_',
    include_target=False,
    title="EDA Variables Categ√≥ricas"
):
    """
    Grafica variables categ√≥ricas como barras horizontales.

    - Features seleccionadas por `prefix_features`
    - Si include_target=True, incluye variables target categ√≥ricas
    """

    # --- Features categ√≥ricas ---
    feature_cols = [
        col for col in df.columns
        if col.startswith(prefix_features)
        and not pd.api.types.is_numeric_dtype(df[col])
    ]

    # --- Target categ√≥rico (opcional) ---
    if include_target:
        target_cols = [
            col for col in df.columns
            if col.startswith(prefix_target)
            and not pd.api.types.is_numeric_dtype(df[col])
        ]
    else:
        target_cols = []

    lista_categoricas = feature_cols + target_cols

    for current_cat in lista_categoricas:
        counts = df[current_cat].value_counts().sort_values(ascending=True)
        total = counts.sum()

        n_cat = len(counts)
        fig_height = max(400, 30 * n_cat + 150)

        fig = go.Figure([
            go.Bar(
                x=counts.values,
                y=counts.index,
                orientation='h',
                text=[f"{v} ({v/total*100:.1f}%)" for v in counts],
                textposition='auto'
            )
        ])

        fig.update_layout(
            title=dict(text=f"<b>{title} ‚Äì {current_cat}</b>", x=0.5),
            xaxis_title="Frecuencia",
            yaxis_title="Categor√≠a",
            width=800,
            height=fig_height,
            template="plotly_white",
            showlegend=False
        )

        fig.show()


def plot_eda_num(
    df,
    prefix_features='num_',
    prefix_target='tgt_',
    include_target=False,
    titulo='Exploratory Data Analysis (EDA)'
):
    """
    Genera histogramas + KDE + boxplot para variables num√©ricas.

    - Features seleccionadas por `prefix_features`
    - Si include_target=True, incluye variables que comienzan con `prefix_target`
      siempre que sean num√©ricas.
    """

    # --- Features num√©ricas ---
    feature_cols = [
        col for col in df.columns
        if col.startswith(prefix_features)
        and pd.api.types.is_numeric_dtype(df[col])
    ]

    # --- Target num√©rico (opcional) ---
    if include_target:
        target_cols = [
            col for col in df.columns
            if col.startswith(prefix_target)
            and pd.api.types.is_numeric_dtype(df[col])
        ]
    else:
        target_cols = []

    lista_numericas = feature_cols + target_cols
    print(f"Total de variables num√©ricas ploteadas: {len(lista_numericas)}\n")

    for col in lista_numericas:
        x = df[col].dropna()

        hist_vals, bin_edges = np.histogram(x, bins=38, density=True)
        bin_centers = 0.5 * (bin_edges[1:] + bin_edges[:-1])

        kde = gaussian_kde(x)
        kde_x = np.linspace(x.min(), x.max(), 500)
        kde_y = kde(kde_x)

        skew = x.skew()
        mean = x.mean()
        median = x.median()

        fig = make_subplots(
            rows=2, cols=1,
            shared_xaxes=True,
            row_heights=[0.75, 0.25],
            vertical_spacing=0.08
        )

        fig.add_trace(go.Bar(
            x=bin_centers,
            y=hist_vals,
            name="Histograma"
        ), row=1, col=1)

        fig.add_trace(go.Scatter(
            x=kde_x,
            y=kde_y,
            mode="lines",
            name="KDE"
        ), row=1, col=1)

        fig.add_trace(go.Box(
            x=x,
            name="Boxplot",
            orientation='h',
            boxmean=True,
            showlegend=False
        ), row=2, col=1)

        fig.update_layout(
            title=dict(
                text=(
                    f"<b>{titulo} - {col}</b><br>"
                    f"Media: {mean:.2f}, Mediana: {median:.2f}, Asimetr√≠a: {skew:.2f}"
                ),
                x=0.5
            ),
            width=800,
            height=600,
            template="plotly_white"
        )

        fig.update_yaxes(showticklabels=False, row=2, col=1)
        fig.show()



# Plot matriz de correlaci√≥n
def plot_corr_matrix(
    df,
    prefix_features,
    prefix_target=None,
    include_target=False,
    title="Matriz de correlaci√≥n"
):
    """
    Muestra un mapa de calor triangular inferior (sin diagonal)
    con los valores de correlaci√≥n.

    - Las features se seleccionan por `prefix_features`
    - El target se selecciona por `prefix_target` (si se incluye)

    El target se incluye √∫nicamente con fines exploratorios.
    """

    # Features
    feature_cols = [c for c in df.columns if c.startswith(prefix_features)]

    # Target por prefijo
    if include_target and prefix_target is not None:
        target_cols = [c for c in df.columns if c.startswith(prefix_target)]
        cols = feature_cols + target_cols
    else:
        cols = feature_cols

    df_sub = df[cols]

    # Matriz de correlaci√≥n
    corr = df_sub.corr(numeric_only=True)

    # M√°scara triangular inferior sin diagonal
    mask = np.tril(np.ones_like(corr, dtype=bool), k=-1)
    corr_masked = corr.where(mask)

    # Ocultar NaN
    z_vals = corr_masked.where(~corr_masked.isna(), None)

    # Heatmap
    fig = go.Figure(
        data=go.Heatmap(
            z=z_vals.values,
            x=corr_masked.columns,
            y=corr_masked.index,
            colorscale="RdBu",
            zmin=-1,
            zmax=1,
            text=corr_masked.round(2).astype(str).replace("nan", ""),
            texttemplate="%{text}",
            textfont={"size": 8},
            hoverongaps=False,
            colorbar=dict(title="Correlaci√≥n")
        )
    )

    fig.update_layout(
        title=dict(text=f"<b>{title}</b>", x=0.5, font=dict(size=18)),
        xaxis=dict(
            title=dict(text="<b>Variables</b>", font=dict(size=14)),
            tickangle=45
        ),
        yaxis=dict(
            title=dict(text="<b>Variables</b>", font=dict(size=14)),
            autorange="reversed"
        ),
        width=800,
        height=800,
        margin=dict(l=100, r=50, t=80, b=100)
    )

    fig.show()


import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

def compute_vif(df, prefix_features="feature_"):
    """
    Calcula el Variance Inflation Factor (VIF) para variables num√©ricas
    cuyos nombres comienzan con el prefijo indicado.

    Par√°metros
    ----------
    df : pandas.DataFrame
        DataFrame con las variables.
    prefix_features : str, opcional
        Prefijo de las columnas a evaluar (por defecto 'feature_').

    Retorna
    -------
    pandas.DataFrame
        DataFrame con columnas ['feature', 'VIF'], ordenado de mayor a menor VIF.
    """

    # --- Seleccionar features por prefijo ---
    X = df[[c for c in df.columns if c.startswith(prefix_features)]]

    # --- Asegurar solo num√©ricas ---
    X = X.select_dtypes(include="number").dropna()

    # --- Calcular VIF ---
    vif_df = pd.DataFrame({
        "feature": X.columns,
        "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    })

    vif_df_output = vif_df.sort_values("VIF", ascending=False).reset_index(drop=True)
    display(vif_df_output)
    return vif_df_output



# EDA completo
def compute_EDA(df):
  df_eda = df.copy()
  print("Dimensiones del dataset:", df_eda.shape)
  plot_eda_cat(df_eda, prefix_features='cat_', prefix_target='tgt_', include_target=True, title='Distribuci√≥n en Variables Categoricas')
  plot_eda_num(df_eda, prefix_features='num_', prefix_target='tgt_', include_target=True, titulo='Distribuciones en Variables Num√©ricas')
  plot_corr_matrix(df_eda, prefix_features='num_', prefix_target='tgt_', include_target=True, title='Matriz de Correlaci√≥n para Variables Num√©ricas')
  compute_vif(df_eda, prefix_features='num_')
  completitud_df = completitud(df_eda)
  display(completitud_df)



# Detecci√≥n y eliminaci√≥n de outliers con Isolation Forest
def remove_outliers_iforest(df, prefix='num_', contamination=0.02, random_state=42):
    # Filtrar columnas num√©ricas
    num_cols = [c for c in df.columns if c.startswith(prefix)]
    X = df[num_cols].dropna()

    # Modelo
    iso = IsolationForest(contamination=contamination, random_state=random_state)
    preds = iso.fit_predict(X)

    # Filtrar outliers (preds = -1 son outliers)
    df_filtered = df.loc[X.index[preds == 1]]

    print(f"‚úÖ Registros conservados: {len(df_filtered)} / {len(df)} ({len(df_filtered)/len(df)*100:.1f}%)")
    return df_filtered

# Evaluaci√≥n de modelo regresor
def evaluate_regressor_model(model, X_train, y_train, X_test=None, y_test=None, n_splits=5):
    """
    Eval√∫a un modelo regresor con validaci√≥n cruzada y opcionalmente en un test set.
    
    Parameters
    ----------
    model : sklearn-like regressor
    X_train : pd.DataFrame o np.array
    y_train : pd.Series o np.array
    X_test : opcional, pd.DataFrame o np.array
    y_test : opcional, pd.Series o np.array
    n_splits : int, n√∫mero de folds CV
    
    Returns
    -------
    metrics : dict con R2, RMSE, MAE y MAPE para train (CV) y test (si aplica)
    """
    
    def mape(y_true, y_pred):
        return np.mean(np.abs((y_true - y_pred) / np.where(y_true==0, 1, y_true))) * 100
    
    # ---- Predicciones CV en train ----
    cv = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    y_pred_cv = cross_val_predict(model, X_train, y_train, cv=cv)
    
    metrics = {
        "R2_train_CV": r2_score(y_train, y_pred_cv),
        "RMSE_train_CV": root_mean_squared_error(y_train, y_pred_cv),
        "MAE_train_CV": mean_absolute_error(y_train, y_pred_cv),
        "MAPE_train_CV (%)": mape(y_train, y_pred_cv)
    }
    
    # ---- Predicciones en test (si se proporcionan) ----
    if X_test is not None and y_test is not None:
        y_pred_test = model.predict(X_test)
        metrics.update({
            "R2_test": r2_score(y_test, y_pred_test),
            "RMSE_test": root_mean_squared_error(y_test, y_pred_test),
            "MAE_test": mean_absolute_error(y_test, y_pred_test),
            "MAPE_test (%)": mape(y_test, y_pred_test)
        })
    
    return metrics

# Identificaci√≥n y sugerencia de eliminaci√≥n de variables altamente correlacionadas
def drop_high_corr(df, threshold=0.9, verbose=True):
    """
    Identifica variables altamente correlacionadas y sugiere cu√°les eliminar.

    Par√°metros
    ----------
    df : pd.DataFrame
        DataFrame con variables num√©ricas.
    threshold : float
        Umbral de correlaci√≥n absoluta.
    verbose : bool
        Si True, imprime las parejas con alta correlaci√≥n.

    Retorna
    -------
    to_drop : list
        Lista de columnas a eliminar.
    """
    corr = df.corr().abs()
    upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

    to_drop = []
    reported = []

    for col in upper.columns:
        high_corr = upper[col][upper[col] > threshold]
        if not high_corr.empty:
            to_drop.append(col)
            for idx, val in high_corr.items():
                reported.append((idx, col, val))

    if verbose and reported:
        print("\n‚ö†Ô∏è Variables con alta correlaci√≥n (>|{:.2f}|):\n".format(threshold))
        for v1, v2, val in reported:
            print(f"  - {v1} ‚Üî {v2} : corr = {val:.3f}")

    if verbose and not reported:
        print("\n‚úÖ No se encontraron variables con correlaci√≥n alta.")

    return to_drop

# Agregado de features de interacci√≥n multiplicativa
def add_interaction_features(
    X,
    feature_pairs,
    prefix="num_"
):
    """
    Agrega features de interacci√≥n multiplicativa.

    Parameters
    ----------
    X : pd.DataFrame
    feature_pairs : list of tuples
        Ej: [("num_feature_2", "num_feature_9")]
    prefix : str
        Prefijo para las nuevas columnas

    Returns
    -------
    X_new : pd.DataFrame
    """

    X_new = X.copy()

    for f1, f2 in feature_pairs:
        new_col = f"{f1}_x_{f2}"
        X_new[new_col] = X_new[f1] * X_new[f2]

    return X_new

# Eliminaci√≥n de outliers SOLO en train con Isolation Forest
def remove_outliers_iforest_train(
    X_train,
    y_train=None,
    prefix="num_",
    contamination=0.02,
    random_state=42,
    verbose=True
):
    """
    Elimina outliers SOLO en el conjunto de entrenamiento usando Isolation Forest.

    Parameters
    ----------
    X_train : pd.DataFrame
        Features de entrenamiento
    y_train : pd.Series o None
        Target de entrenamiento (opcional, para mantener alineaci√≥n)
    prefix : str
        Prefijo de columnas num√©ricas
    contamination : float
        Proporci√≥n esperada de outliers
    random_state : int
        Semilla
    verbose : bool
        Imprime resumen

    Returns
    -------
    X_train_clean : pd.DataFrame
    y_train_clean : pd.Series (si y_train no es None)
    """

    # Seleccionar columnas num√©ricas por prefijo
    num_cols = [c for c in X_train.columns if c.startswith(prefix)]

    if len(num_cols) == 0:
        raise ValueError("No se encontraron columnas num√©ricas con el prefijo indicado.")

    # Ajustar Isolation Forest SOLO en train
    iso = IsolationForest(
        contamination=contamination,
        random_state=random_state
    )

    preds = iso.fit_predict(X_train[num_cols])
    mask_inliers = preds == 1

    X_train_clean = X_train.loc[mask_inliers]

    if y_train is not None:
        y_train_clean = y_train.loc[X_train_clean.index]
    else:
        y_train_clean = None

    if verbose:
        kept = mask_inliers.sum()
        total = len(mask_inliers)
        print(
            f"‚úÖ Registros conservados: {kept} / {total} "
            f"({kept/total*100:.1f}%)"
        )

    return X_train_clean, y_train_clean




#### Lectura de dataset 

In [5]:
# Lectura de conjunto de datos de entrenamiento
df = pd.read_csv('data/training_data.csv')
display(df)

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,target
0,432.475954,289.373016,481.315600,358.755566,802.659004,176.761177,72.648102,720.969179,36.327684,83.768878,...,4.385848,516.789458,19.624422,13.162440,42.351948,35.920392,20.755984,13.814300,384.497136,14.364922
1,517.596250,330.448341,585.920055,22.684031,169.813240,335.601640,284.451476,748.101047,73.701438,358.147215,...,5.563334,2.960064,20.721878,17.740184,1.726915,167.576065,75.492679,2.480979,303.710869,19.984801
2,189.439350,553.888820,165.833790,202.465927,176.695586,321.155049,407.278389,161.245668,282.269025,221.570899,...,4.536947,581.823741,101.695639,0.653592,486.859084,117.491548,6.420465,20.713314,22.651537,12.944351
3,237.307878,195.894881,416.752252,468.729031,611.693517,301.411711,241.880655,49.597044,122.396821,13.828319,...,5.518968,45.014729,196.350455,47.638515,411.414213,67.142022,115.630943,8.927957,388.240433,14.792440
4,602.845256,16.103208,221.759979,345.765574,558.588369,276.704241,408.069566,19.390813,138.769765,146.662193,...,2.136214,133.590430,197.634584,26.278027,111.127557,172.181136,85.869642,30.537857,625.931837,11.802634
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,939.158462,553.881664,543.088541,71.556673,354.225260,343.143913,233.732499,420.369307,362.371947,310.233672,...,3.666136,713.540081,158.829853,25.995461,143.293235,127.731367,122.722659,48.910185,298.725020,23.716818
796,489.373311,13.531758,120.387215,187.544473,409.964275,523.310429,256.719030,341.095062,55.850912,383.027048,...,3.443595,397.801688,110.880683,17.123976,284.277940,111.041401,124.447221,49.888743,212.752488,16.515457
797,936.496031,140.148546,213.454901,461.841364,86.759083,53.081953,241.042320,379.023230,140.241770,306.861602,...,4.356944,876.638733,43.728076,17.400287,556.740895,100.203033,98.565669,44.833075,844.393623,12.642383
798,783.876505,477.530653,216.890295,175.429988,348.212630,593.979057,55.137074,226.367598,329.873123,302.451012,...,3.849649,324.661269,38.852525,46.472905,352.709718,96.373785,114.783223,51.264147,703.326040,16.186675


#### Exploraci√≥n cruda de dataset

In [6]:
# Tama√±o del dataset
print("Dimensiones del dataset:", df.shape)

Dimensiones del dataset: (800, 21)


In [7]:
# Completitud del dataset
completitud_df = completitud(df)
print(completitud_df)

       columna  total  completitud
0    feature_0      0        100.0
1    feature_1      0        100.0
2    feature_2      0        100.0
3    feature_3      0        100.0
4    feature_4      0        100.0
5    feature_5      0        100.0
6    feature_6      0        100.0
7    feature_7      0        100.0
8    feature_8      0        100.0
9    feature_9      0        100.0
10  feature_10      0        100.0
11  feature_11      0        100.0
12  feature_12      0        100.0
13  feature_13      0        100.0
14  feature_14      0        100.0
15  feature_15      0        100.0
16  feature_16      0        100.0
17  feature_17      0        100.0
18  feature_18      0        100.0
19  feature_19      0        100.0
20      target      0        100.0


In [8]:
# Descripci√≥n estad√≠stica del dataset
display(df.describe())

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,target
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,...,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,468.181612,301.960218,317.132996,283.213456,485.97595,320.953859,217.13633,387.196289,179.485453,196.003058,...,3.372667,479.841915,100.112096,28.334725,277.205999,91.217615,70.927129,27.385266,457.016407,14.631342
std,270.797415,170.691136,176.50192,157.698215,272.59403,185.157189,123.660691,235.131376,104.059309,111.042671,...,1.986369,276.304197,59.149794,16.105155,167.38593,53.950523,40.588905,15.509062,270.650146,5.089503
min,0.916648,0.800119,0.173025,0.308823,0.598527,0.997347,0.402436,0.474825,0.72785,0.264253,...,0.004464,1.70624,0.194306,0.222312,1.726915,0.093789,0.072986,0.101761,0.252919,0.279805
25%,239.33014,157.338244,167.516318,151.302826,250.305362,158.37461,106.876625,195.037853,94.220273,95.052359,...,1.63015,249.56026,48.132939,14.660437,131.516567,44.35797,34.903031,13.815298,221.270792,10.879914
50%,477.75062,303.257176,326.310194,294.574403,493.470486,328.722464,217.09809,364.124238,173.364771,199.587048,...,3.349497,468.536888,100.247827,27.953146,275.950814,87.484343,74.429675,27.766111,462.153497,14.687955
75%,704.650292,448.878174,474.484472,415.806162,720.747672,480.076559,327.411334,588.565017,268.790459,291.546642,...,5.024192,713.291136,149.966755,42.514155,423.110446,137.888018,105.484961,40.402938,688.352373,18.224713
max,940.771543,595.359858,614.271632,549.896216,950.017444,638.199832,426.308251,809.346792,367.084755,384.919108,...,6.859269,979.715063,203.122292,56.467485,566.611509,187.041256,138.675389,53.25474,935.740775,27.360789


#### Renombrado de caracter√≠sticas

In [9]:
# Identificaci√≥n de nombres de columnas
display(df.columns.to_list())

['feature_0',
 'feature_1',
 'feature_2',
 'feature_3',
 'feature_4',
 'feature_5',
 'feature_6',
 'feature_7',
 'feature_8',
 'feature_9',
 'feature_10',
 'feature_11',
 'feature_12',
 'feature_13',
 'feature_14',
 'feature_15',
 'feature_16',
 'feature_17',
 'feature_18',
 'feature_19',
 'target']

In [10]:
# Mapeo manual de variables de acuerdo a tipo
id_feats = []
date_feats = []
num_feats = [c for c in df.columns if c.startswith("feature_")]
cat_feats=[]
text_feats = []
geo_feats = []
tgt_feats = ['target']

# Renombrado aplicado 
df=rename_cols(df, id_feats,"id_")
df=rename_cols(df, date_feats,"date_")
df=rename_cols(df, num_feats,"num_")
df=rename_cols(df, cat_feats,"cat_")
df=rename_cols(df, text_feats,"text_")
df=rename_cols(df, geo_feats,"geo_")
df=rename_cols(df, tgt_feats,"tgt_")

# Completitud del dataset
completitud_df = completitud(df)
print(completitud_df)

           columna  total  completitud
0    num_feature_0      0        100.0
1    num_feature_1      0        100.0
2    num_feature_2      0        100.0
3    num_feature_3      0        100.0
4    num_feature_4      0        100.0
5    num_feature_5      0        100.0
6    num_feature_6      0        100.0
7    num_feature_7      0        100.0
8    num_feature_8      0        100.0
9    num_feature_9      0        100.0
10  num_feature_10      0        100.0
11  num_feature_11      0        100.0
12  num_feature_12      0        100.0
13  num_feature_13      0        100.0
14  num_feature_14      0        100.0
15  num_feature_15      0        100.0
16  num_feature_16      0        100.0
17  num_feature_17      0        100.0
18  num_feature_18      0        100.0
19  num_feature_19      0        100.0
20      tgt_target      0        100.0


In [11]:
# Tipos de datos
print(df.dtypes)

num_feature_0     float64
num_feature_1     float64
num_feature_2     float64
num_feature_3     float64
num_feature_4     float64
num_feature_5     float64
num_feature_6     float64
num_feature_7     float64
num_feature_8     float64
num_feature_9     float64
num_feature_10    float64
num_feature_11    float64
num_feature_12    float64
num_feature_13    float64
num_feature_14    float64
num_feature_15    float64
num_feature_16    float64
num_feature_17    float64
num_feature_18    float64
num_feature_19    float64
tgt_target        float64
dtype: object


No hay necesidad de cambiar tipo de dato

#### Eliminaci√≥n de Duplicados

In [12]:
# Duplicados generales
duplicados = df[df.duplicated(keep=False)]
print(f"N√∫mero de duplicados generales: {duplicados.shape[0]}")
# Sobrescribir df sin duplicados generales
df = df.drop_duplicates(keep='first')
print(f"DataFrame sin duplicados generales. Total de filas: {df.shape[0]}\n")


N√∫mero de duplicados generales: 0
DataFrame sin duplicados generales. Total de filas: 800



En este caso, no es necesario revisar duplicados espec√≠ficos

#### Eliminaci√≥n de variables por completitud

In [13]:
# Eliminaci√≥n de variables por completitud abajo del 80%
df = filtrar_por_completitud(df, umbral=80)
# Completitud del dataset
completitud_df = completitud(df)
print(completitud_df)

‚úÖ No se eliminaron columnas, todas cumplen el umbral.
           columna  total  completitud
0    num_feature_0      0        100.0
1    num_feature_1      0        100.0
2    num_feature_2      0        100.0
3    num_feature_3      0        100.0
4    num_feature_4      0        100.0
5    num_feature_5      0        100.0
6    num_feature_6      0        100.0
7    num_feature_7      0        100.0
8    num_feature_8      0        100.0
9    num_feature_9      0        100.0
10  num_feature_10      0        100.0
11  num_feature_11      0        100.0
12  num_feature_12      0        100.0
13  num_feature_13      0        100.0
14  num_feature_14      0        100.0
15  num_feature_15      0        100.0
16  num_feature_16      0        100.0
17  num_feature_17      0        100.0
18  num_feature_18      0        100.0
19  num_feature_19      0        100.0
20      tgt_target      0        100.0


#### Eliminaci√≥n de variables categoricas unitarias

In [14]:
# Eliminaci√≥n de variables categoricas unitarias
unarias_umbral = unarias_ponderadas(df, 0.9)
print(unarias_umbral)

‚úÖ No se encontraron variables unarias por encima del umbral.
[]


In [15]:
# Eliminamos variables ""
df = df.drop(unarias_umbral)
# Completitud del dataset
completitud_df = completitud(df)
print(completitud_df)

           columna  total  completitud
0    num_feature_0      0        100.0
1    num_feature_1      0        100.0
2    num_feature_2      0        100.0
3    num_feature_3      0        100.0
4    num_feature_4      0        100.0
5    num_feature_5      0        100.0
6    num_feature_6      0        100.0
7    num_feature_7      0        100.0
8    num_feature_8      0        100.0
9    num_feature_9      0        100.0
10  num_feature_10      0        100.0
11  num_feature_11      0        100.0
12  num_feature_12      0        100.0
13  num_feature_13      0        100.0
14  num_feature_14      0        100.0
15  num_feature_15      0        100.0
16  num_feature_16      0        100.0
17  num_feature_17      0        100.0
18  num_feature_18      0        100.0
19  num_feature_19      0        100.0
20      tgt_target      0        100.0


#### Descriptivos de variables categ√≥ricos

In [16]:
# Descriptivos categ√≥ricos
descriptivos(df, tipo_variable = "cat_")

**Observaciones**:

No hay variables categ√≥ricas


#### Normalizaci√≥n de variables categoricas

**Observaciones**:

No hay variables categ√≥ricas

#### Descriptivos de variables num√©ricos

In [17]:
# Descriptivos num√©ricos
#descriptivos(df, tipo_variable = "num_")
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_feature_0,800.0,468.181612,270.797415,0.916648,239.33014,477.75062,704.650292,940.771543
num_feature_1,800.0,301.960218,170.691136,0.800119,157.338244,303.257176,448.878174,595.359858
num_feature_2,800.0,317.132996,176.50192,0.173025,167.516318,326.310194,474.484472,614.271632
num_feature_3,800.0,283.213456,157.698215,0.308823,151.302826,294.574403,415.806162,549.896216
num_feature_4,800.0,485.97595,272.59403,0.598527,250.305362,493.470486,720.747672,950.017444
num_feature_5,800.0,320.953859,185.157189,0.997347,158.37461,328.722464,480.076559,638.199832
num_feature_6,800.0,217.13633,123.660691,0.402436,106.876625,217.09809,327.411334,426.308251
num_feature_7,800.0,387.196289,235.131376,0.474825,195.037853,364.124238,588.565017,809.346792
num_feature_8,800.0,179.485453,104.059309,0.72785,94.220273,173.364771,268.790459,367.084755
num_feature_9,800.0,196.003058,111.042671,0.264253,95.052359,199.587048,291.546642,384.919108


**Observaciones**:

Num√©ricas:

* No valores ausentes -> No es necesario imputaci√≥n
* 800 Observaciones
* medias entre ~70 y ~480
* Desviaciones est√°ndar grandes (en general)
* Escalas diferentes -> Para algunos modelos ser√≠a necesario estandarizaci√≥n de caracter√≠sticas.
* Variable target (media~14.6, std~5.1, rango~{0.28-27.36})



#### Consistencia de variables (n√∫mericas y categ√≥ricas)

En este caso, se asume que los valores estan en rangos aceptables. Si se conoce informaci√≥n m√°s adelante acerca del nombre de la caracter√≠stica, a√±adir c√≥digo en esta secci√≥n. 

### Anal√≠sis Exploratorio de datos

In [18]:
# An√°lisis exploratorio visual completo
compute_EDA(df)

Dimensiones del dataset: (800, 21)
Total de variables num√©ricas ploteadas: 21



Unnamed: 0,feature,VIF
0,num_feature_2,4.096813
1,num_feature_3,4.082605
2,num_feature_6,4.035078
3,num_feature_4,4.033025
4,num_feature_18,3.992109
5,num_feature_14,3.970153
6,num_feature_10,3.950958
7,num_feature_9,3.94789
8,num_feature_13,3.935431
9,num_feature_12,3.889785


Unnamed: 0,columna,total,completitud
0,num_feature_0,0,100.0
1,num_feature_1,0,100.0
2,num_feature_2,0,100.0
3,num_feature_3,0,100.0
4,num_feature_4,0,100.0
5,num_feature_5,0,100.0
6,num_feature_6,0,100.0
7,num_feature_7,0,100.0
8,num_feature_8,0,100.0
9,num_feature_9,0,100.0


In [19]:
cols_corr_lineal_high = ['num_feature_2', 'num_feature_9', 'num_feature_11', 'num_feature_13']
for col in cols_corr_lineal_high:
    fig = px.scatter(df, 
               x=col, 
               y="tgt_target",
               title=f"<b> Relaci√≥n Target y {col}</b>"
               )
    fig.show()

**Observaciones**:

> Categ√≥ricas:

No hay variables categ√≥ricas

> Num√©ricas:

* Las distribuciones son mayormente uniformes 
* No se observan valores at√≠picos claros de forma gr√°fica
* La variable target muestra un comportmiento normal de forma gr√°fica

> Correlaci√≥n:

* Las variables feature_2, feature_9, feature_11 y feature_13 muestran una correlaci√≥n lineal marginalmente positiva con respecto al target.
* Entre variables no se encuentra multicolinealidad (VIF<5)

#### Divisi√≥n train y validation

In [20]:
# Divisi√≥n en train y validation
X = df.drop(columns=["tgt_target"])
y = df["tgt_target"]

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

X_train_copy = X_train.copy()
X_val_copy   = X_val.copy()
y_train_copy = y_train.copy()
y_val_copy   = y_val.copy()

print(X_train.shape, X_val.shape)

(640, 20) (160, 20)


#### Eliminaci√≥n de registros con valores at√≠picos (‚ÄúOutliers‚Äù) en variables n√∫mericas

In [21]:
# Eliminaci√≥n de outliers SOLO en train
X_train, y_train = remove_outliers_iforest_train(
    X_train,
    y_train,
    prefix="num_",
    contamination=0.02
)


‚úÖ Registros conservados: 627 / 640 (98.0%)


#### Imputaci√≥n de valores ausentes

En este caso, no hay valores ausentes.

#### Feature Engineering

In [22]:
# Agregado de features de interacci√≥n multiplicativa
interaction_pairs = [
    ("num_feature_2", "num_feature_9"),
    ("num_feature_2", "num_feature_11"),
    ("num_feature_2", "num_feature_13")
]

X_train = add_interaction_features(X_train, interaction_pairs)
X_val  = add_interaction_features(X_val, interaction_pairs)
display(X_train)
display(X_val)

Unnamed: 0,num_feature_0,num_feature_1,num_feature_2,num_feature_3,num_feature_4,num_feature_5,num_feature_6,num_feature_7,num_feature_8,num_feature_9,...,num_feature_13,num_feature_14,num_feature_15,num_feature_16,num_feature_17,num_feature_18,num_feature_19,num_feature_2_x_num_feature_9,num_feature_2_x_num_feature_11,num_feature_2_x_num_feature_13
264,744.433848,279.603122,440.885131,249.825761,157.830291,625.017237,307.194733,258.690992,273.900606,227.785191,...,180.224975,26.352384,17.955787,37.874871,94.389309,48.491425,51.936572,100427.103901,1975.481957,79458.511540
615,189.704342,169.515720,159.711405,271.028842,501.661833,503.321821,1.949638,122.868615,203.831963,36.765301,...,164.721301,30.198920,332.380560,71.632278,23.645069,49.790943,383.143462,5871.837912,941.464250,26307.870473
329,112.708424,408.293215,79.507889,480.510533,263.485517,242.279372,58.339004,568.262343,265.055920,66.267849,...,5.561363,20.714709,474.128719,77.917635,37.191700,5.916702,599.084617,5268.816756,315.843064,442.172255
342,609.812562,331.544645,276.439661,238.682854,427.169169,511.939539,274.753389,223.864986,267.561102,16.599789,...,133.331068,45.714448,203.742245,142.333688,55.783508,7.889559,659.932887,4588.840078,223.433616,36857.995217
394,821.414806,87.192228,474.434225,389.224064,259.875231,425.253691,305.836166,743.077033,110.102407,247.369275,...,116.163276,2.653386,78.839082,20.770825,41.816844,30.367408,143.959024,117360.450407,2920.861135,55111.833772
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,891.654883,408.168478,205.926767,206.273501,167.634453,240.094591,87.790513,351.663655,181.951406,25.035889,...,101.021418,51.256217,481.739674,17.385602,86.397912,29.727572,587.186525,5155.559641,733.337979,20803.014042
106,176.476694,152.915578,447.580349,239.978405,544.426190,205.916192,15.236929,795.330606,225.513680,369.039811,...,106.542329,30.277714,268.379850,66.543393,84.410002,11.341218,435.518738,165174.967375,752.402925,47686.252677
270,312.072869,519.586266,70.238540,435.518407,641.828073,195.292272,82.071336,744.668102,193.362857,333.416203,...,113.955421,25.926394,240.682738,27.821588,78.565938,9.977632,312.210791,23418.667446,234.160370,8004.062476
435,69.506949,378.216064,510.923901,60.117238,611.489879,434.272489,409.036731,201.968091,354.640203,198.274549,...,126.655925,34.960160,539.976224,90.650051,93.697843,43.994574,877.562649,101303.205996,3058.394576,64711.539372


Unnamed: 0,num_feature_0,num_feature_1,num_feature_2,num_feature_3,num_feature_4,num_feature_5,num_feature_6,num_feature_7,num_feature_8,num_feature_9,...,num_feature_13,num_feature_14,num_feature_15,num_feature_16,num_feature_17,num_feature_18,num_feature_19,num_feature_2_x_num_feature_9,num_feature_2_x_num_feature_11,num_feature_2_x_num_feature_13
696,230.501161,341.176848,227.022576,54.311683,210.603737,535.737568,78.297313,747.406206,52.594414,221.180395,...,100.383182,35.244147,189.417120,133.193441,138.264625,44.989388,181.347223,50212.943152,460.126108,22789.248516
667,776.188804,467.432919,241.685974,283.858851,738.306768,562.440868,360.388571,432.755064,223.549078,60.073949,...,178.202329,31.983917,298.951201,38.888052,84.543431,39.405872,865.934469,14519.030811,794.693227,43069.003360
63,125.400716,226.877674,525.739629,481.685534,743.005516,234.479521,216.628360,119.107896,38.374721,375.412612,...,32.016665,3.373459,41.380871,118.112328,128.841290,50.893061,859.959527,197369.287540,3425.276273,16832.429804
533,530.937076,594.982682,514.077688,51.971303,752.263759,71.946641,133.556079,717.382769,61.596468,284.738164,...,97.240815,8.915045,424.242044,92.325721,37.162258,11.164005,3.137824,146377.536964,2510.948593,49989.333268
66,195.143589,329.264620,76.977000,211.083857,540.630523,199.344441,208.784218,758.789873,81.404173,170.850305,...,178.800434,13.348175,533.458261,1.497290,34.406303,27.105381,835.131218,13151.543889,233.875687,13763.520996
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
589,584.080907,311.310376,175.314512,474.020833,372.299587,270.435811,109.735211,294.797595,144.359117,133.645173,...,34.211975,33.482820,397.187934,54.279408,109.097865,8.082211,903.302312,23429.938179,176.702193,5997.855728
798,783.876505,477.530653,216.890295,175.429988,348.212630,593.979057,55.137074,226.367598,329.873123,302.451012,...,38.852525,46.472905,352.709718,96.373785,114.783223,51.264147,703.326040,65598.689194,834.951490,8426.735619
744,289.794233,476.545691,264.606258,370.854918,234.851373,528.431963,376.291535,476.028290,263.371861,172.650747,...,102.320810,13.299969,13.240139,100.265077,48.585088,14.542984,146.534683,45684.468119,1513.442240,27074.726760
513,784.920768,181.708297,332.208205,50.592577,123.640831,155.137675,360.627965,232.394676,365.058929,352.198630,...,17.746427,29.063848,35.387880,168.462981,124.521589,18.236196,62.663296,117003.274363,895.158694,5895.508620


#### Codificaci√≥n de variables categoricas

En este caso no hay variables categ√≥ricas

#### Feature Selection

In [23]:
# Eliminacion de variables altamente correlacionadas
to_drop = drop_high_corr(X_train, threshold=0.9)
X_train = X_train.drop(columns=to_drop)
X_val = X_val.drop(columns=to_drop)



‚úÖ No se encontraron variables con correlaci√≥n alta.


In [24]:
# Selecci√≥n de variables mediante regularizaci√≥n Elastic Net con validaci√≥n cruzada
enet = Pipeline([
    ("scaler", StandardScaler()),
    ("model", ElasticNetCV(
        l1_ratio=[0.1, 0.5, 0.9],
        alphas=np.logspace(-4, 0, 50),
        cv=5,
        max_iter=5000,
        random_state=42
    ))
])

enet.fit(X_train, y_train)

coef = enet.named_steps["model"].coef_

selected_features = X_train.columns[coef != 0]
discarded_features = X_train.columns[coef == 0]

print(f"‚úî Features seleccionadas: {len(selected_features)}")
print(f"‚úò Features descartadas: {len(discarded_features)}")

X_train = X_train[selected_features]
X_val  = X_val[selected_features]

display(X_train)
display(X_val)


‚úî Features seleccionadas: 16
‚úò Features descartadas: 7


Unnamed: 0,num_feature_1,num_feature_2,num_feature_3,num_feature_4,num_feature_6,num_feature_9,num_feature_11,num_feature_12,num_feature_13,num_feature_14,num_feature_15,num_feature_17,num_feature_18,num_feature_19,num_feature_2_x_num_feature_9,num_feature_2_x_num_feature_13
264,279.603122,440.885131,249.825761,157.830291,307.194733,227.785191,4.480718,429.185756,180.224975,26.352384,17.955787,94.389309,48.491425,51.936572,100427.103901,79458.511540
615,169.515720,159.711405,271.028842,501.661833,1.949638,36.765301,5.894784,549.840700,164.721301,30.198920,332.380560,23.645069,49.790943,383.143462,5871.837912,26307.870473
329,408.293215,79.507889,480.510533,263.485517,58.339004,66.267849,3.972475,399.948483,5.561363,20.714709,474.128719,37.191700,5.916702,599.084617,5268.816756,442.172255
342,331.544645,276.439661,238.682854,427.169169,274.753389,16.599789,0.808255,722.529313,133.331068,45.714448,203.742245,55.783508,7.889559,659.932887,4588.840078,36857.995217
394,87.192228,474.434225,389.224064,259.875231,305.836166,247.369275,6.156514,497.190483,116.163276,2.653386,78.839082,41.816844,30.367408,143.959024,117360.450407,55111.833772
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,408.168478,205.926767,206.273501,167.634453,87.790513,25.035889,3.561159,540.343326,101.021418,51.256217,481.739674,86.397912,29.727572,587.186525,5155.559641,20803.014042
106,152.915578,447.580349,239.978405,544.426190,15.236929,369.039811,1.681045,295.538626,106.542329,30.277714,268.379850,84.410002,11.341218,435.518738,165174.967375,47686.252677
270,519.586266,70.238540,435.518407,641.828073,82.071336,333.416203,3.333788,433.247754,113.955421,25.926394,240.682738,78.565938,9.977632,312.210791,23418.667446,8004.062476
435,378.216064,510.923901,60.117238,611.489879,409.036731,198.274549,5.986008,331.308534,126.655925,34.960160,539.976224,93.697843,43.994574,877.562649,101303.205996,64711.539372


Unnamed: 0,num_feature_1,num_feature_2,num_feature_3,num_feature_4,num_feature_6,num_feature_9,num_feature_11,num_feature_12,num_feature_13,num_feature_14,num_feature_15,num_feature_17,num_feature_18,num_feature_19,num_feature_2_x_num_feature_9,num_feature_2_x_num_feature_13
696,341.176848,227.022576,54.311683,210.603737,78.297313,221.180395,2.026786,966.758718,100.383182,35.244147,189.417120,138.264625,44.989388,181.347223,50212.943152,22789.248516
667,467.432919,241.685974,283.858851,738.306768,360.388571,60.073949,3.288123,589.223003,178.202329,31.983917,298.951201,84.543431,39.405872,865.934469,14519.030811,43069.003360
63,226.877674,525.739629,481.685534,743.005516,216.628360,375.412612,6.515157,632.014203,32.016665,3.373459,41.380871,128.841290,50.893061,859.959527,197369.287540,16832.429804
533,594.982682,514.077688,51.971303,752.263759,133.556079,284.738164,4.884376,528.404699,97.240815,8.915045,424.242044,37.162258,11.164005,3.137824,146377.536964,49989.333268
66,329.264620,76.977000,211.083857,540.630523,208.784218,170.850305,3.038254,720.799012,178.800434,13.348175,533.458261,34.406303,27.105381,835.131218,13151.543889,13763.520996
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
589,311.310376,175.314512,474.020833,372.299587,109.735211,133.645173,1.007915,87.282841,34.211975,33.482820,397.187934,109.097865,8.082211,903.302312,23429.938179,5997.855728
798,477.530653,216.890295,175.429988,348.212630,55.137074,302.451012,3.849649,324.661269,38.852525,46.472905,352.709718,114.783223,51.264147,703.326040,65598.689194,8426.735619
744,476.545691,264.606258,370.854918,234.851373,376.291535,172.650747,5.719601,35.424520,102.320810,13.299969,13.240139,48.585088,14.542984,146.534683,45684.468119,27074.726760
513,181.708297,332.208205,50.592577,123.640831,360.627965,352.198630,2.694571,294.071397,17.746427,29.063848,35.387880,124.521589,18.236196,62.663296,117003.274363,5895.508620


In [25]:
# Importancia de variables seg√∫n coeficientes del Elastic Net
coef_df = (
    pd.Series(coef, index=enet.feature_names_in_)
      .loc[selected_features]
      .sort_values(key=np.abs, ascending=False)
)

coef_df

num_feature_2                     2.663126
num_feature_13                    1.840883
num_feature_9                     1.781615
num_feature_11                    1.651494
num_feature_14                    0.185016
num_feature_18                    0.143467
num_feature_2_x_num_feature_13    0.120161
num_feature_3                     0.101233
num_feature_17                   -0.093307
num_feature_19                   -0.091890
num_feature_4                    -0.055350
num_feature_6                     0.049421
num_feature_12                    0.044403
num_feature_15                    0.040390
num_feature_1                    -0.016645
num_feature_2_x_num_feature_9     0.013769
dtype: float64

In [26]:
# Importancia de variables seg√∫n Random Forest
rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

rf_importance = pd.Series(
    rf.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

rf_importance

num_feature_2_x_num_feature_13    0.442829
num_feature_2_x_num_feature_9     0.177318
num_feature_11                    0.126450
num_feature_9                     0.060748
num_feature_2                     0.052515
num_feature_18                    0.042606
num_feature_13                    0.034233
num_feature_3                     0.008820
num_feature_19                    0.007519
num_feature_14                    0.007443
num_feature_1                     0.007431
num_feature_17                    0.006867
num_feature_6                     0.006720
num_feature_4                     0.006437
num_feature_12                    0.006162
num_feature_15                    0.005902
dtype: float64

#### Modelo Baseline (ElasticNet)

In [27]:
# Pipeline

pipe_enet = Pipeline([
    ("scaler", StandardScaler()),
    ("model", ElasticNet(max_iter=5000, random_state=42))
])

# Bayesian Search CV

search_enet = BayesSearchCV(
    pipe_enet,
    {
        "model__alpha": Real(1e-4, 1.0, prior="log-uniform"),
        "model__l1_ratio": Real(0.05, 0.95)
    },
    n_iter=30,
    cv=5,
    scoring="r2",
    n_jobs=-1,
    random_state=42
)

search_enet.fit(X_train, y_train)

# Evaluaci√≥n del modelo

enet_best = search_enet.best_estimator_

enet_metrics = evaluate_regressor_model(
    enet_best, X_train, y_train, X_val, y_val
)


#### Modelo Random Forest (no lineal, sin escalado)

In [28]:
# Pipeline

pipe_rf = Pipeline([
    ("model", RandomForestRegressor(
        n_estimators=300,
        random_state=42,
        n_jobs=-1
    ))
])

# Bayesian Search CV

search_rf = BayesSearchCV(
    pipe_rf,
    {
        "model__n_estimators": Integer(200, 600),
        "model__max_depth": Integer(3, 20),
        "model__min_samples_leaf": Integer(2, 15)
    },
    n_iter=30,
    cv=5,
    scoring="r2",
    n_jobs=-1,
    random_state=42
)

search_rf.fit(X_train, y_train)

# Evaluaci√≥n del modelo

rf_best = search_rf.best_estimator_

rf_metrics = evaluate_regressor_model(
    rf_best, X_train, y_train, X_val, y_val
)


#### Modelo XGBOOST (no lineal, no requiere escalado)

In [29]:
# Pipeline

pipe_xgb = Pipeline([
    ("model", XGBRegressor(
        objective="reg:squarederror",
        random_state=42,
        n_jobs=-1
    ))
])

# Bayesian Search CV

search_xgb = BayesSearchCV(
    pipe_xgb,
    {
        "model__n_estimators": Integer(200, 800),
        "model__max_depth": Integer(3, 8),
        "model__learning_rate": Real(0.01, 0.3, prior="log-uniform"),
        "model__subsample": Real(0.6, 1.0),
        "model__colsample_bytree": Real(0.6, 1.0)
    },
    n_iter=40,
    cv=5,
    scoring="r2",
    n_jobs=-1,
    random_state=42
)

search_xgb.fit(X_train, y_train)

# Evaluaci√≥n del modelo

xgb_best = search_xgb.best_estimator_
xgb_best_params = xgb_best.named_steps["model"].get_params()


xgb_metrics = evaluate_regressor_model(
    xgb_best, X_train, y_train, X_val, y_val
)


#### Selecci√≥n de mejor modelo

In [30]:
results = pd.DataFrame.from_dict(
    {
        "ElasticNet": enet_metrics,
        "RandomForest": rf_metrics,
        "XGBoost": xgb_metrics
    },
    orient="index"
).sort_values(by="R2_test", ascending=False)

display(results)


Unnamed: 0,R2_train_CV,RMSE_train_CV,MAE_train_CV,MAPE_train_CV (%),R2_test,RMSE_test,MAE_test,MAPE_test (%)
XGBoost,0.83831,2.075517,1.633324,15.934728,0.850974,1.829158,1.47071,11.41585
RandomForest,0.754542,2.557253,2.010935,20.75513,0.7714,2.265474,1.841662,14.360814
ElasticNet,0.680622,2.917007,2.232953,21.49822,0.693004,2.625348,2.115523,15.781392


**Observaciones**:

Considerando:

* M√°ximo desempe√±o en R¬≤_test (**~0.85**)

* Menores errores (RMSE, MAE, MAPE)

* Buen equilibrio entre entrenamiento y prueba

* Capacidad para capturar relaciones no lineales complejas

**XGBoost** fue seleccionado como el modelo final para entrenamiento completo y despliegue en datos no vistos.

#### Entrenamiento de modelo final usando todo el conjunto de datos de TRAIN dataset

In [31]:
# Definici√≥n de mejor modelo
xgb_model = XGBRegressor(**xgb_best_params)
print(is_fitted(xgb_model))  # Revisa si el modelo est√° entrenado (deber√≠a ser False)

#######################################################################################################
########### SOLO PARA COMPROBACION EN DATOS X_train, ytrain (OPCIONAL; PUEDES COMENTARLO) #############

# Clonamos modelo
xgb_model_copy = clone(xgb_model)
# Entrenar transformers en training data obtenido por el split
X_train_copy_transformed = xgb_pipeline.fit_transform(X_train_copy, y_train_copy)
# Aplicar mismos transformers a validation data obtenido por el split
X_val_copy_transformed = xgb_pipeline.transform(X_val_copy)
xgb_model_copy.fit(X_train_copy_transformed, y_train_copy)
best_xgb_metrics_split = evaluate_regressor_model(xgb_model_copy, X_train_copy_transformed, y_train_copy,  X_val_copy_transformed, y_val_copy)
results = pd.DataFrame.from_dict(
    {
        "XGBoost": best_xgb_metrics_split
    },
    orient="index"
)
display(results)

#######################################################################################################
#######################################################################################################

# Entrenar transformers en training data completo
X_transformed = xgb_pipeline.fit_transform(X, y)

# Evaluaci√≥n de mejor modelo en training data completo
best_xgb_metrics = evaluate_regressor_model(xgb_model, X_transformed, y)
results = pd.DataFrame.from_dict(
    {
        "XGBoost": best_xgb_metrics
    },
    orient="index"
)
display(results)

# Entrenar el modelo con los datos preprocesados
print(is_fitted(xgb_model))  # Revisa si el modelo est√° entrenado (deber√≠a ser False)
xgb_model.fit(X_transformed, y)


False
‚úÖ Se renombraron columnas
‚úÖ No se eliminaron columnas, todas cumplen el umbral.
‚úÖ Filtro de variables unarias aplicado con umbral 0.9. 0 columnas eliminadas.
‚úÖ No se encontraron variables unarias por encima del umbral.
‚úÖ Se a√±adieron 3 features: ['num_feature_2_x_num_feature_9', 'num_feature_2_x_num_feature_11', 'num_feature_2_x_num_feature_13']
‚úÖ No se encontraron variables con correlaci√≥n alta.
‚úÖ ElasticNet seleccion√≥ 15 features
‚úò ElasticNet descart√≥ 8 features
‚úÖ Se renombraron columnas
‚úÖ Se a√±adieron 3 features: ['num_feature_2_x_num_feature_9', 'num_feature_2_x_num_feature_11', 'num_feature_2_x_num_feature_13']


Unnamed: 0,R2_train_CV,RMSE_train_CV,MAE_train_CV,MAPE_train_CV (%),R2_test,RMSE_test,MAE_test,MAPE_test (%)
XGBoost,0.853151,1.979746,1.57126,15.544922,0.84929,1.839468,1.479795,11.547724


‚úÖ Se renombraron columnas
‚úÖ No se eliminaron columnas, todas cumplen el umbral.
‚úÖ Filtro de variables unarias aplicado con umbral 0.9. 0 columnas eliminadas.
‚úÖ No se encontraron variables unarias por encima del umbral.
‚úÖ Se a√±adieron 3 features: ['num_feature_2_x_num_feature_9', 'num_feature_2_x_num_feature_11', 'num_feature_2_x_num_feature_13']
‚úÖ No se encontraron variables con correlaci√≥n alta.
‚úÖ ElasticNet seleccion√≥ 13 features
‚úò ElasticNet descart√≥ 10 features


Unnamed: 0,R2_train_CV,RMSE_train_CV,MAE_train_CV,MAPE_train_CV (%)
XGBoost,0.862934,1.883078,1.496052,14.434526


False


#### Predicci√≥n sobre datos nuevos

In [32]:
# Cargar datos de blind test
df_live = cargar_datos()
# Aplicar mismos transformers a blind test
df_live_transformed = xgb_pipeline.transform(df_live)
# Predicci√≥n
y_pred_live = xgb_model.predict(df_live_transformed)


‚úÖ Dataset cargado correctamente. 200 filas y 20 columnas.
‚úÖ Se renombraron columnas
‚úÖ Se a√±adieron 3 features: ['num_feature_2_x_num_feature_9', 'num_feature_2_x_num_feature_11', 'num_feature_2_x_num_feature_13']


#### Guardado de modelo final y predicciones

In [33]:
# Guardar predicciones en CSV
save_predictions_csv(y_pred_live, output_path="predictions_blind_test.csv")
# Guardar pipeline y modelo entrenado
guardar_pipeline_y_modelo(xgb_pipeline, xgb_model, pipeline_path="preprocessing_pipeline.pkl", model_path="xgb_bestmodel.pkl")


‚úÖ Se guardaron 200 predicciones correctamente en el archivo 'predictions_blind_test.csv'
‚úÖ Pipeline de preprocesamiento guardado en: preprocessing_pipeline.pkl
‚úÖ Modelo XGBoost guardado en: xgb_bestmodel.pkl
