## Construcción del modelo

Apalancados en la ingeniería de features, creamos un conjunto de clases a utilizar en la definición de pipelines, que nos permitan reproducir y modificar con facilidad los pasos de preprocesamiento, previos al entrenamiento de un modelo: 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.experimental import enable_iterative_imputer

import sys
sys.path.append('src')
from models.pipeline import CarsPipeline

In [2]:
# Cargar y dividir los datos
data = pd.read_csv('../datasets/Car details v3.csv')

data["selling_price_log"] = np.log(data["selling_price"])

X = data.drop(columns=['selling_price', 'selling_price_log'])
y = data['selling_price']
y_log = data['selling_price_log']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X, y_log, test_size=0.3, random_state=42)

In [3]:
# Ajustar y transformar los datos
final_pipeline = CarsPipeline()

X_train_processed = final_pipeline.fit_transform(X_train)
X_test_processed = final_pipeline.transform(X_test)



In [4]:
final_pipeline_log = CarsPipeline()

X_train_processed_log = final_pipeline_log.fit_transform(X_train_log)
X_test_processed_log = final_pipeline_log.transform(X_test_log)



Veamos de usar un Ridge como primer modelo simple. Usaremos búsqueda de grilla para el hiperparámetro alpha:

In [5]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

In [38]:
# Creamos el arbol
regression_original = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                                   max_depth=None, min_samples_split=2, min_samples_leaf=1, 
                                   random_state=42)
# Y entrenamos
regression_original.fit(X_train_processed, y_train)

In [39]:
regression_original.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': 42,
 'splitter': 'best'}

In [8]:
from sklearn.tree import export_graphviz
export_graphviz(regression_original, out_file = "arbol_regression.dot",
                feature_names=final_pipeline.final_columns(),
                rounded=True,
                filled=True)

In [40]:
from sklearn.metrics import mean_absolute_error

y_pred_train = regression_original.predict(X_train_processed)
y_pred = regression_original.predict(X_test_processed)

mae_train = mean_absolute_error(y_train, y_pred_train)
mae = mean_absolute_error(y_test, y_pred)

print(f"El error de entrenamiento fue: {mae_train}")
print(f"El error de testeo fue: {mae}")

El error de entrenamiento fue: 3132.088631506009
El error de testeo fue: 82609.24565633338


In [41]:
num_leaves = regression_original.tree_.n_leaves
print(f"Número de hojas del árbol: {num_leaves}")

Número de hojas del árbol: 4432


In [10]:
# Creamos el arbol
regression_log = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                                   max_depth=None, min_samples_split=2, min_samples_leaf=1, 
                                   random_state=42)
# Y entrenamos
regression_log.fit(X_train_processed_log, y_train_log)

In [11]:
from sklearn.tree import export_graphviz
export_graphviz(regression_log, out_file = "arbol_regression_log.dot",
                feature_names=final_pipeline.final_columns(),
                rounded=True,
                filled=True)

In [12]:
from sklearn.metrics import mean_absolute_error

y_pred_train_log = regression_log.predict(X_train_processed_log)
y_pred_log = regression_log.predict(X_test_processed_log)

y_train_inv = np.exp(y_train_log)
y_pred_train_inv = np.exp(y_pred_train_log)

y_test_inv = np.exp(y_test_log)
y_pred_inv = np.exp(y_pred_log)

mae_train = mean_absolute_error(y_train_inv, y_pred_train_inv)
mae = mean_absolute_error(y_test_inv, y_pred_inv)

print(f"El error de entrenamiento fue: {mae_train}")
print(f"El error de testeo fue: {mae}")

El error de entrenamiento fue: 3135.7068671169136
El error de testeo fue: 81767.63396613808


In [31]:
param_distributions = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2'],
}

In [32]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

regression = DecisionTreeRegressor(random_state=42)
random_search = GridSearchCV(
    estimator=regression,
    param_grid=param_distributions,
    cv=5,
    verbose=1,
    n_jobs=-1,
    error_score='raise'
)

In [33]:
random_search.fit(X_train_processed, y_train)

Fitting 5 folds for each of 972 candidates, totalling 4860 fits


  _data = np.array(data, dtype=dtype, copy=copy,


In [34]:
print("Best hyperparameters:", random_search.best_params_)

Best hyperparameters: {'criterion': 'absolute_error', 'max_depth': 30, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'random'}


In [35]:
best_model = random_search.best_estimator_
test_score = best_model.score(X_test_processed, y_test)
print("Test set score:", test_score)

Test set score: 0.9352104619586608


In [36]:
from sklearn.metrics import mean_absolute_error

# Obtener el mejor modelo a partir de RandomizedSearchCV
best_model = random_search.best_estimator_

# Realizar predicciones sobre el conjunto de prueba
y_pred = best_model.predict(X_test_processed)

# Calcular el MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.4f}")

Mean Absolute Error (MAE): 82399.5892


In [37]:
num_leaves = best_model.tree_.n_leaves
print(f"Número de hojas del árbol: {num_leaves}")

Número de hojas del árbol: 2183


In [24]:
# Creamos el arbol
regression = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                                   max_depth=None, min_samples_split=2, min_samples_leaf=1, 
                                   random_state=42)
# Y entrenamos
regression.fit(X_train_processed, y_train)

y_pred_train = regression.predict(X_train_processed)
y_pred = regression.predict(X_test_processed)

mae_train = mean_absolute_error(y_train, y_pred_train)
mae = mean_absolute_error(y_test, y_pred)

param_distributions = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2'],
}

print(f"El error de entrenamiento fue: {mae_train}")
print(f"El error de testeo fue: {mae}")

El error de entrenamiento fue: 3132.088631506009
El error de testeo fue: 82609.24565633338


- Ridge
- Arbol regresión
- SVR
- Boost (hay 2)
- Random Forest

In [47]:
import optuna
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import make_scorer, mean_absolute_error

# Definir la función objetivo
def objective(trial):
    # Definir los hiperparámetros que Optuna optimizará
    criterion = trial.suggest_categorical('criterion', ['squared_error', 'friedman_mse', 'absolute_error'])
    splitter = trial.suggest_categorical('splitter', ['best', 'random'])
    max_depth = trial.suggest_int('max_depth', 1, 100)  # Increase range
    min_samples_split = trial.suggest_int('min_samples_split', 2, 50)  # Increase range
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 50) 
    max_features = trial.suggest_categorical('max_features', [None, 'sqrt', 'log2'])

    # Crear el modelo
    model = DecisionTreeRegressor(
        criterion=criterion,
        splitter=splitter,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        random_state=42
    )

    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    scorer = make_scorer(mean_absolute_error, greater_is_better=False)
    score = cross_val_score(model, X_train_processed, y_train, cv=kf, scoring=scorer)

    return -score.mean() 

def champion_callback(study, frozen_trial):
    """
    Mostramos menos información, sino es demasiado verboso
    """

    winner = study.user_attrs.get("winner", None)

    if study.best_value and winner != study.best_value:
        study.set_user_attr("winner", study.best_value)
        if winner:
            improvement_percent = (abs(winner - study.best_value) / study.best_value) * 100
            print(
                f"Trial {frozen_trial.number} achieved value: {frozen_trial.value} with "
                f"{improvement_percent: .4f}% improvement"
            )
        else:
            print(f"Initial trial {frozen_trial.number} achieved value: {frozen_trial.value}")

optuna.logging.set_verbosity(optuna.logging.ERROR)

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=5000, callbacks=[champion_callback])

# Obtener los mejores hiperparámetros
best_params = study.best_params
print(f"Mejores hiperparámetros encontrados: {best_params}")

# Entrenar un modelo con los mejores hiperparámetros
best_model = DecisionTreeRegressor(**best_params, random_state=42)
best_model.fit(X_train_processed, y_train)

# Predecir sobre el conjunto de prueba
y_pred = best_model.predict(X_test_processed)

# Calcular el MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE) del mejor modelo: {mae:.4f}")

Initial trial 0 achieved value: 131991.09783651488
Trial 4 achieved value: 102672.24560675555 with  28.5558% improvement
Trial 21 achieved value: 100084.61684847799 with  2.5854% improvement
Trial 22 achieved value: 95993.24306881276 with  4.2621% improvement
Trial 45 achieved value: 92807.96007274797 with  3.4321% improvement
Trial 50 achieved value: 89523.86389281756 with  3.6684% improvement
Trial 56 achieved value: 88813.0102103244 with  0.8004% improvement
Trial 60 achieved value: 88425.07624750174 with  0.4387% improvement
Trial 65 achieved value: 88195.4669795951 with  0.2603% improvement
Trial 71 achieved value: 85701.6983347322 with  2.9098% improvement
Trial 127 achieved value: 85134.89649441773 with  0.6658% improvement
Trial 301 achieved value: 84410.999840483 with  0.8576% improvement
Trial 377 achieved value: 84177.60434312848 with  0.2773% improvement
Trial 428 achieved value: 83307.06042795998 with  1.0450% improvement
Trial 611 achieved value: 82303.66996489698 with  1

In [43]:
num_leaves = best_model.tree_.n_leaves
print(f"Número de hojas del árbol: {num_leaves}")

Número de hojas del árbol: 1431
