Objetivo: El dataset winequality-red_procesados.csv (al cual se le realizó ingeniería de características y análisis de datos en una instancia previa), consiste en datos de vinos rojos basados en datos físico-químicos y una metrica de calidad de vino. Construya varios arbol de regresión, usando algún método de búsqueda de hiper-parámetros.

In [1]:
# Importa las librerias necesarias
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Carga la ruta al archivo de datos
ruta = os.path.dirname((os.path.abspath('models')))
ruta_datos = os.path.join(ruta, "datasets/winequality-red_procesados.csv")

# Lectura del archivo a DataFrame
wines = pd.read_csv(ruta_datos)

In [3]:
# Muestra los primeros registros
wines.head()

Unnamed: 0,fixed acidity,volatile acidity,residual sugar,chlorides,total sulfur dioxide,sulphates,alcohol,quality
0,-0.52799,1.02275,-0.57726,-0.245629,-0.36175,-0.631289,-0.990401,5
1,-0.287516,2.067774,0.259784,0.717151,0.724247,0.290757,-0.610922,5
2,-0.287516,1.371091,-0.098949,0.454575,0.29643,0.060245,-0.610922,5
3,1.756519,-1.41564,-0.57726,-0.289392,0.493884,-0.477615,-0.610922,6
4,-0.52799,0.790522,-0.696838,-0.289392,-0.164296,-0.631289,-0.990401,5


In [4]:
# Importa la clase de separacion de set de datos 
from sklearn.model_selection import train_test_split

In [5]:
# Separa las variables predictoras de la variable a predecir.
X = wines.loc[:, ["fixed acidity", "volatile acidity", "residual sugar", "chlorides", "total sulfur dioxide", "sulphates", "alcohol"]].values
y = wines.loc[:, "quality"].values

In [6]:
# Separa los sets de entrenamiento y de prueba.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)

In [7]:
# Estandariza las caracteristicas
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test)

In [8]:
# Importa la clase del modelo del arbol de regresion
from sklearn.tree import DecisionTreeRegressor

Se crea el primer árbol:

In [12]:
# Crea el arbol
regression_1 = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                                   max_depth=None, min_samples_split=20, min_samples_leaf=10, 
                                   random_state=42)
# Y entrenamos
regression_1.fit(X_train, y_train)

In [13]:
# MAE
y_pred_train = regression_1.predict(X_train)
y_pred_test = regression_1.predict(X_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mae = mean_absolute_error(y_test, y_pred_test)

print(f"El MAE de entreamiento fue: {mae_train}")
print(f"El MAE de testeo fue: {mae}")

El MAE de entreamiento fue: 0.3763938636340669
El MAE de testeo fue: 0.5657314559896184


Se prueba con un segundo árbol:

In [14]:
# Crea el arbol
regression_2 = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                                   max_depth=None, min_samples_split=100, min_samples_leaf=50, 
                                   random_state=42)
# Y entrenamos
regression_2.fit(X_train, y_train)

In [15]:
# MAEs
y_pred_train = regression_2.predict(X_train)
y_pred_test = regression_2.predict(X_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mae = mean_absolute_error(y_test, y_pred_test)

print(f"El MAE de entreamiento fue: {mae_train}")
print(f"El MAE de testeo fue: {mae}")

El MAE de entreamiento fue: 0.46881174554476834
El MAE de testeo fue: 0.538107043575094


Se busca un tercer árbol con Gridsearch los mejores hiperparámetros:

In [16]:
from sklearn.model_selection import GridSearchCV

# Crea el modelo de arbol de decision
tree_regressor = DecisionTreeRegressor()

# Define la grilla de hiperparametros
param_grid = [
    {"criterion": ["squared_error"],
     "splitter": ["best"],
     "max_depth": [x for x in range(10)],
     "min_samples_split": [x for x in range(20)],
     "min_samples_leaf": [x for x in range(10)],
     "max_features": [None],
     "ccp_alpha": [x * 0.01 for x in range(0, 50)]}
]

# Configura GridSearchCV
grid_search = GridSearchCV(tree_regressor, param_grid, cv=5, scoring='neg_mean_absolute_error', refit=True)

# Ajusta el modelo
grid_search.fit(X_train, y_train)

# Imprime los mejores parametros
print("Mejores parámetros encontrados:")
print(grid_search.best_params_)

# Usa el mejor modelo para predecir
best_model = grid_search.best_estimator_


135500 fits failed out of a total of 500000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50000 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\fabri\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\fabri\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\fabri\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\fabri\AppData\Local\Programs\Python\Pyt

Mejores parámetros encontrados:
{'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 8, 'max_features': None, 'min_samples_leaf': 9, 'min_samples_split': 19, 'splitter': 'best'}


 -0.67668435]


In [17]:
# MAEs
y_pred_train = best_model.predict(X_train)
y_pred_test = best_model.predict(X_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mae = mean_absolute_error(y_test, y_pred_test)

print(f"El MAE de entreamiento fue: {mae_train}")
print(f"El MAE de testeo fue: {mae}")

El MAE de entreamiento fue: 0.38508626503452775
El MAE de testeo fue: 0.5725873826251745


| Árbol                     |MAE Train|MAE Test|
|---------------------------|---------|--------|
| 10 observaciones por hoja |   0.38  |  0.57  |
| 50 observaciones por hoja |   0.47  |  0.54  |
| Grilla de hiperparámetros |   0.39  |  0.57  |