Este colab fue desarrollado por Arnold Charry Armero.

# Extremely Randomized Trees (Regresión)

El algoritmo Extremely Randomized Trees (Extra Trees) es un método de ensamble basado en la construcción de múltiples árboles de decisión. Su funcionamiento es muy similar al de Random Forest, ya que ambos combinan muchos árboles entrenados sobre distintos subconjuntos de características para mejorar la generalización del modelo y reducir la varianza.

La diferencia clave radica en el nivel de aleatoriedad introducido durante la construcción de los árboles:

*   En Random Forest, cada nodo se divide buscando el punto de corte óptimo (el umbral que maximiza la reducción de impureza) dentro de un subconjunto aleatorio de características.
*   En Extra Trees, además de seleccionar aleatoriamente las características, los puntos de corte también se eligen de manera completamente aleatoria, sin buscar el umbral óptimo.

Para cada característica seleccionada, el algoritmo genera un único punto de corte aleatorio (por ejemplo, un valor entre el mínimo y máximo de esa variable en el nodo actual). Luego, evalúa cada uno de esos cortes aleatorios calculando la reducción de impureza (por ejemplo, usando el índice de Gini, entropía o varianza) y elige el que produce la mayor ganancia.

Este enfoque hace que Extra Trees sea más rápido que Random Forest, ya que evita el proceso exhaustivo de búsqueda del mejor umbral, y además reduce el riesgo de sobreajuste al introducir más variabilidad entre los árboles.

Finalmente, la predicción del ensamble se obtiene combinando los resultados de todos los árboles:

*   En clasificación, se utiliza la votación mayoritaria entre los árboles.
*   En regresión, se calcula el promedio (o promedio ponderado) de las predicciones individuales.

En resumen, Extra Trees logra un equilibrio entre rapidez, simplicidad y capacidad de generalización, gracias a la fuerte aleatoriedad en la selección de características y puntos de corte

Ahora se continúa con la implementación en código,

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
from pandas.api.types import is_numeric_dtype, is_object_dtype, is_string_dtype
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Machine Learning/Bases de Datos/auto-mpg.csv')

In [None]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [None]:
# Nos aseguramos de ver todo el contenido del DataFrame
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [None]:
df.drop('car name', axis = 1, inplace = True)
df.head(1000)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130,3504,12.0,70,1
1,15.0,8,350.0,165,3693,11.5,70,1
2,18.0,8,318.0,150,3436,11.0,70,1
3,16.0,8,304.0,150,3433,12.0,70,1
4,17.0,8,302.0,140,3449,10.5,70,1
5,15.0,8,429.0,198,4341,10.0,70,1
6,14.0,8,454.0,220,4354,9.0,70,1
7,14.0,8,440.0,215,4312,8.5,70,1
8,14.0,8,455.0,225,4425,10.0,70,1
9,15.0,8,390.0,190,3850,8.5,70,1


In [None]:
# Cantidad de valores faltantes
df.isnull().sum().iloc[np.where(df.isnull().sum() != 0)[0]]

Unnamed: 0,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
dtypes: float64(3), int64(4), object(1)
memory usage: 25.0+ KB


In [None]:
# Se reemplazan los "?" por NaN
df.replace("?", np.nan, inplace=True)

# Se convierten columnas numéricas a tipo float
for col in df.columns:
    try:
        df[col] = pd.to_numeric(df[col])
    except ValueError:
        # Si la conversión falla, se deja la columna como está (probablemente categórica)
        pass

# Rellenar valores NaN con la media en columnas numéricas
df = df.apply(lambda col: col.fillna(col.mean()) if np.issubdtype(col.dtype, np.number) else col)

In [None]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,38.199187,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,76.0,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,95.0,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,125.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


## Preprocesamiento de Datos

In [None]:
# Obtenemos los vectores
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values

In [None]:
# Se detectan las columnas categóricas
cat_cols = df.select_dtypes(include=['object', 'category']).columns
cat_indices = [df.columns.get_loc(col) for col in cat_cols]

# Se detectan las columnas numéricas
num_indices = [i for i in range(df.shape[1] - 1) if i not in cat_indices]

# Se crea el transformador
ct = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(drop='first',sparse_output=False, dtype=int), cat_indices)],
                    remainder='passthrough')

## Separación en Base de datos de Entrenamiento y Prueba

In [None]:
# Se divide la base de datos
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Escalado de Datos

In [None]:
# Se escalan las variables y se hace one-hot encoder
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

In [None]:
# Visualizar el array X_train
np.set_printoptions(threshold=1000)
np.set_printoptions(suppress=True, precision=2, linewidth=200)
print(X_train)

[[  8.  350.  170.  ...  11.4  77.    1. ]
 [  4.  119.  100.  ...  14.8  81.    3. ]
 [  8.  304.  150.  ...  15.5  74.    1. ]
 ...
 [  4.   68.   49.  ...  19.5  73.    2. ]
 [  6.  250.  100.  ...  15.   71.    1. ]
 [  4.   90.   71.  ...  16.5  75.    2. ]]


## Entrenamiento del Modelo

Ahora vamos con el modelo ExtraTrees,

In [None]:
ExtraTrees_model = ExtraTreesRegressor(n_estimators = 100, max_features = "sqrt", random_state = 0)
ExtraTrees_model.fit(X_train, y_train)

Realizando una predicción,

In [None]:
# Se debe de hacer en el orden original
print("Predicción:", ExtraTrees_model.predict(ct.transform([[4, 97.0, 165, 3693, 10.5, 70, 3]]))[0])

Predicción: 21.682000000000002


In [None]:
# Obtenemos las predicciones
y_pred = ExtraTrees_model.predict(X_test)
print(y_pred.reshape(len(y_pred),1))

[[13.9 ]
 [24.83]
 [14.02]
 [21.62]
 [18.09]
 [31.13]
 [35.75]
 [22.93]
 [15.06]
 [25.71]
 [32.4 ]
 [37.37]
 [19.56]
 [31.84]
 [15.67]
 [32.66]
 [27.77]
 [26.15]
 [17.82]
 [32.49]
 [15.09]
 [24.22]
 [23.98]
 [20.66]
 [32.59]
 [27.01]
 [33.16]
 [29.21]
 [29.43]
 [15.93]
 [19.33]
 [29.69]
 [16.27]
 [32.85]
 [20.74]
 [24.65]
 [19.1 ]
 [15.99]
 [33.05]
 [12.44]
 [13.34]
 [14.91]
 [28.  ]
 [27.8 ]
 [29.77]
 [21.97]
 [20.2 ]
 [14.31]
 [22.26]
 [30.5 ]
 [33.16]
 [25.91]
 [16.02]
 [27.61]
 [14.61]
 [11.1 ]
 [19.06]
 [23.47]
 [30.44]
 [17.33]
 [18.37]
 [26.68]
 [18.84]
 [19.92]
 [12.87]
 [14.4 ]
 [13.14]
 [17.99]
 [24.89]
 [13.48]
 [35.81]
 [12.68]
 [23.54]
 [19.1 ]
 [24.66]
 [30.28]
 [29.42]
 [32.18]
 [29.62]
 [14.16]
 [14.06]
 [28.87]
 [32.72]
 [29.45]
 [31.19]
 [35.61]
 [28.73]
 [18.99]
 [30.06]
 [34.75]
 [28.61]
 [13.24]
 [21.91]
 [32.88]
 [28.34]
 [18.02]
 [19.16]
 [26.75]
 [21.45]
 [13.23]
 [21.54]
 [35.9 ]
 [25.76]
 [23.68]
 [37.89]
 [25.07]
 [25.21]
 [14.27]
 [14.36]
 [22.54]
 [25.18]
 

## Rendimiento del Modelo

In [None]:
# KPI's del Modelo
MAE = mean_absolute_error(y_test, y_pred)
print('MAE: {:0.2f}%'.format(MAE / np.mean(y_test) * 100))
RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE: {:0.2f}%'.format(RMSE / np.mean(y_test) * 100))
r2 = r2_score(y_test, y_pred)
print('R2: {:0.2f}'.format(r2))

MAE: 8.00%
RMSE: 11.45%
R2: 0.88


## Validación Cruzada y Grid Search

Se mide el error cuadrático medio para la diferente selección de datos de entrenamiento y prueba,

In [None]:
# Aplicar K-fold Cross Validation
scores = cross_val_score(estimator = ExtraTrees_model, X = X_train, y = y_train, cv = 10, scoring = 'neg_mean_squared_error')
print(np.sqrt(-scores.mean()))

2.8855884006082477


Se conocen los parámetros de ExtraTrees para evaluar el GridSearch.

In [None]:
ExtraTrees_model = ExtraTreesRegressor(random_state = 0)

In [None]:
# Parámetros
ExtraTrees_model.get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'monotonic_cst', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

In [None]:
# Se establecen los parámetros a probar
parameters = {
    'n_estimators': [100, 300, 500],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2', 0.33]}

In [None]:
# Se utiliza GridSearchCV
full_cv_classifier = GridSearchCV(estimator = ExtraTrees_model,
                                  param_grid = parameters,
                                  cv = 10,
                                  scoring = 'neg_mean_squared_error',
                                  n_jobs = -1,
                                  verbose = 2)

In [None]:
# Se entrena el CV_Classifier
full_cv_classifier.fit(X_train, y_train)

Fitting 10 folds for each of 108 candidates, totalling 1080 fits


In [None]:
print(full_cv_classifier.best_params_)

{'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}


In [None]:
# Se obtiene el mejor resultado
print(np.sqrt(-full_cv_classifier.best_score_))

2.8330289094688967


Ahora se vuelve a entrenar el modelo, pero con los parámetros óptimos.

In [None]:
model = ExtraTreesRegressor(n_estimators = 300, max_depth = None, max_features = 'sqrt', min_samples_leaf = 1,
                              min_samples_split = 2)
model.fit(X_train,y_train)

In [None]:
# Obtenemos las predicciones
y_pred = model.predict(X_test)
print(y_pred.reshape(len(y_pred),1))

[[13.85]
 [24.51]
 [13.79]
 [21.88]
 [18.01]
 [31.26]
 [35.46]
 [23.03]
 [14.98]
 [25.65]
 [32.02]
 [37.33]
 [19.32]
 [31.48]
 [16.1 ]
 [32.18]
 [27.52]
 [26.38]
 [17.86]
 [32.05]
 [15.21]
 [24.18]
 [24.56]
 [20.34]
 [32.95]
 [27.17]
 [33.03]
 [29.33]
 [29.36]
 [16.13]
 [19.13]
 [29.52]
 [16.26]
 [32.67]
 [20.86]
 [24.93]
 [18.9 ]
 [15.99]
 [32.15]
 [12.21]
 [13.38]
 [14.86]
 [28.04]
 [26.68]
 [29.65]
 [21.96]
 [20.41]
 [14.41]
 [21.28]
 [30.62]
 [32.39]
 [26.06]
 [16.  ]
 [27.27]
 [14.78]
 [11.2 ]
 [19.38]
 [23.23]
 [30.72]
 [17.16]
 [17.92]
 [26.3 ]
 [19.12]
 [19.73]
 [12.93]
 [14.54]
 [13.04]
 [18.13]
 [24.9 ]
 [13.5 ]
 [35.16]
 [12.53]
 [23.91]
 [18.92]
 [24.52]
 [31.05]
 [29.51]
 [32.18]
 [30.21]
 [14.09]
 [13.94]
 [28.67]
 [32.32]
 [30.23]
 [31.44]
 [35.24]
 [28.5 ]
 [19.85]
 [29.21]
 [34.65]
 [27.99]
 [13.  ]
 [21.45]
 [33.13]
 [28.79]
 [17.99]
 [19.53]
 [27.  ]
 [21.82]
 [13.27]
 [21.33]
 [36.89]
 [25.45]
 [23.84]
 [38.02]
 [25.27]
 [25.71]
 [14.16]
 [14.38]
 [22.17]
 [25.08]
 

In [None]:
# Obteniendo la precisión del modelo
print('RMSE: {:0.2f}%'.format(np.sqrt(mean_squared_error(y_test, y_pred)) / np.mean(y_test) * 100))
print('MAE: {:0.2f}%'.format(mean_absolute_error(y_test, y_pred) / np.mean(y_test) * 100))
print('R2: {:0.2f}%'.format(r2_score(y_test, y_pred) * 100))

RMSE: 11.58%
MAE: 8.02%
R2: 88.06%


## Referencias

*   Auto MPG Dataset. (2021, December 25). Kaggle. https://www.kaggle.com/datasets/yasserh/auto-mpg-dataset
*   Jacinto, V. R. (2024). Machine learning: Fundamentos, algoritmos y aplicaciones para los negocios, industria y finanzas. Ediciones Díaz de Santos.
*   James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R. https://link.springer.com/content/pdf/10.1007/978-1-0716-1418-1.pdf
*   Vandeput, N. (2021). Data science for supply chain forecasting. de Gruyter.