Este colab fue desarrollado por Arnold Charry Armero.

# Bosques Aleatorios (Regresión)

Los Bosques Aleatorios (Random Forests) son un algoritmo de ensamble basado en la construcción de múltiples Árboles de Decisión. Su funcionamiento se apoya en la técnica de Bagging (Bootstrap Aggregating), que consiste en generar distintas muestras de entrenamiento mediante remuestreo con reemplazo (Bootstrap). Cada árbol del bosque se entrena con una de estas muestras, lo que introduce diversidad entre los modelos y contribuye a reducir la varianza (Velasco Rebolledo, 2024). Además del bagging, los Bosques Aleatorios incorporan un segundo mecanismo de aleatoriedad: en cada división de un árbol, no se consideran todas las variables disponibles, sino un subconjunto aleatorio de características. Este procedimiento aumenta la diversidad entre árboles y evita que todos se centren en las mismas variables dominantes, mejorando la capacidad de generalización del modelo. De acuerdo con estudios empíricos, los mejores tamaños de subconjuntos de variables son $ \sqrt{m}$, $  \frac{m}{3} $ o $ \log_{2}(m+1) $, pero claramente eso depende de cada base de datos y se confirma cuál tamaño es el mejor con Grid Search (James, Witten, Hastie, & Tibshirani, 2021).

En el caso de la regresión, cada árbol produce una predicción numérica para una nueva observación. El resultado final del Bosque Aleatorio se obtiene calculando el promedio de todas las predicciones individuales. Matemáticamente se define de la siguiente manera,

$$ \hat{f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^{B}\hat{f}^{*b}(x) $$

Una de las grandes ventajas de este método es que no produce sobreajuste y disminuye la varianza. Sin embargo, si se utilizan demasiados árboles, se tendrán posiblemente árboles repetidos que no agreguen valor a la nueva predicción y, además, aumenten la correlación entre árboles.

Ahora se continúa con la implementación en código,

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
from pandas.api.types import is_numeric_dtype, is_object_dtype, is_string_dtype
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Machine Learning/Bases de Datos/taxi_trip_pricing.csv')

In [None]:
df.head(1000)

Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.80,0.32,53.82,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
2,36.87,Evening,Weekend,1.0,High,Clear,2.70,1.21,0.15,37.27,52.9032
3,30.33,Evening,Weekday,4.0,Low,,3.48,0.51,0.15,116.81,36.4698
4,,Evening,Weekday,3.0,High,Clear,2.93,0.63,0.32,22.64,15.6180
...,...,...,...,...,...,...,...,...,...,...,...
995,5.49,Afternoon,Weekend,4.0,Medium,Clear,2.39,0.62,0.49,58.39,34.4049
996,45.95,Night,Weekday,4.0,Medium,Clear,3.12,0.61,,61.96,62.1295
997,7.70,Morning,Weekday,3.0,Low,Rain,2.08,1.78,,54.18,33.1236
998,47.56,Morning,Weekday,1.0,Low,Clear,2.67,0.82,0.17,114.94,61.2090


In [None]:
# Cantidad de valores faltantes
df.isnull().sum().iloc[np.where(df.isnull().sum() != 0)[0]]

Unnamed: 0,0
Trip_Distance_km,50
Time_of_Day,50
Day_of_Week,50
Passenger_Count,50
Traffic_Conditions,50
Weather,50
Base_Fare,50
Per_Km_Rate,50
Per_Minute_Rate,50
Trip_Duration_Minutes,50


In [None]:
# Miramos el número de filas y columnas que tiene df
print(f"Filas: {df.shape[0]}")
print(f"Columnas: {df.shape[1]}")

Filas: 1000
Columnas: 11


In [None]:
# Tratamiento de valores faltantes
target_col = df.columns[-1]

for col in df.columns:
    if df[col].isnull().any():
        if col == target_col:
            imputer = SimpleImputer(strategy='mean')
        elif is_numeric_dtype(df[col]):
            valores_unicos = df[col].dropna().unique()
            if set(valores_unicos).issubset({0, 1}) and len(valores_unicos) <= 2:
                imputer = SimpleImputer(strategy='most_frequent')
            else:
                imputer = SimpleImputer(strategy='mean')
        elif is_object_dtype(df[col]) or is_string_dtype(df[col]):
            imputer = SimpleImputer(strategy='most_frequent')
        else:
            continue

        # Imputar y reemplazar
        df[col] = imputer.fit_transform(df[[col]]).ravel()

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Trip_Distance_km       1000 non-null   float64
 1   Time_of_Day            1000 non-null   object 
 2   Day_of_Week            1000 non-null   object 
 3   Passenger_Count        1000 non-null   float64
 4   Traffic_Conditions     1000 non-null   object 
 5   Weather                1000 non-null   object 
 6   Base_Fare              1000 non-null   float64
 7   Per_Km_Rate            1000 non-null   float64
 8   Per_Minute_Rate        1000 non-null   float64
 9   Trip_Duration_Minutes  1000 non-null   float64
 10  Trip_Price             1000 non-null   float64
dtypes: float64(7), object(4)
memory usage: 86.1+ KB


In [None]:
df.describe()

Unnamed: 0,Trip_Distance_km,Passenger_Count,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,27.070547,2.476842,3.502989,1.233316,0.292916,62.118116,56.874773
std,19.400775,1.074311,0.848107,0.418922,0.112662,31.339413,39.46481
min,1.23,1.0,2.01,0.5,0.1,5.01,6.1269
25%,13.1075,2.0,2.77,0.87,0.1975,37.1075,34.57885
50%,26.995,2.476842,3.502989,1.233316,0.292916,62.118116,52.617
75%,37.7825,3.0,4.2025,1.58,0.3825,87.775,67.47665
max,146.067047,4.0,5.0,2.0,0.5,119.84,332.043689


## Preprocesamiento de Datos

In [None]:
# Obtenemos los vectores
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [None]:
# Se detectan las columnas categóricas
cat_cols = df.select_dtypes(include=['object', 'category']).columns
cat_indices = [df.columns.get_loc(col) for col in cat_cols]

# Se detectan las columnas numéricas
num_indices = [i for i in range(df.shape[1] - 1) if i not in cat_indices]

# Se crea el transformador
ct = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(drop='first',sparse_output=False, dtype=int), cat_indices)],
                    remainder='passthrough')

## Separación en Base de datos de Entrenamiento y Prueba

In [None]:
# Se divide la base de datos
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Escalado de Datos

In [None]:
# Se escalan las variables y se hace one-hot encoder
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

In [None]:
# Se visualiza el array X_train
print(X_train)

[[0 1 0 ... 0.68 0.37 51.92]
 [0 0 0 ... 1.08 0.43 56.54]
 [0 0 0 ... 0.56 0.38 43.27]
 ...
 [0 0 0 ... 1.92 0.19 114.32]
 [0 0 1 ... 1.87 0.32 28.13]
 [0 1 0 ... 0.84 0.22 85.74]]


## Entrenamiento del Modelo

Ahora vamos con el modelo Random Forest,

In [None]:
Random_Forest_model = RandomForestRegressor(n_estimators = 100, max_features = "sqrt", random_state = 0)
Random_Forest_model.fit(X_train, y_train)

Realizando una predicción,

In [None]:
# Se debe de hacer en el orden original
print("Predicción:", Random_Forest_model.predict(ct.transform([[30.330000, "Evening", "Weekday", 4.0, "Low",
                                                              "Rain", 2.080000, 1.780000, 0.292916, 54.18]]))[0])

Predicción: 64.63391433448572


In [None]:
# Obtenemos las predicciones
y_pred = Random_Forest_model.predict(X_test)
print(y_pred.reshape(len(y_pred),1))

[[ 54.78832713]
 [ 41.46059787]
 [ 41.37084812]
 [ 58.236235  ]
 [ 29.82714747]
 [ 52.02981167]
 [189.56206227]
 [ 44.87588156]
 [ 57.72364993]
 [ 40.10202073]
 [ 50.3485914 ]
 [ 73.24196473]
 [ 45.93330713]
 [ 39.9681974 ]
 [ 58.5513736 ]
 [ 82.0466202 ]
 [ 53.65088967]
 [ 50.88234954]
 [ 40.51155867]
 [ 37.86265865]
 [ 31.39038947]
 [181.32615928]
 [ 50.48769029]
 [ 34.61015204]
 [ 41.82728617]
 [ 48.24415187]
 [ 39.37776093]
 [ 60.35819709]
 [ 31.54044447]
 [ 54.60735347]
 [ 52.50332313]
 [ 65.6755017 ]
 [ 63.70464273]
 [ 53.5726034 ]
 [ 74.85702447]
 [ 48.23835573]
 [ 43.37997842]
 [ 48.8745042 ]
 [ 53.605948  ]
 [176.050287  ]
 [ 65.20018244]
 [ 51.8041612 ]
 [ 44.61052487]
 [ 47.25302976]
 [ 54.11462835]
 [ 39.26074147]
 [ 58.88148547]
 [236.84816395]
 [ 40.98268073]
 [ 57.62179093]
 [ 44.274548  ]
 [ 41.5565746 ]
 [ 34.51197473]
 [ 51.82503147]
 [ 48.05246387]
 [ 45.4073456 ]
 [ 77.34223561]
 [ 46.056822  ]
 [ 55.92816387]
 [ 43.4588212 ]
 [ 31.1456162 ]
 [ 39.61796047]
 [ 56.86

## Rendimiento del Modelo

In [None]:
# KPI's del Modelo
MAE = mean_absolute_error(y_test, y_pred)
print('MAE: {:0.2f}%'.format(MAE / np.mean(y_test) * 100))
RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE: {:0.2f}%'.format(RMSE / np.mean(y_test) * 100))
r2 = r2_score(y_test, y_pred)
print('R2: {:0.2f}'.format(r2))

MAE: 19.40%
RMSE: 30.37%
R2: 0.80


## Validación Cruzada y Grid Search

Se mide el error cuadrático medio para la diferente selección de datos de entrenamiento y prueba,

In [None]:
# Aplicar K-fold Cross Validation
scores = cross_val_score(estimator = Random_Forest_model, X = X_train, y = y_train, cv = 10, scoring = 'neg_mean_squared_error')
print(np.sqrt(-scores.mean()))

17.864279820073154


Se conocen los parámetros del Bosque Aleatorio para evaluar el GridSearch.

In [None]:
Random_Forest_model = RandomForestRegressor(random_state = 0)

In [None]:
# Parámetros
Random_Forest_model.get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'monotonic_cst', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

In [None]:
# Se establecen los parámetros a probar
parameters = {
    'n_estimators': [100, 300, 500],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2', 0.33]}

In [None]:
# Se utiliza GridSearchCV
full_cv_classifier = GridSearchCV(estimator = Random_Forest_model,
                                  param_grid = parameters,
                                  cv = 10,
                                  scoring = 'neg_mean_squared_error',
                                  n_jobs = -1,
                                  verbose = 2)

In [None]:
# Se entrena el CV_Classifier
full_cv_classifier.fit(X_train, y_train)

Fitting 10 folds for each of 108 candidates, totalling 1080 fits


In [None]:
print(full_cv_classifier.best_params_)

{'max_depth': None, 'max_features': 0.33, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}


In [None]:
# Se obtiene el mejor resultado
print(np.sqrt(-full_cv_classifier.best_score_))

15.481622620421652


Ahora se vuelve a entrenar el modelo, pero con los parámetros óptimos.

In [None]:
model = RandomForestRegressor(n_estimators = 100, max_depth = None, max_features = 0.33, min_samples_leaf = 1,
                              min_samples_split = 2)
model.fit(X_train,y_train)

In [None]:
# Obtenemos las predicciones
y_pred = model.predict(X_test)
print(y_pred.reshape(len(y_pred),1))

[[ 55.98087384]
 [ 41.6975554 ]
 [ 35.7535132 ]
 [ 53.96405867]
 [ 26.17137593]
 [ 52.97181893]
 [239.27993909]
 [ 36.7207382 ]
 [ 60.06826973]
 [ 40.0615686 ]
 [ 53.88366967]
 [ 75.87560847]
 [ 47.5115852 ]
 [ 41.27282707]
 [ 56.11138367]
 [ 85.14907747]
 [ 66.36743693]
 [ 51.85448165]
 [ 34.93711347]
 [ 30.25360647]
 [ 33.56792143]
 [220.06934951]
 [ 47.5973574 ]
 [ 26.40508467]
 [ 40.27374091]
 [ 57.52922187]
 [ 39.94307073]
 [ 61.43471427]
 [ 27.95373147]
 [ 56.53183847]
 [ 53.7971216 ]
 [ 66.33854893]
 [ 68.73693093]
 [ 54.45645947]
 [ 85.06868173]
 [ 41.4407402 ]
 [ 40.56645213]
 [ 46.66852047]
 [ 53.83743647]
 [189.9314418 ]
 [ 64.53606947]
 [ 55.187033  ]
 [ 45.33444473]
 [ 47.8982884 ]
 [ 52.41235604]
 [ 41.96847051]
 [ 60.4471592 ]
 [258.80796061]
 [ 39.0965554 ]
 [ 59.63712967]
 [ 46.42459347]
 [ 44.37078885]
 [ 32.3861152 ]
 [ 50.81162227]
 [ 48.38292293]
 [ 43.45484787]
 [ 73.40904313]
 [ 46.532427  ]
 [ 55.4909214 ]
 [ 41.11161993]
 [ 28.47198367]
 [ 37.16153193]
 [ 49.95

In [None]:
# Obteniendo la precisión del modelo
print('RMSE: {:0.2f}%'.format(np.sqrt(mean_squared_error(y_test, y_pred)) / np.mean(y_test) * 100))
print('MAE: {:0.2f}%'.format(mean_absolute_error(y_test, y_pred) / np.mean(y_test) * 100))
print('R2: {:0.2f}%'.format(r2_score(y_test, y_pred) * 100))

RMSE: 24.03%
MAE: 15.76%
R2: 87.29%


## Referencias

*   Jacinto, V. R. (2024). Machine learning: Fundamentos, algoritmos y aplicaciones para los negocios, industria y finanzas. Ediciones Díaz de Santos.
*   James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R. https://link.springer.com/content/pdf/10.1007/978-1-0716-1418-1.pdf
*   Taxi price regression (2024, December 13). Kaggle. https://www.kaggle.com/datasets/denkuznetz/taxi-price-prediction
*   Vandeput, N. (2021). Data science for supply chain forecasting. de Gruyter.