https://www.cienciadedatos.net/documentos/py08_random_forest_python.html

https://github.com/an-rivas/ENDIREH-data-analysis/blob/preprocesamiento4Cat/OE1_Exploracion/Baseline/Baseline.ipynb

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=forest#sklearn.ensemble.RandomForestClassifier

In [1]:
import pandas as pd
from funciones import CargarPandasDatasetCategoricos, BorrarColumnas, InsertarColumnaNueva

In [2]:
# Tratamiento de datos
# ==============================================================================
import numpy as np
import pandas as pd

# Gráficos
# ==============================================================================
from sklearn import tree #La versión que tengo es 0.24.1 y está disponible apartir de la 0.21
from sklearn.tree import export_graphviz
from sklearn.tree import export_text
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure


# Preprocesado y modelado
# ==============================================================================
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid

# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

## Cargar datos

In [3]:
endireh = CargarPandasDatasetCategoricos('datasets/endireh.csv')

In [4]:
endireh.shape

(21551, 58)

## Preprocesamiento

**Random forest no usan datos categóricos (object type), para ello obtengo One Hot Encoding con la instruccion pd.get_dummies de pandas para las columnas categóricas**

Obtengo la variable objetivo _y_.

In [6]:
y = endireh['P9_8'].copy()

Defino las columnas que no necesitan preprocesar a OHE (One Hot Encoding).

In [None]:
## columnas continuas
columns = ["P1_2", "P1_2_A", "P9_3"]

## columnas categoricas que ya son OHE
columns.extend(['P1_4_1', 'P1_4_2', 'P1_4_3', 'P1_4_4', 'P1_4_5', 'P1_4_6', 'P1_4_7', 'P1_4_8', 'P1_4_9'])

Aparto las columnas del dataset original.

In [None]:
endireh_num = endireh[columns].copy()

Quito las columnas del dataset (incluyendo la variable objetivo)

In [None]:
columns.extend(['P9_8'])

In [None]:
endireh_cat = endireh.drop(columns=columns)

Obtengo el OHE

In [None]:
endireh_cat = pd.get_dummies(endireh_cat)

Concateno los conjuntos de datos OHE y continuos.

In [None]:
endireh_ohe = pd.concat([endireh_cat, endireh_num], axis=1)

## Encontrar los mejores parámetros con _Grid Search_

Declaramos los parametros

In [8]:
param_grid = ParameterGrid(
                    {
                     'n_estimators'      : range(160,165),
                     'criterion'         : ['gini', 'entropy'],
                      'min_samples_split' : range(330, 341, 2),
                      'min_samples_leaf'  : range(90, 101, 2),
                      'random_state'      : [42, 666, 5],
                    }
                )

len(param_grid)

1080

Creamos el diccionario que guadará los resultados

In [9]:
resultados = {'params': [], 'oob_accuracy': []}
importances = []
bestSoFar = [0,0] ## oob score and model

### Loop para ajustar un modelo con cada combinación de hiperparámetros

In [10]:
%%time
print(f'Creando modelos para {len(param_grid)} combinaciones de parámetros.\n')

for i,params in enumerate(param_grid):
    
    modelo = RandomForestClassifier(
                oob_score    = True,
                n_jobs       = -1,
                ** params
             )
    
    modelo.fit(endireh_ohe, y)
    
    resultados['params'].append(params)
    resultados['oob_accuracy'].append(modelo.oob_score_)
    
    if(modelo.oob_score_ > bestSoFar[0]):
        bestSoFar[0] = modelo.oob_score_
        bestSoFar[1] = modelo
    
    importances.append(modelo.feature_importances_)
    
    if i%50 == 0 or i==len(param_grid):
        print(f"Modelo {i}: {params} \u2713 ({modelo.oob_score_})")

print('\n')

Creando modelos para 1080 combinaciones de parámetros.

Modelo 0: {'criterion': 'gini', 'min_samples_leaf': 90, 'min_samples_split': 330, 'n_estimators': 160, 'random_state': 42} ✓
Modelo 50: {'criterion': 'gini', 'min_samples_leaf': 90, 'min_samples_split': 336, 'n_estimators': 161, 'random_state': 5} ✓
Modelo 100: {'criterion': 'gini', 'min_samples_leaf': 92, 'min_samples_split': 330, 'n_estimators': 163, 'random_state': 666} ✓
Modelo 650: {'criterion': 'entropy', 'min_samples_leaf': 92, 'min_samples_split': 332, 'n_estimators': 161, 'random_state': 5} ✓
Modelo 700: {'criterion': 'entropy', 'min_samples_leaf': 92, 'min_samples_split': 338, 'n_estimators': 163, 'random_state': 666} ✓
Modelo 750: {'criterion': 'entropy', 'min_samples_leaf': 94, 'min_samples_split': 334, 'n_estimators': 160, 'random_state': 42} ✓
Modelo 800: {'criterion': 'entropy', 'min_samples_leaf': 94, 'min_samples_split': 340, 'n_estimators': 161, 'random_state': 5} ✓
Modelo 850: {'criterion': 'entropy', 'min_sampl

## Visualizamos resultados

### 20 de diciembre 1080 combinaciones

    param_grid = ParameterGrid(
                    {
                     'n_estimators'      : range(160,165),
                     'criterion'         : ['gini', 'entropy'],
                      'min_samples_split' : range(330, 341, 2),
                      'min_samples_leaf'  : range(90, 101, 2),
                      'random_state'      : [42, 666, 5],
                    }
                )
            
Puesto a las 21:03 hrs. Al experimento 50 fueron 1 hr 15 min. Aprox: 27 hrs. que sería a media noche del martes,

In [None]:
resultados = pd.DataFrame(resultados)
resultados = pd.concat([resultados, resultados['params'].apply(pd.Series)], axis=1)
resultados = resultados.sort_values('oob_accuracy', ascending=False)
resultados = resultados.drop(columns = 'params')
resultados.head(10)

Unnamed: 0,oob_accuracy,criterion,min_samples_leaf,min_samples_split,n_estimators,random_state
0,0.71412,gini,90,330,160,42
725,0.71412,entropy,94,330,161,5
711,0.71412,entropy,92,340,162,42
712,0.71412,entropy,92,340,162,666
713,0.71412,entropy,92,340,162,5
714,0.71412,entropy,92,340,163,42
715,0.71412,entropy,92,340,163,666
716,0.71412,entropy,92,340,163,5
717,0.71412,entropy,92,340,164,42
718,0.71412,entropy,92,340,164,666


In [12]:
importances

[array([8.09119557e-05, 1.11540455e-03, 3.71612311e-04, ...,
        1.64193793e-03, 2.43843677e-03, 1.61885552e-03]),
 array([0.00202901, 0.00218497, 0.00110465, ..., 0.00094154, 0.00374129,
        0.01345559]),
 array([0.        , 0.00273238, 0.00088267, ..., 0.00162138, 0.00807813,
        0.0102919 ]),
 array([8.03126079e-05, 1.10714229e-03, 3.68859627e-04, ...,
        1.62977543e-03, 2.42037428e-03, 1.60686400e-03]),
 array([0.00202901, 0.00218497, 0.00110465, ..., 0.00094154, 0.00374129,
        0.01345559]),
 array([0.        , 0.00273238, 0.00088267, ..., 0.00162138, 0.00807813,
        0.0102919 ]),
 array([8.03126079e-05, 1.10714229e-03, 3.68859627e-04, ...,
        1.62977543e-03, 2.42037428e-03, 1.60686400e-03]),
 array([0.00201398, 0.00216879, 0.00109647, ..., 0.00093457, 0.00371357,
        0.01335592]),
 array([0.        , 0.00271272, 0.00087632, ..., 0.00160972, 0.00802001,
        0.01021786]),
 array([8.03126079e-05, 1.10714229e-03, 3.68859627e-04, ...,
        1.62