# Bosques Aleatorios en Machine Learning

Al igual que los árboles de decisión, el bosque aleatorio (random forest) también puede usarse tanto para hacer regresión como para clasificación, y es un método que básicamente lo que hace es construir múltiples árboles de decisión y luego combina sus salidas para hacer una predicción final.

Los bosques aleatorios tienen varias ventajas por encima de los árboles de decisión individuales:

1. Son menos propensos a sobreajustar.
2. Pueden manejar un número mayor de características de entrada.
3. Pueden capturar relaciones no lineales complejas entre las variables de entrada y salida.

In [1]:
import numpy as np
import pandas as pd

import sklearn.model_selection as ms
import sklearn.preprocessing as sp
import sklearn.ensemble as se


# Caso de ejemplo
path: str = './data/tarjetas_credito.csv'
df: pd.DataFrame = pd.read_csv(path)
df.head()

Unnamed: 0,Duracion,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Monto,Clase
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [2]:
# Información del dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   Duracion  284807 non-null  float64
 1   V1        284807 non-null  float64
 2   V2        284807 non-null  float64
 3   V3        284807 non-null  float64
 4   V4        284807 non-null  float64
 5   V5        284807 non-null  float64
 6   V6        284807 non-null  float64
 7   V7        284807 non-null  float64
 8   V8        284807 non-null  float64
 9   V9        284807 non-null  float64
 10  V10       284807 non-null  float64
 11  V11       284807 non-null  float64
 12  V12       284807 non-null  float64
 13  V13       284807 non-null  float64
 14  V14       284807 non-null  float64
 15  V15       284807 non-null  float64
 16  V16       284807 non-null  float64
 17  V17       284807 non-null  float64
 18  V18       284807 non-null  float64
 19  V19       284807 non-null  float64
 20  V20 

In [3]:
# Estadísticos descriptivos
df.describe()

Unnamed: 0,Duracion,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Monto,Clase
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


**Según Federico:**

El dataset representa casi 285 mil registros de transacciones hechas con tarjetas de crédito.

Los nombres de columna `V1`, `V2`, etc. es porque el contenido del dataset (como todo aquel con información sensible) ha pasado por un proceso de anonimización para manipular los datos, sin tener acceso a información restringida.

Las únicas columnas nombradas son `Duracion`, `Monto` y `Clase`. Esta última es muy importante, porque indica si la transacción ha sido legítima (`0`) o fraudulenta (`1`).

Entonces el objetivo de nuestro trabajo es usar ese algoritmo para identificar si hay alguna relación fuerte entre todas las columnas del registro y la columna `clase`, de modo que luego podamos usar este modelo para identificar si una nueva transacción que nos ingrese tiene riesgo de ser fraudulenta o no, teniendo toda la información de todas sus columnas.

Sin embargo, cuando tenemos toda esta cantidad de números tan diversos, ya que cada columna representa diferentes cosas y sus datos pueden tener valores mínimos y máximos muy dispersos, va a ser muy difícil para nuestro modelo encontrar relaciones entre todos ellos.

Entonces vamos a tener que **normalizar** estos datos, que todos estos números de todas las columnas sean transformados dentro de una nueva escala para que en cada columna **el valor mínimo sea cero y el valor máximo sea uno**.

In [6]:
# Normalizar datos
scale: sp.MinMaxScaler = sp.MinMaxScaler(feature_range=(0, 1))
normalized: np.ndarray = scale.fit_transform(df)
df_normalized: pd.DataFrame = pd.DataFrame(normalized, columns=df.columns)
df_normalized.head()

Unnamed: 0,Duracion,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Monto,Clase
0,0.0,0.935192,0.76649,0.881365,0.313023,0.763439,0.267669,0.266815,0.786444,0.475312,...,0.561184,0.522992,0.663793,0.391253,0.585122,0.394557,0.418976,0.312697,0.005824,0.0
1,0.0,0.978542,0.770067,0.840298,0.271796,0.76612,0.262192,0.264875,0.786298,0.453981,...,0.55784,0.480237,0.666938,0.33644,0.58729,0.446013,0.416345,0.313423,0.000105,0.0
2,6e-06,0.935217,0.753118,0.868141,0.268766,0.762329,0.281122,0.270177,0.788042,0.410603,...,0.565477,0.54603,0.678939,0.289354,0.559515,0.402727,0.415489,0.311911,0.014739,0.0
3,6e-06,0.941878,0.765304,0.868484,0.213661,0.765647,0.275559,0.266803,0.789434,0.414999,...,0.559734,0.510277,0.662607,0.223826,0.614245,0.389197,0.417669,0.314371,0.004807,0.0
4,1.2e-05,0.938617,0.77652,0.864251,0.269796,0.762975,0.263984,0.268968,0.782484,0.49095,...,0.561327,0.547271,0.663392,0.40127,0.566343,0.507497,0.420561,0.31749,0.002724,0.0


In [8]:
X: pd.DataFrame = df_normalized.drop('Clase', axis=1)
X.head()

Unnamed: 0,Duracion,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Monto
0,0.0,0.935192,0.76649,0.881365,0.313023,0.763439,0.267669,0.266815,0.786444,0.475312,...,0.582942,0.561184,0.522992,0.663793,0.391253,0.585122,0.394557,0.418976,0.312697,0.005824
1,0.0,0.978542,0.770067,0.840298,0.271796,0.76612,0.262192,0.264875,0.786298,0.453981,...,0.57953,0.55784,0.480237,0.666938,0.33644,0.58729,0.446013,0.416345,0.313423,0.000105
2,6e-06,0.935217,0.753118,0.868141,0.268766,0.762329,0.281122,0.270177,0.788042,0.410603,...,0.585855,0.565477,0.54603,0.678939,0.289354,0.559515,0.402727,0.415489,0.311911,0.014739
3,6e-06,0.941878,0.765304,0.868484,0.213661,0.765647,0.275559,0.266803,0.789434,0.414999,...,0.57805,0.559734,0.510277,0.662607,0.223826,0.614245,0.389197,0.417669,0.314371,0.004807
4,1.2e-05,0.938617,0.77652,0.864251,0.269796,0.762975,0.263984,0.268968,0.782484,0.49095,...,0.584615,0.561327,0.547271,0.663392,0.40127,0.566343,0.507497,0.420561,0.31749,0.002724


In [9]:
y: pd.Series = df_normalized['Clase']

In [10]:
X_train, X_test, y_train, y_test = ms.train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [11]:
forest: se.RandomForestClassifier = se.RandomForestClassifier()
forest.fit(X_train, y_train)

In [12]:
forest.score(X_test, y_test)

0.9995786664794073

In [13]:
# Probar el modelo
new_df: pd.DataFrame = pd.DataFrame({
    'Duracion': [0.000006], 'V1': [0.452345], 'V2': [0.564789], 'V3': [0.123456], 'V4': [0.654321],
    'V5': [0.987654], 'V6': [0.345678], 'V7': [0.234567], 'V8': [0.876543], 'V9': [0.456789],
    'V10': [0.567890], 'V11': [0.678901], 'V12': [0.789012], 'V13': [0.890123], 'V14': [0.901234],
    'V15': [0.012345], 'V16': [0.543210], 'V17': [0.432109], 'V18': [0.321098], 'V19': [0.210987],
    'V20': [0.109876], 'V21': [0.098765], 'V22': [0.887654], 'V23': [0.776543], 'V24': [0.665432],
    'V25': [0.554321],     'V26': [0.443210], 'V27': [0.332109], 'V28': [0.221098], 'Monto': [0.110987]
}, index=[0])

In [15]:
predict_class = forest.predict(new_df)
predict_class

array([0.])

In [16]:
probability = forest.predict_proba(new_df)
probability

array([[0.61, 0.39]])

In [17]:
print('Clase predicha: ', predict_class[0])
print('Probabilidad de ser legítima: ', probability[0][0])
print('Probabilidad de ser ilegítima: ', probability[0][1])

Clase predicha:  0.0
Probabilidad de ser legítima:  0.61
Probabilidad de ser ilegítima:  0.39
