# **RETO HACKATON**: *CIENCIA DE DATOS - CAIXA BANK*
*Carlos Cabruja - Data*

## Background

El IBEX 35 es el índice oficial de la bolsa española compuesto por las 35 empresas más negociadas del mercado. Este índice nos muestra en tiempo real si los precios en bolsa están subiendo o bajando, por lo que permite medir el comportamiento de este conjunto de acciones.

El IBEX35 sirve como punto de referencia para los inversores del mercado español. La rentabilidad de este índice es el objetivo a batir por los gestores.

Por lo tanto, la modelización de las dinámicas de este tipo de índices resultan esenciales para la toma de decisiones por parte de todas las entidades bursátiles.

## Reto

1. Desarrolla un modelo predictivo que permita predecir la variable target (si el precio de cierre del IBEX35 será superior o inferior al precio de cierre actual).

Para ello deberas entrenar tu modelo con los datos de training (si también se usan los tweets se sumaran 100 puntos) e introducir como input de tu modelo el dataset test_x para realizar las predicciones.

2. Crea un breve documento (máx. 2 páginas) o presentación (máx. 4 slides) explicando la solución que has empleado y porque la has empleado.

## Librerías

In [41]:
# Tratamiento de datos
# ==============================================================================
import numpy as np
import pandas as pd
import nltk
import datetime as dt

# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocesado y modelado
# ==============================================================================
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.inspection import permutation_importance
import multiprocessing

# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

## Datos

In [2]:
# cargamos los datasets a trabajar
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test_x.csv')
tweets = pd.read_csv('data/tweets_from2015_#Ibex35.csv')

In [3]:
train # visualizamos los datos de entrenamiento

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Target
0,1994-01-03,3615.199951,3654.699951,3581.000000,3654.500000,3654.496338,0.0,0
1,1994-01-04,3654.500000,3675.500000,3625.100098,3630.300049,3630.296387,0.0,1
2,1994-01-05,3625.199951,3625.199951,3583.399902,3621.199951,3621.196289,0.0,1
3,1994-01-06,,,,,,,0
4,1994-01-07,3621.199951,3644.399902,3598.699951,3636.399902,3636.396240,0.0,1
...,...,...,...,...,...,...,...,...
6549,2019-05-24,9150.299805,9211.099609,9141.400391,9174.599609,9174.599609,121673100.0,0
6550,2019-05-27,9225.900391,9294.599609,9204.700195,9216.400391,9216.400391,60178000.0,0
6551,2019-05-28,9220.400391,9224.900391,9132.900391,9191.799805,9191.799805,218900800.0,0
6552,2019-05-29,9113.200195,9116.700195,9035.099609,9080.500000,9080.500000,148987100.0,0


In [4]:
test # visualizamos los datos de prueba

Unnamed: 0,test_index,Date,Open,High,Low,Close,Adj Close,Volume
0,6557,2019-06-05,9136.799805,9173.400391,9095.000000,9150.500000,9150.500000,158753000.0
1,6558,2019-06-06,9169.200195,9246.200195,9136.700195,9169.200195,9169.200195,212720900.0
2,6559,2019-06-07,9186.700195,9261.400391,9185.700195,9236.099609,9236.099609,150664700.0
3,6560,2019-06-10,9284.200195,9302.200195,9248.099609,9294.099609,9294.099609,102323700.0
4,6561,2019-06-11,9288.599609,9332.500000,9273.400391,9282.099609,9282.099609,144701200.0
...,...,...,...,...,...,...,...,...
721,7278,2022-03-25,8314.099609,8363.200195,8286.500000,8330.599609,8330.599609,156189000.0
722,7279,2022-03-28,8354.400391,8485.700195,8354.400391,8365.599609,8365.599609,167961800.0
723,7280,2022-03-29,8451.000000,8621.000000,8419.700195,8614.599609,8614.599609,257812200.0
724,7281,2022-03-30,8583.299805,8597.400391,8508.900391,8550.599609,8550.599609,185389000.0


In [5]:
tweets # visualizamos los datos de tweets

Unnamed: 0,tweetDate,handle,text
0,Sat Apr 09 14:47:45 +0000 2022,abelac62,He hecho el repaso de todos los componentes de...
1,Thu Apr 07 19:14:36 +0000 2022,LluisPerarnau,Els projectes que han presentat les empreses d...
2,Mon Apr 04 16:48:45 +0000 2022,Pegaso121080,"Por si no lo has visto, o no lo encuentras en ..."
3,Tue Apr 05 07:23:16 +0000 2022,zonavalue,📈 #BOLSA: El #Ibex35 abre en 🟢 \n\n🇪🇸 #Ibex35 ...
4,Thu Mar 31 16:07:43 +0000 2022,EPeconomia,"El #Ibex35 retrocede un 0,4% en marzo y un 3,0..."
...,...,...,...
9796,Thu Jan 08 16:41:36 +0000 2015,elEconomistaes,"#Cierre | El #Ibex35 sube un 2,26% hasta los 1..."
9797,Sat Jan 03 17:20:30 +0000 2015,Roger_bolsa,Un vistazo a los #Bluechips del #Ibex #Ibex35....
9798,Sat Jan 10 19:42:45 +0000 2015,Secretosdebolsa,Así comienza la #Bolsa en #2015 Ojo a los sopo...
9799,Sat Jan 10 21:47:17 +0000 2015,Roger_bolsa,Análisis del #BancoSantander #Santander #SAN t...


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6554 entries, 0 to 6553
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       6554 non-null   object 
 1   Open       6421 non-null   float64
 2   High       6421 non-null   float64
 3   Low        6421 non-null   float64
 4   Close      6421 non-null   float64
 5   Adj Close  6421 non-null   float64
 6   Volume     6421 non-null   float64
 7   Target     6554 non-null   int64  
dtypes: float64(6), int64(1), object(1)
memory usage: 409.8+ KB


Habrá que hacer tratamiento de NULL en los datos

In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 726 entries, 0 to 725
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   test_index  726 non-null    int64  
 1   Date        726 non-null    object 
 2   Open        726 non-null    float64
 3   High        726 non-null    float64
 4   Low         726 non-null    float64
 5   Close       726 non-null    float64
 6   Adj Close   726 non-null    float64
 7   Volume      726 non-null    float64
dtypes: float64(6), int64(1), object(1)
memory usage: 45.5+ KB


In [8]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9801 entries, 0 to 9800
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweetDate  9799 non-null   object
 1   handle     9798 non-null   object
 2   text       9797 non-null   object
dtypes: object(3)
memory usage: 229.8+ KB


Hacer **tweetdate** un Date para que pueda hacer merge con nuestro dataframe de train, y tratamiento de nulls

## Limpieza de datos

In [9]:
# cuantos nulls tiene tweets?
tweets.isnull().sum()

tweetDate    2
handle       3
text         4
dtype: int64

In [10]:
# filas con nulls en tweetDate
tweets[tweets.tweetDate.isnull()]

Unnamed: 0,tweetDate,handle,text
6931,,,
9634,,,


No hay nada, eliminamos....

In [11]:
# eliminar filas con nulls en tweetDate
tweets = tweets.dropna(subset=['tweetDate'])

# filas con nulls en text
tweets[tweets.text.isnull()]

Unnamed: 0,tweetDate,handle,text
1070,Mon Mar 08 07:13:57 +0000 2021,pharma_jonpi,
9667,Y Montoro dando caña....,,


Tampoco son relevantes los tweets que no tienen fecha de publicación, ni texto de los tweets.

In [12]:
# eliminar todos los nulls
tweets = tweets.dropna()
tweets.isnull().sum()

tweetDate    0
handle       0
text         0
dtype: int64

In [13]:
# convertir Date a datetime YYYY-MM-DD
try:
    tweets['tweetDate'] = pd.to_datetime(tweets['tweetDate'])
except Exception as e:
    print(e)

Unknown string format: #Bolsa #IBEX35 https://t.co/2wBR9k3hHr


Como **tweetdate** está contaminada con texto la vamos a tratar fila por fila. 

In [14]:
for i in tweets['tweetDate']:
    lista_i = str(i).split(' ')
    if len(lista_i) != 6: # si no tiene 6 elementos (YYYY-MM-DD HH:MM:SS)
        # eliminar filas con fechas mal formateadas
        tweets = tweets.drop(tweets[tweets['tweetDate'] == i].index)

In [15]:
# convertir Date a datetime YYYY-MM-DD
tweets['tweetDate'] = pd.to_datetime(tweets['tweetDate'])

In [16]:
# renombrar tweetDate a Date
tweets = tweets.rename(columns={'tweetDate': 'Date'})

Y la columna handle no nos sirve ya que no daremos importancia a quién ha escrito el tweet.

In [17]:
# eliminar handle de los tweets
tweets = tweets.drop(['handle'], axis=1)
tweets

Unnamed: 0,Date,text
0,2022-04-09 14:47:45+00:00,He hecho el repaso de todos los componentes de...
1,2022-04-07 19:14:36+00:00,Els projectes que han presentat les empreses d...
2,2022-04-04 16:48:45+00:00,"Por si no lo has visto, o no lo encuentras en ..."
3,2022-04-05 07:23:16+00:00,📈 #BOLSA: El #Ibex35 abre en 🟢 \n\n🇪🇸 #Ibex35 ...
4,2022-03-31 16:07:43+00:00,"El #Ibex35 retrocede un 0,4% en marzo y un 3,0..."
...,...,...
9796,2015-01-08 16:41:36+00:00,"#Cierre | El #Ibex35 sube un 2,26% hasta los 1..."
9797,2015-01-03 17:20:30+00:00,Un vistazo a los #Bluechips del #Ibex #Ibex35....
9798,2015-01-10 19:42:45+00:00,Así comienza la #Bolsa en #2015 Ojo a los sopo...
9799,2015-01-10 21:47:17+00:00,Análisis del #BancoSantander #Santander #SAN t...


Ahora creamos la columna sentimiento, con la libreria de Sentiment Analysis de NLTK

In [18]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer() # instanciamos el analizador

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [19]:
tweets['sentiment'] = tweets['text'].apply(lambda x: sentiment.polarity_scores(x)['compound'])
tweets

Unnamed: 0,Date,text,sentiment
0,2022-04-09 14:47:45+00:00,He hecho el repaso de todos los componentes de...,0.0000
1,2022-04-07 19:14:36+00:00,Els projectes que han presentat les empreses d...,0.0000
2,2022-04-04 16:48:45+00:00,"Por si no lo has visto, o no lo encuentras en ...",-0.5267
3,2022-04-05 07:23:16+00:00,📈 #BOLSA: El #Ibex35 abre en 🟢 \n\n🇪🇸 #Ibex35 ...,0.0000
4,2022-03-31 16:07:43+00:00,"El #Ibex35 retrocede un 0,4% en marzo y un 3,0...",0.0000
...,...,...,...
9796,2015-01-08 16:41:36+00:00,"#Cierre | El #Ibex35 sube un 2,26% hasta los 1...",0.0000
9797,2015-01-03 17:20:30+00:00,Un vistazo a los #Bluechips del #Ibex #Ibex35....,0.0000
9798,2015-01-10 19:42:45+00:00,Así comienza la #Bolsa en #2015 Ojo a los sopo...,0.0000
9799,2015-01-10 21:47:17+00:00,Análisis del #BancoSantander #Santander #SAN t...,0.0000


In [20]:
# eliminar text
tweets = tweets.drop(['text'], axis=1)

Ahora, a fin de hacer el merge con train, trataremos los datos para que sea solo una fecha.

Para los tweets hechos en una misma fecha pero distinta hora, se hará una media de los sentimientos.

In [21]:
# convertir Date a datetime YYYY-MM-DD
tweets['Date'] = pd.to_datetime(tweets['Date']).dt.date

In [22]:
# eliminar Dates repetidos con el mismo sentimiento y fecha
tweets = tweets.drop_duplicates()

# hay alguna fecha repetida con distinto sentimiento?
tweets.groupby(['Date', 'sentiment']).count()

Date,sentiment
2015-01-03,0.0000
2015-01-04,0.0000
2015-01-05,-0.5267
2015-01-05,0.0000
2015-01-06,-0.7430
...,...
2022-04-07,0.0000
2022-04-08,-0.2960
2022-04-08,0.0000
2022-04-09,-0.6249


In [23]:
# sacamos las fechas repetidas con distinto sentimiento
temp_df = tweets.groupby(['Date']).count()
lista_fechas = temp_df[temp_df['sentiment'] > 1].index

# restamos el indice de sentimiento en las fechas repetidas
for i in lista_fechas:
    # sacamos las filas repetidas
    temp_df = tweets[tweets['Date'] == i]
    # restamos el indice de sentimiento en las fechas repetidas
    sentiment = temp_df['sentiment'].mean()
    # actualizamos el sentimiento en las filas repetidas
    tweets.loc[tweets['Date'] == i, 'sentiment'] = sentiment

In [24]:
# eliminamos las filas repetidas
tweets = tweets.drop_duplicates()

Ya tenemos nuestra tabla tweets limpia, así que hacemos el merge con train.

In [25]:
# hacer train Date a datetime
train['Date'] = pd.to_datetime(train['Date']).dt.date
train

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Target
0,1994-01-03,3615.199951,3654.699951,3581.000000,3654.500000,3654.496338,0.0,0
1,1994-01-04,3654.500000,3675.500000,3625.100098,3630.300049,3630.296387,0.0,1
2,1994-01-05,3625.199951,3625.199951,3583.399902,3621.199951,3621.196289,0.0,1
3,1994-01-06,,,,,,,0
4,1994-01-07,3621.199951,3644.399902,3598.699951,3636.399902,3636.396240,0.0,1
...,...,...,...,...,...,...,...,...
6549,2019-05-24,9150.299805,9211.099609,9141.400391,9174.599609,9174.599609,121673100.0,0
6550,2019-05-27,9225.900391,9294.599609,9204.700195,9216.400391,9216.400391,60178000.0,0
6551,2019-05-28,9220.400391,9224.900391,9132.900391,9191.799805,9191.799805,218900800.0,0
6552,2019-05-29,9113.200195,9116.700195,9035.099609,9080.500000,9080.500000,148987100.0,0


In [26]:
# merge de train y tweets
train_tweets = pd.merge(train, tweets, on='Date')
train_tweets

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Target,sentiment
0,2015-01-05,10267.200195,10390.799805,9977.799805,9993.299805,9993.290039,299610800.0,1,-0.263350
1,2015-01-06,10040.700195,10060.799805,9871.099609,9871.099609,9871.089844,282855400.0,0,-0.371500
2,2015-01-07,9937.299805,10051.200195,9836.400391,9891.400391,9891.390625,290122400.0,0,0.000000
3,2015-01-08,10053.200195,10143.000000,9970.299805,10115.000000,10114.990234,320452300.0,0,0.000000
4,2015-01-09,10080.000000,10080.000000,9610.099609,9719.000000,9718.990234,789490200.0,1,0.000000
...,...,...,...,...,...,...,...,...,...
1018,2019-05-24,9150.299805,9211.099609,9141.400391,9174.599609,9174.599609,121673100.0,0,-0.299150
1019,2019-05-27,9225.900391,9294.599609,9204.700195,9216.400391,9216.400391,60178000.0,0,-0.284467
1020,2019-05-28,9220.400391,9224.900391,9132.900391,9191.799805,9191.799805,218900800.0,0,-0.441733
1021,2019-05-29,9113.200195,9116.700195,9035.099609,9080.500000,9080.500000,148987100.0,0,-0.138575


Ahora vamos a limpiar nuestros datos de entrenamiento 

In [27]:
train_tweets.isnull().sum()

Date         0
Open         1
High         1
Low          1
Close        1
Adj Close    1
Volume       1
Target       0
sentiment    0
dtype: int64

In [28]:
# muestra los nulos en open
train_tweets[train_tweets.Open.isnull()]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Target,sentiment
442,2017-01-02,,,,,,,0,0.0


No contiene ninguna información relevante, por lo que la vamos a eliminar.

In [29]:
train_tweets = train_tweets.dropna()
train_tweets.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
Target       0
sentiment    0
dtype: int64

Como último paso vamos a categorizar la influencia de los tweets en neutral si el sentimiento es 0, positivo si es mayor a 0 y negativo si es menor a 0.

In [30]:
# convertir sentiment a neutral, positivo y negativo
train_tweets['sentiment'] = train_tweets['sentiment'].apply(lambda x: 'neutral' if x == 0 else ('positivo' if x > 0 else 'negativo'))
train_tweets

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Target,sentiment
0,2015-01-05,10267.200195,10390.799805,9977.799805,9993.299805,9993.290039,299610800.0,1,negativo
1,2015-01-06,10040.700195,10060.799805,9871.099609,9871.099609,9871.089844,282855400.0,0,negativo
2,2015-01-07,9937.299805,10051.200195,9836.400391,9891.400391,9891.390625,290122400.0,0,neutral
3,2015-01-08,10053.200195,10143.000000,9970.299805,10115.000000,10114.990234,320452300.0,0,neutral
4,2015-01-09,10080.000000,10080.000000,9610.099609,9719.000000,9718.990234,789490200.0,1,neutral
...,...,...,...,...,...,...,...,...,...
1018,2019-05-24,9150.299805,9211.099609,9141.400391,9174.599609,9174.599609,121673100.0,0,negativo
1019,2019-05-27,9225.900391,9294.599609,9204.700195,9216.400391,9216.400391,60178000.0,0,negativo
1020,2019-05-28,9220.400391,9224.900391,9132.900391,9191.799805,9191.799805,218900800.0,0,negativo
1021,2019-05-29,9113.200195,9116.700195,9035.099609,9080.500000,9080.500000,148987100.0,0,negativo


In [31]:
# guardamos los datos en un pickle
train_tweets.to_pickle('data/train_tweets.pkl')

## Preprocesamiento

In [32]:
# convertir sentiment a dummies
train_tweets = pd.get_dummies(train_tweets, columns=['sentiment'], drop_first=True)
train_tweets

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Target,sentiment_neutral,sentiment_positivo
0,2015-01-05,10267.200195,10390.799805,9977.799805,9993.299805,9993.290039,299610800.0,1,0,0
1,2015-01-06,10040.700195,10060.799805,9871.099609,9871.099609,9871.089844,282855400.0,0,0,0
2,2015-01-07,9937.299805,10051.200195,9836.400391,9891.400391,9891.390625,290122400.0,0,1,0
3,2015-01-08,10053.200195,10143.000000,9970.299805,10115.000000,10114.990234,320452300.0,0,1,0
4,2015-01-09,10080.000000,10080.000000,9610.099609,9719.000000,9718.990234,789490200.0,1,1,0
...,...,...,...,...,...,...,...,...,...,...
1018,2019-05-24,9150.299805,9211.099609,9141.400391,9174.599609,9174.599609,121673100.0,0,0,0
1019,2019-05-27,9225.900391,9294.599609,9204.700195,9216.400391,9216.400391,60178000.0,0,0,0
1020,2019-05-28,9220.400391,9224.900391,9132.900391,9191.799805,9191.799805,218900800.0,0,0,0
1021,2019-05-29,9113.200195,9116.700195,9035.099609,9080.500000,9080.500000,148987100.0,0,0,0


In [33]:
# estandarizacón de variables númericas
num_cols = ['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

# aplicar standarScaler a las variables númericas
scaler = StandardScaler()
train_tweets[num_cols] = scaler.fit_transform(train_tweets[num_cols])

train_tweets

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Target,sentiment_neutral,sentiment_positivo
0,2015-01-05,0.635807,0.709710,0.378691,0.316976,0.316969,0.722081,1,0,0
1,2015-01-06,0.368240,0.319783,0.252780,0.173236,0.173228,0.545958,0,0,0
2,2015-01-07,0.246093,0.308440,0.211833,0.197115,0.197108,0.622344,0,1,0
3,2015-01-08,0.383007,0.416910,0.369840,0.460128,0.460121,0.941154,0,1,0
4,2015-01-09,0.414666,0.342469,-0.055212,-0.005674,-0.005682,5.871398,1,1,0
...,...,...,...,...,...,...,...,...,...,...
1018,2019-05-24,-0.683598,-0.684222,-0.608297,-0.646035,-0.646033,-1.148293,0,0,0
1019,2019-05-27,-0.594290,-0.585559,-0.533601,-0.596867,-0.596864,-1.794693,0,0,0
1020,2019-05-28,-0.600787,-0.667915,-0.618328,-0.625803,-0.625801,-0.126294,0,0,0
1021,2019-05-29,-0.727424,-0.795765,-0.733737,-0.756722,-0.756719,-0.861185,0,0,0


In [34]:
prep_test = test.copy()
prep_test = prep_test.set_index('test_index')
prep_test['Date'] = pd.to_datetime(prep_test['Date']).dt.date
prep_test[num_cols] = scaler.transform(prep_test[num_cols])
prep_test

Unnamed: 0_level_0,Date,Open,High,Low,Close,Adj Close,Volume
test_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
6557,2019-06-05,-0.699545,-0.728768,-0.663052,-0.674383,-0.674380,-0.758532
6558,2019-06-06,-0.661271,-0.642748,-0.613844,-0.652387,-0.652384,-0.191254
6559,2019-06-07,-0.640598,-0.624787,-0.556021,-0.573695,-0.573692,-0.843551
6560,2019-06-10,-0.525420,-0.576578,-0.482387,-0.505471,-0.505468,-1.351683
6561,2019-06-11,-0.520223,-0.540776,-0.452531,-0.519587,-0.519584,-0.906236
...,...,...,...,...,...,...,...
7278,2022-03-25,-1.671409,-1.686100,-1.617117,-1.638806,-1.638805,-0.785483
7279,2022-03-28,-1.623801,-1.541354,-1.536992,-1.597637,-1.597636,-0.661734
7280,2022-03-29,-1.509687,-1.381484,-1.459935,-1.304746,-1.304744,0.282719
7281,2022-03-30,-1.353400,-1.409369,-1.354675,-1.380027,-1.380026,-0.478550


In [35]:
X = train_tweets.drop(['Target'], axis=1).set_index('Date')
y = train_tweets['Target']

## Modelización

### Random Forest Classifier

In [37]:
# Grid de hiperparámetros evaluados
# ==============================================================================
param_grid = {'n_estimators': [150],
              'max_depth'   : [None, 3, 10, 20],
              'criterion'   : ['gini', 'entropy']
             }

# Búsqueda por grid search con validación cruzada
# ==============================================================================
grid = GridSearchCV(
        estimator  = RandomForestClassifier(random_state = 123),
        param_grid = param_grid,
        scoring    = 'accuracy',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123), 
        refit      = True,
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X, y = y)

# Resultados
# ==============================================================================
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head()

Unnamed: 0,param_criterion,param_max_depth,param_n_estimators,mean_test_score,std_test_score,mean_train_score,std_train_score
2,gini,10.0,150,0.58513,0.025883,0.970237,0.003684
6,entropy,10.0,150,0.582522,0.028284,0.953115,0.006213
3,gini,20.0,150,0.579912,0.03371,1.0,0.0
7,entropy,20.0,150,0.578599,0.027613,1.0,0.0
4,entropy,,150,0.575674,0.032995,1.0,0.0


In [38]:
# Mejores hiperparámetros por validación cruzada
# ==============================================================================
print("----------------------------------------")
print("Mejores hiperparámetros encontrados (cv)")
print("----------------------------------------")
print(grid.best_params_, ":", grid.best_score_, grid.scoring)

----------------------------------------
Mejores hiperparámetros encontrados (cv)
----------------------------------------
{'criterion': 'gini', 'max_depth': 10, 'n_estimators': 150} : 0.585129921887454 accuracy


In [39]:
random_forest = grid.best_estimator_

### Gradient Boosting Classifier

In [None]:
# Grid de hiperparámetros evaluados
# ==============================================================================
param_grid = {'n_estimators'  : [50, 100, 500, 1000],
              'max_features'  : ['auto', 'sqrt', 'log2'],
              'max_depth'     : [None, 1, 3, 5, 10, 20],
              'subsample'     : [0.5, 1],
              'learning_rate' : [0.001, 0.01, 0.1]
             }

# Búsqueda por grid search con validación cruzada
# ==============================================================================
grid = GridSearchCV(
        estimator  = GradientBoostingClassifier(random_state=123),
        param_grid = param_grid,
        scoring    = 'accuracy',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits=3, n_repeats=1, random_state=123), 
        refit      = True,
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X, y = y)

# Resultados
# ==============================================================================
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head()