## Máster en Data Science

### Machine Learning

Contacto: angel.blanco@cunef.edu

# **Model tuning Gradient Boosting Classifier**

En este notebook va a realizarse una prueba extra del proceso de búsqueda y ajuste de hiperparámetros para el Gradient Boosting Classifier por si diera mejores resultados que el Logistic Regression.

### Librerías

In [1]:
import os
from pathlib import Path

# Cambio del directory al root del proyecto
current_dir = Path.cwd()

if current_dir.name == "notebooks":
    os.chdir(current_dir.parent)

# Procesamiento
import pandas as pd

# Evaluación preliminar del modelo
from sklearn.metrics import (
    make_scorer, 
    fbeta_score
)

# Funciones
from src.metrics import get_metrics
from src.data import read_train, read_val, read_test
from src.models import write_model

# Modelos
from sklearn.ensemble import GradientBoostingClassifier

# Búsqueda de hiperparámetros
from sklearn.model_selection import RandomizedSearchCV

# Tiempo de ejecución
import time


# Omitir warnings
import warnings
warnings.filterwarnings('ignore')

### Carga de datos

In [3]:
x_train, y_train = read_train()
x_val, y_val = read_val()
x_test, y_test = read_test()

In [53]:
results_before = pd.read_csv('tables/metrics.csv')

### Modelo GradientBoostingClassifier

In [8]:
model = GradientBoostingClassifier()

f2_scorer = make_scorer(fbeta_score, beta=2)

parametros = {
    'n_estimators': [100, 200, 300, 400],
    'learning_rate': [0.1, 0.05, 0.01],
    'max_depth': [4, 6, 8],
    'min_samples_leaf': [20, 50, 100, 150],
    'max_features': [1.0, 0.3, 0.1]
}


In [19]:
random_search = RandomizedSearchCV(model, n_iter=10, param_distributions=parametros, scoring=f2_scorer, cv=2)

random_search.fit(x_train, y_train)

print(random_search.best_params_)
print('------------------')
print(f'best f2_score: {random_search.best_score_}')

{'n_estimators': 300, 'min_samples_leaf': 100, 'max_features': 0.3, 'max_depth': 8, 'learning_rate': 0.1}
------------------
best f2_score: 0.9606112514660967


In [62]:
print(f'best f2_score: {random_search.best_score_:.4f}')

best f2_score: 0.9606


In [20]:
write_model(model=random_search, name="random_search_gbc")

In [56]:
final_model = GradientBoostingClassifier(n_estimators=300, min_samples_leaf=100, max_features=0.3, max_depth=8, learning_rate=0.1)

rows = []

name = final_model.__class__.__name__
start = time.perf_counter()
final_model.fit(x_train, y_train)
end = time.perf_counter()
eta = end - start
y_pred = final_model.predict(x_test)
y_prob = final_model.predict_proba(x_test)
    
write_model(model=final_model, name="final_model_gbc")
print(f'saved model {name}')

metrics = get_metrics(y_true=y_test, y_pred=y_pred, y_prob=y_prob)
    
metrics = {"model": name, "time(min)": eta/60} | {key: [value] for key, value  in metrics.items()}
rows.append( pd.DataFrame(metrics) )

all_metrics = pd.concat(rows)

saved model final_model


### Antes del tuning:

In [54]:
results_before = results_before.iloc[[3]]
results_before

Unnamed: 0,model,time(min),accuracy,precission,recall,f1_score,f2_score,ROC_0,ROC_1
3,GradientBoostingClassifier,3.1345,0.9836,0.2017,0.174,0.1868,0.1789,0.1213,0.8787


### Después del tuning:

In [57]:
final_results = all_metrics.round(4)
final_results

Unnamed: 0,model,time(min),accuracy,precission,recall,f1_score,f2_score,ROC_0,ROC_1
0,GradientBoostingClassifier,6.7927,0.989,0.5,0.0009,0.0018,0.0011,0.1524,0.8476


El modelo presenta peor f2_score después del tuning y además tarda el doble en entrenar. He decidido quedarme con el Logistic Regression.