## Máster en Data Science

### Machine Learning

Contacto: angel.blanco@cunef.edu

# **Model tuning Logistic Regression**

En este notebook va a realizarse el proceso de búsqueda y ajuste de hiperparámetros para el tipo de modelo seleccionado.

### Librerías

In [2]:
import os
from pathlib import Path

# Cambio del directory al root del proyecto
current_dir = Path.cwd()

if current_dir.name == "notebooks":
    os.chdir(current_dir.parent)


# Procesamiento
import pandas as pd
import numpy as np

# Evaluación preliminar del modelo
from sklearn.metrics import (
    make_scorer, 
    fbeta_score
)

# Funciones
from src.metrics import get_metrics
from src.data import read_train, read_val, read_test
from src.models import write_model

# Modelos
from sklearn.linear_model import LogisticRegression

# Búsqueda de hiperparámetros
from sklearn.model_selection import RandomizedSearchCV

# Tiempo de ejecución
import time


# Omitir warnings
import warnings
warnings.filterwarnings('ignore')

### Carga de datos

In [3]:
x_train, y_train = read_train()
x_val, y_val = read_val()
x_test, y_test = read_test()

In [4]:
results_before = pd.read_csv('tables/metrics.csv')

### Modelo Logistic Regression

Primeramente, hacemos una búsqueda de hiperparámetros para optimizar el modelo.

He utilizado un random search porque, pese a que puede que no obtenga unos mejores resultados que el modelo sin tunear, es más eficiente en cuestión de tiempo que un gridsearch.

Fijo los parámetros que quiero que pruebe:

In [5]:
model = LogisticRegression()

f2_scorer = make_scorer(fbeta_score, beta=2)

parametros = {
    'penalty': ['l1', 'l2', 'elasticnet'],
    'C': np.logspace(-4, 4, 20),
    'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
    'multi_class': ['auto', 'ovr', 'multinomial'],
    'max_iter': [5000]
}

In [6]:
random_search = RandomizedSearchCV(model, param_distributions=parametros, n_iter=50, scoring=f2_scorer, cv=5, random_state=34)

random_search.fit(x_train, y_train)

print(random_search.best_params_)
print('------------------')
print(f'best f2_score: {random_search.best_score_}')

{'solver': 'sag', 'penalty': 'l2', 'multi_class': 'multinomial', 'max_iter': 5000, 'C': 11.288378916846883}
------------------
best f2_score: 0.013221928339172212


In [7]:
write_model(model=random_search, name="random_search")

In [8]:
final_model = LogisticRegression(**random_search.best_params_) #ponemos ** para que haga unpacking

# Computamos las métricas sobre los dos sets para ver si hay Overfitting
sets = {
    "train": {"x": x_train, "y": y_train},
    "validation": {"x": x_val, "y": y_val}
}

rows = []

name = final_model.__class__.__name__
start_train = time.perf_counter()
final_model.fit(x_train, y_train)
end_train = time.perf_counter()
    
eta_train = end_train - start_train
    
model_path = f'../models/{name}.pkl'

write_model(model=final_model, name=name)
    
print(f'Saved model "{name}" at "{model_path}"')
    
# Calculamos las métricas para ambos sets
for set_name, set_data in sets.items():
        
    start_predict = time.perf_counter()
    y_pred = final_model.predict(set_data["x"])
    end_predict = time.perf_counter()

    eta_predict = end_predict - start_predict

    y_prob = final_model.predict_proba(set_data["x"])
        
    # Combinamos los dos diccionarios
    metrics = {
        "model": name,
        "set": set_name,
        "training_time (min)": eta_train/60,
        "predict_time (sec)": eta_predict} | get_metrics(y_true=set_data["y"], y_pred=y_pred, y_prob=y_prob)  # esto devuelve un diccionario

    # Pandas requiere los valores en formato lista
    metrics =  {key: [value] for key, value  in metrics.items()}
        
    rows.append( pd.DataFrame(metrics) )

all_metrics = pd.concat(rows).round(4)

Saved model "LogisticRegression" at "../models/LogisticRegression.pkl"


### Antes del tuning:

In [13]:
# Filtramos con una query para simplificar
results_before.query("model == 'LogisticRegression'")

Unnamed: 0,model,set,training_time (min),predict_time (sec),accuracy,precission,recall,f1_score,f2_score,ROC_0,ROC_1,tuned
0,LogisticRegression,train,0.1561,0.0423,0.9164,0.8173,0.7494,0.7819,0.7621,0.0433,0.9567,False
1,LogisticRegression,validation,0.1561,0.0077,0.9495,0.0812,0.3472,0.1317,0.2098,0.1591,0.8409,False


In [14]:
results_before["tuned"] = False

### Después del tuning:

In [15]:
final_results = all_metrics.round(4)
final_results["tuned"] = True
final_results

Unnamed: 0,model,set,training_time (min),predict_time (sec),accuracy,precission,recall,f1_score,f2_score,ROC_0,ROC_1,tuned
0,LogisticRegression,train,7.6484,0.0698,0.989,0.5568,0.0108,0.0213,0.0135,0.1247,0.8753,True
0,LogisticRegression,validation,7.6484,0.0945,0.9891,0.7,0.0124,0.0244,0.0154,0.1299,0.8701,True


In [16]:
pd.concat([
    results_before.query("model == 'LogisticRegression'"), final_results
]).query("set == 'validation'")

Unnamed: 0,model,set,training_time (min),predict_time (sec),accuracy,precission,recall,f1_score,f2_score,ROC_0,ROC_1,tuned
1,LogisticRegression,validation,0.1561,0.0077,0.9495,0.0812,0.3472,0.1317,0.2098,0.1591,0.8409,False
0,LogisticRegression,validation,7.6484,0.0945,0.9891,0.7,0.0124,0.0244,0.0154,0.1299,0.8701,True


Los resultados son prácticamente idénticos, pero el tiempo de entrenamiento es bastante mayor después del tuning así que, al ser aleatorio no ha habido mucha suerte con la combinación y no se obtiene una mejoría sustancial en comparación con el aumento de tiempo de entrenamiento. Por tanto, nos quedamos con el modelo sin tuning.

Cabe destacar también que ambos modelos presentan overfitting.