Notebook para comparar los modelos hechos del dataset ponderado (P2)

- **Tarea del backlog:** https://github.com/UCM-GIDIA-PD1/c2425-R4/issues/46
- **Propósito del código:** Comparaciones de los modelos del dataframe de peleas ponderadas (P2).
- **Autor(es):** Carlos Vallejo.  
- **Descripción y uso:** El objetivo de este notebook es ver que modelo nos da el resultado más óptimo.

In [32]:
import os
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, TimeSeriesSplit, cross_val_score
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,log_loss, f1_score, roc_auc_score, precision_score, recall_score, make_scorer, balanced_accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.inspection import permutation_importance
import mlflow
import optuna
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


### Carga y preprocesamiento de los datos

Esta fase es global

In [4]:
ruta_train =  os.path.join("..","..","..", "data", "P2","train.parquet")
train = pd.read_parquet(ruta_train)

ruta_test =  os.path.join("..","..","..", "data", "P2","test.parquet")
test = pd.read_parquet(ruta_test)

test.head()

Unnamed: 0,DATE,Peleador_A,Peleador_B,WINNER,KD_A,KD_B,SIG_STR_A,SIG_STR_B,TD_PORC_A,TD_PORC_B,...,Puntos_A,Puntos_B,Peleas_A,Peleas_B,KD_DIFF,SIG_STR_DIFF,TD_DIFF,SUB_ATT_DIFF,REV_DIFF,CTRL_DIFF
0,2023-02-04,Derrick Lewis,Serghei Spivac,True,0.4,0.0,0.5864,0.3636,0.12,0.45,...,189.61206,80.3633,25,9,0.4,0.2228,-0.307248,-0.36,0.0,-136.44
1,2023-02-04,Dooho Choi,Kyle Nelson,True,0.36,0.0,0.4824,0.6552,0.08,0.0,...,0.0,0.0,6,5,0.36,-0.1728,0.133333,0.0,0.0,-61.2
2,2023-02-04,Marcin Tybura,Blagoy Ivanov,False,0.0,0.0,0.4276,0.5056,0.2,0.2792,...,139.896213,2.207734e-07,16,6,0.0,-0.078,-0.188071,-0.4,0.0,-56.72
3,2023-02-11,Tyson Pedro,Modestas Bukauskas,True,0.6,0.0,0.6544,0.4112,0.0,0.0,...,24.394312,0.0,8,4,0.6,0.2432,0.0,0.4,0.0,16.88
4,2023-02-11,Islam Makhachev,Alexander Volkanovski,False,0.36,0.24,0.63,0.5844,0.7576,0.12,...,322.872251,534.2818,13,12,0.12,0.0456,0.130692,1.16,-0.8,5.28


## Árbol de decisión

Primero seleccionamos las variables a utilizar. En el caso del árbol escogemos todas excepto los nombres de los peleadores y la fecha de la pelea.

In [None]:
columnasQuitar = ["DATE","Peleador_A","Peleador_B", "WINNER"]

X_train_arbol = train.drop(columns=columnasQuitar)
y_train_arbol = train['WINNER']
X_test_arbol = test.drop(columns=columnasQuitar)
y_test_arbol = test['WINNER']

Ahora creamos el modelo teniendo en cuenta los hiperparámetros utilizados para crear el modelo más óptimo.

In [16]:
class_weights = compute_class_weight('balanced', classes=np.unique(y_train_arbol), y=y_train_arbol)
class_weight_dict = {i: class_weights[i] for i in range(len(class_weights))}
tscv = TimeSeriesSplit(n_splits=5)

# Definir modelo base
dt = DecisionTreeClassifier(random_state=42)

# Cuadrícula de hiperparámetros
param_grid = {
    'criterion': ['gini'],  # Función para medir la calidad de la división
    'max_depth': [5],  # Profundidad máxima del árbol
    'min_samples_split': [2],  # Mínimo de muestras para dividir un nodo
    'min_samples_leaf': [1],  # Mínimo de muestras en una hoja
    'min_weight_fraction_leaf': [0.10],
    'splitter': ['best'],
    'class_weight': [None],  # Ajuste de pesos por clase
    'max_features': [None]
}
    
f1_scorer = make_scorer(f1_score, average='macro')

# Búsqueda de hiperparámetros con validación cruzada
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    scoring= f1_scorer,
    cv=tscv,
    n_jobs=-1,
    verbose=2
)

# Entrenar el modelo
arbol = grid_search.fit(X_train_arbol, y_train_arbol)


Fitting 5 folds for each of 1 candidates, totalling 5 fits


## XGboost

Las variables seleccionadas son las mismas que en el árbol de decisión.

In [18]:
columnasQuitar = ["DATE","Peleador_A","Peleador_B", "WINNER"]

X_train_xgboost = train.drop(columns=columnasQuitar)
y_train_xgboost = train['WINNER']
X_test_xgboost = test.drop(columns=columnasQuitar)
y_test_xgboost = test['WINNER']

Creamos el modelo teniendo en cuenta los mejores parámetros obtenidos en el notebook "XGBoost_pond.ipynb"

In [None]:
# Parámetros fijos
params = {
    'n_estimators': 350,
    'max_depth': 37,
    'learning_rate': 0.014525356301837976,
    'subsample': 0.6223758299538741,
    'colsample_bytree': 0.5603666759669468,
    'gamma': 0.4820740200787993,
    'min_child_weight': 31,  
    'scale_pos_weight': np.float64(1.2905829596412555),
    'reg_alpha': 0.1,
    'tree_method': 'exact',
    'grow_policy': 'lossguide',
    'random_state': 42
}

tscv = TimeSeriesSplit(n_splits=5)
model = XGBClassifier(**params)

xgboost = model.fit(X_train_xgboost, y_train_xgboost)

    

## Regresión logística

En regresión logística sí eliminamos variables (para ver el proceso, está en "regresion_logistica_P2_telmo.ipynb").

In [30]:
columnasQuitar = ["DATE","Peleador_A","Peleador_B", "WINNER",'TD_B_y', 'GRAPPLER_B', 'STR_CLINCH_A_y', 'GRAPPLER_A', 'STR_HEAD_B_y', 'TD_A_y', 'STR_BODY_B_y', 'STR_HEAD_B_x', 'STR_DISTANCE_A_x', 'STR_HEAD_A_y', 'TOTAL_STR_B_y', 'STR_HEAD_A_x', 'STR_DISTANCE_A_y', 'STR_CLINCH_B_y', 'TOTAL_STR_A_y', 'STR_GROUND_B_y', 'STR_DISTANCE_B_x', 'STR_DISTANCE_B_y', 'TD_A_x', 'STR_GROUND_A_y', 'STR_BODY_A_y']

X_train_rl = train.drop(columns=columnasQuitar)
y_train_rl = train['WINNER']
X_test_rl = test.drop(columns=columnasQuitar)
y_test_rl = test['WINNER']

In [34]:
model = LogisticRegression(
    random_state=42,
    max_iter=1000,  # Aumentamos las iteraciones para asegurar convergencia
    solver='saga',  # 'saga' soporta L1 y L2
    C=0.6556635377191836,  # Valor específico para C
    penalty='l1'  # Usamos L1 como penalización
)

# Pipeline sin características polinomiales si no es necesario
pipeline = make_pipeline(
    model
)

# Ajustar el modelo con los datos de entrenamiento
rl = pipeline.fit(X_train_rl, y_train_rl)


