## Modelo datos estructurados
Este notebook desarrolla un primer modelo para resolver el problema de Petfinder. Empezamos haciendo un modelo inicial muy simple para ver la viabilidad de resolver el problema. Luego analizamos como se comporta la métrica kappa propuesta y vemos la matriz de confusión. Finalmente hacemos una optimizacin de hiperparametros evaluando con train/test y otra validando con 5 fold CV y testeando en el 20% de los datos

In [1]:
#Import de librerias basicas tablas y matrices
import numpy as np 
import pandas as pd 

#Gradient Boosting
import lightgbm as lgb

#Funciones auxiliares sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, StratifiedKFold #Split y cross Validation
from sklearn.metrics import cohen_kappa_score, accuracy_score, balanced_accuracy_score #Metricas
# from sklearn.utils import shuffle 

#Visualizacióon
from plotly import express as px

#Plot de matriz de confusion normalizada en actuals
from utils import plot_confusion_matrix

import os

#Optimizacion de hiperparametros
import optuna
from optuna.artifacts import FileSystemArtifactStore, upload_artifact

#Guardado de objetos en archivos joblib
from joblib import load, dump


In [2]:
#PONER LA ESTRUCUTRA DE CARPETAS COMO SE VE AQUI ABAJO !!!

In [3]:
# Paths para acceso archivos
#Este notebook asume la siguiente estructura de carpetas a partir de la ubicacion de base_dir 
#(dos niveles arriba de la carpeta donde se ejecuta el notebook). 
# /UA_MDM_LDI_II/
# /UA_MDM_LDI_II/input
# /UA_MDM_LDI_II/input/petfinder-adoption-prediction/            <- Aca deben ir todos los archivos de datos de la competencia 
# /UA_MDM_LDI_II/tutoriales/                       <- Aca deben poner los notebooks y scripts que les compartimos
# /UA_MDM_LDI_II/work/                             <- Resultados de notebooks iran dentro de esta carpeta en subcarpetas
# /UA_MDM_LDI_II/work/models/                     <- Modelos entrenados en archivos joblibs
# /UA_MDM_LDI_II/work/optuna_temp_artifacts/      <- Archivos que queremos dejar como artefacto de un trial de optuna (optuna los copiara a la carpeta de abajo)
# /UA_MDM_LDI_II/work/optuna_artifacts/           <- Archivos con artefactos que sibimos a optuna

#Subimos dos niveles para quedar en la carpeta que contiene input y UA_MDM_LDI_II
BASE_DIR = ''

#Datos de entrenamiento 
PATH_TO_TRAIN = os.path.join(BASE_DIR, "input/petfinder-adoption-prediction/train/train.csv")

#Salida de modelos entrenados
PATH_TO_MODELS = os.path.join(BASE_DIR, "work/models")

#Artefactos a subir a optuna
PATH_TO_TEMP_FILES = os.path.join(BASE_DIR, "work/optuna_temp_artifacts")

#Artefactos que optuna gestiona
PATH_TO_OPTUNA_ARTIFACTS = os.path.join(BASE_DIR, "work/optuna_artifacts")


SEED = 42 #Semilla de procesos aleatorios (para poder replicar exactamente al volver a correr un modelo)
TEST_SIZE = 0.2 #Facción para train/test= split 

#PONER CUALQUIER SEMILLLA Y 20% DEL DATASTE USAMOS

In [4]:
#NUESTRO TODO ES TRAIN ... ENTONCES EN BASE A ESE DATASET HACEMOS TRAIN Y TEST

In [5]:
# Datos Tabulares
dataset = pd.read_csv(PATH_TO_TRAIN)

In [6]:
#Columnas del dataset
dataset.columns

Index(['Type', 'Name', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
       'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
       'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'RescuerID',
       'VideoAmt', 'Description', 'PetID', 'PhotoAmt', 'AdoptionSpeed'],
      dtype='object')

In [7]:
dataset.head()

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
0,2,Nibble,3,299,0,1,1,7,0,1,...,1,1,100,41326,8480853f516546f6cf33aa88cd76c379,0,Nibble is a 3+ month old ball of cuteness. He ...,86e1089a3,1.0,2
1,2,No Name Yet,1,265,0,1,1,2,0,2,...,1,1,0,41401,3082c7125d8fb66f7dd4bff4192c8b14,0,I just found it alone yesterday near my apartm...,6296e909a,2.0,0
2,1,Brisco,1,307,0,1,2,7,0,2,...,1,1,0,41326,fa90fa5b1ee11c86938398b60abc32cb,0,Their pregnant mother was dumped by her irresp...,3422e4906,7.0,3
3,1,Miko,4,307,0,2,1,2,0,2,...,1,1,150,41401,9238e4f44c71a75282e62f7136c6b240,0,"Good guard dog, very alert, active, obedience ...",5842f1ff5,8.0,2
4,1,Hunter,1,307,0,1,1,0,0,2,...,1,1,0,41326,95481e953f8aed9ec3d16fc4509537e8,0,This handsome yet cute boy is up for adoption....,850a43f90,3.0,2


## FEATURE ENGENEERING
Haremos modificaciones en el dataset para ver si mejora o no la fuerza predictiva del modelo base

In [8]:
#Imputar datos faltantes en Nombre
dataset['Name'].fillna('No Name Yet', inplace=True)
dataset['Name'] = dataset['Name'].replace('', 'No Name Yet')
dataset.isnull().sum()

Type              0
Name              0
Age               0
Breed1            0
Breed2            0
Gender            0
Color1            0
Color2            0
Color3            0
MaturitySize      0
FurLength         0
Vaccinated        0
Dewormed          0
Sterilized        0
Health            0
Quantity          0
Fee               0
State             0
RescuerID         0
VideoAmt          0
Description      12
PetID             0
PhotoAmt          0
AdoptionSpeed     0
dtype: int64

In [9]:
#Nuevas variables 

# Age categorization
# Create age categories based on the age of the pets
bins = [0, 12, 60, float('inf')]  # Define the age bins for puppy, adult, and senior
labels = [1,2 ,3 ] #1='puppy' 2='adult' 3='senior'
dataset['AgeCategory'] = pd.cut(dataset['Age'], bins, labels=labels, right=False)

# Breed combination
# Create a new feature that indicates whether the pet is a purebred(1) or a mix(2)
dataset['Breed'] = dataset.apply(lambda row: 2 if row['Breed2'] != 0 else 1, axis=1)

# Color analysis
# Analyze the color combinations to create new features such as "IsMultiColored" or specific color categories
dataset['IsMultiColored'] = dataset.apply(lambda row: 1 if row['Color2'] != 0 or row['Color3'] != 0 else 0, axis=1)

#Name code
dataset['name_code'] = dataset['Name'].apply(lambda x: 1 if x != 'No Name Yet' else 0)



In [10]:
#Escalar normalizar 
# Select the numerical features that you want to normalize or scale
numerical_features = ['Breed1']  # Add the names of the numerical columns you want to scale

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Apply the scaler to the selected numerical features
dataset[numerical_features] = scaler.fit_transform(dataset[numerical_features])

In [11]:
dataset.head()

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed,AgeCategory,Breed,IsMultiColored,name_code
0,2,Nibble,3,0.973941,0,1,1,7,0,1,...,8480853f516546f6cf33aa88cd76c379,0,Nibble is a 3+ month old ball of cuteness. He ...,86e1089a3,1.0,2,1,1,1,1
1,2,No Name Yet,1,0.863192,0,1,1,2,0,2,...,3082c7125d8fb66f7dd4bff4192c8b14,0,I just found it alone yesterday near my apartm...,6296e909a,2.0,0,1,1,1,0
2,1,Brisco,1,1.0,0,1,2,7,0,2,...,fa90fa5b1ee11c86938398b60abc32cb,0,Their pregnant mother was dumped by her irresp...,3422e4906,7.0,3,1,1,1,1
3,1,Miko,4,1.0,0,2,1,2,0,2,...,9238e4f44c71a75282e62f7136c6b240,0,"Good guard dog, very alert, active, obedience ...",5842f1ff5,8.0,2,1,1,1,1
4,1,Hunter,1,1.0,0,1,1,0,0,2,...,95481e953f8aed9ec3d16fc4509537e8,0,This handsome yet cute boy is up for adoption....,850a43f90,3.0,2,1,1,0,1


In [12]:
import pandas as pd
import lightgbm as lgb
from sklearn.preprocessing import OneHotEncoder

def add_new_variables_with_lgbm(dataset):
    # Select the features and target variable
    numeric_features = ['Age', 'VideoAmt', 'PhotoAmt', 'AgeCategory', 'IsMultiColored', 'name_code']
    categorical_features = ['Type', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3']

    # Preprocess non-numeric columns using one-hot encoding
    encoded_features = pd.get_dummies(dataset[categorical_features], columns=categorical_features, prefix=categorical_features)

    # Concatenate the encoded features with the numeric features
    processed_dataset = pd.concat([dataset[numeric_features], encoded_features], axis=1)

    # Create a training dataset
    target = dataset['AdoptionSpeed']
    train_data = lgb.Dataset(processed_dataset, label=target, params={'verbose': -1})

    # Define LightGBM parameters
    lgb_params = {
        'objective': 'multiclass',
        'num_class': 5,  # Assuming AdoptionSpeed has 5 classes
        'metric': 'multi_logloss',
        # Add your other parameters here
    }

    # Train the LightGBM model
    model = lgb.train(lgb_params, train_data)

    # Get predictions for each sample
    predictions = model.predict(processed_dataset)

    # Add new variables based on the predictions
    for class_num in range(5):  # Assuming AdoptionSpeed has 5 classes
        dataset[f'prediction_class_{class_num}'] = predictions[:, class_num]

    return dataset

# Call the function with your DataFrame "DATAFRAME"
df_with_new_variables = add_new_variables_with_lgbm(dataset.copy())  # Use DATAFRAME.copy() to avoid modifying the original DataFrame

In [13]:
df_with_new_variables

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,AdoptionSpeed,AgeCategory,Breed,IsMultiColored,name_code,prediction_class_0,prediction_class_1,prediction_class_2,prediction_class_3,prediction_class_4
0,2,Nibble,3,0.973941,0,1,1,7,0,1,...,2,1,1,1,1,0.028566,0.231616,0.358729,0.137739,0.243349
1,2,No Name Yet,1,0.863192,0,1,1,2,0,2,...,0,1,1,1,0,0.096615,0.352294,0.246250,0.123992,0.180848
2,1,Brisco,1,1.000000,0,1,2,7,0,2,...,3,1,1,1,1,0.010853,0.270040,0.413996,0.256144,0.048968
3,1,Miko,4,1.000000,0,2,1,2,0,2,...,2,1,1,1,1,0.002941,0.111016,0.247903,0.320008,0.318133
4,1,Hunter,1,1.000000,0,1,1,0,0,2,...,2,1,1,0,1,0.005587,0.343812,0.350237,0.212460,0.087903
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14988,2,No Name Yet,2,0.866450,0,3,1,0,0,2,...,2,1,1,0,0,0.004050,0.251369,0.276092,0.210277,0.258212
14989,2,Serato & Eddie,60,0.863192,264,3,1,4,7,2,...,4,3,2,1,1,0.000915,0.055371,0.165172,0.260624,0.517918
14990,2,Monkies,2,0.863192,266,3,5,6,7,3,...,3,1,2,1,1,0.011357,0.506431,0.226790,0.168735,0.086687
14991,2,Ms Daym,9,0.866450,0,2,4,7,0,1,...,4,1,1,1,1,0.012514,0.094100,0.169759,0.179496,0.544131


In [17]:
dataset=df_with_new_variables

In [18]:

#Separo un 20% para test estratificado opr target
#PERO MANTENEMOS LA PROPORCION
train, test = train_test_split(dataset,
                               test_size = TEST_SIZE,
                               random_state = SEED,
                               stratify = dataset.AdoptionSpeed)

In [19]:
#Armo listas con features de texto y numericas
char_feats = [f for f in dataset.columns if dataset[f].dtype=='O']
numeric_feats = [f for f in dataset.columns if dataset[f].dtype!='O']

In [20]:
#Lista de features numericas
#ESTE EJEMPLO SE HACE SOLO CON VARIABLES NUMERICAS
numeric_feats

['Type',
 'Age',
 'Breed1',
 'Breed2',
 'Gender',
 'Color1',
 'Color2',
 'Color3',
 'MaturitySize',
 'FurLength',
 'Vaccinated',
 'Dewormed',
 'Sterilized',
 'Health',
 'Quantity',
 'Fee',
 'State',
 'VideoAmt',
 'PhotoAmt',
 'AdoptionSpeed',
 'AgeCategory',
 'Breed',
 'IsMultiColored',
 'name_code',
 'prediction_class_0',
 'prediction_class_1',
 'prediction_class_2',
 'prediction_class_3',
 'prediction_class_4']

In [21]:

#Defino features a usar en un primer modelo de prueba
features = ['Type',
 'Age',
 'Breed1',
 'Breed2',
 'Gender',
 'Color1',
 'Color2',
 'Color3',
 'MaturitySize',
 'FurLength',
 'Vaccinated',
 'Dewormed',
 'Sterilized',
 'Health',
 'Quantity',
 'Fee',
 'State',
 'VideoAmt',
 'PhotoAmt',
 'AgeCategory',
 'Breed',
 'IsMultiColored',
 'name_code',
 'prediction_class_0',
 'prediction_class_1',
 'prediction_class_2',
 'prediction_class_3',
 'prediction_class_4']

label = 'AdoptionSpeed'

In [22]:
#Genero dataframes de train y test con sus respectivos targets
X_train = train[features]
y_train = train[label]

X_test = test[features]
y_test = test[label]

In [23]:
#Entreno un modelo inicial sin modificar hiperparametros. Solamente especifico el numero de clases y el tipo de modelo como clasificacoión
#HACEMOS LGBM POR DEFECTO COMO VIENE
lgb_params = params = {
                        'objective': 'multiclass',#EL PROBLEMA ES UNA MULTICPLASE 
                        'num_class': len(y_train.unique())
                        }


#genero el objeto Dataset que debo pasarle a lightgbm para que entrene
lgb_train_dataset = lgb.Dataset(data=X_train,
                                label=y_train)

#entreno el modelo con los parametros por defecto
lgb_model = lgb.train(lgb_params,
                      lgb_train_dataset)

In [24]:
#Obtengo las predicciones sobre el set de test. El modelo me da una lista de probabilidades para cada clase y tomo la clase con mayor probabilidad con la funcion argmax
y_pred = lgb_model.predict(X_test).argmax(axis=1)#CON ARGMAX OBTENGO LA PREDICCION MAS GRANDE DE LAS CATEGORIAS OSEA LA GANADORA

#Calculo el Kappa
cohen_kappa_score(y_test,y_pred, weights = 'quadratic')
#MI MODELO DA UN 0.31 DE SCORE 
#RECORDAR QUE AL SER UN MULTICLASE EL PREDICT ME TRAE 5 COLUMNAS ( PROBABILIDAD DEL MODELO A CADA UNA DE LAS CLASES )




0.45429837731585787

In [25]:
#Muestro la matriz de confusión
display(plot_confusion_matrix(y_test,y_pred))
#SI OBTENGO UNA DIAGONAL MUY FUERTE AMARILLA ES PQ PREDIJO BIEN EL MODELO 

In [26]:
#Vamos a ponewr en perspectiva el score de Kappa


#Cual es el score perfecto? Evaluo la clase real contra si misma. Es decir, el caso en que el modelo establece todas las clases en su valor real
cohen_kappa_score(y_test,y_test, weights = 'quadratic')

1.0

In [27]:
#Como se veria la matriz de confusión PERO ESTA ES LA PERFECTA
display(plot_confusion_matrix(y_test,y_test))

In [28]:
#SIMULACIONES

In [29]:
#Pruebo un modelo alternativo donde en vez de usar la version multiclass real de lightGBM utilizo One vs All

lgb_params = params = {
                        'objective': 'multiclassova', # ESTA HACE MUCHAS LOGITICAS COMBINADAS PERO SIEMPRE DECIDE QUEDARSE CON UNA 
                        'num_class': len(y_train.unique())
                        }


lgb_train_dataset = lgb.Dataset(data=X_train,
                                label=y_train)


lgb_model = lgb.train(lgb_params,
                      lgb_train_dataset)

In [30]:
#MAtriz de confusion y Kappa dfe OVA
y_pred = lgb_model.predict(X_test).argmax(axis=1)

display(plot_confusion_matrix(y_test,y_pred))

{'kappa':cohen_kappa_score(y_test,
                y_pred,
                weights = 'quadratic'),
 'accuracy':accuracy_score(y_test,y_pred),
 'balanced_accuracy':balanced_accuracy_score(y_test,y_pred)}




{'kappa': 0.465889464194431,
 'accuracy': 0.5258419473157719,
 'balanced_accuracy': 0.5051412429485322}

In [31]:
#USAR MULTINOVA MEJORA KAPPA OSEA MI MODELO SE ACERCA MAS AL ORIG

## Optimizacion de hiperparametros modelo train/test

In [32]:
#AHORA SUMAMOS HIPERPARAMETROS PARA VER QUE PASA CON LGBM

In [33]:

#Funcion que vamos a optimizar. Optuna requiere que usemos el objeto trial para generar los parametros a optimizar
def lgb_objective(trial):
    #PArametros para LightGBM
    lgb_params = {      
                        #PArametros fijos
                        'objective': 'multiclass',
                        'verbosity':-1,
                        'num_class': len(y_train.unique()),
                        #Hiperparametros a optimizar utilizando suggest_float o suggest_int segun el tipo de dato
                        #Se indica el nombre del parametro, valor minimo, valor maximo 
                        #en elgunos casos el parametro log=True para parametros que requieren buscar en esa escala
                        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
                        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
                        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
                        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
                        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
                        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
                        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
                        } 

    #Genero objeto dataset de entrenamiento
    lgb_train_dataset = lgb.Dataset(data=X_train,
                                    label=y_train)

    #ajuste de modelo
    lgb_model = lgb.train(lgb_params,
                        lgb_train_dataset)
    
    #Devuelvo el score en test
    return(cohen_kappa_score(y_test,lgb_model.predict(X_test).argmax(axis=1),
                             weights = 'quadratic'))

In [34]:
#Defino el estudio a optimizar
study = optuna.create_study(direction='maximize', #buscamos maximizar la metrica
                            storage="sqlite:///work/db.sqlite3",  # Specify the storage URL here.
                            study_name="04 - LGB Multiclass", #nombre del experimento
                            load_if_exists=True) #continuar si ya existe

#Corremos 100 trials para buscar mejores parametros
study.optimize(lgb_objective, n_trials=100)

[I 2024-08-18 20:06:39,331] Using an existing study with name '04 - LGB Multiclass' instead of creating a new one.


In [None]:
#corriendo esto en terminal deberia ver el dashboard 
#optuna-dashboard sqlite:///work/db.sqlite3 --artifact-dir /work/optuna_artifacts --port 8081

In [None]:
#Obtenemos mejor resultado
study.best_params

{'lambda_l1': 0.00016911420117746327,
 'lambda_l2': 3.0291532676782044e-07,
 'num_leaves': 23,
 'feature_fraction': 0.9710839320963843,
 'bagging_fraction': 0.6006057268679922,
 'bagging_freq': 6,
 'min_child_samples': 16}

In [None]:
#Vamos a replicar el resultado de la optimizacion reentrenando el modelo con el mejor conjunto de hiperparametros
#Generamos parametros incluyendo los fijos y la mejor solución que encontro optuna
lgb_params = {
    'objective': 'multiclass',
    'verbosity': -1,
    'num_class': len(y_train.unique())
}

# Update lgb_params with study.best_params
lgb_params.update(study.best_params)

lgb_train_dataset = lgb.Dataset(data=X_train,
                                label=y_train)


#Entreno
lgb_model = lgb.train(lgb_params,
                    lgb_train_dataset)

#Muestro matriz de confusion y kappa
display(plot_confusion_matrix(y_test,lgb_model.predict(X_test).argmax(axis=1)))

cohen_kappa_score(y_test,lgb_model.predict(X_test).argmax(axis=1),
                             weights = 'quadratic')


0.24313006126269543

In [32]:
#A PARTIR DE OPTUNA (OPTIMIZACION BAYESIANA) ES LO MEJOR TENGO ENTENDIDO, ECONTRE LOS MEJORES VALORES HIPERPARMTROS
#Y ESO MEJORO KAPPA 

## Modelo con cross validation y conjunto de test

In [33]:
#Genero una metrica para que lightGBM haga la evaluación y pueda hacer early_stopping en el cross validation
def lgb_custom_metric_kappa(dy_pred, dy_true):
    metric_name = 'kappa'
    value = cohen_kappa_score(dy_true.get_label(),dy_pred.argmax(axis=1),weights = 'quadratic')
    is_higher_better = True
    return(metric_name, value, is_higher_better)

#Funcion objetivo a optimizar. En este caso vamos a hacer 5fold cv sobre el conjunto de train. 
# El score de CV es el objetivo a optimizar. Ademas vamos a usar los 5 modelos del CV para estimar el conjunto de test,
# registraremos en optuna las predicciones, matriz de confusion y el score en test.
# CV Score -> Se usa para determinar el rendimiento de los hiperparametros con precision 
# Test Score -> Nos permite testear que esta todo OK, no use (ni debo usar) esos datos para nada en el entrenamiento 
# o la optimizacion de hiperparametros

def cv_es_lgb_objective(trial):

    #PArametros para LightGBM
    lgb_params = {      
                        #PArametros fijos
                        'objective': 'multiclass',
                        'verbosity':-1,
                        'num_class': len(y_train.unique()),
                        #Hiperparametros a optimizar utilizando suggest_float o suggest_int segun el tipo de dato
                        #Se indica el nombre del parametro, valor minimo, valor maximo 
                        #en elgunos casos el parametro log=True para parametros que requieren buscar en esa escala
                        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
                        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
                        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
                        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
                        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
                        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
                        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
                        } 

    #Voy a generar estimaciones de los 5 modelos del CV sobre los datos test y los acumulo en la matriz scores_ensemble
    scores_ensemble = np.zeros((len(y_test),len(y_train.unique())))

    #Score del 5 fold CV inicializado en 0
    score_folds = 0

    #Numero de splits del CV
    n_splits = 5

    #Objeto para hacer el split estratificado de CV
    skf = StratifiedKFold(n_splits=n_splits)

    for i, (if_index, oof_index) in enumerate(skf.split(X_train, y_train)):
        
        #Dataset in fold (donde entreno) 
        lgb_if_dataset = lgb.Dataset(data=X_train.iloc[if_index],
                                        label=y_train.iloc[if_index],
                                        free_raw_data=False)
        
        #Dataset Out of fold (donde mido la performance del CV)
        lgb_oof_dataset = lgb.Dataset(data=X_train.iloc[oof_index],
                                        label=y_train.iloc[oof_index],
                                        free_raw_data=False)

        #Entreno el modelo
        lgb_model = lgb.train(lgb_params,
                                lgb_if_dataset,
                                valid_sets=lgb_oof_dataset,
                                callbacks=[lgb.early_stopping(10, verbose=False)],
                                feval = lgb_custom_metric_kappa
                                )
        
        #Acumulo los scores (probabilidades) de cada clase para cada uno de los modelos que determino en los folds
        #Se predice el 20% de los datos que separe para tes y no uso para entrenar en ningun fold
        scores_ensemble = scores_ensemble + lgb_model.predict(X_test)
        
        #Score del fold (registros de dataset train que en este fold quedan out of fold)
        score_folds = score_folds + cohen_kappa_score(y_train.iloc[oof_index], 
                                                            lgb_model.predict(X_train.iloc[oof_index]).argmax(axis=1),weights = 'quadratic')/n_splits


    #Guardo prediccion del trial sobre el conjunto de test
    # Genero nombre de archivo
    predicted_filename = os.path.join(PATH_TO_TEMP_FILES,f'test_{trial.study.study_name}_{trial.number}.joblib')
    # Copia del dataset para guardar la prediccion
    predicted_df = test.copy()
    # Genero columna pred con predicciones sumadas de los 5 folds
    predicted_df['pred'] = [scores_ensemble[p,:] for p in range(scores_ensemble.shape[0])]
    # Grabo dataframe en temp_artifacts
    dump(predicted_df, predicted_filename)
    # Indico a optuna que asocie el archivo generado al trial
    upload_artifact(trial, predicted_filename, artifact_store)    

    #Grabo natriz de confusion
    #Nombre de archivo
    cm_filename = os.path.join(PATH_TO_TEMP_FILES,f'cm_{trial.study.study_name}_{trial.number}.jpg')
    #Grabo archivo
    plot_confusion_matrix(y_test,scores_ensemble.argmax(axis=1)).write_image(cm_filename)
    #Asocio al trial
    upload_artifact(trial, cm_filename, artifact_store)

    #Determino score en conjunto de test y asocio como metrica adicional en optuna
    test_score = cohen_kappa_score(y_test,scores_ensemble.argmax(axis=1),weights = 'quadratic')
    trial.set_user_attr("test_score", test_score)

    #Devuelvo score del 5fold cv a optuna para que optimice en base a eso
    return(score_folds)

In [34]:
#Inicio el store de artefactos (archivos) de optuna
artifact_store = FileSystemArtifactStore(base_path=PATH_TO_OPTUNA_ARTIFACTS)

#Genero estudio
study = optuna.create_study(direction='maximize',
                            storage="sqlite:///work/db.sqlite3",  # Specify the storage URL here.
                            study_name="04 - LGB Multiclass CV",
                            load_if_exists = True)
#Corro la optimizacion
study.optimize(cv_es_lgb_objective, n_trials=100)


FileSystemArtifactStore is experimental (supported from v3.3.0). The interface can change in the future.

[I 2024-08-18 19:00:16,961] Using an existing study with name '04 - LGB Multiclass CV' instead of creating a new one.

upload_artifact is experimental (supported from v3.3.0). The interface can change in the future.


upload_artifact is experimental (supported from v3.3.0). The interface can change in the future.

[I 2024-08-18 19:00:31,180] Trial 268 finished with value: 0.2667309846229608 and parameters: {'lambda_l1': 1.7029266305985666, 'lambda_l2': 3.4331244753116814, 'num_leaves': 24, 'feature_fraction': 0.8241786860233868, 'bagging_fraction': 0.7167045852959953, 'bagging_freq': 1, 'min_child_samples': 21}. Best is trial 45 with value: 1.0.

upload_artifact is experimental (supported from v3.3.0). The interface can change in the future.


upload_artifact is experimental (supported from v3.3.0). The interface can change in the future.

[I 2024-08-18 19:00:37,066] Trial 269 fin

Para ver el optuna dashboard tengo que correr este comando en la terminal

In [35]:
!pip install kaleido




In [36]:
!optuna-dashboard sqlite:///../work/db.sqlite3 --artifact-dir ../work/optuna_artifacts --port 8081

Traceback (most recent call last):
  File "c:\users\s1093678\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\s1093678\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\s1093678\Anaconda3\Scripts\optuna-dashboard.exe\__main__.py", line 7, in <module>
  File "c:\users\s1093678\anaconda3\lib\site-packages\optuna_dashboard\_cli.py", line 119, in main
    storage = get_storage(args.storage, storage_class=args.storage_class)
  File "c:\users\s1093678\anaconda3\lib\site-packages\optuna_dashboard\_storage_url.py", line 59, in get_storage
    return guess_storage_from_url(storage)
  File "c:\users\s1093678\anaconda3\lib\site-packages\optuna_dashboard\_storage_url.py", line 78, in guess_storage_from_url
    raise ValueError(
ValueError: Please specify 'sqlite:///sqlite:///../work/db.sqlite3' to use SQLite3 (RDBStorage)


In [37]:
#para proxima clase hacer feature engeneering (agregar feature, health , colores, etc) pq hasta aca solo s ehizo con los variables numericas
#TENER UN MODELO CANDIDATO TABULAR PARA SEGUIR ADELANTE 
#VAMOS A VER LA EJECUCION DE UN MODELO PARA BUCAR HIPERPARAMETROS ETC 
#DAR CON UN MODELO GANADOR
