---

## 7.3 Optimización del Modelo

A modo de comparación se usó el método llamado AutoML, en el cual una gran cantidad de modelos son implementados y testeados en paralelo. Este tipo de herramientas están diseñadas para reemplazar el trabajo de un ingeniero de datos y/o para usuarios que desconocen las herramientas que se aprendieron en el curso.

Usamos la biblioteca H2O, que lleva a cabo la selección de modelos en una máquina virtual o en un local host (nuestro caso). Es un método que se corre sólo y necesita un procesamiento de datos mínimo. Acá solo tuvimos que cambiar el formato del dataframe y señalar al modelo cuales eran las columnas a usar y cuáles las columnas a predecir.

H2O luego reconoce que columnas son inútiles para la tarea e itera sobre un número máximo de modelos variando sus parámetros. Luego reporta los resultados en una tabla, retornado como ganador aquel modelo con la mejor métrica.

Como queríamos simular un setting minimal para ver si realmente nuestra profesión es fácilmente reemplazable, no realizamos el procesamiento de columnas diseñado. Dicho esto, si se usaron las features diseñadas a mano y con expresiones regulares, y también se usaron los embeddings de BERT para la clasificación, ya que estás transformaciones se hicieron a nivel de dataframe inicial por simplicidad. Es por esto que las columnas donde habían valores separados por ";" fueron tomadas tal cual sin desagregar. Haber cambiado esto puedo haber mejorado los resultados obtenidos.

In [None]:
import h2o
from h2o.automl import H2OAutoML

import sys
import os
project_path = os.path.abspath('..')
sys.path.insert(1, project_path)

### Inicialización de H2O

In [None]:

"""# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()"""

# Start the H2O cluster (locally)
h2o.init()

### Carga y procesamiento de datos

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from preprocessing import Nothing, CategoriesTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, PowerTransformer, OneHotEncoder
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.pipeline import Pipeline
import re
import pandas as pd

from src.features.preprocessing import Nothing, CategoriesTokenizer, boc_many_values, boc_some_values, custom_features, preprocessing_bert


In [3]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
import numpy as np

MODEL = "distilbert-videogame-descriptions-rating"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

def sentence_clf_output(text):
    """retorna el SequenceClassifierOutput"""
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input, return_dict=True, output_hidden_states=True)
    return output

def logits_embedding(clf_output):
    # retorna el vector de scores de clasificacion (antes de la capa softmax)
    return clf_output['logits'][0].detach().numpy().reshape(1,5)

def integrar_bert_logits(df_in):
    df = df_in.copy(deep=True)

    embed = lambda row: logits_embedding(sentence_clf_output(row))
    bert_logits = np.concatenate(df['short_description'].apply(embed).to_numpy())  # .reshape(100,3)

    df[['bert1','bert2','bert3','bert4','bert5']] = pd.DataFrame(bert_logits, index= df.index)

    return df

def custom_features(dataframe_in):
    df = dataframe_in.copy(deep=True)

    df['month'] = pd.to_datetime(df['release_date']).dt.month
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.to_julian_date())
    return df

In [4]:
df_train = pd.read_pickle('train.pickle')
df_train = integrar_bert_logits(df_train)
df_train = custom_features(df_train)

In [5]:
columns = df_train.columns
columns = list(columns)
columns.remove('rating')
columns.remove('estimated_sells')

### Clasificación

In [6]:
hf_train = h2o.H2OFrame(df_train)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [7]:
# Run AutoML for 20 base models
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=columns, y='rating', training_frame=hf_train)

Mixed,Mostly Positive,Negative,Positive,Very Positive,Error,Rate
1585.0,2.0,3.0,66.0,0.0,0.0428744,"71 / 1,656"
6.0,1681.0,2.0,13.0,5.0,0.0152314,"26 / 1,707"
0.0,0.0,1290.0,0.0,0.0,0.0,"0 / 1,290"
0.0,3.0,4.0,1759.0,265.0,0.1339242,"272 / 2,031"
0.0,3.0,0.0,0.0,1194.0,0.0025063,"3 / 1,197"
1591.0,1689.0,1299.0,1838.0,1464.0,0.0472021,"372 / 7,881"

k,hit_ratio
1,0.9527979
2,0.9989849
3,0.9998732
4,1.0
5,1.0

Mixed,Mostly Positive,Negative,Positive,Very Positive,Error,Rate
532.0,204.0,317.0,537.0,66.0,0.678744,"1,124 / 1,656"
368.0,293.0,184.0,756.0,106.0,0.8283538,"1,414 / 1,707"
335.0,112.0,575.0,254.0,14.0,0.5542636,"715 / 1,290"
292.0,229.0,115.0,1150.0,245.0,0.4337765,"881 / 2,031"
66.0,74.0,35.0,580.0,442.0,0.6307435,"755 / 1,197"
1593.0,912.0,1226.0,3277.0,873.0,0.6203527,"4,889 / 7,881"

k,hit_ratio
1,0.3796472
2,0.642558
3,0.8318741
4,0.9525441
5,1.0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.3770185,0.004348,0.3762981,0.3841843,0.3757842,0.3764928,0.3723331
auc,,0.0,,,,,
err,0.6229815,0.004348,0.6237019,0.6158157,0.6242158,0.6235072,0.6276669
err_count,981.8,41.045097,1021.0,989.0,995.0,992.0,912.0
logloss,1.3966794,0.0139617,1.3818406,1.4025223,1.408077,1.409489,1.3814679
max_per_class_error,0.8354025,0.0062813,0.8262108,0.8391813,0.8318584,0.8416666,0.8380952
mean_per_class_accuracy,0.3723334,0.0046229,0.3659002,0.3789127,0.3719483,0.3730164,0.3718895
mean_per_class_error,0.6276666,0.0046229,0.6340997,0.6210872,0.6280518,0.6269836,0.6281105
mse,0.5343803,0.0045261,0.5331347,0.5378264,0.5343159,0.5390667,0.5275577
null_deviance,5016.8887,223.34808,5201.0483,5113.2144,5074.2974,5066.9873,4628.896


In [8]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_id,mean_per_class_error,logloss,rmse,mse
StackedEnsemble_AllModels_1_AutoML_1_20221214_05007,0.625176,1.39728,0.731049,0.534433
StackedEnsemble_BestOfFamily_1_AutoML_1_20221214_05007,0.632819,1.39886,0.731111,0.534523
XGBoost_grid_1_AutoML_1_20221214_05007_model_3,0.645279,1.4475,0.739698,0.547153
DeepLearning_grid_2_AutoML_1_20221214_05007_model_1,0.645524,1.49349,0.728871,0.531253
XGBoost_grid_1_AutoML_1_20221214_05007_model_2,0.655344,1.49005,0.740581,0.548461
XGBoost_3_AutoML_1_20221214_05007,0.65683,1.44231,0.740801,0.548786
DeepLearning_grid_1_AutoML_1_20221214_05007_model_1,0.659124,1.50631,0.735669,0.541209
XGBoost_1_AutoML_1_20221214_05007,0.661569,1.53459,0.7452,0.555324
DeepLearning_grid_3_AutoML_1_20221214_05007_model_1,0.66446,1.49385,0.733125,0.537472
XGBoost_grid_1_AutoML_1_20221214_05007_model_1,0.664501,1.47977,0.742586,0.551433


In [54]:
aml.get_best_model()

Mixed,Mostly Positive,Negative,Positive,Very Positive,Error,Rate
1585.0,2.0,3.0,66.0,0.0,0.0428744,"71 / 1,656"
6.0,1681.0,2.0,13.0,5.0,0.0152314,"26 / 1,707"
0.0,0.0,1290.0,0.0,0.0,0.0,"0 / 1,290"
0.0,3.0,4.0,1759.0,265.0,0.1339242,"272 / 2,031"
0.0,3.0,0.0,0.0,1194.0,0.0025063,"3 / 1,197"
1591.0,1689.0,1299.0,1838.0,1464.0,0.0472021,"372 / 7,881"

k,hit_ratio
1,0.9527979
2,0.9989849
3,0.9998732
4,1.0
5,1.0

Mixed,Mostly Positive,Negative,Positive,Very Positive,Error,Rate
532.0,204.0,317.0,537.0,66.0,0.678744,"1,124 / 1,656"
368.0,293.0,184.0,756.0,106.0,0.8283538,"1,414 / 1,707"
335.0,112.0,575.0,254.0,14.0,0.5542636,"715 / 1,290"
292.0,229.0,115.0,1150.0,245.0,0.4337765,"881 / 2,031"
66.0,74.0,35.0,580.0,442.0,0.6307435,"755 / 1,197"
1593.0,912.0,1226.0,3277.0,873.0,0.6203527,"4,889 / 7,881"

k,hit_ratio
1,0.3796472
2,0.642558
3,0.8318741
4,0.9525441
5,1.0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.3770185,0.004348,0.3762981,0.3841843,0.3757842,0.3764928,0.3723331
auc,,0.0,,,,,
err,0.6229815,0.004348,0.6237019,0.6158157,0.6242158,0.6235072,0.6276669
err_count,981.8,41.045097,1021.0,989.0,995.0,992.0,912.0
logloss,1.3966794,0.0139617,1.3818406,1.4025223,1.408077,1.409489,1.3814679
max_per_class_error,0.8354025,0.0062813,0.8262108,0.8391813,0.8318584,0.8416666,0.8380952
mean_per_class_accuracy,0.3723334,0.0046229,0.3659002,0.3789127,0.3719483,0.3730164,0.3718895
mean_per_class_error,0.6276666,0.0046229,0.6340997,0.6210872,0.6280518,0.6269836,0.6281105
mse,0.5343803,0.0045261,0.5331347,0.5378264,0.5343159,0.5390667,0.5275577
null_deviance,5016.8887,223.34808,5201.0483,5113.2144,5074.2974,5066.9873,4628.896


AutoML retornó el modelo *StackedEnsemble_AllModels_1_AutoML_1_20221214_05007*, que combina varios de los modelos usados para la búsqueda (lo cual es simar a lo que hicimos nosotros). No es posible ver con certeza que métrica f1 tenía pues no está implementada en la biblioteca. Sin embargo, sus errores por clase mostrados nos hace pensar que los resultados son similares a los observados en el gridsearch. En realidad el modelo tuvo mejoras, pues al postular los resultados a codalab, donde el modelo obtuvo el resultado más alto entre todos los equipos con un f1 de 0.36.

## Regresión

In [18]:
columns.remove('bert1')
columns.remove('bert2')
columns.remove('bert3')
columns.remove('bert4')
columns.remove('bert5')

aml_reg = H2OAutoML(max_models=20, seed=1)
aml_reg.train(x=columns, y='estimated_sells', training_frame=hf_train)

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,199762.58,31126.373,206921.95,147853.44,225092.47,221708.11,197236.9
mean_residual_deviance,1377224100000.0,1286725440000.0,753386460000.0,275603915000.0,3520427720000.0,1580256200000.0,756446330000.0
mse,1377224100000.0,1286725440000.0,753386460000.0,275603915000.0,3520427720000.0,1580256200000.0,756446330000.0
null_deviance,3612912800000000.0,2400989530000000.0,1875035890000000.0,2686562440000000.0,7590005000000000.0,4074870890000000.0,1838090440000000.0
r2,0.4443374,0.225081,0.3424686,0.8350467,0.2600623,0.3822952,0.4018143
residual_deviance,2180709380000000.0,2061307140000000.0,1232540190000000.0,442619902000000.0,5615082400000000.0,2514187690000000.0,1099116560000000.0
rmse,1079212.0,515419.12,867978.4,524979.94,1876280.2,1257082.4,869739.25
rmsle,,0.0,,,,,


In [19]:
lb_reg = aml_reg.leaderboard
lb_reg.head(rows=lb_reg.nrows)

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
StackedEnsemble_BestOfFamily_1_AutoML_3_20221214_75429,1180390.0,1393330000000.0,202091,,1393330000000.0
StackedEnsemble_AllModels_1_AutoML_3_20221214_75429,1198550.0,1436530000000.0,203202,,1436530000000.0
DeepLearning_grid_3_AutoML_3_20221214_75429_model_1,1207980.0,1459220000000.0,208966,,1459220000000.0
DeepLearning_grid_2_AutoML_3_20221214_75429_model_1,1216490.0,1479840000000.0,214142,,1479840000000.0
DeepLearning_grid_1_AutoML_3_20221214_75429_model_1,1319360.0,1740710000000.0,216681,,1740710000000.0
DeepLearning_1_AutoML_3_20221214_75429,1372250.0,1883070000000.0,274241,,1883070000000.0
XRT_1_AutoML_3_20221214_75429,1383570.0,1914270000000.0,224037,1.35117,1914270000000.0
GBM_grid_1_AutoML_3_20221214_75429_model_1,1384050.0,1915580000000.0,251106,,1915580000000.0
XGBoost_grid_1_AutoML_3_20221214_75429_model_4,1385140.0,1918600000000.0,257858,,1918600000000.0
XGBoost_grid_1_AutoML_3_20221214_75429_model_1,1394960.0,1945900000000.0,260791,,1945900000000.0


In [57]:
aml_reg.get_best_model()

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,199762.58,31126.373,206921.95,147853.44,225092.47,221708.11,197236.9
mean_residual_deviance,1377224100000.0,1286725440000.0,753386460000.0,275603915000.0,3520427720000.0,1580256200000.0,756446330000.0
mse,1377224100000.0,1286725440000.0,753386460000.0,275603915000.0,3520427720000.0,1580256200000.0,756446330000.0
null_deviance,3612912800000000.0,2400989530000000.0,1875035890000000.0,2686562440000000.0,7590005000000000.0,4074870890000000.0,1838090440000000.0
r2,0.4443374,0.225081,0.3424686,0.8350467,0.2600623,0.3822952,0.4018143
residual_deviance,2180709380000000.0,2061307140000000.0,1232540190000000.0,442619902000000.0,5615082400000000.0,2514187690000000.0,1099116560000000.0
rmse,1079212.0,515419.12,867978.4,524979.94,1876280.2,1257082.4,869739.25
rmsle,,0.0,,,,,


Muestras expectativas eran altas para la tarea de regresión, donde esperábamos una mejora con respecto a nuestro modelo de votos. El modelo escogido por el método fue *StackedEnsemble_BestOfFamily_1_AutoML_3_20221214_75429*, y mostró resultados de R2 promisorios. No obstante, se notaba de todos modos una fuerte varianza a través de los distintos folds (implementados por H2O por defecto). En codalab este modelo fue un fracaso, mostrando un R2 de 0.

### Submission

In [21]:
df_test = pd.read_pickle('test.pickle')
df_test = integrar_bert_logits(df_test)
df_test = custom_features(df_test)

In [23]:
df_test_h2o = h2o.H2OFrame(df_test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [48]:
def h2o_pred_to_numpy(h2o_frame):
    pred_df = h2o_frame.as_data_frame()
    return pred_df['predict'].to_numpy()  # .astype('<U15')

In [None]:
y_pred_clf = h2o_pred_to_numpy(aml.predict(df_test_h2o))
y_pred_clf = y_pred_clf.astype('<U15')

In [None]:
y_pred_rgr = h2o_pred_to_numpy(aml_reg.predict(df_test_h2o))

In [None]:
from zipfile import ZipFile
import os

def generateFiles(predict_data, clf_pipe, rgr_pipe):
    """Genera los archivos a subir en CodaLab

    Input
    predict_data: Dataframe con los datos de entrada a predecir
    clf_pipe: pipeline del clf
    rgr_pipe: pipeline del rgr

    Ouput
    archivo de txt
    """
    y_pred_clf = h2o_pred_to_numpy(clf_pipe.predict(predict_data))
    y_pred_clf = y_pred_clf.astype('<U15')
    y_pred_rgr = h2o_pred_to_numpy(rgr_pipe.predict(predict_data))
    
    with open('./predictions_clf.txt', 'w') as f:
        for item in y_pred_clf:
            f.write("%s\n" % item)

    with open('./predictions_rgr.txt', 'w') as f:
        for item in y_pred_rgr:
            f.write("%s\n" % item)

    with ZipFile('predictions.zip', 'w') as zipObj2:
       zipObj2.write('predictions_rgr.txt')
       zipObj2.write('predictions_clf.txt')

    os.remove("predictions_rgr.txt")
    os.remove("predictions_clf.txt")

generateFiles(df_test_h2o,aml,aml_reg)