# 4.2 - Modelo multimodal


Este es un modelo stackado usando Bert de HuggingFace. Uso el modelo preentrenado `bert-base-multilingual-uncased` para el word embedding, y después esos embeddings se juntan con los datos que tenemos en el dataframe para introducirlos como datos de entrada en un perceptrón de 2 capas. 



![modelo](../img/multimodal.png)


[Referencia](https://github.com/georgian-io/Multimodal-Toolkit)

In [1]:
# librerias

import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

from multimodal_transformers.data import load_data_from_folder
from multimodal_transformers.model import AutoModelWithTabular, TabularConfig

from transformers import AutoTokenizer, AutoConfig, Trainer, TrainingArguments, EvalPrediction

import torch

from typing import Callable, Dict

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split as tts

import warnings
warnings.filterwarnings('ignore')

**Preparando datos**

Cargo los datos limpios de los pisos y los junto con los datos de las reviews.

In [2]:
listings=pd.read_csv('../data/clean_data/listings.csv')

listings.head()

Unnamed: 0,id,host_id,host_is_superhost,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,6369,13660,1,Hispanoamérica,Chamartín,40.45628,-3.67763,Apartment,Private room,2,1,1,0,Real Bed,"{Wifi,""Air conditioning"",Kitchen,Elevator,Heat...",70,0,5,2,15,1,365,22,52,82,82,73,14,1,0,1,0
1,21853,83531,0,Cármenes,Latina,40.40341,-3.74084,Apartment,Private room,1,1,1,1,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,""...",17,0,0,1,8,4,40,0,0,0,162,33,0,2,0,2,0
2,23001,82175,0,Legazpi,Arganzuela,40.38695,-3.69304,Apartment,Entire home/apt,6,2,3,5,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",""Wheelcha...",50,300,30,1,10,15,730,2,2,2,213,0,0,6,6,0,0
3,24805,101471,0,Universidad,Centro,40.42202,-3.70395,Apartment,Entire home/apt,3,1,0,1,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,E...",80,200,30,2,0,5,730,27,57,87,362,9,7,1,1,0,0
4,24836,101653,1,Justicia,Centro,40.41995,-3.69764,Apartment,Entire home/apt,4,1,2,3,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",115,200,0,4,0,3,10,24,54,77,342,67,15,1,1,0,0


In [3]:
reviews=pd.read_csv('../data/raw_data/reviews.csv.gz', compression='gzip', low_memory=False)

reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,6369,29428,2010-03-14,84790,Nancy,Simon and Arturo have the ultimate location in...
1,6369,31018,2010-03-23,84338,David,Myself and Kristy originally planned on stayin...
2,6369,34694,2010-04-10,98655,Marion,We had a great time at Arturo and Simon's ! A ...
3,6369,37146,2010-04-21,109871,Kurt,I very much enjoyed the stay. \r\nIt's a wond...
4,6369,38168,2010-04-26,98901,Dennis,Arturo and Simon are polite and friendly hosts...


In [4]:
# solo primera review para testear

primera=reviews.groupby('listing_id').first().reset_index()

primera.drop(columns=['id', 'date', 'reviewer_id', 'reviewer_name'], inplace=True)

primera.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17204 entries, 0 to 17203
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   listing_id  17204 non-null  int64 
 1   comments    17202 non-null  object
dtypes: int64(1), object(1)
memory usage: 5.7 MB


In [5]:
# todas las reviews en una sola 

todas=reviews.groupby('listing_id').agg({'comments': 'sum'}).reset_index()

todas.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17204 entries, 0 to 17203
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   listing_id  17204 non-null  int64 
 1   comments    17204 non-null  object
dtypes: int64(1), object(1)
memory usage: 361.4 MB


In [6]:
total=listings.merge(todas, left_on='id', right_on='listing_id')

total=total.dropna()

total.head()

Unnamed: 0,id,host_id,host_is_superhost,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,listing_id,comments
0,6369,13660,1,Hispanoamérica,Chamartín,40.45628,-3.67763,Apartment,Private room,2,1,1,0,Real Bed,"{Wifi,""Air conditioning"",Kitchen,Elevator,Heat...",70,0,5,2,15,1,365,22,52,82,82,73,14,1,0,1,0,6369,Simon and Arturo have the ultimate location in...
1,21853,83531,0,Cármenes,Latina,40.40341,-3.74084,Apartment,Private room,1,1,1,1,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,""...",17,0,0,1,8,4,40,0,0,0,162,33,0,2,0,2,0,21853,"Mi experiencia en casa de Adel fue buena, aunq..."
2,24805,101471,0,Universidad,Centro,40.42202,-3.70395,Apartment,Entire home/apt,3,1,0,1,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,E...",80,200,30,2,0,5,730,27,57,87,362,9,7,1,1,0,0,24805,"During my stay, I enjoyed all around and had a..."
3,24836,101653,1,Justicia,Centro,40.41995,-3.69764,Apartment,Entire home/apt,4,1,2,3,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",115,200,0,4,0,3,10,24,54,77,342,67,15,1,1,0,0,24836,Incredible location! Tenty and Goyo were very...
4,26825,114340,0,Legazpi,Arganzuela,40.38985,-3.69011,House,Private room,1,1,1,1,Real Bed,"{Wifi,""Wheelchair accessible"",Doorman,Elevator...",25,0,15,1,0,2,365,30,60,90,365,142,21,1,0,1,0,26825,"Agustina is a great host, she is very thoughtf..."


In [7]:
total.drop(columns=['id', 'host_id', 'listing_id'], inplace=True)

train, val=tts(total, train_size=0.75, random_state=42)
train.iloc[:3000].to_csv('../data/nlp_data/train.csv')

val, test=tts(val, train_size=0.5, random_state=42)
val.iloc[:500].to_csv('../data/nlp_data/val.csv')
test.iloc[:300].to_csv('../data/nlp_data/test.csv')

train.shape, val.shape, test.shape 

((12855, 31), (2142, 31), (2143, 31))

In [8]:
label_col='price'

text_cols=['amenities'] 

categorical_cols=['neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'property_type', 'room_type',
                  'bed_type', ]

numerical_cols=['accommodates', 'bathrooms', 'bedrooms', 'beds', 'host_is_superhost', 'latitude', 'longitude',
                'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 
                'minimum_nights', 'maximum_nights', 'availability_30', 'availability_60', 
                'availability_90', 'availability_365', 'number_of_reviews', 
                'number_of_reviews_ltm', 'calculated_host_listings_count', 
                'calculated_host_listings_count_entire_homes', 
                'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms']


In [9]:
def metrics(preds: np.ndarray, labels: np.ndarray) -> dict:
    
    mse=mean_squared_error(labels, preds)
    rmse=mean_squared_error(labels, preds, squared=False)
    mae=mean_absolute_error(labels, preds)
    r2=r2_score(labels, preds)

    return {'mse': mse,
            'rmse': rmse,
            'mae': mae,
            'r2': r2}

In [10]:
def compute_metrics() -> Callable[[EvalPrediction], Dict]:
    
        def compute_metrics_fn(p: EvalPrediction):
            
                preds=np.squeeze(p.predictions)
                return metrics(preds, p.label_ids)
            
        return compute_metrics_fn

In [11]:
# tokenizador

tokenizer=AutoTokenizer.from_pretrained('bert-base-multilingual-uncased', truncation=True)

In [12]:
# carga de datos, tensor de pytorch

train_data, eval_data, test_data=load_data_from_folder('../data/nlp_data/',
                                                       text_cols,
                                                       tokenizer,
                                                       label_col=label_col,
                                                       categorical_cols=categorical_cols,
                                                       numerical_cols=numerical_cols,
                                                       categorical_encode_type='binary',
                                                       sep_text_token_str=tokenizer.sep_token, 
                                                       max_token_length=512)

In [13]:
# configuracion y carga del modelo

config=AutoConfig.from_pretrained('bert-base-multilingual-uncased')

tabular_config=TabularConfig(num_labels=1,  # regresion
                             cat_feat_dim=train_data.cat_feats.shape[1],
                             numerical_feat_dim=train_data.numerical_feats.shape[1],
                             combine_feat_method='weighted_feature_sum_on_transformer_cat_and_numerical_feats')

config.tabular_config=tabular_config

modelo=AutoModelWithTabular.from_pretrained('bert-base-multilingual-uncased', config=config)

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertWithTabular: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertWithTabular from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertWithTabular from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertWithTabular were not initialized from the model checkpoint at bert-base-multilingual-uncased and are new

In [14]:
# entrenamiento

args=TrainingArguments(num_train_epochs=1,
                       per_device_train_batch_size=32,
                       evaluate_during_training=True,
                       logging_steps=25,
                       overwrite_output_dir=True,
                       output_dir='../data/nlp_data/logs/model_name',
                       logging_dir='../data/nlp_data/logs/runs')



device=torch.device('mps')  # mps neural engine Mac M1, cuda o cpu para el resto


trainer=Trainer(model=modelo.to(device), 
                args=args, 
                train_dataset=train_data,
                eval_dataset=eval_data,
                compute_metrics=compute_metrics())


trainer.train()
trainer.save_model()

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/94 [00:00<?, ?it/s]

{'loss': 198560.42, 'learning_rate': 3.670212765957447e-05, 'epoch': 0.26595744680851063, 'step': 25}
{'loss': 28450.82, 'learning_rate': 2.340425531914894e-05, 'epoch': 0.5319148936170213, 'step': 50}
{'loss': 59570.24, 'learning_rate': 1.0106382978723404e-05, 'epoch': 0.7978723404255319, 'step': 75}


In [15]:
resultados=[]

In [16]:
# eval data, evaluacion

eval_res=trainer.evaluate(eval_dataset=eval_data)

resultados.append({'eval': eval_res})

Evaluation:   0%|          | 0/63 [00:00<?, ?it/s]

{'eval_loss': 21944.183775390626, 'eval_mse': 21944.183750223598, 'eval_rmse': 148.13569370757205, 'eval_mae': 79.71226263237, 'eval_r2': -0.40728843794817116, 'epoch': 1.0, 'step': 94}


In [17]:
# prediccion

y_pred=trainer.predict(test_dataset=test_data).predictions

y_pred[:5]

Prediction:   0%|          | 0/38 [00:00<?, ?it/s]

array([[12.737488],
       [12.422106],
       [11.73135 ],
       [12.513758],
       [12.612514]], dtype=float32)

In [18]:
# test data, evaluacion

test_res=trainer.evaluate(eval_dataset=test_data)

resultados.append({'test': test_res})

Evaluation:   0%|          | 0/38 [00:00<?, ?it/s]

{'eval_loss': 436301.16282714845, 'eval_mse': 436301.22198325617, 'eval_rmse': 660.5310151561819, 'eval_mae': 129.84854205449423, 'eval_r2': -0.040146848582289785, 'epoch': 1.0, 'step': 94}


In [19]:
resultados

[{'eval': {'eval_loss': 21944.183775390626,
   'eval_mse': 21944.183750223598,
   'eval_rmse': 148.13569370757205,
   'eval_mae': 79.71226263237,
   'eval_r2': -0.40728843794817116,
   'epoch': 1.0}},
 {'test': {'eval_loss': 436301.16282714845,
   'eval_mse': 436301.22198325617,
   'eval_rmse': 660.5310151561819,
   'eval_mae': 129.84854205449423,
   'eval_r2': -0.040146848582289785,
   'epoch': 1.0}}]

El modelo ha sido entrenado una sola época y con 3000 registros, por lo que las métricas no son representativas, R2 negativo, mi error es mayor que la varianza. Las predicciones tampoco son representativas. Sin embargo, un modelo stackado como éste podría funcionar muy bien. Viendo lo que ocurre con R2 entre evaluación y testeo, me inclino a pensar que si entrenara el modelo varias decenas de épocas funcionaría de manera razonable.