# Multimodal Deep Learning for Recommendation (Hands-On Session)

⭐ **The 2024 ACM RecSys Summer School** ⭐

*Bari (Italy), October 09, 2024*

<div>
  <img src="https://recsys.acm.org/wp-content/uploads/2023/11/RecSysBanner_1000_180.png" alt="SisInfLab" width="600">
  <img src="https://recsys.acm.org/wp-content/uploads/2020/07/Recsys-OG.png" alt="recsys" width="200">
</div>

🧑 Instructor: [Daniele Malitesta](https://danielemalitesta.github.io/)

💳 Credits: I'd like to thank [Matteo Attimonelli](mailto:matteo.attimonelli@poliba.it), [Danilo Danese](mailto:danilo.danese@poliba.it), and [Angela Di Fazio](mailto:angela.difazio@poliba.it) for their massive and amazing job on making the code well-structured and work smoothly!

If you use this code for your experiments, please cite our recent work (on arXiv) 🙏

![GitHub Repo stars](https://img.shields.io/github/stars/sisinflab/Ducho-meets-Elliot)
 [![arXiv](https://img.shields.io/badge/arXiv-2409.15857-b31b1b.svg)](https://arxiv.org/abs/2409.15857)

 <img src="https://github.com/sisinflab/Ducho-meets-Elliot/blob/master/framework.png?raw=true"  width="700">

```
@article{DBLP:journals/corr/abs-2409-15857,
  author       = {Matteo Attimonelli and
                  Danilo Danese and
                  Angela Di Fazio and
                  Daniele Malitesta and
                  Claudio Pomo and
                  Tommaso Di Noia},
  title        = {Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation},
  journal      = {CoRR},
  volume       = {abs/2409.15857},
  year         = {2024}
}
```

## Clone the repository

First, we clone the repository to exploit the Ducho + Elliot experimental environment 🐑

In [None]:
!git clone --recursive https://github.com/sisinflab/Ducho-meets-Elliot.git
%cd Ducho-meets-Elliot/

## Set up the working environment

Now, we setup the proper environment to run the experiments. Conveniently, Google Colab provides most of the packages we need. Thus, we only install the remaining ones 😎

In [None]:
!pip install pandas loguru alive_progress sentence_transformers
!pip install torch_geometric
!pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cu121.html
!python -m pip install -U prettytable

## Download and visualize the multimodal recommendation datasets

We're now set to download the multimodal recommendation dataset. For the sake of this lecture, we consider the popular **[Amazon Product Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews)** dataset 🛒

Specifically, we consider the following product categories:

| **Datasets**     | **# Users** | **# Items** | **# Interactions** | **Sparsity (%)** |
|------------------|-------------|-------------|--------------------|------------------|
| Office Products  | 4,471       | 1,703       | 20,608             | 99.73%           |
| Digital Music    | 5,082       | 2,338       | 30,623             | 99.74%           |
| Baby             | 19,100      | 6,283       | 80,931             | 99.93%           |
| Toys & Games     | 19,241      | 11,101      | 89,558             | 99.96%           |
| Beauty           | 21,752      | 11,145      | 100,834            | 99.96%           |

For the sake of this hands-on session, we will consider only the "Office Products" dataset!

In [None]:
# RUN THIS CELL ONLY IF YOU HAVE TIME TO WASTE :-)

%cd Ducho
!python3 ./demos/demo_office/prepare_dataset.py

In [None]:
# OTHERWISE, HERE'S AN ALREADY-PREPARED VERSION OF THE DATASET

import gdown

%cd Ducho
!mkdir -p local/data
gdown.download(f'https://drive.google.com/uc?id=1DSe7osyJ5dmXRsgOkdDMPeHPHCk7eezv', 'demo_office.zip', quiet=False)
!mv demo_office.zip local/data/
!unzip local/data/demo_office.zip
%rm local/data/demo_office.zip

We can visualize one random item from the dataset 👓

In [None]:
import pandas as pd
import random
from matplotlib import pyplot as plt
from PIL import Image
from IPython.display import display, HTML

meta = pd.read_csv('local/data/demo_office/meta.tsv', sep='\t')
random_item = random.choice(meta['asin'].tolist())
description = meta[meta["asin"]==random_item]["description"].values[0]
display(HTML(f"ASIN: {random_item}\n<div style='white-space: pre-wrap; width: 100%;'>{description}</div>"))
img = Image.open(f'local/data/demo_office/images/{random_item}.jpg')
plt.imshow(img)
plt.axis('off')
plt.show()

from prettytable import PrettyTable
table = PrettyTable()

table.field_names = ["USER", "Rating"]
reviews = pd.read_csv('local/data/demo_office/reviews.tsv', sep='\t', header=None)
current_reviews = reviews[reviews[1]==random_item]

for idx, row in current_reviews.iterrows():
  table.add_row([row[0], row[2]])

display(HTML('Clicked by:'))
print(table)

## Check if GPU is available

Before running any GPU-bound process, let's check if the GPU is available:

In [None]:
!nvidia-smi
!nvcc --version

## Extract multimodal features with Ducho

![GitHub Repo stars](https://img.shields.io/github/stars/sisinflab/Ducho)
 [![arXiv](https://img.shields.io/badge/arXiv-2403.04503-b31b1b.svg)](https://arxiv.org/abs/2403.04503)

 If you use Ducho for your experiments, please cite our papers 🙏

<div>
  <img src="https://github.com/sisinflab/Ducho/raw/main/docs/source/img/ducho_v2_overview.png" alt="duccio" width="800">
</div>

```
@inproceedings{DBLP:conf/www/AttimonelliDMPG24,
  author       = {Matteo Attimonelli and
                  Danilo Danese and
                  Daniele Malitesta and
                  Claudio Pomo and
                  Giuseppe Gassi and
                  Tommaso Di Noia},
  title        = {Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction
                  of Multimodal Features in Recommendation},
  booktitle    = {{WWW} (Companion Volume)},
  pages        = {1075--1078},
  publisher    = {{ACM}},
  year         = {2024}
}
```

```
@inproceedings{DBLP:conf/mm/MalitestaGPN23,
  author       = {Daniele Malitesta and
                  Giuseppe Gassi and
                  Claudio Pomo and
                  Tommaso Di Noia},
  title        = {Ducho: {A} Unified Framework for the Extraction of Multimodal Features
                  in Recommendation},
  booktitle    = {{ACM} Multimedia},
  pages        = {9668--9671},
  publisher    = {{ACM}},
  year         = {2023}
}
```

Now, we are all set to extract multimodal product features through Ducho 🦾

In [None]:
# CREATE THE CONFIGURATION FILE FOR DUCHO
def create_config():
  import yaml

  config_ducho = """dataset_path: ./local/data/demo_office
gpu list: 0
visual:
    items:
        input_path: images
        output_path: visual_embeddings_32
        model: [
            { model_name: ResNet50,
              output_layers: avgpool,
              reshape: [224, 224],
              preprocessing: zscore,
              backend: torch,
              batch_size: 32
            }
        ]

textual:
    items:
        input_path: meta.tsv
        item_column: asin
        text_column: description
        output_path: textual_embeddings_32
        model: [
          { model_name: sentence-transformers/all-mpnet-base-v2,
              output_layers: 1,
              clear_text: False,
              backend: sentence_transformers,
              batch_size: 32
          }
        ]

visual_textual:
    items:
        input_path: {visual: images, textual: meta.tsv}
        item_column: asin
        text_column: description
        output_path: {visual: visual_embeddings_32, textual: textual_embeddings_32}
        model: [
          { model_name: openai/clip-vit-base-patch16,
              backend: transformers,
              output_layers: 1,
              batch_size: 32
          }
        ]

  """

  ducho_dir = f"demos/demo_office/config.yml"
  with open(ducho_dir, 'w') as conf_file:
      conf_file.write(config_ducho)

# RUN THE EXTRACTION WITH DUCHO
from ducho.runner.Runner import MultimodalFeatureExtractor
import torch
import os
import numpy as np
import random

def set_seed(seed = 42):
    """Set all seeds to make results reproducible (deterministic mode).
       When seed is None, disables deterministic mode.
    :param seed: an integer to your choosing
    """
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ":16:8"


def main():
    set_seed()
    extractor_obj = MultimodalFeatureExtractor(config_file_path='./demos/demo_office/config.yml')
    extractor_obj.execute_extractions()


if __name__ == '__main__':
    create_config()
    main()

## Dataset splitting and features mapping in Elliot

![GitHub Repo stars](https://img.shields.io/github/stars/sisinflab/Formal-Multimod-Rec)
 [![arXiv](https://img.shields.io/badge/arXiv-2309.05273-b31b1b.svg)](https://arxiv.org/abs/2309.05273)

If everything went smoothly with the features extraction, now we can: (i) split the original dataset into train/validation/test set ✂ (ii) map the multimodal item features to ids aligned with the training set 🗾

To this end, we will use Elliot, our framework for rigorous and reproducibile recommender systems evaluation.

If you find it useful for your research, please cite our works 🙏



```
@article{10.1145/3662738,
author = {Malitesta, Daniele and Cornacchia, Giandomenico and Pomo, Claudio and Merra, Felice Antonio and Di Noia, Tommaso and Di Sciascio, Eugenio},
title = {Formalizing Multimedia Recommendation through Multimodal Deep Learning},
year = {2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3662738},
doi = {10.1145/3662738},
note = {Just Accepted},
journal = {ACM Trans. Recomm. Syst.},
month = {apr},
keywords = {Multimodal Deep Learning, Multimedia Recommender Systems, Benchmarking}
}
```



```
@inproceedings{DBLP:conf/sigir/AnelliBFMMPDN21,
  author       = {Vito Walter Anelli and
                  Alejandro Bellog{\'{\i}}n and
                  Antonio Ferrara and
                  Daniele Malitesta and
                  Felice Antonio Merra and
                  Claudio Pomo and
                  Francesco Maria Donini and
                  Tommaso Di Noia},
  title        = {Elliot: {A} Comprehensive and Rigorous Framework for Reproducible
                  Recommender Systems Evaluation},
  booktitle    = {{SIGIR}},
  pages        = {2405--2414},
  publisher    = {{ACM}},
  year         = {2021}
}
```



**Dataset splitting**

* Train = 80% dataset

* Test = 20% dataset

* Valid = 10% train set

In [None]:
%cd ..

%mv ./Ducho/local/data/demo_office/visual_embeddings_32 ./data/office
%mv ./Ducho/local/data/demo_office/textual_embeddings_32 ./data/office

%cp ./Ducho/local/data/demo_office/reviews.tsv ./data/office

split_config = '''
experiment:
  backend: pytorch
  data_config:
    strategy: dataset
    dataset_path: ../data/office/reviews.tsv
  splitting:
    save_on_disk: True
    save_folder: ../data/office_splits/
    test_splitting:
      strategy: random_subsampling
      test_ratio: 0.2
    validation_splitting:
      strategy: random_subsampling
      test_ratio: 0.1
  dataset: office
  top_k: 20
  evaluation:
    cutoffs: [ 10, 20 ]
    simple_metrics: [ Recall, nDCG ]
  gpu: 0
  external_models_path: ../external/models/__init__.py
  models:
    MostPop:
      meta:
        verbose: True
        save_recs: False
'''

split_dir = f"./config_files/split_office.yml"
with open(split_dir, 'w') as conf_file:
    conf_file.write(split_config)

# SPLIT INTO TRAIN/VAL/TEST
%env CUBLAS_WORKSPACE_CONFIG=:16:8
!python3 run_split.py --dataset office

%cd ./data/office

%env CUBLAS_WORKSPACE_CONFIG=:16:8

# MAP ITEMS TO NUMERICAL IDS
train = pd.read_csv('train.tsv', sep='\t', header=None)
val = pd.read_csv('val.tsv', sep='\t', header=None)
test = pd.read_csv('test.tsv', sep='\t', header=None)

df = pd.concat([train, val, test], axis=0)

users = df[0].unique()
items = df[1].unique()

users_map = {u: idx for idx, u in enumerate(users)}
items_map = {i: idx for idx, i in enumerate(items)}

train[0] = train[0].map(users_map)
train[1] = train[1].map(items_map)

val[0] = val[0].map(users_map)
val[1] = val[1].map(items_map)

test[0] = test[0].map(users_map)
test[1] = test[1].map(items_map)

train.to_csv('train_indexed.tsv', sep='\t', index=False, header=None)
val.to_csv('val_indexed.tsv', sep='\t', index=False, header=None)
test.to_csv('test_indexed.tsv', sep='\t', index=False, header=None)

visual_embeddings_folder = f'visual_embeddings_32/torch/ResNet50/avgpool'
textual_embeddings_folder = f'textual_embeddings_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'

visual_embeddings_folder_indexed = f'visual_embeddings_indexed_32/torch/ResNet50/avgpool'
textual_embeddings_folder_indexed = f'textual_embeddings_indexed_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'

if not os.path.exists(visual_embeddings_folder_indexed):
    os.makedirs(visual_embeddings_folder_indexed)

if not os.path.exists(textual_embeddings_folder_indexed):
    os.makedirs(textual_embeddings_folder_indexed)

for key, value in items_map.items():
    np.save(f'{visual_embeddings_folder_indexed}/{value}.npy', np.load(f'{visual_embeddings_folder}/{key}.npy'))
    np.save(f'{textual_embeddings_folder_indexed}/{value}.npy', np.load(f'{textual_embeddings_folder}/{key}.npy'))


visual_embeddings_folder = f'visual_embeddings_32/transformers/openai/clip-vit-base-patch16/1'
textual_embeddings_folder = f'textual_embeddings_32/transformers/openai/clip-vit-base-patch16/1'

visual_embeddings_folder_indexed = f'visual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'
textual_embeddings_folder_indexed = f'textual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'

if not os.path.exists(visual_embeddings_folder_indexed):
    os.makedirs(visual_embeddings_folder_indexed)

if not os.path.exists(textual_embeddings_folder_indexed):
    os.makedirs(textual_embeddings_folder_indexed)

for key, value in items_map.items():
    np.save(f'{visual_embeddings_folder_indexed}/{value}.npy', np.load(f'{visual_embeddings_folder}/{key}.npy'))
    np.save(f'{textual_embeddings_folder_indexed}/{value}.npy', np.load(f'{textual_embeddings_folder}/{key}.npy'))


%cd ../../

The downloaded multimodal dataset has the following structure:

```
├── office
│   ├── visual_embeddings_indexed_32
|       ├── torch
|          ├── ResNet50
|             ├── avgpool
│                ├── 0.npy
│                ├── 1.npy
│                ├── ...
|       ├── transformers
|          ├── openai
|             ├── clip-vit-base-patch16
|                ├── 1
│                   ├── 0.npy
│                   ├── 1.npy
│                   ├── ...
|
│   ├── textual_embeddings_indexed_32
|       ├── sentence_transformers
|          ├── sentence-transformers
|             ├── all-mpnet-base-v2
|                ├── 1
│                   ├── 0.npy
│                   ├── 1.npy
│                   ├── ...
|       ├── transformers
|          ├── openai
|             ├── clip-vit-base-patch16
|                ├── 1
│                   ├── 0.npy
│                   ├── 1.npy
│                   ├── ...
│   ├── train_indexed.tsv
│   ├── val_indexed.tsv
│   ├── test_indexed.tsv
```



## Configure and run the experiments
Let's set the hyper-parameters for the model to be trained and tested. We will focus on VBPR in a modified version which adopts multimodal features. We train and evaluate it on Amazon Office.

### First multimodal features configuration

We start with the configuration ResNet50 (visual) + SentenceBert (textual), the most common one from the literature.

In [None]:
import yaml

config_filename = 'hands-on_resnet50_sentencebert'
elliot_config_1 = {
  'experiment': {
    'backend': 'pytorch',
    'data_config': {
      'strategy': 'fixed',
      'train_path': '../data/{0}/train_indexed.tsv',
      'validation_path': '../data/{0}/val_indexed.tsv',
      'test_path': '../data/{0}/test_indexed.tsv',
      'side_information': [
        {
            'dataloader': 'VisualAttribute',
            'visual_features': '../data/{0}/visual_embeddings_indexed_32/torch/ResNet50/avgpool'
        },
        {
            'dataloader': 'TextualAttribute',
            'textual_features': '../data/{0}/textual_embeddings_indexed_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'
        }
      ]
    },
    'dataset': 'office',
    'top_k': 20,
    'evaluation': {
      'cutoffs': [20],
      'simple_metrics': ['Recall', 'Precision', 'nDCG', 'HR']
    },
    'gpu': 0,
    'external_models_path': '../external/models/__init__.py',
    'models': {
      'external.VBPR': {
        'meta': {
          'hyper_opt_alg': 'grid',
          'verbose': True,
          'save_weights': False,
          'save_recs': False,
          'validation_rate': 10,
          'validation_metric': 'Recall@20',
          'restore': False
        },
        'epochs': 200,
        'batch_size': 1024,
        'factors': 64,
        'lr': 0.005,
        'l_w': 1e-5,
        'n_layers': 1,
        'comb_mod': 'concat',
        'modalities': "('visual','textual')",
        'loaders': "('VisualAttribute','TextualAttribute')",
        'seed': 123
      }
    }
  }
}

with open(f'config_files/{config_filename}.yml', 'w') as file:
    documents = yaml.dump(elliot_config_1, file)

### Run Elliot

Now we are all set to run an experiment with VBPR on Amazon Office with the first multimodal configuration.

In [None]:
from elliot.run import run_experiment

run_experiment(f"config_files/hands-on_resnet50_sentencebert.yml")

### Second multimodal features configuration

Second, we prepare and run the second configuration for the multimodal feature extractors. In this case, we use CLIP, a popular multimodal model in the deep learning literature, but largely overlooked in the recommendation community.



In [None]:
config_filename = 'hands-on_clip'
elliot_config_2 = {
  'experiment': {
    'backend': 'pytorch',
    'data_config': {
      'strategy': 'fixed',
      'train_path': '../data/{0}/train_indexed.tsv',
      'validation_path': '../data/{0}/val_indexed.tsv',
      'test_path': '../data/{0}/test_indexed.tsv',
      'side_information': [
        {
            'dataloader': 'VisualAttribute',
            'visual_features': '../data/{0}/visual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'
        },
        {
            'dataloader': 'TextualAttribute',
            'textual_features': '../data/{0}/textual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'
        }
      ]
    },
    'dataset': 'office',
    'top_k': 20,
    'evaluation': {
      'cutoffs': [20],
      'simple_metrics': ['Recall', 'Precision', 'nDCG', 'HR']
    },
    'gpu': 0,
    'external_models_path': '../external/models/__init__.py',
    'models': {
      'external.VBPR': {
        'meta': {
          'hyper_opt_alg': 'grid',
          'verbose': True,
          'save_weights': False,
          'save_recs': False,
          'validation_rate': 10,
          'validation_metric': 'Recall@20',
          'restore': False
        },
        'epochs': 200,
        'batch_size': 1024,
        'factors': 64,
        'lr': 0.005,
        'l_w': 1e-5,
        'n_layers': 1,
        'comb_mod': 'concat',
        'modalities': "('visual','textual')",
        'loaders': "('VisualAttribute','TextualAttribute')",
        'seed': 123
      }
    }
  }
}

with open(f'config_files/{config_filename}.yml', 'w') as file:
    documents = yaml.dump(elliot_config_2, file)

### Run Elliot

Now we are all set to run an experiment with VBPR on Amazon Office with the second multimodal configuration.

In [None]:
from elliot.run import run_experiment

run_experiment(f"config_files/hands-on_clip.yml")

## Final comments

We see that with different multimodal feature extractors, results are not the same. Indeed, with CLIP, we find (in most cases) improved recommendation performance than the usual ones obtained with ResNet50 + SentenceBert. That happens even without having explored a wide hyper-parameter space.





```
# First configuration (ResNet50 + SentenceBert)

{20: {'Recall': 0.07367239361200444, 'Precision': 0.008286736747931112, 'nDCG': 0.034714471891791276, 'HR': 0.1433683739655558}}
```



```
# Second configuration (CLIP)

{20: {'Recall': 0.07386561970547656, 'Precision': 0.008443301274882577, 'nDCG': 0.03508305857211589, 'HR': 0.14538134645493178}}
```

