Skip to content

amrohendawi/recSys_framework

Repository files navigation

Recommender Systems Implementation

This repository provides a general framework for preparing datasets and building various recommender systems. It was mainly used for evaluating different RecSys models and also for training BERT4Rec model based on the original paper [1].

The model is implemented in PyTorch.

Getting Started

This repository is tested on Python 3.9 and PyTorch 1.12.0 with Cuda 11.6.

To prepare the working environment, please run the following command:

  1. Clone the repository
git clone https://github.com/amrohendawi/recSys_framework

cd recommender_systems
  1. Create a virtual environment and activate it
python3 -m venv venv

source venv/bin/activate
  1. Install the requirements
pip install -r requirements.txt

Description of the Code Structure

The source code is divided into the following files/folders:

  • config: contain configuration files specific to a model/dataset/evaluation.
  • saved: contain saved models and dataloaders.
  • datasets: contain raw dataset files that still need to be processed.
  • training_data: contain processed atomic formatted dataset files that can be fed into the model.
  • log contains the log file of the training process.
  • log_tensorboard contains the tensorboard log file of the training process.
  • wandb contains the wandb log file of the training process.
  • preprocess_dataset.py: preprocess the raw dataset files into atomic formatted dataset files.
  • run.py: run/train/evaluate models.

Dataset

The datasets used in the examples are the H&M and ML-100k datasets. The raw dataset files must be stored in the datasets folder. The atomic formatted dataset files are stored after running preprocess_dataset.py in the training_data.

To use a different dataset, it must be in atomic format.

A list of datasets discussed in this repo

General Datasets

SN Dataset Instructions
1 MovieLens Link
2 Anime Link
3 Epinions Link
4 Yelp Link
5 Netflix Link
6 Book-Crossing Link
7 Jester Link
8 Douban Link
9 Yahoo Music Link
10 KDD2010 Link
11 Amazon Link
12 Pinterest Link
13 Gowalla Link
14 Last.FM Link
15 DIGINETICA Link
16 Steam Link
17 Ta Feng Link
18 Foursquare Link
19 Tmall Link
20 YOOCHOOSE Link
21 Retailrocket Link
22 LFM-1b Link
23 MIND Link

CTR Datasets

SN Dataset Instructions
1 Criteo Link
2 Avazu Link
3 iPinYou Link
4 Phishing websites Link
5 Adult Link

Knowledge-aware Datasets

SN Dataset Instructions
1 MovieLens Link
2 Amazon-book Link
3 LFM-1b (tracks) Link

Here is a list of RecBole datasets that can be formatted to atomic format sorted by importance:

  1. Official Recbole Datasets and Preprocessing
  2. Google Drive
  3. Official list from Recbole

If you want to use a dataset that is not in atomic format, refer to the RecBole documentation or [6] to convert it to atomic format.

Configuration

The configuration file is located in config folder. The configuration defines the model hyperparameters, the dataset, the training/evaluation parameters and more.

Training

Here's an example on how to train a model with ml-100k dataset:

  1. Download a dataset from the dataset section and put it in the datasets folder:
cd datasets
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
  1. Preprocess the dataset to generate the atomic formatted dataset files:
python preprocess_dataset.py --dataset ml-100k  --convert_inter --convert_item --convert_user

These files will then be stored in the training_data folder automatically.

  1. Prepare a configuration file for the model similar to the one in config/config_ml-100k.yaml. You mainly need to tell the model where to find the dataset files and how to process them (which columns to pick from the .inter, .item, and .user files).

  2. For training a new model on ml-100k dataset run the following script:

python run.py --model_name BERT4Rec --config config_ml-100k.yaml --dataset ml-100k

Logging

There are two ways to log the training process:

  1. Using Weights & Biases (recommended). To use it, refer it to the official recbole documentation

  2. Using Tensorboard. Recbole produces tensorboard logs by default out of the box. To review the logs, you need to install the tensorboard package. Then, run the following command:

tensorboard --logdir=tensorboard_logs

Evaluation

The following 2 tables show an extensive evaluation of different recommender models on H&M dataset.

1. Evaluation results
index model type duration best_valid_score valid_score recall MRR nDCG hit precision map
0 Pop general 1.48 0.4703 True 0.6663 0.572 0.5227 0.8842 0.1457 0.408
1 ItemKNN general 5.45 0.2129 True 0.5762 0.291 0.3158 0.8231 0.1041 0.1909
2 BPR general 3.4 0.2646 True 0.412 0.3542 0.3075 0.6366 0.0877 0.2238
3 NeuMF general 4.07 0.4333 True 0.6573 0.5276 0.4928 0.8849 0.1402 0.3733
4 RecVAE general 85.48 0.4678 True 0.6706 0.5688 0.5209 0.8922 0.1453 0.4039
5 LightGCN general 118.88 0.3259 True 0.4859 0.4039 0.3694 0.6709 0.1041 0.2809
6 FFM context-aware 5.25 0.1766 True 0.5615 0.2507 0.2908 0.8036 0.0988 0.1673
7 DeepFM context-aware 5.2 0.1772 True 0.5625 0.2496 0.2907 0.8046 0.0991 0.1669
8 BERT4Rec(2 layers) sequential 22.06 0.4363 True 0.6969 0.5409 0.5157 0.9018 0.1427 0.3929
9 BERT4Rec(4 layers) sequential 29.42 0.4631 True 0.7461 0.7952 0.5884 0.9515 0.5502 0.4631
10 GRU4Rec sequential 5.86 0.5854 True 0.7086 0.6778 0.6037 0.9038 0.1591 0.4989
11 SHAN sequential 10.22 0.5201 True 0.5624 0.4984 0.4706 0.6555 0.1076 0.4025
2. Test results
index model type duration(m) best_valid_score valid_score Recall MRR nDCG hit precision map
0 Pop general 1.48 0.4703 True 0.7485 0.6272 0.5904 0.9346 0.1654 0.4703
1 ItemKNN general 5.45 0.2129 True 0.6178 0.3241 0.3461 0.8665 0.1145 0.2129
2 BPR general 3.4 0.2646 True 0.4645 0.398 0.3533 0.6814 0.1004 0.2646
3 NeuMF general 4.07 0.4333 True 0.7441 0.5835 0.5606 0.9403 0.1606 0.4333
4 RecVAE general 85.48 0.4678 True 0.7562 0.626 0.5905 0.9421 0.1654 0.4678
5 LightGCN general 118.88 0.3259 True 0.5443 0.4422 0.4176 0.7012 0.1175 0.3259
6 FFM context-aware 5.25 0.1766 True 0.5986 0.2638 0.3089 0.8448 0.1074 0.1766
7 DeepFM context-aware 5.2 0.1772 True 0.6016 0.2637 0.3102 0.8478 0.1083 0.1772
8 BERT4Rec(2 layers) sequential 22.06 0.4363 True 0.7441 0.5898 0.5622 0.9284 0.1582 0.4363
9 BERT4Rec(4 layers) sequential 29.42 0.4631 True 0.7839 0.7952 0.5884 0.8959 0.5502 0.3871
10 GRU4Rec sequential 5.86 0.5854 True 0.8045 0.7160 0.5129 0.8898 0.4795 0.3852
11 SHAN sequential 10.22 0.5201 True 0.7063 0.6325 0.6012 0.7979 0.1475 0.5201

References

[1]: BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformers

[2] RecBole: A Toolkit for Large-scale Recommendation System

[3] RecBole GitHub

[4] RecBole Datasets

[5] RecBole Tutorial

[6] RecBole Dataset Conversion Tools

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published