Recommender Systems Implementation

This repository provides a general framework for preparing datasets and building various recommender systems. It was mainly used for evaluating different RecSys models and also for training BERT4Rec model based on the original paper [1].

The model is implemented in PyTorch.

Getting Started

This repository is tested on Python 3.9 and PyTorch 1.12.0 with Cuda 11.6.

To prepare the working environment, please run the following command:

Clone the repository

git clone https://github.com/amrohendawi/recSys_framework

cd recommender_systems

Create a virtual environment and activate it

python3 -m venv venv

source venv/bin/activate

Install the requirements

pip install -r requirements.txt

Description of the Code Structure

The source code is divided into the following files/folders:

config: contain configuration files specific to a model/dataset/evaluation.
saved: contain saved models and dataloaders.
datasets: contain raw dataset files that still need to be processed.
training_data: contain processed atomic formatted dataset files that can be fed into the model.
log contains the log file of the training process.
log_tensorboard contains the tensorboard log file of the training process.
wandb contains the wandb log file of the training process.
preprocess_dataset.py: preprocess the raw dataset files into atomic formatted dataset files.
run.py: run/train/evaluate models.

Dataset

The datasets used in the examples are the H&M and ML-100k datasets. The raw dataset files must be stored in the datasets folder. The atomic formatted dataset files are stored after running preprocess_dataset.py in the training_data.

To use a different dataset, it must be in atomic format.

A list of datasets discussed in this repo

General Datasets

SN	Dataset	Instructions
1	MovieLens	Link
2	Anime	Link
3	Epinions	Link
4	Yelp	Link
5	Netflix	Link
6	Book-Crossing	Link
7	Jester	Link
8	Douban	Link
9	Yahoo Music	Link
10	KDD2010	Link
11	Amazon	Link
12	Pinterest	Link
13	Gowalla	Link
14	Last.FM	Link
15	DIGINETICA	Link
16	Steam	Link
17	Ta Feng	Link
18	Foursquare	Link
19	Tmall	Link
20	YOOCHOOSE	Link
21	Retailrocket	Link
22	LFM-1b	Link
23	MIND	Link

CTR Datasets

SN	Dataset	Instructions
1	Criteo	Link
2	Avazu	Link
3	iPinYou	Link
4	Phishing websites	Link
5	Adult	Link

Knowledge-aware Datasets

SN	Dataset	Instructions
1	MovieLens	Link
2	Amazon-book	Link
3	LFM-1b (tracks)	Link

Here is a list of RecBole datasets that can be formatted to atomic format sorted by importance:

If you want to use a dataset that is not in atomic format, refer to the RecBole documentation or [6] to convert it to atomic format.

Configuration

The configuration file is located in config folder. The configuration defines the model hyperparameters, the dataset, the training/evaluation parameters and more.

Training

Here's an example on how to train a model with ml-100k dataset:

Download a dataset from the dataset section and put it in the datasets folder:

cd datasets
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip

Preprocess the dataset to generate the atomic formatted dataset files:

python preprocess_dataset.py --dataset ml-100k  --convert_inter --convert_item --convert_user

These files will then be stored in the training_data folder automatically.

Prepare a configuration file for the model similar to the one in config/config_ml-100k.yaml. You mainly need to tell the model where to find the dataset files and how to process them (which columns to pick from the .inter, .item, and .user files).
For training a new model on ml-100k dataset run the following script:

python run.py --model_name BERT4Rec --config config_ml-100k.yaml --dataset ml-100k

Logging

There are two ways to log the training process:

Using Weights & Biases (recommended). To use it, refer it to the official recbole documentation
Using Tensorboard. Recbole produces tensorboard logs by default out of the box. To review the logs, you need to install the tensorboard package. Then, run the following command:

tensorboard --logdir=tensorboard_logs

Evaluation

The following 2 tables show an extensive evaluation of different recommender models on H&M dataset.

1. Evaluation results

index	model	type	duration	best_valid_score	valid_score	recall	MRR	nDCG	hit	precision	map
0	Pop	general	1.48	0.4703	True	0.6663	0.572	0.5227	0.8842	0.1457	0.408
1	ItemKNN	general	5.45	0.2129	True	0.5762	0.291	0.3158	0.8231	0.1041	0.1909
2	BPR	general	3.4	0.2646	True	0.412	0.3542	0.3075	0.6366	0.0877	0.2238
3	NeuMF	general	4.07	0.4333	True	0.6573	0.5276	0.4928	0.8849	0.1402	0.3733
4	RecVAE	general	85.48	0.4678	True	0.6706	0.5688	0.5209	0.8922	0.1453	0.4039
5	LightGCN	general	118.88	0.3259	True	0.4859	0.4039	0.3694	0.6709	0.1041	0.2809
6	FFM	context-aware	5.25	0.1766	True	0.5615	0.2507	0.2908	0.8036	0.0988	0.1673
7	DeepFM	context-aware	5.2	0.1772	True	0.5625	0.2496	0.2907	0.8046	0.0991	0.1669
8	BERT4Rec(2 layers)	sequential	22.06	0.4363	True	0.6969	0.5409	0.5157	0.9018	0.1427	0.3929
9	BERT4Rec(4 layers)	sequential	29.42	0.4631	True	0.7461	0.7952	0.5884	0.9515	0.5502	0.4631
10	GRU4Rec	sequential	5.86	0.5854	True	0.7086	0.6778	0.6037	0.9038	0.1591	0.4989
11	SHAN	sequential	10.22	0.5201	True	0.5624	0.4984	0.4706	0.6555	0.1076	0.4025

2. Test results

index	model	type	duration(m)	best_valid_score	valid_score	Recall	MRR	nDCG	hit	precision	map
0	Pop	general	1.48	0.4703	True	0.7485	0.6272	0.5904	0.9346	0.1654	0.4703
1	ItemKNN	general	5.45	0.2129	True	0.6178	0.3241	0.3461	0.8665	0.1145	0.2129
2	BPR	general	3.4	0.2646	True	0.4645	0.398	0.3533	0.6814	0.1004	0.2646
3	NeuMF	general	4.07	0.4333	True	0.7441	0.5835	0.5606	0.9403	0.1606	0.4333
4	RecVAE	general	85.48	0.4678	True	0.7562	0.626	0.5905	0.9421	0.1654	0.4678
5	LightGCN	general	118.88	0.3259	True	0.5443	0.4422	0.4176	0.7012	0.1175	0.3259
6	FFM	context-aware	5.25	0.1766	True	0.5986	0.2638	0.3089	0.8448	0.1074	0.1766
7	DeepFM	context-aware	5.2	0.1772	True	0.6016	0.2637	0.3102	0.8478	0.1083	0.1772
8	BERT4Rec(2 layers)	sequential	22.06	0.4363	True	0.7441	0.5898	0.5622	0.9284	0.1582	0.4363
9	BERT4Rec(4 layers)	sequential	29.42	0.4631	True	0.7839	0.7952	0.5884	0.8959	0.5502	0.3871
10	GRU4Rec	sequential	5.86	0.5854	True	0.8045	0.7160	0.5129	0.8898	0.4795	0.3852
11	SHAN	sequential	10.22	0.5201	True	0.7063	0.6325	0.6012	0.7979	0.1475	0.5201

References

[1]: BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformers

[2] RecBole: A Toolkit for Large-scale Recommendation System

[3] RecBole GitHub

[4] RecBole Datasets

[5] RecBole Tutorial

[6] RecBole Dataset Conversion Tools

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
dataset_preprocessing		dataset_preprocessing
evaluation		evaluation
usage		usage
.gitignore		.gitignore
README.md		README.md
demo_train_BERT4Rec.ipynb		demo_train_BERT4Rec.ipynb
full_requirements.txt		full_requirements.txt
preprocess_dataset.py		preprocess_dataset.py
requirements.txt		requirements.txt
run.py		run.py

amrohendawi/recSys_framework

Folders and files

Latest commit

History

Repository files navigation

Recommender Systems Implementation

Getting Started

Description of the Code Structure

Dataset

General Datasets

CTR Datasets

Knowledge-aware Datasets

Configuration

Training

Logging

Evaluation

References

About

Resources

Stars

Watchers

Forks

Languages