This repository contains the code and experiments for the Master of Science thesis "GAN-based Matrix Factorization for Recommender Systems" at Politecnico di Milano. The abstract is provided below. Full text is available at http://hdl.handle.net/10589/154120.
The last decade has seen an exponential increase in the amount of available information thanks to the ever-growing number of connected devices and interaction of users with online content like social media, e-commerce, etc. While this translates in more choices for users given their diverse set of preferences, it makes it difficult for them to explore this vast amount of information. Recommender systems (RS) aim to alleviate this problem by filtering the content offered to users by predicting either the rating of items by users or the propensity of users to like specific items. The latter is known as Top-N recommendation in the RS community and it refers to the problem of recommending items to users, preferably in the order from most likely-to-interact to least likely-to-interact.
RS use two main approaches for providing recommendations to users; collaborative filtering and content-based filtering. One of the main algorithms used in collaborative filtering is matrix factorization which constitutes in estimating the user preferences by decomposing a user-item interaction matrix into matrices of lower dimensionality of latent features of users and items.
The burst of big data has triggered a corresponding response in the machine learning community in trying to come up with new techniques to extract relevant information from data. One such technique is Generative Adversarial Nets (GAN) proposed in 2014 by Goodfellow et al. which initiated a fresh interest in generative modelling. Under this modelling paradigm, GANs have shown great results in estimating high-dimensional, degenerate distributions in Computer Vision, Natural Language Processing and various other scientific fields. Despite their popularity and abilities in learning arbitrary distributions, GANs, and more generally generative modelling, have not been widely applied in RS.
In this thesis we investigate a novel approach that estimates the user and item latent factors in a matrix factorization setting through the application of Generative Adversarial Networks for generic Top-N recommendation problem. We detail the formulation of this approach and show its performance through different experiments on well know datasets in the RS community.
This repo is based on a version of the repo Recsys_Course_AT_PoliMi. In order to run the code and experiments you need first to setup a Python environment. Any environment manager will work but we suggest conda
since it is easier to recreate our environment if using a GPU. conda
can help with the installation of CUDA
and CUDA toolkit
necessary to utilize available GPU(s). We highly recommend running this repo with a GPU since GAN-based recommenders require long running times.
Run the following command to create a new environment with Python 3.6.8
and install all requirements in file conda_requirements.txt
:
conda create -n <name-env> python==3.6.8 --file conda_requirements.txt
The file conda_requirements.txt
also contains the packages cudatoolkit==9.0
and cudnn==7.1.2
which are installed completely separate from other versions you might already have installed and are managed by conda
.
Next install the following packages using pip
inside the newly created environment since they are not found in the main
channel of conda
and conda-forge
channel holds old versions of them:
pip install scikit-optimize==0.7.2 telegram-send==0.25
Activate the newly created environment:
conda activate <name-env>
First download and install Python 3.6.8 from python.org. Then install virtualenv
:
python -m pip install --user virtualenv
Now create a new environment with virtualenv (by default it will use the Python version it was installed with):
virtualenv <name-env> <path-to-new-env>
Activate the new environment with:
source <path-to-new-env>/bin/activate
Now install the required packages through the file pip_requirements.txt
:
pip install -r pip_requirements.txt
Note that if you intend to use a GPU and install required packages using virtualenv
and pip
then you need to install separately cudatoolkit==9.0
and cudnn==7.1.2
following instructions for your GPU on nvidia.com.
Before running any experiment or algorithm you need to compile the Cython code part of some of the recommenders. You can compile them all with the following command:
python run_compile_all_cython.py
N.B You need to have the following packages installed before compiling: gcc
and python3-dev
We have provided python scripts to test-run only the GAN-based algorithms. They can be ran by:
python run_GANMF.py
python run_CFGAN.py
By default they download MovieLens100K dataset, split it into train-test-validation sets with ratio 6-2-2 and have reasonable default hyperparameters for both algorithms for this dataset.
In order to run all the comparisons with the baselines use the file RecSysExp.py
. First compute for each dataset 5 mutually exclusive sets:
-
Training set: once best hyperparameters of the recommender are found, it will be finally trained with this set.
- Training set small: the recommender is first trained on this small training set with the aim of finding the best hyperparameters
- Early stopping set: validation set used to incorporate early stopping in the hyperparameters tuning.
- Validation set: the recommender with the current hyperparameter values is tested against this set.
-
Test set: once the best hyperparameters are found, the recommender is finally tested with this set. The results presented are the ones on this set.
Compute the splits for each dataset with the following command:
python RecSysExp.py --build_datasets
To run the tuning of a recommender use the following command:
python RecSysExp.py <recommender-name> [--item | --user] [--run_all | <dataset-name(s)>] [--no_mp]
-
recommender-name
is a value among:Random, PureSVD, ALS, BPR, SLIMBPR, CFGAN, GANMF, DisGANMF, DeepGANMF, fullGANMF
. -
item | user
is a flag used only for GAN-based recommenders. It denotes the item/user based training procedure for the selected recommender. -
run_all
is a flag that selects all datasets on which to tune the selected recommender. If this flag is selecteddataset-name(s)
is neglected. -
dataset-name(s)
is a value among:LastFM, CiaoDVD, Delicious, 100K, 1M
. Multiple values can be set separated by space. -
no_mp
is a flag that explictly requests no parallelism during tuning (each dataset tuned in parallel through Python'smultiprocessing
module). This flag is necessary for GAN-based algorithms running on a GPU in order not to exhaust the available GPU memory by constructing parallel Tensorflow graphs. It can be ommited for other baselines.
All results, best hyperparameters and dataset splits are saved into the directory experiments
.
In order to run the ablation studies use the script in AblationStudy.py
. This file implements two functions: ablation_study
and feature_matching_cos_sim
. In order to run it, modify the very last line with the name of any of the above functions:
python AblationStudy.py [--run-all | <dataset-name(s)>] [item | user]
-
run-all
is a flag that asks for the computation of the experiment on all datasets. If it is set,dataset-name(s)
is neglected. -
dataset-name(s)
is a value among:LastFM, CiaoDVD, Delicious, 100K, 1M
. Multiple values can be set separated by space. -
item | user
is a flag that sets the training procedure forGANMF
recommender.
Results for function ablation_study
are saved in directory ablation_study
and results for feature_matching_cos_sim
are saved in directory cosine_similarities
. This experiment must be run after tuning GANMF
since the best hyperparameters are retrieved from experiments
directory.
In order to test each tuned recommender on the test set (which is created when tuning the hyperparameters) run the following command:
python RunBestParameters.py <recommender-name> [train-mode] [--run-all | dataset-name(s)]
-
recommender-name
is a value among:Random, PureSVD, ALS, BPR, SLIMBPR, CFGAN, GANMF, DisGANMF, DeepGANMF, fullGANMF
. -
train-mode
is a value among:item, user
. It specifies the training procedure for GAN-based recommenders. If omitted for GAN-based recommenders both training procedures are run. It is omitted for other baselines. -
run-all
is a flag that asks for the computation of the experiment on all datasets. If it is set,dataset-name(s)
is neglected. -
dataset-name(s)
is a value among:LastFM, CiaoDVD, Delicious, 100K, 1M
. Multiple values can be set separated by space.