MODL: Massive Online Dictionary Learning
This python package (webpage) implements our two papers from 2016:
Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux. Stochastic Subsampling for Factorizing Huge Matrices. 2017.
Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux. Dictionary Learning for Massive Matrix Factorization. International Conference on Machine Learning, Jun 2016, New York, United States. 2016
It allows to perform sparse / dense matrix factorization on fully-observed/missing data very efficiently, by leveraging random sampling with online learning. It is able to factorize matrices of terabyte scale with hundreds of components in the latent space in a few hours.
This package allows to reproduce the experiments and figures from the papers.
More importantly, it provides https://github.com/scikit-learn/scikit-learn compatible estimators that fully implements the proposed algorithms.
Installing from source with pip
Installation from source is simple. In a command prompt:
git clone https://github.com/arthurmensch/modl.git cd modl pip install -r requirements.txt pip install . cd $HOME py.test --pyargs modl
The package essentially provides three estimators:
DictFact, that computes a matrix factorization from Numpy arrays
fMRIDictFact, that computes sparse spatial maps from fMRI images
ImageDictFact, that computes a patch dictionary from an image
RecsysDictFact, that allows to predict score from a collaborative filtering approach
A fast running example that decomposes a small dataset of resting-fmri data into a 70 components map is provided
It can be adapted for running on the 2TB HCP dataset, by changing the source parameter into 'hcp' (you will need to download the data first)
A fast running example that extracts the patches of a HD image can be run from
It can be adapted to run on AVIRIS data, changing the image source into 'aviris' in the file.
Our core algorithm can be run to perform collaborative filtering very efficiently:
You will need to download datasets beforehand:
make download-movielens1m make download-movielens10m
sacreddependency will be removed
- Release a fetcher for HCP from S3 bucker
- Release examples with larger datasets and benchmarks
Please feel free to report any issue and propose improvements on github.
Related projects :
- spira is a python library to perform collaborative filtering based on coordinate descent. It serves as the baseline for recsys experiments - we hard included it for simplicity.
- scikit-learn is a python library for machine learning. It serves as the basis of this project.
- nilearn is a neuro-imaging library that we wrap in our fMRI related estimators.
Licensed under simplified BSD.
Arthur Mensch, 2015 - present