SubModLib is an easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine. Submodlib finds its application in summarization, data subset selection, hyper parameter tuning, efficient training etc. Through a rich API, it offers a great deal of flexibility in the way it can be used.
Please check out our latest arxiv preprint: https://arxiv.org/abs/2202.10680
- Rich suite of functions for a wide variety of subset selection tasks:
- regular set (submodular) functions
- submodular mutual information functions
- conditional gain functions
- conditional mutual information functions
- Supports different types of optimizers
- naive greedy
- lazy (accelerated) greedy
- stochastic (random) greedy
- lazier than lazy greedy
- Combines the best of Python's ease of use and C++'s efficiency
- Rich API which gives a variety of options to the user. See this notebook for an example of different usage patterns
- De-coupled function and optimizer paradigm makes it suitable for a wide-variety of tasks
- Comprehensive documentation (available here)
- Modelling capabilities of regular submodular functions
- Modelling capabilities of submodular mutual information (SMI) functions
- Modelling capabilities of conditional gain (CG) functions
- Modelling capabilities of conditional mutual information (CMI) functions
- This notebook contains a quantitative analysis of performance of different functions and role of the parameterization in aspects like query-coverage, query-relevance, privacy-irrelevance and diversity for different SMI, CG and CMI functions as observed on synthetically generated dataset.
- This notebook contains similar analysis on ImageNet dataset.
- For a more detailed discussion on all possible usage patterns, please see Different Options of Usage
$ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib
$ git clone https://github.com/decile-team/submodlib.git
$ cd submodlib
$ pip install .
- Latest documentation is available at readthedocs. However, if local documentation is required to be built, follow these steps::
$ pip install -U sphinx
$ pip install sphinxcontrib-bibtex
$ pip install sphinx-rtd-theme
$ cd docs
$ make clean html
- To run the tests, follow these steps:
$ pip install pytest
$ pytest
# this runs ALL tests$ pytest -m <marker> --verbose --disable-warnings -rA
# this runs test specified by the . Possible markers are mentioned in pyproject.toml file.
It is very easy to get started with submodlib. Using a submodular function in submodlib essentially boils down to just two steps:
- instantiate the corresponding function object
- invoke the desired method on the created object
The most frequently used methods are:
- f.evaluate() - takes a subset and returns the score of the subset as computed by the function f
- f.marginalGain() - takes a subset and an element and returns the marginal gain of adding the element to the subset, as computed by f
- f.maximize() - takes a budget and an optimizer to return an optimal set as a result of maximizing f
For example,
from submodlib import FacilityLocationFunction
objFL = FacilityLocationFunction(n=43, data=groundData, mode="dense", metric="euclidean")
greedyList = objFL.maximize(budget=10,optimizer='NaiveGreedy')
For a more detailed discussion on all possible usage patterns, please see Different Options of Usage
- Regular set (submodular) functions (for vanilla subset selection requiring representation, diversity, importance, relevance, coverage)
- Submodular mutual information functions (for query-focused subset selelction)
- Conditional gain functions (for query-irrelevant/privacy-preserving subset selection)
- Conditional mutual information functions (for joint query-focused and privacy-preserving subset selection)
We demonstrate the representational power and modeling capabilities of different functions qualitatively in the following Google Colab notebooks:
- Modelling capabilities of regular submodular functions
- Modelling capabilities of submodular mutual information (SMI) functions
- Modelling capabilities of conditional gain (CG) functions
- Modelling capabilities of conditional mutual information (CMI) functions
This notebook contains a quantitative analysis of performance of different functions and role of the parameterization in aspects like query-coverage, query-relevance, privacy-irrelevance and diversity for different SMI, CG and CMI functions as observed on synthetically generated dataset. This notebook contains similar analysis on ImageNette dataset.
- NaiveGreedy
- LazyGreedy
- StochasticGreedy
- LazierThanLazyGreedy
- This notebook demonstrates the use and comparison of different optimizers.
- This notebook contains demonstration of using submodlib for an image collection summarization application.
To gauge the performance of submodlib, selection by Facility Location was performed on a randomly generated dataset of 1024-dimensional points. Specifically the following code was run for the number of data points ranging from 50 to 10000.
K_dense = helper.create_kernel(dataArray, mode="dense", metric='euclidean', method="other")
obj = FacilityLocationFunction(n=num_samples, mode="dense", sijs=K_dense, separate_rep=False,pybind_mode="array")
obj.maximize(budget=budget,optimizer=optimizer, stopIfZeroGain=False, stopIfNegativeGain=False, verbose=False, show_progress=False)
The above code was timed using Python's timeit module averaged across three executions each. We report the following numbers:
Number of data points | Time taken (in seconds) |
---|---|
50 | 0.00043 |
100 | 0.001074 |
200 | 0.003024 |
500 | 0.016555 |
1000 | 0.081773 |
5000 | 2.469303 |
6000 | 3.563144 |
7000 | 4.667065 |
8000 | 6.174047 |
9000 | 8.010674 |
10000 | 9.417298 |
If your research makes use of SUBMODLIB, please consider citing:
SUBMODLIB (Submodlib: A Submodular Optimization Library (Kaushal et al., 2022))
@article{kaushal2022submodlib,
title={Submodlib: A submodular optimization library},
author={Kaushal, Vishal and Ramakrishnan, Ganesh and Iyer, Rishabh},
journal={arXiv preprint arXiv:2202.10680},
year={2022}
}
- Vishal Kaushal, Ganesh Ramakrishnan and Rishabh Iyer. Currently maintained by CARAML Lab
Should you face any issues or have any feedback or suggestions, please feel free to contact vishal[dot]kaushal[at]gmail.com
This work is supported by the Ekal Fellowship (www.ekal.org). This work is also supported by the National Science Foundation(NSF) under Grant Number 2106937, a startup grant from UT Dallas, as well as Google and Adobe awards.