Skip to content

abir-de/SELCON

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SELCON: Training Data Subset Selection for Regression with Controlled Generalization Error

The code was cleaned and modularized by Sanidhya Anand

Overview

This directory contains code necessary to run the SELCON algorithm which is a data subset selection algorithm for efficient training. In a nutshell, it aims to select the training subset as well as the model parameters subject to a set of constraints ensuring that the error on validation set remains below an acceptable level. We show that solving the above problem is equivalent to minimizing a monotone and approximate submodular function.

If you use this code in your paper, please use:

@inproceedings{durga2021training,
    		title={Training Data Subset Selection for Regression with Controlled Generalization Error},
		author={Durga, S and Iyer, Rishabh and Ramakrishnan, Ganesh and De, Abir},
		booktitle={International Conference on Machine Learning},
		pages={9202--9212},
		year={2021},
		 }

Installation

You can install the package using
pip install selcon

To run this code fully, you'll need PyTorch (we're using version 1.4.0) and scikit-learn. We've been running our code in Python 3.7.

Usage

SELCON package can be utilised in Linear Subset Selection or Deep Subset Selection methods as:

SELCON for linear model

from SELCON.datasets import load_def_data, get_data
from SELCON.linear import Regression

load_def_data provides functionality for using the datasets used for the experiments in the paper (provided you have them available in the 'Dataset' directory)

reg = Regression()

# Converts specified numpy arrays to torch tensors (assuming data has been split previously)
X_trn, X_val, Y_trn, Y_val = get_data(x_train, x_val, y_train, y_val)
# Trains SELCON model for a subset fraction of 0.03 on the training subset (no fairness)
reg.train_model(X_trn, Y_trn, X_val, Y_val, fraction = 0.03)
# Return optimal subset indices
subset_idxs = reg.return_subset()

# Returns the optimal subset of the training data for further use
X_sub = X_trn[subset_idxs]
y_sub = Y_trn[subset_idxs]

SELCON for Deep model

from SELCON.datasets import load_def_data, get_data
from SELCON.deep import DeepSelection
reg = DeepSelection()

# Converts specified numpy arrays to torch tensors (assuming data has been split into train-val sets previously)
X_trn, X_val, Y_trn, Y_val = get_data(x_train, x_val, y_train, y_val)
# Trains SELCON model for a subset fraction of 0.03 on the training subset (with fairness)
reg.train_model_fair(X_trn, Y_trn, X_val, Y_val, fraction = 0.03)
# Return optimal subset indices
subset_idxs = reg.return_subset()

# Returns the optimal subset of the training data for further use
X_sub = X_trn[subset_idxs]
y_sub = Y_trn[subset_idxs]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published