SubStrat
is a Python package designed to optimize AutoML running times on large datasets.
It wraps existing AutoML tools such as AutoSklearn, TPOT and H2O, and instead of execute them on directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset which preserves a characteristic of the original one. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset.
*** SubStrat
is still under development, currently supporting AutoSklearn
.
Follow us here for updates. ***
SubStrat
is based on the VLDB 2023 Paper:
- Teddy Lazebnik, Amit Somech, and Abraham Itzhak Weinberg. SubStrat: A Subset-Based Optimization Strategy for Faster AutoML. PVLDB, 16(4): 772 - 780, 2022. doi:10.14778/3574245.3574261
-
Automated Machine Learning (AutoML): Using the power of tools such as
AutoSklearn
library, users can seamlessly train and fine-tune machine learning models on their dataset. -
Genetic Dataset Summarization:
SubStrat
uses a genetic algorithm for summarizing datasets, providing concise data representations while retaining vital information.
- Run the genDST algorithm for finding a data subset that preserves dataset entropy.
- Run the AutoML tool on the data subset (fast) and obtain intermediate ML pipeline.
- Improve the intermediate pipeline by employing a restricted AutoML run on the full dataset.
- Return the optimized ML pipeline.
Very recomended to use venv
.
python3 -m venv subsrat_vev
cd subsrat_vev
sourch bin/activate
Installing the SubStrat package.
pip install substart-automl
from SubStrat import SubStrat
# Initialize SubStrat with a dataset and target column
s = SubStrat(dataset=my_dataset, target_col_name='target')
# Excute SubStrat flow
cls = s.run()
See basic example here
Example in google Colab
Provides the primary interface for the AutoML functionalities.
dataset
: Input dataset (pandas DataFrame).target_col_name
: Name of the target column in the dataset.input_classifier
: Classifier instance (optional). Defaults to an instance fromAutoSklearn
.summary_algorithm
: Algorithm to summarize the dataset. Defaults toGeneticSubAlgorithmn
.desired_accuracy
:Desired accuracy for the output classifier.
run()
: Executes the SubStrat flow, and returnsAutoSklearnClassifier
.
Implements the genetic algorithm for dataset summarization.
dataset
: The original dataset that needs to be summarized.target_column_name
: The name of the target column in the dataset.sub_row_size
: The number of rows for the subset of the dataset (summary). If not provided, it will be calculated based on a predefined rule.sub_col_size
: The number of columns for the subset of the dataset (summary). If not provided, it will be calculated based on a predefined rule.population_size
: The number of individuals in the population for the genetic algorithm.fitness
: The fitness function used in the genetic algorithm. If not provided, a default fitness function will be used.selection
: The selection operator used in the genetic algorithm. If not provided, a default selection operator will be used.mutation_rate
: The mutation rate used in the genetic algorithm.num_generation
: The number of generations the genetic algorithm will run for.init_pop
: The algorithm used to initialize the population for the genetic algorithm. If not provided, a default algorithm will be used.stagnation_limit
: Number of generations without improvement in best gene score before stopping.time_limit
: Maximum time in seconds therun
function can execute.
run()
: Executes the genetic algorithm and returns the best subset of the dataset.
- Add verbose mode.
- Add the option to use more AutoML frameworks, like TPOT.
- Make SubStrat more configable by the user.
- Make the UX more friendly.
Teddy Lazebnik, Amit Somech, and Abraham Itzhak Weinberg. 2022. SubStrat: A Subset-Based Optimization Strategy for Faster AutoML. Proc. VLDB Endow. 16, 4 (December 2022), 772–780. https://doi.org/10.14778/3574245.3574261