This code is belonging to the NeurIPS 2022 paper Structural kernel search via Bayesian Optimization and Symbolical Optimal Tranport. The repo can be used to import and run our algorithm (see example) and to reproduce the experimental results in the paper. Please cite the paper when code is used for reproduction, reporting or extending of the method/code.
Our algorithm does model selection for GP's on the kernel structure level - it searches over a kernel grammar to find the kernel with highest model selection criteria. It first translates a kernel expression (base kernel combined with operators) to a tree e.g.
and uses BO over tree elements (via optimal transport) to do model selection. The final result is a selected kernel for a given dataset.After cloning the repo or extracting the repo from a zip file, first switch to the base folder where this README lies and build the conda
environment and the package itself via
conda env create --file environment.yml
conda activate bosot
pip install -e .
To test the algorithm on a simulated toy dataset you can call
python -m bosot.examples.bayesian_optimization_kernel_space
This example also illustrates how to use our method directly in Python.
All our experiments in the paper are done on publically available datasets. In order to run our experiments on the datasets one need to download the datasets and put them all
in the same folder: YOUR_OWN_DATA_PATH
. Besides the dataset folder YOUR_OWN_DATA_PATH
one also needs an output folder for the experimental results: YOUR_OWN_OUTPUT_PATH
. In the following we describe how to access the datasets:
Download the airline-passengers.csv
from this publically available repo Airline Passenger repo. Put the file into YOUR_OWN_DATA_PATH
.
Download the following publically available zip file LGBB Download link and put the lgbb_original.txt
in YOUR_OWN_DATA_PATH
.
To get the Powerplant dataset, first download the zip-file from Powerplant Data Folder. Extract the zip file and open the .xslx
file in exel. Go to fifth sheet and save it as power_plant_data.csv
file in YOUR_OWN_DATA_PATH
.
To get the Airfoil dataset, download the file airfoil_self_noise.dat
from Airfoil Data Folder and put it into YOUR_OWN_DATA_PATH
All necessary dataset preprocessing steps are taken automatically by the dataset wrapper classes. Running our proposed kernel search method can be done via the bosot/main.py
script as command-line tool. Specifying the dataset is done via the following strings: [Airfoil, LGBB, PowerPlant, AirlinePassenger]
that need to be set as --base_data_set_name
argument. We give example configurations for running the scripts (The main script has more arguments which are listed below) where you only need to exchange YOUR_OWN_DATA_PATH
and YOUR_OWN_OUTPUT_PATH
with your folders - we note that running the scripts can last fairly long:
Running our method (SOT-Kernel-Kernel) on the Airline dataset can be done via:
python -m bosot.main --data_dir=YOUR_OWN_DATA_PATH --experiment_output_dir=YOUR_OWN_OUTPUT_PATH --n_steps=50 --base_data_n_train=100 --base_data_set_name=AirlinePassenger --kernel_grammar_generator_config=CKSWithRQGeneratorConfig --bayesian_optimizer_config=ObjectBOExpectedImprovementEAFewerStepsConfig --run_name=SOTonAirline
Running our method (SOT-Kernel-Kernel) on the LGBB dataset can be done via:
python -m bosot.main --data_dir=YOUR_OWN_DATA_PATH --experiment_output_dir=YOUR_OWN_OUTPUT_PATH --n_steps=50 --base_data_n_train=150 --base_data_set_name=LGBB --kernel_grammar_generator_config=CKSWithRQGeneratorConfig --run_name=SOTonLGBB
Running our method (SOT-Kernel-Kernel) on the Powerplant dataset can be done via:
python -m bosot.main --data_dir=YOUR_OWN_DATA_PATH --experiment_output_dir=YOUR_OWN_OUTPUT_PATH --n_steps=50 --base_data_n_train=500 --base_data_set_name=Powerplant --kernel_grammar_generator_config=CKSHighDimGeneratorConfig --run_name=SOTonPowerplant
Running our method (SOT-Kernel-Kernel) on the Airfoil dataset can be done via:
python -m bosot.main --data_dir=YOUR_OWN_DATA_PATH --experiment_output_dir=YOUR_OWN_OUTPUT_PATH --n_steps=50 --base_data_n_train=500 --base_data_set_name=Airfoil --kernel_grammar_generator_config=CKSHighDimGeneratorConfig --run_name=SOTonAirfoil
Running the Hellinger-Kernel-Kernel approach on the Airline dataset can be done via:
python -m bosot.main --data_dir=YOUR_OWN_DATA_PATH --experiment_output_dir=YOUR_OWN_OUTPUT_PATH --n_steps=50 --base_data_n_train=100 --base_data_set_name=AirlinePassenger --kernel_kernel_config=BasicHellingerKernelKernelConfig --kernel_grammar_generator_config=CKSWithRQGeneratorConfig --bayesian_optimizer_config=ObjectBOExpectedImprovementConfig --run_name=HellingerOnAirline
Running Greedy Search on the Airline dataset can be done via:
python -m bosot.main --data_dir=YOUR_OWN_DATA_PATH --experiment_output_dir=YOUR_OWN_OUTPUT_PATH --n_steps=50 --base_data_n_train=100 --base_data_set_name=AirlinePassenger --use_heuristic=True --heuristic_search_config=GreedyKernelSearchBaseInitialConfig --num_stages_heuristics=5 --run_name=GreedyOnAirline
Running TreeGEP on the Airline dataset can be done via:
python -m bosot.main --data_dir=YOUR_OWN_DATA_PATH --experiment_output_dir=YOUR_OWN_OUTPUT_PATH --n_steps=50 --base_data_n_train=100 --base_data_set_name=AirlinePassenger --use_heuristic=True --heuristic_search_config=TreeGEPEvolutionaryOptimizerConfig --num_stages_heuristics=5 --run_name=TreeGEPOnAirline
Our method and the competitor methods in our paper can be started via the bosot/main.py
script as command-line tool. Most parameters also come with default values (see bosot/main.py
). The following list specfies the most important command line parameters:
--experiment_output_dir
: The output folder, where plots and metrics of the kernel search should be saved (subfolders are generated automatically)
--data_dir
: Input folder, where all the datasets files lie. Here the classes in the data_sets
folder will search for the respective files that they expect. Which files they expect is described above.
--n_steps
: Number of BO iterations, in case of the BO methods is used (only applies if --use_heuristic=False
).
--run_name
: Specifies the name of the run - is used to create subfolders and final metric files.
--base_data_set_name
: Name of the wrapper class corresponding to a dataset. It expects one of the following values: [Airfoil, LGBB, PowerPlant, AirlinePassenger]
--base_data_n_train
: Number of training datapoints that should be sampled uniformly from the whole base dataset.
--kernel_grammar_generator_config
: Specifies the search space that should be used. It expects one of the following values: [CKSWithRQGeneratorConfig,CKSHighDimGeneratorConfig]
. Here CKSWithRQGeneratorConfig
corresponds to the search space for low dimensional datasets and CKSHighDimGeneratorConfig
for higher dimensional datasets (see paper for details).
--kernel_kernel_config
: Specifies the kernel-kernel that is used. It expects one of the following values: [OTWeightedDimsExtendedGrammarKernelConfig,BasicHellingerKernelKernelConfig]
where OTWeightedDimsExtendedGrammarKernelConfig
is our method (SOT-Kernel-Kernel), which is the default value here. BasicHellingerKernelKernelConfig
corresponds to the Hellinger Kernel-Kernel.
--use_heuristic
: Flag if heuristic should be used instead of BO.
--num_stages_heuristics
: Number of stages in the heuristics. This is the number of stages in Greedy Search or number of generations in TreeGEP.
--heuristic_search_config
: Specifies which heuristic should be used in case --use_heuristic=True
. It expects one of the following values: [GreedyKernelSearchBaseInitialConfig,TreeGEPEvolutionaryOptimizerConfig]
.
--seed
: Seed value.
BOSOT is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.