Most of the Enigmatic functions use Python's keyword parameters to setup experiments. This is useful to keep all the experiment setup arguments at one same place and simplify method calls. Typically, you will have a dictionary with the parameters as follows.
run = {
"bid" : "tptp/PUZ",
"pids" : ["mzr01", "mzr02"],
"refs" : ["mzr02"],
"limit" : "T5",
"cores" : 32,
"features" : "C(l,p,x,s,r,h,v,c,d,a):G:P",
"learner" : LightGBM(),
"eargs" : "--training-examples=3 -s --free-numbers",
"dataname" : "mytest",
"iters" : 6,
}
Then you run, for example, the eval/train loop as follows.
from pyprove import log
from enigmatic import models
from enigmatic.learn.lgbooster import LightGBM
run = ... # see above
logger = log.logger("loop", **run)
models.loops(**run)
Don't forget to setup the logger, otherwise you might not see any output.
See below for run
parameter description.
This is the list of run
parameters supported by Enigmatic and their basic description.
name | type | description |
---|---|---|
bid |
str |
benchmark id: directory relative to $PYPROVE_BENCHMARKS (or . ) |
pids |
[str] |
strategy id list (filenames relative to ./strats directory) |
limit |
str |
evaluation time limit (use T5 for 5 seconds) |
cores |
int |
number of cores to use for evaluation (set $OMP_NUM_THREADS to control model building) |
eargs |
str |
additional command line arguments for eprover |
refs |
str |
reference strategies for model building |
features |
str |
Enigmatic features to use when generating training data |
dataname |
str |
custom experiment name |
iters |
int |
number of iteration for model looping |
split |
float |
ratio to divide train/test data |
forgets |
(float, float) |
randomly forget (neg, pos) of training samples (can be (None, None) ) |
balance |
int |
automatically keep the pos:neg ratio close to 1:balance |
options |
[str] |
option flags |
debug |
[str] |
debugging flags |
The option flags are recognized.
flag | description |
---|---|
headless |
do not use progress bars |
loop-coop-only |
use coop strategies when looping (no solo strategies) |
The following flags are recognized.
flag | description |
---|---|
acc |
compute train/test model accuracies (use together with split ) |
train |
keep separate uncompressed train vectors for each problem file in 00TRAINS |
nozip |
do not compress training data |
force |
do not use stored files and recompute everything |
Enigmatic provides automatic LightGBM model parameter tuner, implemented using optuna
, motivated by optuna
's LightGBM integration.
It can be used either manually (if you already have a training data), or automatically by Enigmatic model building methods like models.build
.
If you have train and test vectors already generated in train.in
and test.in
files, and you want to find a model with the best testing accuracy, use enigmatic.lgbtune
module as follows.
from pyprove import log
from enigmatic import lgbtune
logger = log.logger("tune")
lgbtune.lgbtune("train.in", "test.in", timeout=3600)
This will run for 3600
seconds and it will create a bunch of models in a temporary directory (adjustable by optional argument d_tmp
).
In the end, you will see an output like this:
Best model params: {'num_leaves': 512, ...}
Best model accuracy: 72.73% (89.74% / 64.63%)
Best model file: optuna-tmp/min_data/model0002.lgb
Other arguments to control the tuning are described below.
parameter | default | description |
---|---|---|
d_tmp |
"./optuna-tmp" |
tuner temporary directory |
phases |
"l:b:m:r" |
LightGBM parameters to tune in phases (see below) |
timeout |
3600 |
tuning timeout in seconds |
iters |
None |
alternative runtime limit by the number of models to be built |
init_params |
{} |
initial model to build and default values of non-tunable parameters |
min_leaves |
256 |
the minimal number of leaves in a LightGBM model |
max_leaves |
32767 |
the maximal number of leaves in a LightGBM model |
The LightGBM parameters are tuned in phases.
Parameter phases
controls which parameters are tuned and in which order.
The phases
value is a string of characters from l
, b
, m
, and r
separated by :
.
Their meaning is as follows.
phase | LightGBM parameter(s) to tune |
---|---|
l |
num_leaves |
b |
bagging_fraction and bagging_freq |
m |
min_data |
r |
lambda_l1 and lambda_l2 |
The number of leaves is tuned in phase (l
) which should be the first phase.
Using just phases="l"
is also quite fine.
The values for the LightGBM parameter num_leaves
will be sampled from values 2^(n/2) (for all natural n
) which
fall in between min_leaves
and max_leaves
.
The runtime limit (timeout
or iters
or both) is evenly divided into each phase (hence iters
should be dividable by the number of phases).
Note that the timeout limit is checked only after finishing model building, so at least one model will be built in each phase.
Unless init_params
is set to None
, an initial model will be built at the beginning.
This initial model building is not coverd by the runtime limit(s).
Values from init_params
are also used to set up non-tunable parameter values like learning_rate
during the tuning.
If some parameter value is not present in init_params
(or init_params
is None
),
the default value from enigmtic.learn.lgbooster.DEFAULTS
will be used.
The tuner slightly favors testing accuracy on positive samples.
Given the testing accuracies (posacc, negacc)
the score of the model is computed as
2*posacc + negacc
and the model with highest possible score is considered the best.
This positive accuracy weight can be controled by enigmatic.lgbtune.POS_ACC_WEIGHT
.
If you have Enigmatic-compatible compressed data files like train.in-data.npz
and train.in-label.npz
you can use them as well.
It is, indeed, highly recommended in order to speed up data loading.
But note that in the case of compressed data, you still pass just train.in
and test.in
to lgbtune
.
It will automatically recognize compressed data.
The compressed files must be named as above (suffixes *-data.npz
and *-label.npz
).
To compress an SVM-Light text file train.in
use enigmatic.trains.compress
as follows.
from pyprove import log
from enigmatic import trains
logger = log.logger("compress")
trains.compress("train.in")
You can also use automatic model building instead of standard LightGBM/XGBoost boosters (with models.build
or models.train
).
Then, Enigmatic will automatically tune the parameters for you.
Just run as follows.
from enigmatic import models
from enigmatic.learn.autolgb import AutoLgb
run = {
...
"learner": AutoLgb(iters=12),
...
}
models.build(**run)
AutoLgb
can additionally take the same parameters as lgbtune
to control the tuning.
There are different defaults, however
(iters=30
, phases="l:b:m"
, timeout=None
, init_params=None
).