Skip to content
Cheng Li edited this page Mar 14, 2019 · 131 revisions

Conditional Bernoulli Mixtures

Pyramid implements the "Conditional Bernoulli Mixtures (CBM)" algorithm described in the paper

Conditional Bernoulli Mixtures for Multi-label Classification.
Cheng Li, Bingyu Wang, Virgil Pavlu, and Javed Aslam.
In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.

Model Architecture: mixture for label dependencies

CBM is designed to provide accurate predictions for multi-label classification problems by capturing label dependencies with a mixture model. It is a discriminative extension of the Bernoulli Mixture Model. CBM reduces multi-label classification to multi-class and binary classifications and uses existing multi-class and binary classifiers as base learners.

The package includes 3 versions of CBM implementations with different base learners: L2 regularized logistic regressions, Elasticnet (L1+L2) regularized logistic regressions, and gradient boosted trees.

Training: sparsity for speed

Our newly designed model training procedure exploits the sparsity in the feature matrix, the label matrix, as well as the sparsity in the CBM structure. This allows the training to scale to large datasets with many features and labels, and also allows CBM to use a large number of components to increase the modeling power. The original algorithm described in the ICML paper did not exploit the sparsity in CBM structure and thus had a complexity linear in number of mixture components. Now by exploiting structural sparsity, the computational cost no longer grows linearly with the number of components. In fact, using a large number of components increases the sparsity and thus can even speed up the overall training. On the RCV1-2K dataset (containing 47,236 features, 2,456 labels and 623,847 instances), training a CBM model with 50 components is 3X faster than training a CBM model with 1 component (which is equivalent to the widely used binary relevance method that trains one independent binary classifier for each label).

Prediction: different outputs for different evaluation metrics

CBM currently can work with three different prediction methods designed for different evaluation metrics: it outputs the joint mode to optimize instance set accuracy; it outputs the marginal modes to optimize the instance Hamming loss; and it runs the GFM algorithm to optimize instance F1.

CBM with L2 regularized logistic regression base learners

This is the fastest version. L2 regularized logistic regressions are trained by L-BFGS.

Usage

To run the CBM algorithm, please just type

./pyramid config/cbm_lr.properties

where the cbm_lr.properties file (can be found in the config folder of the package) specifies all the algorithm parameters, as explained below.

Program Properties

The properties file (a plain text file with each line being a key value pair) specifies all input, output and hyper parameters required by the program. A sample properties file is shown below. The same file can also be found in the config folder associated with the code release. You can modify this file to set up the the correct dataset paths on your computer and experiment with different model parameters.

############## input and output ###############

# Full path to input train dataset
input.trainData=/mnt/home/cheng/mlc_data_pyramid/rcv1subset_topics_1/train_test_split/train

# Full path to input validation dataset, if available
# Used for hyper parameter tuning and early stopping
# If no additional validation set is available, leave it as blank, 
# and random 20% of the training data will be used as the validation set. 
input.validData=

# Full path to input test dataset
input.testData=/mnt/home/cheng/mlc_data_pyramid/rcv1subset_topics_1/train_test_split/test

# Directory for the program output
output.dir=/mnt/home/cheng/out/cbm_lr/rcv1

# Whether to show detailed debugging information
output.verbose=false

################# functions #####################

# Perform hyper parameter tuning before training
# If external validation data is given, the model is trained on the full training data
# and tuned on the given validation data; otherwise, the model is trained on 80% of the training data,
# and tuned on the rest 20% of the training data.
tune=false

# Train the model on all the available data (excluding test data), using tuned or user specified hyper parameters
# If the external validation data is also given, the model is trained on training data + validation data
train=true

# Load back trained model, make predictions on the test set, and evaluate test performance.
# The program shows several different predictions designed to optimize different evaluation metrics.
test=true

######### prediction method ########

# Whether to allow empty subset to be predicted; 
# true = allow empty prediction
# false = do not allow empty prediction
# auto = allow empty prediction only if the training set contains empty label sets
predict.allowEmpty=auto

# The threshold for skipping components with small contributions
# This is designed to speed up prediction
predict.piThreshold=0.001


######### tune #########
# Hyper parameter tuning uses the validation set to decide logistic regression regularization variance, 
# number of CBM components and number of EM training iterations.
# Users can specify candidate values for variance and components.
# The optimal EM training iterations will be determined automatically by monitoring the validation performance.
# The metric monitored is specified in predict.targetMetric.

# To achieve optimal prediction under which target evaluation metric? 
# Currently supported metrics: instance_set_accuracy, instance_f1 and instance_hamming_loss.
# Generally speaking no single model is well suited for all evaluation metrics.
# Optimizing different metrics require different prediction methods and hyper parameters.
# The program automatically chooses the optimal prediction method designed for each metric.
# The predictor designed for instance set accuracy outputs the joint mode.
# The predictor designed for instance Hamming loss outputs the marginal modes.
# The predictor designed for instance F1 runs the GFM algorithm.
# The metric specified here will serve as the main metric for selecting the best model during hyper parameter tuning
# Once the model is trained, the program shows all different predictions made by different prediction methods
tune.targetMetric=instance_set_accuracy

# What values to try for logistic regression Gaussian prior variance (for L2 regularization)
# Small values indicate strong regularizations
# The variance can greatly affect the performance and thus requires careful tuning
tune.variance.candidates=0.1,1,1000,1000000

# What values to try for number of CBM components
# The default value 50 usually gives good performance
# To reduce turning time, users can just set tune.numComponents.candidates=50
tune.numComponents.candidates=1,20,50,100


# Evaluate the metric on the validation set every k iterations
# Frequent evaluation may slow down training
# Use a small value (e.g. 1) if we expect the training to take just a few iterations (e.g. 20)
# Use a big value (e.g. 10) if we expect the training to take many iterations (e.g. 200)
tune.monitorInterval=1

# the model training will never stop before it reaches this minimum number of iterations
tune.earlyStop.minIterations=5

# If the validation metric does not improve after k successive evaluations, the training will stop
# for example, if tune.monitorInterval=5 and tune.earlyStop.patience=2, trains stops if no improvement in 10 iterations
# Using a patient value too small make cause the training to stop too early
# Using a patient value too big make increase the tuning time
tune.earlyStop.patience=10


######### train #################
# Whether to use optimal hyper parameter value found by tuning
# These hyper parameters include: train.iterations, train.variance, and train.numComponents

# if true, users do not need to specify these values
# if false or if no tuning has be performed, users need to provide a value for each of them 
train.useTunedHyperParameters=false

# When a separate validation set is provided, users can either set train.useValidData=false to only use it for hyper parameter tuning;
# or, set train.useValidData=ture to train the final model on training set + validation set after tuning
train.useValidData=false

# Number of EM training iterations
train.iterations=6

# Logistic regression Gaussian prior variance (for L2 regularization)
# The variance can greatly affect the performance and thus requires careful tuning
# The default value 1 may not be optimal
train.variance=1000000

# Number of CBM components
# The default value 50 usually gives good performance
train.numComponents=50


# The parameters below usually do not affect the performance much
# Users can use default values

# number of LBFGS parameter updates for LR in each M step
# The default value 15 is good most of the time
# If the train.iterations found by hyper parameter tuning is 1 or 2, each M step is probably doing too much work and the training overfits too quickly. In this case, we can decrease train.updatesPerIteration
train.updatesPerIteration=15

# In each component, skip instances with small memberships values (gammas)
# This is designed to speed up training
train.skipDataThreshold=0.00001

# Skip training a classifier for a label in a component if that label almost never appears or almost always appears in that component. 
# A constant output (the prior probability) will be used in this case.
# This is designed to speed up training
train.skipLabelThreshold=0.00001
# Smooth the probability of a non-existent label in a component with the its overall probability in the dataset
# This is designed to avoid zero probabilities
train.smoothStrength=0.0001

######## test ##############
# When generating prediction reports for individual label probabilities, labels with probabilities below the threshold will not be displayed
# This only make the reports more readable; it does not affect the actual prediction in any way.
report.labelProbThreshold=0.2

# the internal Java class name for this application. 
# users do not need to modify this.
pyramid.class=CBMLR

Training the model

First make sure the input and output paths are set correctly on your computer. The training set is mandatory; the validation set is optional; and the test set does not need to be provided at training time. The training algorithm relies on a few hyper parameters (those parameters with the train prefix). If you know how to choose the proper hyper parameter values, you can set tune=false, train.useTunedHyperParameters=false and the program will start training with the provided hyper parameters.

Tuning hyper parameters

Sometimes the default hyper parameters do not work well and picking the right hyper parameters by hand can be tricky. The program provides a simple tuning function that selects the best hyper parameters based on a validation set. You specify some candidate hyper parameter values and grid search will be performed to pick the best one. You can also specify in tune.targetMetric the evaluation metric you wish to monitor and optimize for during tuning. If you set tune=true, tune=true, train.useTunedHyperParameters=true, the program will first tune hyper parameters and then train the model using the tuned hyper parameters.

Making predictions

Our original paper describes just one prediction method which aims for high instance set accuracy. But CBM is not limited to optimize instance set accuracy. In fact CBM provides a general joint probability estimation, which can be fed into various plug-in predictors designed for different metrics. The program currently supports prediction methods designed for instance set accuracy, instance Hamming loss and instance F1. The predictor designed for instance set accuracy outputs the joint mode. The predictor designed for instance Hamming loss outputs the marginal modes. The predictor designed for instance F1 runs the GFM algorithm. Once the model is trained, the program shows different predictions made by all three prediction methods, computes various evaluation metrics and saves them to the output directory.

In predictions.txt, each line contains the predicted set for an instance together with the probability of the set. An example:

{4, 32}:0.5703314710396906

In label_probabilities.txt, each lines contains all individual (marginal) label probabilities for an instance. Labels are sorted by probabilities in decreasing order. Since there are usually a large number of label candidates, to improve readability, labels with low probabilities are not displayed (the threshold can be specified in the properties file). An example:

4:0.6144574339227634, 32:0.5964267467053088, 31:0.3740235851502197

A sample output

Running the code using the sample properties file and the RCV1 dataset, you will get some output that looks like:

============================================================
Start training with given hyper parameters:
train.numComponents = 50
train.variance = 1000000.0
train.iterations = 6
The following labels do not actually appear in the training set and therefore cannot be learned:
49, 80
Initializing the model
Initialization done
Training progress: iteration 1
Training progress: iteration 2
Training progress: iteration 3
Training progress: iteration 4
Training progress: iteration 5
Training progress: iteration 6
training done!
time spent on training = 00:03:08.938

============================================================
============================================================

Making predictions on test set with 3 different predictors designed for different metrics:
============================================================
Making predictions on test set with the instance set accuracy optimal predictor
test performance with the instance set accuracy optimal predictor
instance subset accuracy = 0.49866666666666665
instance Jaccard index = 0.693695719095719
instance Hamming loss = 0.013896440129449837
instance F1 = 0.7577262903762904
instance precision = 0.8781555555555556
instance recall = 0.7148875901875902
label Jaccard Index = 0.2854058305082703
label Hamming loss = 0.013896440129449837
label F1 = 0.37039108607561394
label precision = 0.8942756068655454
label recall = 0.3150660378020579
label binary accuracy = 0.9861035598705502
micro Jaccard index = 0.6020389249304912
micro Hamming loss = 0.013896440129449837
micro F1 = 0.7515908827953256
micro precision = 0.8715953307392996
micro recall = 0.6606325638157226

test performance is saved to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/instance_accuracy_optimal/performance.txt
predicted sets and their probabilities are saved to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/instance_accuracy_optimal/predictions.txt
============================================================
============================================================
Making predictions on test set with the instance F1 optimal predictor
test performance with the instance F1 optimal predictor
instance subset accuracy = 0.483
instance Jaccard index = 0.6977624819624819
instance Hamming loss = 0.013957928802588998
instance F1 = 0.7682737562002266
instance precision = 0.876694312169312
instance recall = 0.7363148388648387
label Jaccard Index = 0.2929142527853247
label Hamming loss = 0.013957928802588997
label F1 = 0.378184127013246
label precision = 0.8801382908284088
label recall = 0.3261233296227414
label binary accuracy = 0.9860420711974109
micro Jaccard index = 0.6090109690871182
micro Hamming loss = 0.013957928802588997
micro F1 = 0.7570003943884163
micro precision = 0.8486609398686206
micro recall = 0.6832096003254348

test performance is saved to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/instance_f1_optimal/performance.txt
predicted sets and their probabilities are saved to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/instance_f1_optimal/predictions.txt
============================================================
============================================================
Making predictions on test set with the instance Hamming loss optimal predictor
test performance with the instance Hamming loss optimal predictor
instance subset accuracy = 0.473
instance Jaccard index = 0.672190392015392
instance Hamming loss = 0.013744336569579288
instance F1 = 0.7406699374699374
instance precision = 0.9160888888888888
instance recall = 0.6875311087061088
label Jaccard Index = 0.2643949761412336
label Hamming loss = 0.013744336569579288
label F1 = 0.34692530681280287
label precision = 0.9167522803488144
label recall = 0.28151669944960267
label binary accuracy = 0.9862556634304207
micro Jaccard index = 0.5927311085538933
micro Hamming loss = 0.013744336569579288
micro F1 = 0.7442952616051539
micro precision = 0.9121900826446281
micro recall = 0.6285975795789688

test performance is saved to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/instance_hamming_loss_optimal/performance.txt
predicted sets and their probabilities are saved to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/instance_hamming_loss_optimal/predictions.txt
============================================================
============================================================
computing other predictor-independent metrics
label averaged MAP
0.5297139332502768
instance averaged MAP
0.856168310276265
global AP truncated at 30
0.8131470893569791
individual label probabilities are saved to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/label_probabilities.txt
individual log likelihood of the test ground truth label set is written to /mnt/home/cheng/out/cbm_lr/rcv1/test_predictions/ground_truth_log_likelihood.txt
average log likelihood of the test ground truth label sets = -5.574611222079268
This is computed by ignoring test instances with new labels unobserved during training
The following labels do not actually appear in the training set and therefore cannot be learned:
49, 80

============================================================

Sample Hyper Parameters

Roughly tuned hyper parameters are also available for a few datasets. These can be used to reproduce the results in our ICML paper (results may not be exactly the same as we have updated the training algorithm since then). These hyper parameters are tuned to maximize the instance set accuracy on the validation set. Users who are interested in other metrics can stick with the same variance value and re-tune the number of components and training iterations.

SCENE RCV1 TMC MEDIAMILL NUSWIDE
scene.properties rcv1.properties tmc.properties mediamill.properties nuswide.properties
scene.log rcv1.log tmc.log mediamill.log nuswide.log

CBM with elasticnet (L1+L2) regularized logistic regression base learners

This version uses both L1 penalty and L2 penalty to regularize logistic regression learners. It performs automatic feature selection and produces compact models, which makes it a good fit for high dimensional multi-label text data. Coordinate descent algorithm is used to train logistic regressions.

Usage

To run the CBM algorithm with elastic-net logistic regression learners, please just type

./pyramid config/cbm_en.properties

where the cbm_en.properties file (can be found in the config folder of the package) specifies all the algorithm parameters, as explained below.

Program Properties

############## input and output ###############

# Full path to input train dataset
input.trainData=/mnt/home/cheng/mlc_data_pyramid/rcv1subset_topics_1/train_test_split/train

# Full path to input validation dataset, if available
# Used for hyper parameter tuning and early stopping
# If no additional validation set is available, leave it as blank, 
# and random 20% of the training data will be used as the validation set. 
input.validData=

# Full path to input test dataset
input.testData=/mnt/home/cheng/mlc_data_pyramid/rcv1subset_topics_1/train_test_split/test

# Directory for the program output
output.dir=/mnt/home/cheng/out/cbm_en/rcv1

# Whether to show detailed debugging information
output.verbose=false

################# functions #####################

# Perform hyper parameter tuning before training
# If external validation data is given, the model is trained on the full training data
# and tuned on the given validation data; otherwise, the model is trained on 80% of the training data,
# and tuned on the rest 20% of the training data.
tune=true

# Train the model on all the available data (excluding test data), using tuned or user specified hyper parameters
# If the external validation data is also given, the model is trained on training data + validation data
train=true

# Load back trained model, make predictions on the test set, and evaluate test performance.
# The program shows several different predictions designed to optimize different evaluation metrics.
test=true

######### prediction method ########

# Whether to allow empty subset to be predicted; 
# true = allow empty prediction
# false = do not allow empty prediction
# auto = allow empty prediction only if the training set contains empty label sets
predict.allowEmpty=auto

# The threshold for skipping components with small contributions
# This is designed to speed up prediction
predict.piThreshold=0.001


######### tune #########

# Hyper parameter tuning uses the validation set to decide elasticnet penalty and L1 ratio, 
# number of CBM components and number of EM training iterations.
# Users can specify candidate values for penalty, L1 ratio and components.
# The optimal EM training iterations will be determined automatically by monitoring the validation performance.
# The metric monitored is specified in predict.targetMetric.

# To achieve optimal prediction under which target evaluation metric? 
# Currently supported metrics: instance_set_accuracy, instance_f1 and instance_hamming_loss.
# Generally speaking no single model is well suited for all evaluation metrics.
# Optimizing different metrics require different prediction methods and hyper parameters.
# The program automatically chooses the optimal prediction method designed for each metric.
# The predictor designed for instance set accuracy outputs the joint mode.
# The predictor designed for instance Hamming loss outputs the marginal modes.
# The predictor designed for instance F1 runs the GFM algorithm.
# The metric specified here will serve as the main metric for selecting the best model during hyper parameter tuning
# Once the model is trained, the program shows all different predictions made by different prediction methods
tune.targetMetric=instance_f1

# the overall elastic-net penalty is a weighted combination of L1 norm and L2 norm and has the form 
# penalty*[l1Ratio*L1 norm + (1-l1Ratio)*L2 norm]

# What values to try for the overall elastic-net penalty 
# Big values indicate strong regularizations
# The penalty can greatly affect the performance and thus requires careful tuning
tune.penalty.candidates=0.0001,0.000001

# What values to try for L1 Ratio
# Any real number from 0 to 1, where 0 means L2 only, 1 means L1 only, and 0.5 means half L1 and half L2.
tune.l1Ratio.candidates=0.1,0.5

# What values to try for number of CBM components
# The default value 50 usually gives good performance
# To reduce turning time, users can just set tune.numComponents.candidates=50
tune.numComponents.candidates=50


# Evaluate the metric on the validation set every k iterations
# Frequent evaluation may slow down training
# Use a small value (e.g. 1) if we expect the training to take just a few iterations (e.g. 20)
# Use a big value (e.g. 10) if we expect the training to take many iterations (e.g. 200)
tune.monitorInterval=1

# the model training will never stop before it reaches this minimum number of iterations
tune.earlyStop.minIterations=5

# If the validation metric does not improve after k successive evaluations, the training will stop
# for example, if tune.monitorInterval=5 and tune.earlyStop.patience=2, trains stops if no improvement in 10 iterations
# Using a patient value too small make cause the training to stop too early
# Using a patient value too big make increase the tuning time
tune.earlyStop.patience=10


######### train #################

# Whether to use optimal hyper parameter values found by tuning
# These hyper parameters include: train.iterations, train.penalty, train.l1Ratio, and train.numComponents
# if true, users do not need to specify these values
# if false or if no tuning has be performed, users need to provide a value for each of them 
train.useTunedHyperParameters=true

# When a separate validation set is provided, users can either set train.useValidData=false to only use it for hyper parameter tuning;
# or, set train.useValidData=ture to train the final model on training set + validation set after tuning
train.useValidData=false

# Number of EM training iterations
train.iterations=10

# the overall elastic-net penalty is a weighted combination of L1 norm and L2 norm and has the form 
# penalty*[l1Ratio*L1 norm + (1-l1Ratio)L2 norm]
# Big values indicate strong regularizations
# The penalty can greatly affect the performance and thus requires careful tuning
train.penalty=0.0001

# Any real number from 0 to 1, where 0 means L2 only, 1 means L1 only, and 0.5 means half L1 and half L2.
train.l1Ratio=1.0

# Number of CBM components
# The default value 50 usually gives good performance
train.numComponents=20


# The parameters below usually do not affect the performance much
# Users can use default values

# whether to initialize CBM by random parameters
# default is false and uses BM to initialize CBM
train.randomInitialize=false

# whether to use line search for elastic-net training
# using line search slows down training
# default=false
# in very rare occasions, the training may diverge without line search; if this happens, set lineSearch=true
train.elasticnet.lineSearch=false

# whether to speed up elastic-net training using the active set trick
train.elasticnet.activeSet=true

# number of coordinate descent iterations for LR in each M step
# The default value 5 is good most of the time
# If the train.iterations found by hyper parameter tuning is 1 or 2, each M step is probably doing too much work and the training overfits too quickly. In this case, we can decrease train.updatesPerIteration
train.updatesPerIteration=5

# In each component, skip instances with small memberships values (gammas)
# This is designed to speed up training
train.skipDataThreshold=0.00001

# Skip training a classifier for a label in a component if that label almost never appears or almost always appears in that component. 
# A constant output (the prior probability) will be used in this case.
# This is designed to speed up training
train.skipLabelThreshold=0.00001

# Smooth the probability of a non-existent label in a component with the its overall probability in the dataset
# This is designed to avoid zero probabilities
train.smoothStrength=0.0001

######## test ##############

# When generating prediction reports for individual label probabilities, labels with probabilities below the threshold will not be displayed
# This only make the reports more readable; it does not affect the actual prediction in any way.
report.labelProbThreshold=0.2





# the internal Java class name for this application. 
# users do not need to modify this.
pyramid.class=CBMEN

CBM with gradient boosting base learners

Gradient boosting learners are more powerful than logistic regression learners and handles non-linearity better. CBM with gradient boosting learners is well suited for complex multi-label classification tasks that require non-linear mappings.

Usage

To run the CBM algorithm with gradient boosting learners, please just type

./pyramid config/cbm_gb.properties

where the cbm_gb.properties file (can be found in the config folder of the package) specifies all the algorithm parameters, as explained below.

Program Properties

############## input and output ###############

# Full path to input train dataset
input.trainData=/mnt/home/cheng/mlc_data_pyramid/mediamill/train_test_split/train

# Full path to input validation dataset, if available
# Used for hyper parameter tuning and early stopping
# If no additional validation set is available, leave it as blank, 
# and random 20% of the training data will be used as the validation set. 
input.validData=

# Full path to input test dataset
input.testData=/mnt/home/cheng/mlc_data_pyramid/mediamill/train_test_split/test

# Directory for the program output
output.dir=/mnt/home/cheng/out/cbm_gb/mediamill

# Whether to show detailed debugging information
output.verbose=false

################# functions #####################

# Perform hyper parameter tuning before training
# If external validation data is given, the model is trained on the full training data
# and tuned on the given validation data; otherwise, the model is trained on 80% of the training data,
# and tuned on the rest 20% of the training data.
tune=false

# Train the model on all the available data (excluding test data), using tuned or user specified hyper parameters
# If the external validation data is also given, the model is trained on training data + validation data
train=true

# Load back trained model, make predictions on the test set, and evaluate test performance.
# The program shows several different predictions designed to optimize different evaluation metrics.
test=true

######### prediction method ########

# Whether to allow empty subset to be predicted; 
# true = allow empty prediction
# false = do not allow empty prediction
# auto = allow empty prediction only if the training set contains empty label sets
predict.allowEmpty=auto

# The threshold for skipping components with small contributions
# This is designed to speed up prediction
predict.piThreshold=0.001


######### tune #########

# Hyper parameter tuning uses the validation set to decide number of regression tree leaves, 
# number of CBM components and number of EM training iterations.
# Users can specify candidate values for number of leaves and components.
# The optimal EM training iterations will be determined automatically by monitoring the validation performance.
# The metric monitored is specified in predict.targetMetric.

# To achieve optimal prediction under which target evaluation metric? 
# Currently supported metrics: instance_set_accuracy, instance_f1 and instance_hamming_loss.
# Generally speaking no single model is well suited for all evaluation metrics.
# Optimizing different metrics require different prediction methods and hyper parameters.
# The program automatically chooses the optimal prediction method designed for each metric.
# The predictor designed for instance set accuracy outputs the joint mode.
# The predictor designed for instance Hamming loss outputs the marginal modes.
# The predictor designed for instance F1 runs the GFM algorithm.
# The metric specified here will serve as the main metric for selecting the best model during hyper parameter tuning
# Once the model is trained, the program shows all different predictions made by different prediction methods
tune.targetMetric=instance_set_accuracy

# What values to try for the number of leaves in each regression tree
# Having more leaves makes the model more powerful but the training will take longer time
tune.numLeaves.candidates=2,5,15
# What values to try for number of CBM components
# The default value 50 usually gives good performance
# To reduce turning time, users can just set tune.numComponents.candidates=50
tune.numComponents.candidates=1,20,50


# Evaluate the metric on the validation set every k iterations
# Frequent evaluation may slow down training
# Use a small value (e.g. 1) if we expect the training to take just a few iterations (e.g. 20)
# Use a big value (e.g. 10) if we expect the training to take many iterations (e.g. 200)
tune.monitorInterval=5

# the model training will never stop before it reaches this minimum number of iterations
tune.earlyStop.minIterations=5

# If the validation metric does not improve after k successive evaluations, the training will stop
# for example, if tune.monitorInterval=5 and tune.earlyStop.patience=2, trains stops if no improvement in 10 iterations
# Using a patient value too small make cause the training to stop too early
# Using a patient value too big make increase the tuning time
tune.earlyStop.patience=100


######### train #################

# Whether to use optimal hyper parameter values found by tuning
# These hyper parameters include: train.iterations, train.numLeaves, and train.numComponents
# if true, users do not need to specify these values
# if false or if no tuning has be performed, users need to provide a value for each of them 
train.useTunedHyperParameters=false

# When a separate validation set is provided, users can either set train.useValidData=false to only use it for hyper parameter tuning;
# or, set train.useValidData=ture to train the final model on training set + validation set after tuning
train.useValidData=false

# Number of EM training iterations
train.iterations=120

# Number of leaves in each regression tree
# Having more leaves makes the model more powerful but the training will be slower
train.numLeaves=5

# Number of CBM components
# The default value 50 usually gives good performance
train.numComponents=5


# The parameters below usually do not affect the performance much
# Users can use default values

# shrink the output of each regression tree by a factor
# by default, no shrinkage is applied
# using a shrinkage rate such as 0.1 may eventually lead to a better test performance but requires more training iterations
train.shrinkage=0.1

# number gradient boosting updates (i.e. number of trees to fit) in each M step
# The default value 10 is good most of the time
# If the train.iterations found by hyper parameter tuning is 1 or 2, each M step is probably doing too much work and the training overfits too quickly. In this case, we can decrease train.updatesPerIteration
train.updatesPerIteration=20

# In each component, skip instances with small memberships values (gammas)
# This is designed to speed up training
train.skipDataThreshold=0.00001

# Skip training a classifier for a label in a component if that label almost never appears or almost always appears in that component. 
# A constant output (the prior probability) will be used in this case.
# This is designed to speed up training
train.skipLabelThreshold=0.00001
# Smooth the probability of a non-existent label in a component with the its overall probability in the dataset
# This is designed to avoid zero probabilities
train.smoothStrength=0.000001

######## test ##############

# When generating prediction reports for individual label probabilities, labels with probabilities below the threshold will not be displayed
# This only make the reports more readable; it does not affect the actual prediction in any way.
report.labelProbThreshold=0.2





# the internal Java class name for this application. 
# users do not need to modify this.
pyramid.class=CBMGB

Running CBM GB on mediamill with the above hyper parameters will produce an output that looks like this.

Sample Datasets

The sample multi-label datasets used in the CBM paper together with some other commonly used datasets can be downloaded here.

Source Code

The source code files related to CBM can be found here, here and here.

References

The CBM algorithm is described in

@inproceedings{li2016conditional,  
  author=	{Li, Cheng and Wang, Bingyu and Pavlu, Virgil and Aslam, Javed A.},  
  title=	{Conditional Bernoulli Mixtures for Multi-label Classification},  
  booktitle=	{Proceedings of the 33rd International Conference on Machine Learning},  
  year=	{2016},  
  pages={2482--2491}
}

Optimizing F1 metric by combining CBM and GFM is described in

@inproceedings{wang2018pipeline,
  title={A Pipeline for Optimizing F1-Measure in Multi-label Text Classification},
  author={Wang, Bingyu and Li, Cheng and Pavlu, Virgil and Aslam, Jay},
  booktitle={2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)},
  pages={913--918},
  year={2018},
  organization={IEEE}
}

The GFM algorithm for general F1 maximization is described in

@article{waegeman2014bayes,
  title={On the bayes-optimality of F-measure maximizers.},    
  author={Waegeman, Willem and Dembczynski, Krzysztof and Jachnik, Arkadiusz and Cheng, Weiwei and H{\"u}llermeier, Eyke},    
  journal={Journal of Machine Learning Research},    
  volume={15},    
  number={1},    
  pages={3333--3388},
  year={2014}
}