In [2]:
import sys

sys.path.append("../")

In [3]:
%load_ext autoreload
%autoreload 2

In [24]:
from rumboost.rumboost import rum_train
from rumboost.datasets import load_preprocess_LPMC
from rumboost.metrics import cross_entropy
from rumboost.datasets import prepare_dataset

import numpy as np
import lightgbm

# Example: GPU and batch training 

This notebook shows features implemented in RUMBoost through an example on the LPMC dataset, a mode choice dataset in London developed Hillel et al. (2018). You can find the original source of data [here](https://www.icevirtuallibrary.com/doi/suppl/10.1680/jsmic.17.00018) and the original paper [here](https://www.icevirtuallibrary.com/doi/full/10.1680/jsmic.17.00018).

We first load the preprocessed dataset and its folds for cross-validation. You can find the data under the Data folder

In [9]:
# load dataset
LPMC_train, LPMC_test, folds = load_preprocess_LPMC(path="../Data/")

## Training and testing a RUMBoost model with gradient and hessian computed on the GPU

In order to train a rumboost model with GPU you need to install `torch` and pass a dictionary to the `rum_train()` function `torch_tensors` argument with the following keys:
`'device'`: can be cpu, gpu or cuda. We recommend cuda for best results.
`'torch_compile'`: a boolean if torch.compile will be use to compute the gradient and hessian.

Note that this is only for the calculations within RUMBoost, and therefore the calculations from lightgbm will still be in CPU. This can still provide substantial speed-up for models with heavy calculations (big datasets or for nested or cross-nested logit models).

In [10]:
torch_tensors = {
    'device':'cuda',
    'torch_compile':False,
}

In [11]:
# parameters
general_params = {
    "n_jobs": -1,
    "num_classes": 4,  # important
    "verbosity": 1,  # specific RUMBoost parameter
    "num_iterations": 3000,
    "early_stopping_round": 100,
}

In [12]:
rum_structure = [
    {
        "utility": [0],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_walking",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
            ],
        },
        "shared": False,
    },
    {
        "utility": [1],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_cycling",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
            ],
        },
        "shared": False,
    },
    {
        "utility": [2],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_pt_access",
            "dur_pt_bus",
            "dur_pt_rail",
            "dur_pt_int_waiting",
            "dur_pt_int_walking",
            "pt_n_interchanges",
            "cost_transit",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
                -1,
                -1,
                -1,
                -1,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
                [17],
                [18],
                [19],
                [20],
                [21],
                [22],
            ],
        },
        "shared": False,
    },
    {
        "utility": [3],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_driving",
            "cost_driving_fuel",
            "congestion_charge",
            "driving_traffic_percent",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
                -1,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
                [17],
                [18],
                [19],
            ],
        },
        "shared": False,
    },
]

In [13]:
model_specification = {
    "general_params": general_params,
    "rum_structure": rum_structure,
}

In [14]:
# features and label column names
features = [f for f in LPMC_train.columns if f != "choice"]
label = "choice"

# create lightgbm dataset
lgb_train_set = lightgbm.Dataset(
    LPMC_train[features], label=LPMC_train[label], free_raw_data=False
)
lgb_test_set = lightgbm.Dataset(
    LPMC_test[features], label=LPMC_test[label], free_raw_data=False
)

In [19]:
general_params["num_iterations"] = 1276
general_params["early_stopping_round"] = None

LPMC_model_fully_trained = rum_train(lgb_train_set, model_specification, torch_tensors=torch_tensors)

preds = LPMC_model_fully_trained.predict(lgb_test_set)

ce_test = cross_entropy(preds.cpu().numpy(), lgb_test_set.get_label().astype(int))

print("-" * 50)
print(f"Final negative cross-entropy on the test set: {ce_test}")



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.031607 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 54766, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000233 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 54766, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003215 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1834
[LightGBM] [Info] Number of data points in the train set: 54766, number of used features: 23
[LightG

## Subsampling and batch training

For training speed-up it is possible to use subsampling within the RUMBoost training environment, or to directly provide batches. Batch training is efficient is it allows for not duplicating the data during training, which is useful when training with large datasets. Subsampling is more accurate than bacth training for GBDT, but requires more memory, as the full dataset needs to be passed to the RUMBoost object.

### Subsampling

In [22]:
general_params['subsampling'] = 0.5
general_params['subsample_freq'] = 1

LPMC_model_fully_trained = rum_train(lgb_train_set, model_specification, torch_tensors=torch_tensors)

preds = LPMC_model_fully_trained.predict(lgb_test_set)

ce_test = cross_entropy(preds.cpu().numpy(), lgb_test_set.get_label().astype(int))


print("-" * 50)
print(f"Final negative cross-entropy on the test set: {ce_test}")




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000236 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 54766, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000299 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 54766, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000818 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1834
[LightGBM] [Info] Number of data poi

### Batch training

We need to first already preprocess the dataset with the function `prepare_dataset()`.

In [35]:
lgb_train_set, lgb_test_set = prepare_dataset(rum_structure, LPMC_train, 4, df_test=[LPMC_test], free_raw_data=False)

------------------------------
[1/4] 	 Loading dataset 1...
	 done! 
------------------------------

------------------------------
[2/4] 	 Loading dataset 2...
	 done! 
------------------------------

------------------------------
[3/4] 	 Loading dataset 3...
	 done! 
------------------------------

------------------------------
[4/4] 	 Loading dataset 4...
	 done! 
------------------------------



Then we can prepare the batches.

In [37]:
#indices for train set
permutations = np.random.permutation(lgb_train_set['num_data'])
batch_size = 4000
num_batches = len(permutations) // batch_size + 1
batches = np.split(permutations, np.arange(batch_size, len(permutations), batch_size))

#indices for test set
permutations_valid = np.random.permutation(lgb_test_set['num_data'][0])
batch_size_valid = 1000
batches_valid = np.split(permutations_valid, np.arange(batch_size_valid, len(permutations_valid), batch_size_valid))

And choose for how many iterations each batch will train

In [38]:
general_params['num_iterations'] = 30
#reset subsampling
general_params['subsampling'] = 1.0
general_params['subsample_freq'] = 0

Finally, we can train teh model. The key point is to pass the model trained in the previous iteration as an input to the new model.

In [39]:

epoch_cross_entropy_train = 0
epoch_cross_entropy_valid = 0

time_tracker = []

epoch_train_loss = []
epoch_valid_loss = []
print('Start training...')

for i in range(10*num_batches):

    train_idx = batches[i%num_batches]
    valid_idx = batches_valid[i%num_batches]

    small_train_set = {}
    small_test_set = {}
    small_train_set['train_sets'] = []
    small_test_set['valid_sets'] = []


    for j, dataset in enumerate(lgb_train_set['train_sets']):
        if rum_structure[j]['shared']:
            dataset_train_idx = np.concatenate([np.array(train_idx) + (u - rum_structure[j]['utility'][0]) * lgb_train_set['num_data'] for u in rum_structure[j]['utility'][:len(rum_structure[j]['variables'])]])
            dataset_valid_idx = np.concatenate([np.array(valid_idx) + (u - rum_structure[j]['utility'][0]) * lgb_test_set['num_data'][0] for u in rum_structure[j]['utility'][:len(rum_structure[j]['variables'])]])
        else:
            dataset_train_idx = np.array(train_idx)
            dataset_valid_idx = np.array(valid_idx)
        small_train_set['train_sets'].append(dataset.subset(dataset_train_idx))
        small_test_set['valid_sets'].append([lgb_test_set['valid_sets'][0][j].subset(dataset_valid_idx)])

    small_train_set['num_data'] = len(train_idx)
    small_test_set['num_data'] = [len(valid_idx)]
    small_train_set['labels'] = lgb_train_set['labels'][sorted(train_idx)]
    small_test_set['valid_labels'] = [lgb_test_set['valid_labels'][0][sorted(valid_idx)]]
    small_test_set['valid_sets'] = np.array(small_test_set['valid_sets']).T.tolist()

    if i == 0:
        init_models = None
    else: 
        init_models = MTMC_model_fully_trained.boosters

    MTMC_model_fully_trained = rum_train(small_train_set, model_specification, valid_sets=[small_test_set], torch_tensors=torch_tensors, init_models=init_models, keep_training_booster=True)

    epoch_cross_entropy_train += MTMC_model_fully_trained.best_score_train
    epoch_cross_entropy_valid += MTMC_model_fully_trained.best_score
  
    if (i) % num_batches == 0 and i != 0:
        permutations = np.random.permutation(lgb_train_set['num_data'])
        batches = np.split(permutations, np.arange(batch_size, len(permutations), batch_size))
        permutations_valid = np.random.permutation(lgb_test_set['num_data'][0])
        batches_valid = np.split(permutations_valid, np.arange(batch_size_valid, len(permutations_valid), batch_size_valid))
        print(f"Epoch: {i//num_batches}, Cross-Entropy Train: {epoch_cross_entropy_train/num_batches}, Cross-Entropy Valid: {epoch_cross_entropy_valid/num_batches}")
        print('new mu:', MTMC_model_fully_trained.mu)

      
        epoch_cross_entropy_train = 0
        epoch_cross_entropy_valid = 0

    epoch_train_loss.append(MTMC_model_fully_trained.best_score_train)
    epoch_valid_loss.append(MTMC_model_fully_trained.best_score)

Start training...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000152 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000076 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 17
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000501 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1834
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 23
[LightGBM] [Info] Auto-choosing row-wise multi-threading



[41]-----NCE value on train set : 0.9309
---------NCE value on test set 1: 0.8003
[51]-----NCE value on train set : 0.8186
---------NCE value on test set 1: 0.8157
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000129 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 17
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000129 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000168 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1834
[L

KeyboardInterrupt: 