In [None]:
import sys
sys.path.append('../')

In [None]:
from rumboost.utils import *
from rumboost.utility_smoothing import *
from rumboost.rumboost import *
from rumboost.dataset import *
from rumboost.models import *
from rumboost.utility_plotting import *
from rumboost.post_process import *

import lightgbm

# Example: Cross-nested logit model (correlation amongst alternative)

This notebook shows features implemented in RUMBoost through an example on the LPMC dataset, a mode choice dataset in London developed Hillel et al. (2018). You can find the original source of data [here](https://www.icevirtuallibrary.com/doi/suppl/10.1680/jsmic.17.00018) and the original paper [here](https://www.icevirtuallibrary.com/doi/full/10.1680/jsmic.17.00018).

We first load the preprocessed dataset and its folds for cross-validation. You can find the data under the Data folder

In [None]:
#load dataset
LPMC_train, LPMC_test, folds = load_preprocess_LPMC()

#load model
LPMC_model = LPMC(LPMC_train)

## Cross-Nested Logit model

We relax the assumption that the error term is distributed i.i.d. We assume that alternatives are correlated amongst several nests to obtain a cross-nested logit-like model. Cross-Nested logit probabilities are implemented in RUMBoost. The additional parameters, the scale of a nest $\mu$ and the membership of alternatives to nests, are treated as hyperparameters.

Training a cross-nested logit-like rumboost model requires two additional arguments:

- ```alphas```: a 2d numpy array of the form ```np.array([[alpha_00, alpha_01, alpha_02],[alpha_10, alpha_11, alpha_12], [alpha_20, alpha_21, alpha_22], [alpha_30, alpha_31, alpha_32]]``` where ```alpha_ij``` means the degree of membership of alternative ```i``` to nest ```j```
- ```mu```: a list containing the values (as float) of mu for each nest, e.g. ```[mu_nest_0, mu_nest_1, mu_nest_2]```

We test here a cross-nested logit model where the two nests are motorized and flexible. As this is a work in progress, we just arbitrarily choose values of mu and alphas. This will be later chosen with hyperparameter tuning.

In [None]:
mu = [1.25, 1.16] #random values

alphas  = np.array([[0, 1],
                    [0, 1],
                    [1, 0],
                    [0.5, 0.5]])

params = {'n_jobs': -1,
          'num_classes':4,
          'objective':'multiclass',
          'boosting': 'gbdt',
          'monotone_constraints_method': 'advanced',
          'verbosity': -1,
          'num_iterations':3000,
          'early_stopping_round':100,
          'learning_rate':0.1,
          'max_depth':1
          }

rum_structure = bio_to_rumboost(LPMC_model)

#features and label column names
features = [f for f in LPMC_train.columns if f != "choice"]
label = "choice"

#create lightgbm dataset
lgb_train_set = lightgbm.Dataset(LPMC_train[features], label=LPMC_train[label], free_raw_data=False)
lgb_test_set = lightgbm.Dataset(LPMC_test[features], label=LPMC_test[label], free_raw_data=False)

### Cross-Validation

In [None]:
_, _, folds = load_preprocess_LPMC()

ce_loss = 0
num_trees = 0

#5-fold CV
for i, (train_idx, test_idx) in enumerate(folds):

    #train and validation set
    train_set = lgb_train_set.subset(sorted(train_idx))
    test_set = lgb_train_set.subset(sorted(test_idx))

    #deep copy of params
    param = copy.deepcopy(params)

    print('-'*50 + '\n')
    print(f'Iteration {i+1}')

    #train rum_boost with cross-nested arguments
    LPMC_model_trained = rum_train(param,train_set,rum_structure=rum_structure, valid_sets = [test_set], mu=mu, alphas=alphas)

    #aggregate results
    ce_loss += LPMC_model_trained.best_score
    num_trees += LPMC_model_trained.best_iteration
    
    print('-'*50 + '\n')
    print(f'Best cross entropy loss: {LPMC_model_trained.best_score}')
    print(f'Best number of trees: {LPMC_model_trained.best_iteration}')

ce_loss = ce_loss/5
num_trees = num_trees/5
print('-'*50 + '\n')
print(f'Cross validation negative cross entropy loss: {ce_loss}')
print(f'With a number of trees on average of {num_trees}')

--------------------------------------------------

Iteration 1




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6560824787160351
Best number of trees: 687
--------------------------------------------------

Iteration 2




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6444866813873578
Best number of trees: 687
--------------------------------------------------

Iteration 3




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6594268117163683
Best number of trees: 687
--------------------------------------------------

Iteration 4




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6633328079905507
Best number of trees: 687
--------------------------------------------------

Iteration 5




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6666172963883393
Best number of trees: 683
--------------------------------------------------

Cross validation negative cross entropy loss: 0.6579892152397303
With a number of trees on average of 686.2


### Testing the model on out-of-sample data

Now that we have the optimal number of trees (686), we can train the final version of the model on the full dataset, and test it on out-of-sample data with the ```predict()``` function. Note that the dataset must be a lightgbm object in the ```predict()``` function.

**Also note that you need to specify ```mu``` and ```alphas``` in the predict function to adapt the probability formula accordingly.**

In [None]:
params = {'n_jobs': -1,
          'num_classes':4,
          'objective':'multiclass',
          'boosting': 'gbdt',
          'monotone_constraints_method': 'advanced',
          'verbosity': -1,
          'num_iterations':686,
          #'early_stopping_round':100,
          'learning_rate':0.1,
          'max_depth':1
          }

LPMCCN_model_fully_trained = rum_train(params, lgb_train_set, rum_structure, mu=mu, alphas=alphas)

preds, _, _ = LPMCCN_model_fully_trained.predict(lgb_test_set, mu=mu, alphas=alphas)

ce_test = cross_entropy(preds, lgb_test_set.get_label().astype(int))

print('-'*50)
print(f'Final negative cross-entropy on the test set: {ce_test}')



Finished loading model, total used 686 iterations
Finished loading model, total used 686 iterations
Finished loading model, total used 686 iterations
Finished loading model, total used 686 iterations
--------------------------------------------------
Final negative cross-entropy on the test set: 0.6796725053458201


# References

Salvadé, N., & Hillel, T. (2024). Rumboost: Gradient Boosted Random Utility Models. *arXiv preprint [arXiv:2401.11954](https://arxiv.org/abs/2401.11954)*

Hillel, T., Elshafie, M.Z.E.B., Jin, Y., 2018. Recreating passenger mode choice-sets for transport simulation: A case study of London, UK. Proceedings of the Institution of Civil Engineers - Smart Infrastructure and Construction 171, 29–42. https://doi.org/10.1680/jsmic.17.00018