In [None]:
import sys

sys.path.append("../")

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from rumboost.rumboost import rum_train
from rumboost.datasets import load_preprocess_LPMC
from rumboost.metrics import cross_entropy

import lightgbm

# Example: Nested logit model (correlation amongst alternatives)

This notebook shows features implemented in RUMBoost through an example on the LPMC dataset, a mode choice dataset in London developed Hillel et al. (2018). You can find the original source of data [here](https://www.icevirtuallibrary.com/doi/suppl/10.1680/jsmic.17.00018) and the original paper [here](https://www.icevirtuallibrary.com/doi/full/10.1680/jsmic.17.00018).

We first load the preprocessed dataset and its folds for cross-validation. You can find the data under the Data folder

In [3]:
#load dataset
LPMC_train, LPMC_test, folds = load_preprocess_LPMC()

## Nested Logit model

We relax the assumption that the error term is distributed i.i.d.. If we assume that alternatives are correlated, we obtain a nested logit-like model. Nested logit probabilities are implemented in RUMBoost. The additional parameter, the scale of a nest $\mu$, can be estimated with two ways:
1. by a hyperparameter search
2. optimised within the trianing loop

Training a nested logit-like rumboost model requires an additional dictionary in the model specification dictionary. The nested logit disctionary follows the following form:

- ```mu```: a list containing the values (as float) of mu for each nest, e.g. ```[mu_nest_0, mu_nest_1]```
- ```nests```: a dictionary representing the nesting structure. Keys are the nests id, and values are the the list of alternatives in the corresponding nest. For example {0: [0, 1], 1: [2, 3]} means that alternative 0 and 1 are in nest 0, and alternative 2 and 3 are in nest 1.
- `optimise_mu`: a boolean or list of boolean. If it is a simple boolean and True, all mu values are found through scipy.minimize. If it is a list of boolean, it should be the same size than `mu` and it represents which value should be optimised or not.
  
In this example, we assume that PT and car are in a 'motorised' nest, while the walking and cycling alternative are in their own nests.

### $\mu$ hyperparameter search

We treat $\mu$ as a hyperparameter. We use hyperopt to find the optimal value of the hyperparameter. More details on how to use hyperopt [here](https://hyperopt.github.io/hyperopt/).

Note that for computational purposes, we show here a hyperparameter search for one iteration. As an example, we ran 25 iterations to obtain the results of the paper.

In [4]:
import hyperopt

#specify nest
nest = {0:[0], 1:[1], 2:[2, 3]}

#specifiy seach of mu
param_space =  {'mu': hyperopt.hp.uniform('mu', 1, 2)}

#parameters for the rumboost training
STATIC_PARAMS = {'n_jobs': -1,
                'num_classes':4,
                'objective':'multiclass',
                'boosting': 'gbdt',
                'monotone_constraints_method': 'advanced',
                'verbosity': -1,
                'early_stopping_round':100,
                'num_iterations': 3000,
                'learning_rate':0.1
                }

#rum_structure dictionary
rum_structure = bio_to_rumboost(LPMC_model)

#features and label column names
features = [f for f in LPMC_train.columns if f != "choice"]
label = "choice"

#create lightgbm dataset
lgb_train_set = lightgbm.Dataset(LPMC_train[features], label=LPMC_train[label], free_raw_data=False)
lgb_test_set = lightgbm.Dataset(LPMC_test[features], label=LPMC_test[label], free_raw_data=False)

#objective for hyperopt
def objective(space):
    params = {**space, **STATIC_PARAMS}
    params['max_depth'] = 1

    #create mu structure
    mu = [1, 1, params.pop("mu")]

    loss = 0
    N_sum = 0

    #k-fold cross validation
    for train_idx, test_idx in folds:
        #obtain training and testing data for this iteration (split of k-Fold)
        train_set = lgb_train_set.subset(sorted(train_idx))
        test_set = lgb_train_set.subset(sorted(test_idx))

        #train rumboost with nested parameters
        clf_trained = rum_train(params, train_set, rum_structure, valid_sets=[test_set], nests=nest, mu=mu)

        #predictions
        proba, _, _ = clf_trained.predict(test_set, nests=nest, mu=mu)

        # Cross-Entropy Loss
        sum = 0
        i = 0
        for sel_mode in test_set.get_label():
            sum = sum + np.log(proba[i,int(sel_mode)])
            i += 1
        N = i - 1
        loss += -sum  # Original: (-sum/N) * N
        N_sum += N

    loss = loss / N_sum
    return {'loss': loss, 'status': hyperopt.STATUS_OK, 'best_iteration': clf_trained.best_iteration}


#%%
#n_iter=25
n_iter = 1

trials = hyperopt.Trials()
best_classifier = hyperopt.fmin(fn=objective,
                                space=param_space,
                                algo=hyperopt.tpe.suggest,
                                max_evals=n_iter,
                                trials=trials)

print(f'Best mu: {best_classifier["mu"]} \n Best negative CE: {trials.best_trial["result"]["loss"]}')


  0%|          | 0/1 [00:00<?, ?trial/s, best loss=?]









Early stopping at iteration 1490, with a best score on test set of 0.6547736833551879, and on train set of 0.6386020669332474
Finished loading model, total used 1591 iterations   
Finished loading model, total used 1591 iterations   
Finished loading model, total used 1591 iterations   
Finished loading model, total used 1591 iterations   
  0%|          | 0/1 [00:50<?, ?trial/s, best loss=?]








Early stopping at iteration 1485, with a best score on test set of 0.641589520650019, and on train set of 0.6414858373312715
Finished loading model, total used 1586 iterations   
Finished loading model, total used 1586 iterations   
Finished loading model, total used 1586 iterations   
Finished loading model, total used 1586 iterations   
  0%|          | 0/1 [01:41<?, ?trial/s, best loss=?]








Early stopping at iteration 892, with a best score on test set of 0.6585449369594082, and on train set of 0.6409978331417988
Finished loading model, total used 993 iterations    
Finished loading model, total used 993 iterations    
Finished loading model, total used 993 iterations    
Finished loading model, total used 993 iterations    
  0%|          | 0/1 [02:12<?, ?trial/s, best loss=?]








Early stopping at iteration 1339, with a best score on test set of 0.6634461537926124, and on train set of 0.6370079921944384
Finished loading model, total used 1440 iterations   
Finished loading model, total used 1440 iterations   
Finished loading model, total used 1440 iterations   
Finished loading model, total used 1440 iterations   
  0%|          | 0/1 [02:59<?, ?trial/s, best loss=?]








Early stopping at iteration 1108, with a best score on test set of 0.666548858771854, and on train set of 0.6368294978547153
Finished loading model, total used 1209 iterations   
Finished loading model, total used 1209 iterations   
Finished loading model, total used 1209 iterations   
Finished loading model, total used 1209 iterations   
100%|██████████| 1/1 [03:39<00:00, 219.04s/trial, best loss: 0.6571032134915424]
Best mu: 1.7441568376405705 
 Best negative CE: 0.6571032134915424


### Cross-Validation

Once we know the optimal value of $\mu$, we can perform cross-validation to obtain the best number of trees.

Note that we use the optimal value of $\mu$ found with a bigger hyperparameter search, i.e. 1.17.

In [None]:
_, _, folds = load_preprocess_LPMC()
params = {'n_jobs': -1,
          'num_classes':4,
          'objective':'multiclass',
          'boosting': 'gbdt',
          'monotone_constraints_method': 'advanced',
          'verbosity': -1,
          'num_iterations':3000,
          'early_stopping_round':100,
          'learning_rate':0.1,
          'max_depth':1
          }

#optimal mu
optimal_mu = [1, 1, 1.166746773143513]

ce_loss = 0
num_trees = 0

#CV with 5 folds
for i, (train_idx, test_idx) in enumerate(folds):

    #create the lightgbm CV training and validation set
    train_set = lgb_train_set.subset(sorted(train_idx))
    test_set = lgb_train_set.subset(sorted(test_idx))

    #deep copy of parameters
    param = copy.deepcopy(params)

    print('-'*50 + '\n')
    print(f'Iteration {i+1}')

    #train the model with rum_train and nest parameters
    LPMC_model_trained = rum_train(param,train_set,rum_structure=rum_structure, valid_sets = [test_set], mu=optimal_mu, nests=nest)

    #aggregate results
    ce_loss += LPMC_model_trained.best_score
    num_trees += LPMC_model_trained.best_iteration
    print('-'*50 + '\n')
    print(f'Best cross entropy loss: {LPMC_model_trained.best_score}')
    print(f'Best number of trees: {LPMC_model_trained.best_iteration}')

ce_loss = ce_loss/5
num_trees = num_trees/5
print('-'*50 + '\n')
print(f'Cross validation negative cross entropy loss: {ce_loss}')
print(f'With a number of trees on average of {num_trees}')

--------------------------------------------------

Iteration 1




Early stopping at iteration 1291, with a best score on test set of 0.6543006671161781, and on train set of 0.6387160257806288
Finished loading model, total used 1392 iterations
Finished loading model, total used 1392 iterations
Finished loading model, total used 1392 iterations
Finished loading model, total used 1392 iterations
--------------------------------------------------

Best cross entropy loss: 0.6543006671161781
Best number of trees: 1291
--------------------------------------------------

Iteration 2




Early stopping at iteration 1625, with a best score on test set of 0.6409292761468673, and on train set of 0.6403367047451071
Finished loading model, total used 1726 iterations
Finished loading model, total used 1726 iterations
Finished loading model, total used 1726 iterations
Finished loading model, total used 1726 iterations
--------------------------------------------------

Best cross entropy loss: 0.6409292761468673
Best number of trees: 1625
--------------------------------------------------

Iteration 3




Early stopping at iteration 921, with a best score on test set of 0.6577617492916592, and on train set of 0.6402262085982907
Finished loading model, total used 1022 iterations
Finished loading model, total used 1022 iterations
Finished loading model, total used 1022 iterations
Finished loading model, total used 1022 iterations
--------------------------------------------------

Best cross entropy loss: 0.6577617492916592
Best number of trees: 921
--------------------------------------------------

Iteration 4




Early stopping at iteration 1232, with a best score on test set of 0.6624246467658123, and on train set of 0.636870502526712
Finished loading model, total used 1333 iterations
Finished loading model, total used 1333 iterations
Finished loading model, total used 1333 iterations
Finished loading model, total used 1333 iterations
--------------------------------------------------

Best cross entropy loss: 0.6624246467658123
Best number of trees: 1232
--------------------------------------------------

Iteration 5




Early stopping at iteration 1148, with a best score on test set of 0.6653590794887231, and on train set of 0.6362457959363418
Finished loading model, total used 1249 iterations
Finished loading model, total used 1249 iterations
Finished loading model, total used 1249 iterations
Finished loading model, total used 1249 iterations
--------------------------------------------------

Best cross entropy loss: 0.6653590794887231
Best number of trees: 1148
--------------------------------------------------

Cross validation negative cross entropy loss: 0.656155083761848
With a number of trees on average of 1243.4


### Testing the model on out-of-sample data

Now that we have the optimal number of trees (1243), we can train the final version of the model on the full dataset, and test it on out-of-sample data with the ```predict()``` function. Note that the dataset must be a lightgbm object in the ```predict()``` function.

**Also note that you need to specify ```mu``` and ```nests``` in the predict function to adapt the probability formula accordingly.**

In [None]:
params['num_iterations'] = 1243

LPMCnested_model_fully_trained = rum_train(params, lgb_train_set, rum_structure, mu=optimal_mu, nests=nest)

preds, _, _ = LPMCnested_model_fully_trained.predict(lgb_test_set, mu=optimal_mu, nests=nest)

ce_test = cross_entropy(preds, lgb_test_set.get_label().astype(int))

print('-'*50)
print(f'Final negative cross-entropy on the test set: {ce_test}')



Finished loading model, total used 1243 iterations
Finished loading model, total used 1243 iterations
Finished loading model, total used 1243 iterations
Finished loading model, total used 1243 iterations
--------------------------------------------------
Final negative cross-entropy on the test set: 0.6732825945666746


# References

Salvadé, N., & Hillel, T. (2024). Rumboost: Gradient Boosted Random Utility Models. *arXiv preprint [arXiv:2401.11954](https://arxiv.org/abs/2401.11954)*

Hillel, T., Elshafie, M.Z.E.B., Jin, Y., 2018. Recreating passenger mode choice-sets for transport simulation: A case study of London, UK. Proceedings of the Institution of Civil Engineers - Smart Infrastructure and Construction 171, 29–42. https://doi.org/10.1680/jsmic.17.00018