Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different outputs from Treelite vs LightGBM on Categorical NaNs #277

Closed
siboehm opened this issue May 25, 2021 · 2 comments · Fixed by #304
Closed

Different outputs from Treelite vs LightGBM on Categorical NaNs #277

siboehm opened this issue May 25, 2021 · 2 comments · Fixed by #304
Assignees

Comments

@siboehm
Copy link

siboehm commented May 25, 2021

import lightgbm as lgb
import numpy as np
X = np.array(30*[[1]] + 30*[[2]] + 30*[[0]])
y = np.array(60 * [5] + 30*[10])
train_data = lgb.Dataset(X, label=y, categorical_feature=[0])
bst = lgb.train({}, train_data, 1)

bst.predict(np.array([[np.NaN], [0.0]])) # array([6.5       , 6.99999999])
bst.save_model("model.txt")

import treelite
import treelite_runtime
model = treelite.Model.load('model.txt', model_format='lightgbm')
model.export_lib(toolchain="gcc", libpath='./mymodel.so', verbose=True)
predictor = treelite_runtime.Predictor('./mymodel.so', verbose=True)
dmat = treelite_runtime.DMatrix(np.array([[np.NaN], [0.0]]))
predictor.predict(dmat) # array([6.99999999, 6.99999999])

# just to make sure it's not due to LightGBM model export
x = lgb.Booster(model_file="model.txt")
x.predict(np.array([[np.NaN], [0.0]])) # array([6.5       , 6.99999999])

Model plot:
image

I think this happens because treelight implements DecisionCategorical from LightGBM's code where NaNs get mapped to 0.0, but LightGBM seems to run DecisionCategoricalnner from here instead. CategoricalDecisionInner forgoes the NaN ⇒ 0.0 mapping, and NaNs take the right branch, which would explain the different outputs.

So far this is just my hypothesis, I haven't checked yet whether LightGBM runs DecisionCategorical or DecisionCategoricalInner.

Version: LightGBM == 3.2.1, treelight==1.3.0

Model.txt tree version=v3 num_class=1 num_tree_per_iteration=1 label_index=0 max_feature_idx=0 objective=regression feature_names=Column_0 feature_infos=-1:0:1:2 tree_sizes=322

Tree=0
num_leaves=2
num_cat=1
split_feature=0
split_gain=500
threshold=0
decision_type=1
left_child=-1
right_child=-2
leaf_value=6.9999999920527145 6.5000000039736436
leaf_weight=30 60
leaf_count=30 60
internal_value=6.66667
internal_weight=0
internal_count=90
cat_boundaries=0 1
cat_threshold=1
is_linear=0
shrinkage=1

end of trees

feature_importances:
Column_0=1

parameters:
[boosting: gbdt]
[objective: regression]
[metric: ]
[tree_learner: serial]
[device_type: cpu]
[linear_tree: 0]
[data: ]
[valid: ]
[num_iterations: 1]
[learning_rate: 0.1]
[num_leaves: 31]
[num_threads: 0]
[deterministic: 0]
[force_col_wise: 0]
[force_row_wise: 0]
[histogram_pool_size: -1]
[max_depth: -1]
[min_data_in_leaf: 20]
[min_sum_hessian_in_leaf: 0.001]
[bagging_fraction: 1]
[pos_bagging_fraction: 1]
[neg_bagging_fraction: 1]
[bagging_freq: 0]
[bagging_seed: 3]
[feature_fraction: 1]
[feature_fraction_bynode: 1]
[feature_fraction_seed: 2]
[extra_trees: 0]
[extra_seed: 6]
[early_stopping_round: 0]
[first_metric_only: 0]
[max_delta_step: 0]
[lambda_l1: 0]
[lambda_l2: 0]
[linear_lambda: 0]
[min_gain_to_split: 0]
[drop_rate: 0.1]
[max_drop: 50]
[skip_drop: 0.5]
[xgboost_dart_mode: 0]
[uniform_drop: 0]
[drop_seed: 4]
[top_rate: 0.2]
[other_rate: 0.1]
[min_data_per_group: 100]
[max_cat_threshold: 32]
[cat_l2: 10]
[cat_smooth: 10]
[max_cat_to_onehot: 4]
[top_k: 20]
[monotone_constraints: ]
[monotone_constraints_method: basic]
[monotone_penalty: 0]
[feature_contri: ]
[forcedsplits_filename: ]
[refit_decay_rate: 0.9]
[cegb_tradeoff: 1]
[cegb_penalty_split: 0]
[cegb_penalty_feature_lazy: ]
[cegb_penalty_feature_coupled: ]
[path_smooth: 0]
[interaction_constraints: ]
[verbosity: 1]
[saved_feature_importance_type: 0]
[max_bin: 255]
[max_bin_by_feature: ]
[min_data_in_bin: 3]
[bin_construct_sample_cnt: 200000]
[data_random_seed: 1]
[is_enable_sparse: 1]
[enable_bundle: 1]
[use_missing: 1]
[zero_as_missing: 0]
[feature_pre_filter: 1]
[pre_partition: 0]
[two_round: 0]
[header: 0]
[label_column: ]
[weight_column: ]
[group_column: ]
[ignore_column: ]
[categorical_feature: 0]
[forcedbins_filename: ]
[objective_seed: 5]
[num_class: 1]
[is_unbalance: 0]
[scale_pos_weight: 1]
[sigmoid: 1]
[boost_from_average: 1]
[reg_sqrt: 0]
[alpha: 0.9]
[fair_c: 1]
[poisson_max_delta_step: 0.7]
[tweedie_variance_power: 1.5]
[lambdarank_truncation_level: 30]
[lambdarank_norm: 1]
[label_gain: ]
[eval_at: ]
[multi_error_top_k: 1]
[auc_mu_weights: ]
[num_machines: 1]
[local_listen_port: 12400]
[time_out: 120]
[machine_list_filename: ]
[machines: ]
[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]
[num_gpu: 1]

end of parameters

pandas_categorical:null

@siboehm siboehm closed this as completed May 25, 2021
@siboehm siboehm changed the title Different predictions for LightGBM Different predictions for LightGBM NaN May 25, 2021
@siboehm siboehm changed the title Different predictions for LightGBM NaN Different outputs from Treelite vs LightGBM on Categorical NaNs May 25, 2021
@siboehm
Copy link
Author

siboehm commented May 25, 2021

Sorry I accidentally closed the issue before I finished creating it.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 13, 2021

It turns out that this is due to an apparent bug in LightGBM. I filed microsoft/LightGBM#4468.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants