Why does XGBoost regression predict completely unseen values? #1581

JoshuaC3 · 2016-09-16T11:24:45Z

From what I understand xgboost is not very good at extrapolating unseen values and is a problem with all tree based ML models. How come then, I have a model that predicts negative values when none are found in the training set? I use negatives as this example because they are easier to identify than other previously unseen values.

I get that this is probably not a bug, but I cannot find or understand why I happens on XGBoost?

Is this my fault for not setting a hyper-parameter correctly? Is there a way to reduce the chance of this happening?

khotilov · 2016-09-20T05:32:32Z

Please, provide an example.

JoshuaC3 · 2016-09-20T08:52:20Z

This might a be a little tricky. I have a large dataset of 500,000 by 30. The target variable has absolutely no values less than or equal to zero. I am using the SKLearn API in python. Here is my setup,

`xgbreg = xgb.XGBRegressor(learning_rate = 0.1, n_estimators=1000,
                           max_depth=5, min_child_weight=1,
                           gamma=0, subsample=0.8,
                           colsample_bytree=0.8, objective= "reg:linear",  
                           nthread=-1,scale_pos_weight=1, seed=27)`

Of the 250,000 examples on the testing set 4,000 come back with negative values. From the training set 7,000 came back with negative reads.

If there is any more specific information I can give please state this.

Laurae2 · 2016-09-27T10:19:28Z

It's an unbounded regression, so it is normal to see such behavior. As gradient boosting has to minimize the loss function you use, you might have bounds for your labels but gradient boosting does not look at the bounds and does not bound the predictions.

For instance on the following picture, you can clearly see with one variable that although you have only variable (X) and only positive labels (Y), a GLM with gradient descent with the X*X effect will predict negative numbers after 5.

Here, same story but right in at X=4:

If you want to enforce the predictions to not be negative, use a large loss on the negative samples in a custom loss function. However, keep in mind it might have an effect on the performance of your model at the end.

Other possibility: bound all the negative samples to the minimum of your initial predictions.

JoshuaC3 · 2016-09-27T11:22:43Z

@Laurae2 firstly, thank you for a detailed explanation. I can completely see how a GLM carries on predicting after X=5 and in doing so predicts negative values for Y. It is a semi-parametric model and so this inference is plausible.

My understanding* of Tree based models is that each leaf within the tree has a set of rules that restrict a prediction to that leaf. These rules act as bound upon the variables X_i. The prediction for this leaf is then the average of all of the training set values. From this I cannot see how we can infer/predict/extrapolate a negative value when none have been seen in the training set.

Further, GBM are a very different type of machine learning to GLM and so to compare the two might lead to some "papering over the cracks" in our understanding of what is going on. Here is a reference from Owen Zhang that GBM cannot extrapolate.

If @Laurae2 or anyone else could shed some light on this then that would be amazing.

*beginner level understanding

Laurae2 · 2016-09-27T12:08:51Z

@JoshuaC3 in xgboost, if you assume a tree is cut at a point X, it separates the tree in two:

First part: value > X => provide score or continue splitting
Second part: value < X => provide score or continue splitting

It is not aware on the bounds of the values of the feature. All it knows is "greater than" or "lower than" to choose the cut point. The score computed is adjusted depending on the previous trees (as for "boosting"). The score itself does not know the bounds of the labels. If the additive effect of the tree makes a prediction out of bound, "it cannot see it and will continue as if it was normal" (as it is directed by the loss function).

When you refer to the "cannot extrapolate", I'm not sure how it should be interpreted as it could mean a lot of different things (no context provided in the slide, could mean a lot).

Gradient boosted trees are there to minimize a loss function, not to respect the bounds of your initial inputs. It sees only the gradient statistics, which is the couple (gradient, hessian) for each trained observation, computed against your label using the loss functions. If 200 trees are saying "-2" score for an observation (to minimize the loss function) and only "+1" score for 500 other trees, you indeed will have a negative prediction of -100 at the end (when you never had one at the beginning!).

As the feature inputs of your model increases, it becomes more and more possible to see such behavior (predictions that are literally out of bounds even in the training set because of a set of features that dominates the others in a specific cluster of observations which dominate the gradient statistics for choosing cut points).

To look at the behavior of your negative samples, I suggest you look at the predictions after every N rounds, and select only the rows where you have only negative values at the end. You will see the trend they follow when you plot the final table of (prediction round, mean of predictions of negative observations at the end) or by choosing a random observation and plotting (prediction round, prediction).

JoshuaC3 · 2016-09-27T12:30:37Z

Thank you @Laurae2. I think this is starting to make more sense. There are one or two places that you lost me though.

Gradient boosted trees are there to minimize a loss function

Just to clarify, an example would be to reduce RMSE for some regression problem? (I want to make sure I am not getting this mixed up with score).

If 200 trees are saying "-2" score for an observation (to minimize the loss function) and only "+1" score for 500 other trees, you indeed will have a negative prediction of -100 at the end (when you never had one at the beginning!).

Is there a small maths error (200 * -2) + (1 * 500) = 100?
I am a little unsure what is meant by the score in this case? Is the score what becomes the predicted value?
How might you get a score of -1 in the first place?
Are all previous 1:n trees used in prediction? Or just tree n where n=number_of_trees?

Great idea. I hadn't thought to do this. I will try to plot the prediction for every round (or maybe every nth round) and see what it reveals.

Laurae2 · 2016-09-27T12:44:23Z

Just to clarify, an example would be to reduce RMSE for some regression problem? (I want to make sure I am not getting this mixed up with score).

By default regression is using squared error loss in xgboost if I'm not mistaken. In this case, it would be about reducing the MSE metric. RMSE is another possible metric, which if you can compute the gradient/hessian loss functions, you can minimize them directly.

Is there a small maths error (200 * -2) + (1 * 500) = 100?

Yep small maths error. Inverted the signs.

I am a little unsure what is meant by the score in this case? Is the score what becomes the predicted value?

The score is the prediction from the tree, so yes it is the predicted value from that tree.

How might you get a score of -1 in the first place?

Extremely simplified version that does not depict the reality:

Imagine you have a uniform distribution between [2000, 20000] of your labels and you want to optimize the linear error. A tree selects a feature which cuts the observations in two:

[2000, 6000] => lets say we assign them 4000
[6000, 20000] => lets say we assign them 13000

Let's focus on the predictions (2000, 6000, 6000, 20000) with an error of (-2000, 2000, -7000, 7000), where feature V2 = (1000, 500, 2000, 1000)

Now, we have another tree cutting from another feature, but you get this rule: V2 > 999

Average error when V2 > 999 => -666 score
Average error when V2 < 999 => 2000 score
2000 [error = -2000, V2 = 1000] = 2000 - 666 = 1333 (out of bound)
6000 [error = 2000, V2 = 500] = 6000 + 2000 = 8000 (perfect)
6000 [error = -7000, V2 = 2000] = 6000 - 666 = 5334 (in bound)
20000 [error = 7000, V2 = 1000] = 20000 - 666 = 19167 (in bound)

And there, you have out of bound predictions. This example is very simple and does not cast the shrinkage, the exact method to compute the score, etc.

JoshuaC3 · 2016-09-27T15:20:20Z

Ok, I've read over this a few times and it is has really helped fill some fundamental gaps in my knowledge. The example is very helpful. Thank you!

P.S. I did add the edit shown below but probably only just before you wrote your answer.

Are all previous 1:n trees used in prediction? Or just tree n where n=number_of_trees?

Laurae2 · 2016-09-27T15:25:16Z

Are all previous 1:n trees used in prediction? Or just tree n where n=number_of_trees?

All trees by default. You can setup the limit of trees to use using ntreelimit parameter during prediction (1:ntreelimit).

tqchen closed this as completed Jul 4, 2018

lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does XGBoost regression predict completely unseen values? #1581

Why does XGBoost regression predict completely unseen values? #1581

JoshuaC3 commented Sep 16, 2016

khotilov commented Sep 20, 2016

JoshuaC3 commented Sep 20, 2016 •

edited

Loading

Laurae2 commented Sep 27, 2016

JoshuaC3 commented Sep 27, 2016 •

edited

Loading

Laurae2 commented Sep 27, 2016 •

edited

Loading

JoshuaC3 commented Sep 27, 2016 •

edited

Loading

Laurae2 commented Sep 27, 2016 •

edited

Loading

JoshuaC3 commented Sep 27, 2016 •

edited

Loading

Laurae2 commented Sep 27, 2016

Why does XGBoost regression predict completely unseen values? #1581

Why does XGBoost regression predict completely unseen values? #1581

Comments

JoshuaC3 commented Sep 16, 2016

khotilov commented Sep 20, 2016

JoshuaC3 commented Sep 20, 2016 • edited Loading

Laurae2 commented Sep 27, 2016

JoshuaC3 commented Sep 27, 2016 • edited Loading

Laurae2 commented Sep 27, 2016 • edited Loading

JoshuaC3 commented Sep 27, 2016 • edited Loading

Laurae2 commented Sep 27, 2016 • edited Loading

JoshuaC3 commented Sep 27, 2016 • edited Loading

Laurae2 commented Sep 27, 2016

JoshuaC3 commented Sep 20, 2016 •

edited

Loading

JoshuaC3 commented Sep 27, 2016 •

edited

Loading

Laurae2 commented Sep 27, 2016 •

edited

Loading

JoshuaC3 commented Sep 27, 2016 •

edited

Loading

Laurae2 commented Sep 27, 2016 •

edited

Loading

JoshuaC3 commented Sep 27, 2016 •

edited

Loading