Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does XGBoost regression predict completely unseen values? #1581

Closed
JoshuaC3 opened this issue Sep 16, 2016 · 9 comments
Closed

Why does XGBoost regression predict completely unseen values? #1581

JoshuaC3 opened this issue Sep 16, 2016 · 9 comments

Comments

@JoshuaC3
Copy link

From what I understand xgboost is not very good at extrapolating unseen values and is a problem with all tree based ML models. How come then, I have a model that predicts negative values when none are found in the training set? I use negatives as this example because they are easier to identify than other previously unseen values.

I get that this is probably not a bug, but I cannot find or understand why I happens on XGBoost?

Is this my fault for not setting a hyper-parameter correctly? Is there a way to reduce the chance of this happening?

@khotilov
Copy link
Member

Please, provide an example.

@JoshuaC3
Copy link
Author

JoshuaC3 commented Sep 20, 2016

This might a be a little tricky. I have a large dataset of 500,000 by 30. The target variable has absolutely no values less than or equal to zero. I am using the SKLearn API in python. Here is my setup,

`xgbreg = xgb.XGBRegressor(learning_rate = 0.1, n_estimators=1000,
                           max_depth=5, min_child_weight=1,
                           gamma=0, subsample=0.8,
                           colsample_bytree=0.8, objective= "reg:linear",  
                           nthread=-1,scale_pos_weight=1, seed=27)`

Of the 250,000 examples on the testing set 4,000 come back with negative values. From the training set 7,000 came back with negative reads.

If there is any more specific information I can give please state this.

@Laurae2
Copy link
Contributor

Laurae2 commented Sep 27, 2016

It's an unbounded regression, so it is normal to see such behavior. As gradient boosting has to minimize the loss function you use, you might have bounds for your labels but gradient boosting does not look at the bounds and does not bound the predictions.

For instance on the following picture, you can clearly see with one variable that although you have only variable (X) and only positive labels (Y), a GLM with gradient descent with the X*X effect will predict negative numbers after 5.

image

Here, same story but right in at X=4:

image

If you want to enforce the predictions to not be negative, use a large loss on the negative samples in a custom loss function. However, keep in mind it might have an effect on the performance of your model at the end.

Other possibility: bound all the negative samples to the minimum of your initial predictions.

@JoshuaC3
Copy link
Author

JoshuaC3 commented Sep 27, 2016

@Laurae2 firstly, thank you for a detailed explanation. I can completely see how a GLM carries on predicting after X=5 and in doing so predicts negative values for Y. It is a semi-parametric model and so this inference is plausible.

My understanding* of Tree based models is that each leaf within the tree has a set of rules that restrict a prediction to that leaf. These rules act as bound upon the variables Xi. The prediction for this leaf is then the average of all of the training set values. From this I cannot see how we can infer/predict/extrapolate a negative value when none have been seen in the training set.

Further, GBM are a very different type of machine learning to GLM and so to compare the two might lead to some "papering over the cracks" in our understanding of what is going on. Here is a reference from Owen Zhang that GBM cannot extrapolate.

image
If @Laurae2 or anyone else could shed some light on this then that would be amazing.

*beginner level understanding

@Laurae2
Copy link
Contributor

Laurae2 commented Sep 27, 2016

@JoshuaC3 in xgboost, if you assume a tree is cut at a point X, it separates the tree in two:

  • First part: value > X => provide score or continue splitting
  • Second part: value < X => provide score or continue splitting

It is not aware on the bounds of the values of the feature. All it knows is "greater than" or "lower than" to choose the cut point. The score computed is adjusted depending on the previous trees (as for "boosting"). The score itself does not know the bounds of the labels. If the additive effect of the tree makes a prediction out of bound, "it cannot see it and will continue as if it was normal" (as it is directed by the loss function).

When you refer to the "cannot extrapolate", I'm not sure how it should be interpreted as it could mean a lot of different things (no context provided in the slide, could mean a lot).

Gradient boosted trees are there to minimize a loss function, not to respect the bounds of your initial inputs. It sees only the gradient statistics, which is the couple (gradient, hessian) for each trained observation, computed against your label using the loss functions. If 200 trees are saying "-2" score for an observation (to minimize the loss function) and only "+1" score for 500 other trees, you indeed will have a negative prediction of -100 at the end (when you never had one at the beginning!).

As the feature inputs of your model increases, it becomes more and more possible to see such behavior (predictions that are literally out of bounds even in the training set because of a set of features that dominates the others in a specific cluster of observations which dominate the gradient statistics for choosing cut points).

To look at the behavior of your negative samples, I suggest you look at the predictions after every N rounds, and select only the rows where you have only negative values at the end. You will see the trend they follow when you plot the final table of (prediction round, mean of predictions of negative observations at the end) or by choosing a random observation and plotting (prediction round, prediction).

@JoshuaC3
Copy link
Author

JoshuaC3 commented Sep 27, 2016

Thank you @Laurae2. I think this is starting to make more sense. There are one or two places that you lost me though.

Gradient boosted trees are there to minimize a loss function

Just to clarify, an example would be to reduce RMSE for some regression problem? (I want to make sure I am not getting this mixed up with score).

If 200 trees are saying "-2" score for an observation (to minimize the loss function) and only "+1" score for 500 other trees, you indeed will have a negative prediction of -100 at the end (when you never had one at the beginning!).

  1. Is there a small maths error (200 * -2) + (1 * 500) = 100?
  2. I am a little unsure what is meant by the score in this case? Is the score what becomes the predicted value?
  3. How might you get a score of -1 in the first place?
  4. Are all previous 1:n trees used in prediction? Or just tree n where n=number_of_trees?

Great idea. I hadn't thought to do this. I will try to plot the prediction for every round (or maybe every nth round) and see what it reveals.

@Laurae2
Copy link
Contributor

Laurae2 commented Sep 27, 2016

Just to clarify, an example would be to reduce RMSE for some regression problem? (I want to make sure I am not getting this mixed up with score).

By default regression is using squared error loss in xgboost if I'm not mistaken. In this case, it would be about reducing the MSE metric. RMSE is another possible metric, which if you can compute the gradient/hessian loss functions, you can minimize them directly.

  1. Is there a small maths error (200 * -2) + (1 * 500) = 100?

Yep small maths error. Inverted the signs.

  1. I am a little unsure what is meant by the score in this case? Is the score what becomes the predicted value?

The score is the prediction from the tree, so yes it is the predicted value from that tree.

  1. How might you get a score of -1 in the first place?

Extremely simplified version that does not depict the reality:

Imagine you have a uniform distribution between [2000, 20000] of your labels and you want to optimize the linear error. A tree selects a feature which cuts the observations in two:

  • [2000, 6000] => lets say we assign them 4000
  • [6000, 20000] => lets say we assign them 13000

Let's focus on the predictions (2000, 6000, 6000, 20000) with an error of (-2000, 2000, -7000, 7000), where feature V2 = (1000, 500, 2000, 1000)

Now, we have another tree cutting from another feature, but you get this rule: V2 > 999

  • Average error when V2 > 999 => -666 score
  • Average error when V2 < 999 => 2000 score
  • 2000 [error = -2000, V2 = 1000] = 2000 - 666 = 1333 (out of bound)
  • 6000 [error = 2000, V2 = 500] = 6000 + 2000 = 8000 (perfect)
  • 6000 [error = -7000, V2 = 2000] = 6000 - 666 = 5334 (in bound)
  • 20000 [error = 7000, V2 = 1000] = 20000 - 666 = 19167 (in bound)

And there, you have out of bound predictions. This example is very simple and does not cast the shrinkage, the exact method to compute the score, etc.

@JoshuaC3
Copy link
Author

JoshuaC3 commented Sep 27, 2016

Ok, I've read over this a few times and it is has really helped fill some fundamental gaps in my knowledge. The example is very helpful. Thank you!

P.S. I did add the edit shown below but probably only just before you wrote your answer.

  1. Are all previous 1:n trees used in prediction? Or just tree n where n=number_of_trees?

@Laurae2
Copy link
Contributor

Laurae2 commented Sep 27, 2016

  1. Are all previous 1:n trees used in prediction? Or just tree n where n=number_of_trees?

All trees by default. You can setup the limit of trees to use using ntreelimit parameter during prediction (1:ntreelimit).

@tqchen tqchen closed this as completed Jul 4, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants