-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does XGBoost regression predict completely unseen values? #1581
Comments
Please, provide an example. |
This might a be a little tricky. I have a large dataset of 500,000 by 30. The target variable has absolutely no values less than or equal to zero. I am using the SKLearn API in python. Here is my setup,
Of the 250,000 examples on the testing set 4,000 come back with negative values. From the training set 7,000 came back with negative reads. If there is any more specific information I can give please state this. |
It's an unbounded regression, so it is normal to see such behavior. As gradient boosting has to minimize the loss function you use, you might have bounds for your labels but gradient boosting does not look at the bounds and does not bound the predictions. For instance on the following picture, you can clearly see with one variable that although you have only variable (X) and only positive labels (Y), a GLM with gradient descent with the X*X effect will predict negative numbers after 5. Here, same story but right in at X=4: If you want to enforce the predictions to not be negative, use a large loss on the negative samples in a custom loss function. However, keep in mind it might have an effect on the performance of your model at the end. Other possibility: bound all the negative samples to the minimum of your initial predictions. |
@Laurae2 firstly, thank you for a detailed explanation. I can completely see how a GLM carries on predicting after My understanding* of Tree based models is that each leaf within the tree has a set of rules that restrict a prediction to that leaf. These rules act as bound upon the variables Xi. The prediction for this leaf is then the average of all of the training set values. From this I cannot see how we can infer/predict/extrapolate a negative value when none have been seen in the training set. Further, GBM are a very different type of machine learning to GLM and so to compare the two might lead to some "papering over the cracks" in our understanding of what is going on. Here is a reference from Owen Zhang that GBM cannot extrapolate.
*beginner level understanding |
@JoshuaC3 in xgboost, if you assume a tree is cut at a point X, it separates the tree in two:
It is not aware on the bounds of the values of the feature. All it knows is "greater than" or "lower than" to choose the cut point. The score computed is adjusted depending on the previous trees (as for "boosting"). The score itself does not know the bounds of the labels. If the additive effect of the tree makes a prediction out of bound, "it cannot see it and will continue as if it was normal" (as it is directed by the loss function). When you refer to the "cannot extrapolate", I'm not sure how it should be interpreted as it could mean a lot of different things (no context provided in the slide, could mean a lot). Gradient boosted trees are there to minimize a loss function, not to respect the bounds of your initial inputs. It sees only the gradient statistics, which is the couple (gradient, hessian) for each trained observation, computed against your label using the loss functions. If 200 trees are saying "-2" score for an observation (to minimize the loss function) and only "+1" score for 500 other trees, you indeed will have a negative prediction of -100 at the end (when you never had one at the beginning!). As the feature inputs of your model increases, it becomes more and more possible to see such behavior (predictions that are literally out of bounds even in the training set because of a set of features that dominates the others in a specific cluster of observations which dominate the gradient statistics for choosing cut points). To look at the behavior of your negative samples, I suggest you look at the predictions after every N rounds, and select only the rows where you have only negative values at the end. You will see the trend they follow when you plot the final table of (prediction round, mean of predictions of negative observations at the end) or by choosing a random observation and plotting (prediction round, prediction). |
Thank you @Laurae2. I think this is starting to make more sense. There are one or two places that you lost me though.
Just to clarify, an example would be to reduce RMSE for some regression problem? (I want to make sure I am not getting this mixed up with score).
Great idea. I hadn't thought to do this. I will try to plot the prediction for every round (or maybe every nth round) and see what it reveals. |
By default regression is using squared error loss in xgboost if I'm not mistaken. In this case, it would be about reducing the MSE metric. RMSE is another possible metric, which if you can compute the gradient/hessian loss functions, you can minimize them directly.
Yep small maths error. Inverted the signs.
The score is the prediction from the tree, so yes it is the predicted value from that tree.
Extremely simplified version that does not depict the reality: Imagine you have a uniform distribution between [2000, 20000] of your labels and you want to optimize the linear error. A tree selects a feature which cuts the observations in two:
Let's focus on the predictions (2000, 6000, 6000, 20000) with an error of (-2000, 2000, -7000, 7000), where feature V2 = (1000, 500, 2000, 1000) Now, we have another tree cutting from another feature, but you get this rule: V2 > 999
And there, you have out of bound predictions. This example is very simple and does not cast the shrinkage, the exact method to compute the score, etc. |
Ok, I've read over this a few times and it is has really helped fill some fundamental gaps in my knowledge. The example is very helpful. Thank you! P.S. I did add the edit shown below but probably only just before you wrote your answer.
|
All trees by default. You can setup the limit of trees to use using ntreelimit parameter during prediction (1:ntreelimit). |
From what I understand xgboost is not very good at extrapolating unseen values and is a problem with all tree based ML models. How come then, I have a model that predicts negative values when none are found in the training set? I use negatives as this example because they are easier to identify than other previously unseen values.
I get that this is probably not a bug, but I cannot find or understand why I happens on XGBoost?
Is this my fault for not setting a hyper-parameter correctly? Is there a way to reduce the chance of this happening?
The text was updated successfully, but these errors were encountered: