New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6025] [MLlib] Add helper method to efficiently compute error in GBT's #4819
Conversation
Test build #28089 has started for PR 4819 at commit
|
@jkbradley Is this similar to what you had in mind? |
Test build #28089 has finished for PR 4819 at commit
|
Test FAILed. |
fa215cc
to
7d4ed48
Compare
Test build #28110 has started for PR 4819 at commit
|
Test build #28110 has finished for PR 4819 at commit
|
Test PASSed. |
@jkbradley I am assuming that this is what you intended. It works but I'm not sure about the present design, which differs from the design that you had posted in the JIRA.
I am unable to understand how this would work, if the existing residual is not passed and could you also say what Array[Double] is supposed to be? And does the evaluator mean the loss metric (as done in this PR)? |
Also, the present code is unoptimized since there are two runs across the data RDD. one to update the residual, and the other to calculate the error. But that can be taken care after we discuss the design. |
@MechCoder I had intended to use this internally and to expose a public method. (The "evaluateEachIteration" method was the public one, but feel free to think of a better name.) Yes, the evaluator was the loss metric, which should probably be an optional parameter (defaulting to the training metric).
I'm Ok with combining the 2 JIRAs in 1 PR since they are closely related. For the internal optimization, the "residual" to store is not really the residual but rather the cumulative prediction of the ensemble; that in turn can be used to compute both the gradient and the error. (Note it will be important to use the cached residual for computing the gradient, not just the objective.) That may require adding some internal API to ensembles to permit prediction from a pre-computed sum of trees' predictions. |
Ouch. I just realised what you meant.. Scratch my previous couple of comments. :/ |
@jkbradley Just one quick clarification, please. When you mean |
@MechCoder No problem; sorry I didn't make the JIRAs clearer! Calling it |
@jkbradley , Yes but each element of the array returned corresponds to the error / loss in every iteration right? |
That's correct: element i should have the error/loss for the ensemble containing trees {0, 1, ..., i}. |
While computing the error, with and without validation, for every iteration, the feature prediction of the previous trees was not being cached, which lead to re computation.