Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6025] [MLlib] Add helper method to efficiently compute error in GBT's #4819

Closed
wants to merge 1 commit into from

Conversation

MechCoder
Copy link
Contributor

While computing the error, with and without validation, for every iteration, the feature prediction of the previous trees was not being cached, which lead to re computation.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28089 has started for PR 4819 at commit fa215cc.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

@jkbradley Is this similar to what you had in mind?

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28089 has finished for PR 4819 at commit fa215cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28089/
Test FAILed.

@MechCoder MechCoder changed the title [SPARK-6025] Add helper method to efficiently compute error in GBT's [SPARK-6025] [MLlib] Add helper method to efficiently compute error in GBT's Feb 28, 2015
@SparkQA
Copy link

SparkQA commented Feb 28, 2015

Test build #28110 has started for PR 4819 at commit 7d4ed48.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 28, 2015

Test build #28110 has finished for PR 4819 at commit 7d4ed48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28110/
Test PASSed.

@MechCoder
Copy link
Contributor Author

@jkbradley I am assuming that this is what you intended. It works but I'm not sure about the present design, which differs from the design that you had posted in the JIRA.

def evaluateEachIteration(data: RDD[LabeledPoint], evaluator): Array[Double]

I am unable to understand how this would work, if the existing residual is not passed and could you also say what Array[Double] is supposed to be? And does the evaluator mean the loss metric (as done in this PR)?

@MechCoder
Copy link
Contributor Author

Also, the present code is unoptimized since there are two runs across the data RDD. one to update the residual, and the other to calculate the error. But that can be taken care after we discuss the design.
I just want to make sure that I'm on the right track.

@jkbradley
Copy link
Member

@MechCoder I had intended to use this internally and to expose a public method. (The "evaluateEachIteration" method was the public one, but feel free to think of a better name.) Yes, the evaluator was the loss metric, which should probably be an optional parameter (defaulting to the training metric).

  • [https://issues.apache.org/jira/browse/SPARK-6025]: This is the JIRA for the public method.
  • [https://issues.apache.org/jira/browse/SPARK-5972]: This is the JIRA for the internal optimization.

I'm Ok with combining the 2 JIRAs in 1 PR since they are closely related. For the internal optimization, the "residual" to store is not really the residual but rather the cumulative prediction of the ensemble; that in turn can be used to compute both the gradient and the error. (Note it will be important to use the cached residual for computing the gradient, not just the objective.) That may require adding some internal API to ensembles to permit prediction from a pre-computed sum of trees' predictions.

@MechCoder
Copy link
Contributor Author

Ouch. I just realised what you meant.. Scratch my previous couple of comments. :/

@MechCoder
Copy link
Contributor Author

@jkbradley Just one quick clarification, please.

When you mean evaluateEachIteration should return an Array of Doubles, do you mean that each element of the returned array corresponds to the error per iteration (i.e tree)? In that case how does the name errorPerIteration sound?

@MechCoder MechCoder closed this Mar 4, 2015
@MechCoder MechCoder deleted the spark-6025 branch March 4, 2015 04:25
@jkbradley
Copy link
Member

@MechCoder No problem; sorry I didn't make the JIRAs clearer! Calling it errorPerIteration sounds OK unless we allow users to pass in evaluators, in which case the evaluator might be something new like Accuracy which isn't an "error" metric. I'd still vote for evaluateEachIteration in case we allow this later on.

@MechCoder
Copy link
Contributor Author

@jkbradley , Yes but each element of the array returned corresponds to the error / loss in every iteration right?

@jkbradley
Copy link
Member

That's correct: element i should have the error/loss for the ensemble containing trees {0, 1, ..., i}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants