-
Notifications
You must be signed in to change notification settings - Fork 66
[ML] Instrumentation information for boosted tree learning #1031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass through. Overall looks great (and good job on reducing test duplication). I've made some minor comments. My most significant comment is I'd like to avoid the double inheritance: we have historically mandated against this. I can see why you've done it so will ponder the design a bit to see if I can make an alternative suggestion.
lib/maths/CBoostedTreeImpl.cc
Outdated
lossMoments.add(loss); | ||
m_FoldRoundTestLosses[fold][m_CurrentRound] = loss; | ||
numberTrees.push_back(static_cast<double>(forest.size())); | ||
m_Instrumentation->lossValues(std::to_string(fold), std::move(lossValues)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to convert this to a string here? I can't think of one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I am going to write the loss values in rapid json, fold
becomes the key and it has to be a string. I cannot convert it to string during writing since the string has to exist in memory until the writer finishes writing, otherwise, it just puts some random bytes as the key.
Thank you for your comments @tveasey. I addressed your comments. It would be great if you could review the changes and let me know if it's ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @valeriy42. A couple more minor suggestions, but otherwise looks good. I don't think this needs any further review so I'll approve.
retest |
retest |
) This PR enables instrumentation of the supvervised learning jobs. I needed to add maths::CDataFrameTrainBoostedTreeInstrumentationInterface to have an interface for storing validation error results, hyperparameters, and timing. Since I am passing iteration directly over the new interface, I also refactored nextStep to use an optional string phase as an argument. At the moment, we are not tagging phases, but we intend to in the future. Thus, I added a TODO there. Finally, I extracted the common routines for generating regression and classification data into CDataFrameAnalyzerTrainingFactory. It is not used for feature importance testing so far, cause the generation procedure there is a little bit different. I intend to unify the code in a follow-up PR.
This PR enables instrumentation of the supvervised learning jobs. I needed to add maths::CDataFrameTrainBoostedTreeInstrumentationInterface to have an interface for storing validation error results, hyperparameters, and timing. Since I am passing iteration directly over the new interface, I also refactored nextStep to use an optional string phase as an argument. At the moment, we are not tagging phases, but we intend to in the future. Thus, I added a TODO there. Finally, I extracted the common routines for generating regression and classification data into CDataFrameAnalyzerTrainingFactory. It is not used for feature importance testing so far, cause the generation procedure there is a little bit different. I intend to unify the code in a follow-up PR. Backport for #1031.
This PR enables instrumentation of the supvervised learning jobs. I needed to add
maths::CDataFrameTrainBoostedTreeInstrumentationInterface
to have an interface for storing validation error results, hyperparameters, and timing.Since I am passing
iteration
directly over the new interface, I also refactorednextStep
to use an optional stringphase
as an argument. At the moment, we are not tagging phases, but we intend to in the future. Thus, I added a TODO there.Finally, I extracted the common routines for generating regression and classification data into
CDataFrameAnalyzerTrainingFactory
. It is not used for feature importance testing so far, cause the generation procedure there is a little bit different. I intend to unify the code in a follow-up PR.