Skip to content

Conversation

valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Feb 27, 2020

This PR enables instrumentation of the supvervised learning jobs. I needed to add maths::CDataFrameTrainBoostedTreeInstrumentationInterface to have an interface for storing validation error results, hyperparameters, and timing.

Since I am passing iteration directly over the new interface, I also refactored nextStep to use an optional string phase as an argument. At the moment, we are not tagging phases, but we intend to in the future. Thus, I added a TODO there.

Finally, I extracted the common routines for generating regression and classification data into CDataFrameAnalyzerTrainingFactory. It is not used for feature importance testing so far, cause the generation procedure there is a little bit different. I intend to unify the code in a follow-up PR.

@valeriy42 valeriy42 requested a review from tveasey February 28, 2020 10:24
@valeriy42 valeriy42 removed the WIP label Feb 28, 2020
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first pass through. Overall looks great (and good job on reducing test duplication). I've made some minor comments. My most significant comment is I'd like to avoid the double inheritance: we have historically mandated against this. I can see why you've done it so will ponder the design a bit to see if I can make an alternative suggestion.

lossMoments.add(loss);
m_FoldRoundTestLosses[fold][m_CurrentRound] = loss;
numberTrees.push_back(static_cast<double>(forest.size()));
m_Instrumentation->lossValues(std::to_string(fold), std::move(lossValues));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to convert this to a string here? I can't think of one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when I am going to write the loss values in rapid json, fold becomes the key and it has to be a string. I cannot convert it to string during writing since the string has to exist in memory until the writer finishes writing, otherwise, it just puts some random bytes as the key.

@valeriy42
Copy link
Contributor Author

Thank you for your comments @tveasey. I addressed your comments. It would be great if you could review the changes and let me know if it's ok.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @valeriy42. A couple more minor suggestions, but otherwise looks good. I don't think this needs any further review so I'll approve.

@valeriy42
Copy link
Contributor Author

retest

@valeriy42
Copy link
Contributor Author

retest

@valeriy42 valeriy42 merged commit 8d30f4f into elastic:master Mar 18, 2020
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Mar 19, 2020
)

This PR enables instrumentation of the supvervised learning jobs. I needed to add maths::CDataFrameTrainBoostedTreeInstrumentationInterface to have an interface for storing validation error results, hyperparameters, and timing.

Since I am passing iteration directly over the new interface, I also refactored nextStep to use an optional string phase as an argument. At the moment, we are not tagging phases, but we intend to in the future. Thus, I added a TODO there.

Finally, I extracted the common routines for generating regression and classification data into CDataFrameAnalyzerTrainingFactory. It is not used for feature importance testing so far, cause the generation procedure there is a little bit different. I intend to unify the code in a follow-up PR.
valeriy42 added a commit that referenced this pull request Mar 19, 2020
This PR enables instrumentation of the supvervised learning jobs. I needed to add maths::CDataFrameTrainBoostedTreeInstrumentationInterface to have an interface for storing validation error results, hyperparameters, and timing.

Since I am passing iteration directly over the new interface, I also refactored nextStep to use an optional string phase as an argument. At the moment, we are not tagging phases, but we intend to in the future. Thus, I added a TODO there.

Finally, I extracted the common routines for generating regression and classification data into CDataFrameAnalyzerTrainingFactory. It is not used for feature importance testing so far, cause the generation procedure there is a little bit different. I intend to unify the code in a follow-up PR.

Backport for #1031.
@valeriy42 valeriy42 deleted the analysis-stats branch May 6, 2020 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants