-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Set the class assignment probability threshold to maximise minimum recall #926
Conversation
… class recall by default
…ight change in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good altogether. I am just wondering if you should put the threshold member variable into CBoostedTreeImpl
.
//! Set whether to try and balance within class accuracy. For classification | ||
//! this reweights examples so approximately the same total loss is assigned | ||
//! to every class. | ||
CBoostedTreeFactory& balanceClassTrainingLoss(bool balance); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was short-lived ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I tested with and without this option and it didn't significantly alter results now that the estimated probability at which to assign to each class is floating. I still think there is mileage in getting balancing classes better, possibly by oversampling (undersampling) the minority (majority) class when downsampling for each tree, but haven't been able to get anything to work appreciably better, so just keeping it simple for the time being.
CSolvers::maximize(0.0, 1.0, minRecall(0.0), minRecall(1.0), minRecall, | ||
1e-3, maxIterations, threshold, minRecallAtThreshold); | ||
LOG_TRACE(<< "threshold = " << threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I had only a minor comment, I'll go ahead an approve the PR
retest |
1 similar comment
retest |
Rather than using a fixed threshold on the P(class 1), this switches to optionally supporting two strategies for assigning class labels:
The default choice avoids pathologies for very imbalanced training data, where we can essentially assign all values to one class if we seek to maximise overall accuracy.
(We also need to introduce this threshold into the inference model schema. This needs support in the Java inference code to be merged first, and it will be made in a separate PR.)