Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Set the class assignment probability threshold to maximise minimum recall #926

Merged
merged 17 commits into from
Jan 14, 2020

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jan 9, 2020

Rather than using a fixed threshold on the P(class 1), this switches to optionally supporting two strategies for assigning class labels:

  1. Maximise accuracy (i.e. assign to class with the highest predicted probability).
  2. Maximise the minimum recall of any class (this will be the default behaviour).

The default choice avoids pathologies for very imbalanced training data, where we can essentially assign all values to one class if we seek to maximise overall accuracy.

(We also need to introduce this threshold into the inference model schema. This needs support in the Java inference code to be merged first, and it will be made in a separate PR.)

@tveasey tveasey changed the title [ML] Set the class assignment probability threshold to maximise average recall [ML] Set the class assignment probability threshold to maximise minimum recall Jan 13, 2020
Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good altogether. I am just wondering if you should put the threshold member variable into CBoostedTreeImpl.

include/maths/CBoostedTree.h Outdated Show resolved Hide resolved
//! Set whether to try and balance within class accuracy. For classification
//! this reweights examples so approximately the same total loss is assigned
//! to every class.
CBoostedTreeFactory& balanceClassTrainingLoss(bool balance);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was short-lived ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I tested with and without this option and it didn't significantly alter results now that the estimated probability at which to assign to each class is floating. I still think there is mileage in getting balancing classes better, possibly by oversampling (undersampling) the minority (majority) class when downsampling for each tree, but haven't been able to get anything to work appreciably better, so just keeping it simple for the time being.

Comment on lines +778 to +780
CSolvers::maximize(0.0, 1.0, minRecall(0.0), minRecall(1.0), minRecall,
1e-3, maxIterations, threshold, minRecallAtThreshold);
LOG_TRACE(<< "threshold = " << threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I had only a minor comment, I'll go ahead an approve the PR

@tveasey
Copy link
Contributor Author

tveasey commented Jan 14, 2020

retest

1 similar comment
@tveasey
Copy link
Contributor Author

tveasey commented Jan 14, 2020

retest

@tveasey tveasey merged commit 4d33e6b into elastic:master Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants