[ML] Set the class assignment probability threshold to maximise minimum recall #926

tveasey · 2020-01-09T12:36:14Z

Rather than using a fixed threshold on the P(class 1), this switches to optionally supporting two strategies for assigning class labels:

Maximise accuracy (i.e. assign to class with the highest predicted probability).
Maximise the minimum recall of any class (this will be the default behaviour).

The default choice avoids pathologies for very imbalanced training data, where we can essentially assign all values to one class if we seek to maximise overall accuracy.

(We also need to introduce this threshold into the inference model schema. This needs support in the Java inference code to be merged first, and it will be made in a separate PR.)

… class recall by default

…ight change in the future

valeriy42

Looks good altogether. I am just wondering if you should put the threshold member variable into CBoostedTreeImpl.

include/maths/CBoostedTree.h

valeriy42 · 2020-01-14T12:46:25Z

include/maths/CBoostedTreeFactory.h

-    //! Set whether to try and balance within class accuracy. For classification
-    //! this reweights examples so approximately the same total loss is assigned
-    //! to every class.
-    CBoostedTreeFactory& balanceClassTrainingLoss(bool balance);


This change was short-lived ;)

Indeed. I tested with and without this option and it didn't significantly alter results now that the estimated probability at which to assign to each class is floating. I still think there is mileage in getting balancing classes better, possibly by oversampling (undersampling) the minority (majority) class when downsampling for each tree, but haven't been able to get anything to work appreciably better, so just keeping it simple for the time being.

valeriy42 · 2020-01-14T13:28:26Z

lib/maths/CDataFrameUtils.cc

+    CSolvers::maximize(0.0, 1.0, minRecall(0.0), minRecall(1.0), minRecall,
+                       1e-3, maxIterations, threshold, minRecallAtThreshold);
+    LOG_TRACE(<< "threshold = " << threshold


valeriy42

Since I had only a minor comment, I'll go ahead an approve the PR

tveasey · 2020-01-14T16:17:05Z

retest

tveasey · 2020-01-14T17:26:40Z

retest

…um recall (elastic#926)

…minimum recall (#936) Backport #926.

Set the threshold at which to assign to class one to maximize average…

3194003

… class recall by default

tveasey added >enhancement review v8.0.0 :ml/DataFrameAnalysis v7.6.0 labels Jan 9, 2020

tveasey added 5 commits January 9, 2020 13:52

Docs

ced4781

Tidy ups, fixes and some more testing

855e59e

Correct docs

09f57b4

Correct function name

45c6232

The minimum class recall has a unique maximum

94d4385

tveasey changed the title ~~[ML] Set the class assignment probability threshold to maximise average recall~~ [ML] Set the class assignment probability threshold to maximise minimum recall Jan 13, 2020

tveasey and others added 8 commits January 13, 2020 09:42

Merge branch 'master' into imbalanced-classes

43c3f59

Scale probabilities to generate scores

3640b25

Support either maximum accuracy or maximum minimum class recall

19fa26b

Missing variable initialisation

242544b

Fix and harden test

101f669

Correct test threshold

092c9b6

Tweak test

2fd513c

Merge branch 'master' into imbalanced-classes

b7c8c34

tveasey requested a review from valeriy42 January 14, 2020 11:13

tveasey added 2 commits January 14, 2020 11:42

We need protection for all shap value reads since feature selection m…

04934b3

…ight change in the future

Consistent naming

9a9fce6

valeriy42 reviewed Jan 14, 2020

View reviewed changes

valeriy42 approved these changes Jan 14, 2020

View reviewed changes

Compute decision threshold in train and cache on implementation

2e0952b

tveasey mentioned this pull request Jan 14, 2020

[ML] Disable invalid assertion elastic/elasticsearch#50986

Merged

tveasey merged commit 4d33e6b into elastic:master Jan 14, 2020

tveasey deleted the imbalanced-classes branch January 14, 2020 18:53

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jan 14, 2020

[ML] Set the class assignment probability threshold to maximise minim…

380d58c

…um recall (elastic#926)

tveasey mentioned this pull request Jan 14, 2020

[7.6][ML] Set the class assignment probability threshold to maximise minimum recall #936

Merged

tveasey added a commit that referenced this pull request Jan 14, 2020

[7.x][ML] Set the class assignment probability threshold to maximise …

0b27e77

…minimum recall (#936) Backport #926.

This was referenced Jan 15, 2020

[ML] Correct the top class probabilities #937

Merged

[ML] Write out classification weights in the model definition #938

Merged

tveasey mentioned this pull request Dec 8, 2020

[ML] Make CDataFrameRegressionModel not assignable #557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Set the class assignment probability threshold to maximise minimum recall #926

[ML] Set the class assignment probability threshold to maximise minimum recall #926

tveasey commented Jan 9, 2020 •

edited

Loading

valeriy42 left a comment

valeriy42 Jan 14, 2020

tveasey Jan 14, 2020

valeriy42 Jan 14, 2020

valeriy42 left a comment

tveasey commented Jan 14, 2020

tveasey commented Jan 14, 2020

[ML] Set the class assignment probability threshold to maximise minimum recall #926

[ML] Set the class assignment probability threshold to maximise minimum recall #926

Conversation

tveasey commented Jan 9, 2020 • edited Loading

valeriy42 left a comment

Choose a reason for hiding this comment

valeriy42 Jan 14, 2020

Choose a reason for hiding this comment

tveasey Jan 14, 2020

Choose a reason for hiding this comment

valeriy42 Jan 14, 2020

Choose a reason for hiding this comment

valeriy42 left a comment

Choose a reason for hiding this comment

tveasey commented Jan 14, 2020

tveasey commented Jan 14, 2020

tveasey commented Jan 9, 2020 •

edited

Loading