[ML] Multinomial logistic regression #1037

tveasey · 2020-03-05T14:50:29Z

This change implements the loss function for multinomial logistic regression.

Note I've factored out the loss function related unit tests into their own suite. I also needed to make various changes to our online k-means implementation to support CDenseVector which requires that the vector dimension is passed to the constructor.

valeriy42

Looks good altogether. I added a couple of minor comments. I haven't seen you normalizing the output probability vector. Did I miss it or you don't want to normalize?

include/maths/CTools.h

valeriy42 · 2020-03-09T14:20:21Z

lib/maths/CBoostedTreeLoss.cc

+    case 0: {
+        // We have a member variable to avoid allocating a tempory each time.
+        m_DoublePrediction = prediction;
+        m_PredictionSketch.add(m_DoublePrediction, weight);


What is the purpose of m_PredictionSketch?

The basic idea is as follows. We want to find the weight = argmin_w { sum_{i in I}{ [-log([softmax(prediction_i + w)]_{actual_i}]) } + lambda * w'w }. Here, I is the set of all training examples in the leaf and actual_i is the index of the i'th example actual category. Rather than working with this function directly we summarise it by the set {(x_j, c_j)} where x_j are some points in prediction space and c_j are the counts of the nearest predictions in the I to each x_j. We use a k-means of the predictions in I to choose x_j this is calculated sequentially (to accommodate the case we're using disk storage). This is what m_PredictionSketch is doing. I'll add some class documentation to explain the strategy.

include/maths/CBoostedTreeLoss.h

valeriy42 · 2020-03-09T14:48:06Z

lib/maths/CBoostedTreeLoss.cc

+        LOG_TRACE(<< "x0 = " << x0.transpose());
+
+        double loss;
+        CLbfgs<TDoubleVector> lgbfs{5};


I am probably missing something, where is 5 coming from?

This is the rank of the Hessian approximation. Generally, people recommend nothing too small (less than 3), but you quickly get diminishing returns. For example, 5 is the default for the R's optim package for this parameter. We can experiment a bit with this, but this randomised test suggests the choice isn't too bad.

Co-Authored-By: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>

tveasey · 2020-03-10T15:35:09Z

Thanks for the review @valeriy42. I've been through your comments and added an explanation of the top level strategy. Regarding,

I haven't seen you normalizing the output probability vector. Did I miss it or you don't want to normalize?

I don't actually normalise values of weights (although with "shrinkage" regularisation they shouldn't get too big). Also, I haven't actually wired this in to api::CDataFrameTrainBoostedTreeClassifierRunner yet. This PR is already getting large, so I think I will postpone that for the next one. I'll also finish up the last bit of loss function testing and add a multi-class unit test. I'm therefore also removing WIP.

valeriy42

LGTM. Good work. Thank you for adding extensive documentation. 👍

lib/maths/unittest/CBoostedTreeLossTest.cc

Co-Authored-By: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>

Backport #1037.

tveasey added 9 commits February 27, 2020 14:01

Multinomial logistic regression (1)

b47ce1e

Multinomial logistic regression (2)

6a57eae

Merge branch 'master' into multiclass-8

d1a3cfc

Multinomial logistic regression (3)

9a5a81f

Multinomial logistic regression (4)

e62581c

Multinomial logistic regression (5)

d54b24e

Merge branch 'master' into multiclass-8

75a4b00

Multinomial logistic regression (6)

27c76c8

Multinomial logistic regression (7)

1514dd7

tveasey added WIP >feature v8.0.0 :ml/DataFrameAnalysis v7.7.0 labels Mar 5, 2020

tveasey added 4 commits March 5, 2020 15:37

Docs

6eda181

Multinomial logistic regression (8)

69da645

Bug fix

fd017f3

Multinomial logistic regression (10)

4f66a58

tveasey mentioned this pull request Mar 6, 2020

[ML] Fix handling of numerical precision loss in logistic loss gradient and curvature #1041

Merged

valeriy42 self-requested a review March 9, 2020 09:36

Merge branch 'master' into multiclass-8

33e2095

valeriy42 reviewed Mar 9, 2020

View reviewed changes

tveasey and others added 5 commits March 9, 2020 17:38

Merge branch 'master' into multiclass-8

f98242f

Performance improvements

d3e3c83

Typo

0aa2f34

Co-Authored-By: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>

Explain argmin strategy for [bi|multi]nomial logistic regression

d59a981

Compile fix

1dae810

tveasey removed the WIP label Mar 10, 2020

Tidy up

f1df8ce

valeriy42 approved these changes Mar 11, 2020

View reviewed changes

lib/maths/unittest/CBoostedTreeLossTest.cc Outdated Show resolved Hide resolved

tveasey and others added 4 commits March 11, 2020 18:20

Fix Linux and Windows SIGSEGV

41b072c

Relax test threshold

ba5a19f

Typo

4398a84

Co-Authored-By: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>

Fix unit test

3897993

tveasey merged commit 28148b2 into elastic:master Mar 12, 2020

tveasey deleted the multiclass-8 branch March 12, 2020 14:42

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Mar 12, 2020

[ML] Multinomial logistic regression (elastic#1037)

ff552f7

tveasey mentioned this pull request Mar 12, 2020

[7.7][ML] Multinomial logistic regression #1053

Merged

tveasey added a commit that referenced this pull request Mar 12, 2020

[7.7][ML] Multinomial logistic regression (#1053)

5ab8ee6

Backport #1037.

tveasey mentioned this pull request Mar 13, 2020

[ML] Wire in multiclass classification to the data frame analyzer command #1056

Merged

tveasey mentioned this pull request Mar 30, 2020

[ML] Speed up the lat_long function #1102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Multinomial logistic regression #1037

[ML] Multinomial logistic regression #1037

tveasey commented Mar 5, 2020 •

edited

Loading

valeriy42 left a comment

valeriy42 Mar 9, 2020

tveasey Mar 10, 2020

valeriy42 Mar 9, 2020

tveasey Mar 10, 2020

tveasey commented Mar 10, 2020

valeriy42 left a comment

[ML] Multinomial logistic regression #1037

[ML] Multinomial logistic regression #1037

Conversation

tveasey commented Mar 5, 2020 • edited Loading

valeriy42 left a comment

Choose a reason for hiding this comment

valeriy42 Mar 9, 2020

Choose a reason for hiding this comment

tveasey Mar 10, 2020

Choose a reason for hiding this comment

valeriy42 Mar 9, 2020

Choose a reason for hiding this comment

tveasey Mar 10, 2020

Choose a reason for hiding this comment

tveasey commented Mar 10, 2020

valeriy42 left a comment

Choose a reason for hiding this comment

tveasey commented Mar 5, 2020 •

edited

Loading