Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Multinomial logistic regression #1037

Merged
merged 24 commits into from
Mar 12, 2020
Merged

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Mar 5, 2020

This change implements the loss function for multinomial logistic regression.

Note I've factored out the loss function related unit tests into their own suite. I also needed to make various changes to our online k-means implementation to support CDenseVector which requires that the vector dimension is passed to the constructor.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good altogether. I added a couple of minor comments. I haven't seen you normalizing the output probability vector. Did I miss it or you don't want to normalize?

include/maths/CTools.h Outdated Show resolved Hide resolved
case 0: {
// We have a member variable to avoid allocating a tempory each time.
m_DoublePrediction = prediction;
m_PredictionSketch.add(m_DoublePrediction, weight);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of m_PredictionSketch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea is as follows. We want to find the weight = argmin_w { sum_{i in I}{ [-log([softmax(prediction_i + w)]_{actual_i}]) } + lambda * w'w }. Here, I is the set of all training examples in the leaf and actual_i is the index of the i'th example actual category. Rather than working with this function directly we summarise it by the set {(x_j, c_j)} where x_j are some points in prediction space and c_j are the counts of the nearest predictions in the I to each x_j. We use a k-means of the predictions in I to choose x_j this is calculated sequentially (to accommodate the case we're using disk storage). This is what m_PredictionSketch is doing. I'll add some class documentation to explain the strategy.

include/maths/CBoostedTreeLoss.h Show resolved Hide resolved
LOG_TRACE(<< "x0 = " << x0.transpose());

double loss;
CLbfgs<TDoubleVector> lgbfs{5};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably missing something, where is 5 coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the rank of the Hessian approximation. Generally, people recommend nothing too small (less than 3), but you quickly get diminishing returns. For example, 5 is the default for the R's optim package for this parameter. We can experiment a bit with this, but this randomised test suggests the choice isn't too bad.

@tveasey
Copy link
Contributor Author

tveasey commented Mar 10, 2020

Thanks for the review @valeriy42. I've been through your comments and added an explanation of the top level strategy. Regarding,

I haven't seen you normalizing the output probability vector. Did I miss it or you don't want to normalize?

I don't actually normalise values of weights (although with "shrinkage" regularisation they shouldn't get too big). Also, I haven't actually wired this in to api::CDataFrameTrainBoostedTreeClassifierRunner yet. This PR is already getting large, so I think I will postpone that for the next one. I'll also finish up the last bit of loss function testing and add a multi-class unit test. I'm therefore also removing WIP.

@tveasey tveasey removed the WIP label Mar 10, 2020
Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work. Thank you for adding extensive documentation. 👍

lib/maths/unittest/CBoostedTreeLossTest.cc Outdated Show resolved Hide resolved
tveasey and others added 4 commits March 11, 2020 18:20
Co-Authored-By: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
@tveasey tveasey merged commit 28148b2 into elastic:master Mar 12, 2020
@tveasey tveasey deleted the multiclass-8 branch March 12, 2020 14:42
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Mar 12, 2020
tveasey added a commit that referenced this pull request Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants