-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Multinomial logistic regression #1037
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good altogether. I added a couple of minor comments. I haven't seen you normalizing the output probability vector. Did I miss it or you don't want to normalize?
case 0: { | ||
// We have a member variable to avoid allocating a tempory each time. | ||
m_DoublePrediction = prediction; | ||
m_PredictionSketch.add(m_DoublePrediction, weight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of m_PredictionSketch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic idea is as follows. We want to find the weight = argmin_w { sum_{i in I}{ [-log([softmax(prediction_i + w)]_{actual_i}]) } + lambda * w'w }
. Here, I
is the set of all training examples in the leaf and actual_i
is the index of the i'th example actual category. Rather than working with this function directly we summarise it by the set {(x_j, c_j)}
where x_j
are some points in prediction space and c_j
are the counts of the nearest predictions in the I
to each x_j
. We use a k-means of the predictions in I
to choose x_j
this is calculated sequentially (to accommodate the case we're using disk storage). This is what m_PredictionSketch
is doing. I'll add some class documentation to explain the strategy.
LOG_TRACE(<< "x0 = " << x0.transpose()); | ||
|
||
double loss; | ||
CLbfgs<TDoubleVector> lgbfs{5}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am probably missing something, where is 5 coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the rank of the Hessian approximation. Generally, people recommend nothing too small (less than 3), but you quickly get diminishing returns. For example, 5 is the default for the R's optim package for this parameter. We can experiment a bit with this, but this randomised test suggests the choice isn't too bad.
Thanks for the review @valeriy42. I've been through your comments and added an explanation of the top level strategy. Regarding,
I don't actually normalise values of weights (although with "shrinkage" regularisation they shouldn't get too big). Also, I haven't actually wired this in to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good work. Thank you for adding extensive documentation. 👍
This change implements the loss function for multinomial logistic regression.
Note I've factored out the loss function related unit tests into their own suite. I also needed to make various changes to our online k-means implementation to support
CDenseVector
which requires that the vector dimension is passed to the constructor.