New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277
Conversation
*/ | ||
private def compress(features: Vector, filterIndices: Array[Int]): Vector = { | ||
val orderedIndices = filterIndices.sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can sort it when training rather than storing unsorted indices and resort them when applying transformation on each vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a better solution still -- accept any input, sort it and store the sorted array. I do want to avoid this unnecessary assertion about the input but yeah this can be optimized. I should have pushed on that in the initial review since it was my intent, and yeah it matters actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found sort in transform will sort the number of sample times. sorry @srowen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Yes, accept any input, sort it and store the sorted array will spend same computation with my change. But if we remove this requirement, it will introduce incompatible change which should update MimaExcludes
to escape. I revert the change of #14597 did for MimaExcludes
. I think we should not break compatibility if there is no strong requirement. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, this isn't incompatible. It removed a requirement, and continues to work with sorted or unsorted input. That's fine for a minor release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, may be we can deprecate this method, but I think we should to hear more thoughts. What about to open a separate ticket to discuss it? And this one can be moved forward? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to take out, but if you want to leave it, that's OK too. But why not deprecate it? I think it can be discussed here, I don't see why it's separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#15212 depends on this, so it's better that we can get this in ASAP. I will merge this and open ticket for the deprecating issue. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang there's nothing urgent about this or that change. It was not appropriate to merge this. This change undoes an improvement on an existing change several of us reviewed for weeks and isn't actually necessary. I'm going to revert and propose a change that makes less change but still enables the improvement you wanted to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.take((chiSqTestResult.length * percentile).toInt) | ||
case ChiSqSelector.FPR => | ||
chiSqTestResult.zipWithIndex | ||
.filter{ case (res, _) => res.pValue < alpha } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fpr
type, it's not necessary to compute .sortBy { case (res, _) => -res.statistic }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true. Originally I though it would make sense to present the statistic to the model because it has no way to recover it, and further made sense to rank by the statistic anyway, but subsequent changes make that irrelevant.
Test build #66029 has finished for PR 15277 at commit
|
sort in the transform will cause sort too many times. |
*/ | ||
private def compress(features: Vector, filterIndices: Array[Int]): Vector = { | ||
val orderedIndices = filterIndices.sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's OK, but it can be left in as a deprecated method rather than used. I don't think the sorting requirement needs to be restored.
.take((chiSqTestResult.length * percentile).toInt) | ||
case ChiSqSelector.FPR => | ||
chiSqTestResult.zipWithIndex | ||
.filter{ case (res, _) => res.pValue < alpha } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true. Originally I though it would make sense to present the statistic to the model because it has no way to recover it, and further made sense to rank by the statistic anyway, but subsequent changes make that irrelevant.
Test build #66093 has finished for PR 15277 at commit
|
Merged into master. Thanks for all your review. |
@yanboliang whoa, no I don't agree with this change. I said so above. There is an easy way to make this work as it did before. I'd appreciate it if you reverted this and addressed my comment. |
@srowen I'm sorry for misunderstand. I'll revert it firstly and let's continue the discussion. Thanks. |
OK thank you. I'd like to keep the ability to accept unsorted indices, because it really isn't something the model needs to demand of its input. It's an implementation detail. You can just sort and store the indices. Optionally deprecating isSorted sounds good to me, but not essential. I don't see why anyone would use it though. |
…el rather than require sorted input
## What changes were proposed in this pull request? Partial revert of apache#15277 to instead sort and store input to model rather than require sorted input ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#15299 from srowen/SPARK-17704.2.
What changes were proposed in this pull request?
Several performance improvement for
ChiSqSelector
:1, Keep
selectedFeatures
ordered ascendent.ChiSqSelectorModel.transform
needselectedFeatures
ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.2, When training
fpr
typeChiSqSelectorModel
, it's not necessary to sort the ChiSq test result by statistic.How was this patch tested?
Existing unit tests.