Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

Closed
wants to merge 2 commits into from

Conversation

yanboliang
Copy link
Contributor

What changes were proposed in this pull request?

Several performance improvement for ChiSqSelector:
1, Keep selectedFeatures ordered ascendent.
ChiSqSelectorModel.transform need selectedFeatures ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
2, When training fpr type ChiSqSelectorModel, it's not necessary to sort the ChiSq test result by statistic.

How was this patch tested?

Existing unit tests.

*/
private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
val orderedIndices = filterIndices.sorted
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can sort it when training rather than storing unsorted indices and resort them when applying transformation on each vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a better solution still -- accept any input, sort it and store the sorted array. I do want to avoid this unnecessary assertion about the input but yeah this can be optimized. I should have pushed on that in the initial review since it was my intent, and yeah it matters actually.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found sort in transform will sort the number of sample times. sorry @srowen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Yes, accept any input, sort it and store the sorted array will spend same computation with my change. But if we remove this requirement, it will introduce incompatible change which should update MimaExcludes to escape. I revert the change of #14597 did for MimaExcludes. I think we should not break compatibility if there is no strong requirement. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, this isn't incompatible. It removed a requirement, and continues to work with sorted or unsorted input. That's fine for a minor release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, may be we can deprecate this method, but I think we should to hear more thoughts. What about to open a separate ticket to discuss it? And this one can be moved forward? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to take out, but if you want to leave it, that's OK too. But why not deprecate it? I think it can be discussed here, I don't see why it's separate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#15212 depends on this, so it's better that we can get this in ASAP. I will merge this and open ticket for the deprecating issue. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang there's nothing urgent about this or that change. It was not appropriate to merge this. This change undoes an improvement on an existing change several of us reviewed for weeks and isn't actually necessary. I'm going to revert and propose a change that makes less change but still enables the improvement you wanted to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I have send #15298 to revert this change, and looking forward your PR. Thanks.

.take((chiSqTestResult.length * percentile).toInt)
case ChiSqSelector.FPR =>
chiSqTestResult.zipWithIndex
.filter{ case (res, _) => res.pValue < alpha }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fpr type, it's not necessary to compute .sortBy { case (res, _) => -res.statistic }.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true. Originally I though it would make sense to present the statistic to the model because it has no way to recover it, and further made sense to rank by the statistic anyway, but subsequent changes make that irrelevant.

@yanboliang
Copy link
Contributor Author

cc @mpjlu @srowen @avulanov

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66029 has finished for PR 15277 at commit 27e4710.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link

mpjlu commented Sep 28, 2016

sort in the transform will cause sort too many times.
so this looks good to me. thangks

*/
private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
val orderedIndices = filterIndices.sorted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK, but it can be left in as a deprecated method rather than used. I don't think the sorting requirement needs to be restored.

.take((chiSqTestResult.length * percentile).toInt)
case ChiSqSelector.FPR =>
chiSqTestResult.zipWithIndex
.filter{ case (res, _) => res.pValue < alpha }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true. Originally I though it would make sense to present the statistic to the model because it has no way to recover it, and further made sense to rank by the statistic anyway, but subsequent changes make that irrelevant.

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66093 has finished for PR 15277 at commit 42f12ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Merged into master. Thanks for all your review.

@asfgit asfgit closed this in f7082ac Sep 29, 2016
@srowen
Copy link
Member

srowen commented Sep 29, 2016

@yanboliang whoa, no I don't agree with this change. I said so above. There is an easy way to make this work as it did before. I'd appreciate it if you reverted this and addressed my comment.

@yanboliang
Copy link
Contributor Author

@srowen I'm sorry for misunderstand. I'll revert it firstly and let's continue the discussion. Thanks.

@srowen
Copy link
Member

srowen commented Sep 29, 2016

OK thank you. I'd like to keep the ability to accept unsorted indices, because it really isn't something the model needs to demand of its input. It's an implementation detail. You can just sort and store the indices.

Optionally deprecating isSorted sounds good to me, but not essential. I don't see why anyone would use it though.

@yanboliang yanboliang deleted the spark-17704 branch September 29, 2016 12:01
srowen added a commit to srowen/spark that referenced this pull request Sep 29, 2016
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 1, 2016
## What changes were proposed in this pull request?

Partial revert of apache#15277 to instead sort and store input to model rather than require sorted input

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#15299 from srowen/SPARK-17704.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants