[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

yanboliang · 2016-09-28T08:10:38Z

What changes were proposed in this pull request?

Several performance improvement for ChiSqSelector:
1, Keep selectedFeatures ordered ascendent.
ChiSqSelectorModel.transform need selectedFeatures ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
2, When training fpr type ChiSqSelectorModel, it's not necessary to sort the ChiSq test result by statistic.

How was this patch tested?

Existing unit tests.

yanboliang · 2016-09-28T08:14:54Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

   */
  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
-    val orderedIndices = filterIndices.sorted


We can sort it when training rather than storing unsorted indices and resort them when applying transformation on each vector.

I think there's a better solution still -- accept any input, sort it and store the sorted array. I do want to avoid this unnecessary assertion about the input but yeah this can be optimized. I should have pushed on that in the initial review since it was my intent, and yeah it matters actually.

I just found sort in transform will sort the number of sample times. sorry @srowen

@srowen Yes, accept any input, sort it and store the sorted array will spend same computation with my change. But if we remove this requirement, it will introduce incompatible change which should update MimaExcludes to escape. I revert the change of #14597 did for MimaExcludes. I think we should not break compatibility if there is no strong requirement. Thanks.

I don't understand, this isn't incompatible. It removed a requirement, and continues to work with sorted or unsorted input. That's fine for a minor release.

Hmm, may be we can deprecate this method, but I think we should to hear more thoughts. What about to open a separate ticket to discuss it? And this one can be moved forward? Thanks.

I think it's OK to take out, but if you want to leave it, that's OK too. But why not deprecate it? I think it can be discussed here, I don't see why it's separate.

#15212 depends on this, so it's better that we can get this in ASAP. I will merge this and open ticket for the deprecating issue. Thanks.

@yanboliang there's nothing urgent about this or that change. It was not appropriate to merge this. This change undoes an improvement on an existing change several of us reviewed for weeks and isn't actually necessary. I'm going to revert and propose a change that makes less change but still enables the improvement you wanted to make.

@srowen I have send #15298 to revert this change, and looking forward your PR. Thanks.

yanboliang · 2016-09-28T08:15:58Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+          .take((chiSqTestResult.length * percentile).toInt)
+      case ChiSqSelector.FPR =>
+        chiSqTestResult.zipWithIndex
+          .filter{ case (res, _) => res.pValue < alpha }


For fpr type, it's not necessary to compute .sortBy { case (res, _) => -res.statistic }.

It's true. Originally I though it would make sense to present the statistic to the model because it has no way to recover it, and further made sense to rank by the statistic anyway, but subsequent changes make that irrelevant.

yanboliang · 2016-09-28T08:30:35Z

cc @mpjlu @srowen @avulanov

SparkQA · 2016-09-28T09:12:21Z

Test build #66029 has finished for PR 15277 at commit 27e4710.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-09-28T12:38:01Z

sort in the transform will cause sort too many times.
so this looks good to me. thangks

srowen · 2016-09-29T09:56:39Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

   */
  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
-    val orderedIndices = filterIndices.sorted


That's OK, but it can be left in as a deprecated method rather than used. I don't think the sorting requirement needs to be restored.

srowen · 2016-09-29T09:58:20Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+          .take((chiSqTestResult.length * percentile).toInt)
+      case ChiSqSelector.FPR =>
+        chiSqTestResult.zipWithIndex
+          .filter{ case (res, _) => res.pValue < alpha }


It's true. Originally I though it would make sense to present the statistic to the model because it has no way to recover it, and further made sense to rank by the statistic anyway, but subsequent changes make that irrelevant.

SparkQA · 2016-09-29T10:24:46Z

Test build #66093 has finished for PR 15277 at commit 42f12ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-09-29T11:30:03Z

Merged into master. Thanks for all your review.

srowen · 2016-09-29T11:34:03Z

@yanboliang whoa, no I don't agree with this change. I said so above. There is an easy way to make this work as it did before. I'd appreciate it if you reverted this and addressed my comment.

yanboliang · 2016-09-29T11:40:09Z

@srowen I'm sorry for misunderstand. I'll revert it firstly and let's continue the discussion. Thanks.

srowen · 2016-09-29T11:49:08Z

OK thank you. I'd like to keep the ability to accept unsorted indices, because it really isn't something the model needs to demand of its input. It's an implementation detail. You can just sort and store the indices.

Optionally deprecating isSorted sounds good to me, but not essential. I don't see why anyone would use it though.

…el rather than require sorted input

## What changes were proposed in this pull request? Partial revert of apache#15277 to instead sort and store input to model rather than require sorted input ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#15299 from srowen/SPARK-17704.2.

ChiSqSelector performance improvement.

27e4710

yanboliang commented Sep 28, 2016

View reviewed changes

yanboliang mentioned this pull request Sep 28, 2016

[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) #15212

Closed

Remove Mima exclude.

42f12ce

srowen requested changes Sep 29, 2016

View reviewed changes

asfgit closed this in f7082ac Sep 29, 2016

yanboliang deleted the spark-17704 branch September 29, 2016 12:01

srowen added a commit to srowen/spark that referenced this pull request Sep 29, 2016

Partial revert of apache#15277 to instead sort and store input to mod…

c0c0174

…el rather than require sorted input

srowen mentioned this pull request Sep 29, 2016

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

yanboliang commented Sep 28, 2016

yanboliang Sep 28, 2016

srowen Sep 28, 2016

mpjlu Sep 28, 2016

yanboliang Sep 29, 2016

srowen Sep 29, 2016

yanboliang Sep 29, 2016

srowen Sep 29, 2016

yanboliang Sep 29, 2016

srowen Sep 29, 2016

yanboliang Sep 29, 2016

yanboliang Sep 28, 2016

srowen Sep 29, 2016

yanboliang commented Sep 28, 2016

SparkQA commented Sep 28, 2016

mpjlu commented Sep 28, 2016

srowen Sep 29, 2016

srowen Sep 29, 2016

SparkQA commented Sep 29, 2016

yanboliang commented Sep 29, 2016

srowen commented Sep 29, 2016

yanboliang commented Sep 29, 2016

srowen commented Sep 29, 2016

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

Conversation

yanboliang commented Sep 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang commented Sep 28, 2016

SparkQA commented Sep 28, 2016

mpjlu commented Sep 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 29, 2016

yanboliang commented Sep 29, 2016

srowen commented Sep 29, 2016

yanboliang commented Sep 29, 2016

srowen commented Sep 29, 2016