[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299

srowen · 2016-09-29T13:31:58Z

What changes were proposed in this pull request?

Partial revert of #15277 to instead sort and store input to model rather than require sorted input

How was this patch tested?

Existing tests.

…el rather than require sorted input

yanboliang · 2016-09-29T14:03:04Z

@srowen I think this PR may fail MiMa tests, since it makes binary incompatible change. The major disagreement between this and #15277 is whether to keep selectedFeatures sorted. I think both is OK and these two PRs have no difference comparing computation cost and others. So I still not prefer to make this change(due to binary incompatible), since we don't have strong requirements to make it. I'm still open to hear others' thoughts. Thanks!

yanboliang · 2016-09-29T14:07:27Z

We need sort cost in any case, and put it in fit/training or model has no difference. So I think if we want to introduce this binary incompatible change, there should be strong requirements or users complain. Thanks!

SparkQA · 2016-09-29T14:32:35Z

Test build #66103 has finished for PR 15299 at commit c0c0174.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-29T14:32:37Z

What is binary incompatible here?

yanboliang · 2016-09-29T14:41:04Z

Oh, isSorted is left and it's not introduce binary incompatible right now. Thanks for your remind. I'm neutral for this change. Thanks!

srowen · 2016-09-29T14:43:45Z

Yeah I left it deprecated because I really can't figure why it was left protected. If you're ok with this then yes I'd like to restore these parts from the original change. I think it makes slightly more sense API wise

yanboliang · 2016-09-29T15:03:08Z

@srowen It looks strange to left it protected, and deprecating it looks ok to me except someone tells me any reason. BTW, please update Python API docs to reflect that selectedFeatures is not necessary to be sorted. Thanks.

mpjlu · 2016-09-29T15:19:11Z

hi @srowen , is @transient needed for val selectedFeatures or val filterIndices, one of them?
is it good to define filterIndices lazy?

srowen · 2016-09-29T15:23:02Z

I don't think it can be transient, because if serialized, this data must be part of the model. It is just an array of ints. I don't think lazy makes sense because it is just holding on to an unsorted array to later sort it. But this is an array of, in any reasonable case, less than a thousand elements. quicksorting it is trivial.

Done, I added the selectedFeatures change.

SparkQA · 2016-09-29T16:24:35Z

Test build #66107 has finished for PR 15299 at commit 02a2b40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-09-30T08:52:32Z

Hey folks - what exactly is the issue we're concerned about here for binary compat? SPARK-17017 is for 2.1, not branch-2.0.

Is MiMa really an issue since 2.1 is not released yet?

yanboliang · 2016-09-30T09:34:59Z

@MLnick This change will not break binary compatibility currently. It marks isSorted as deprecated and will break binary compatibility when we delete that method. This should be not a big issue then, since a deprecated method may not be used. What's your opinion about this change? Thanks.

MLnick · 2016-09-30T10:40:42Z

Ah sorry ok I was a bit confused - I thought the isSorted method had only been added after 2.0, but it has been around since the beginning. It is indeed public (well protected) so is a MiMa issue if removed.

MLnick · 2016-09-30T10:43:58Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala


+  @deprecated("not intended for subclasses to use", "2.1.0")


Yeah, I also fail to see why this needs to be exposed. +1 on deprecation.

MLnick · 2016-09-30T10:47:51Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -35,14 +35,15 @@ import org.apache.spark.sql.{Row, SparkSession}
 /**
 * Chi Squared selector model.
 *
- * @param selectedFeatures list of indices to select (filter). Must be ordered asc
+ * @param selectedFeatures list of indices to select (filter).


I wonder if we should say "since the model requires sorted indices, selectedFeatures will be sorted" or something - just to make it clear the model does have this requirement, but takes care of that itself?

I don't mind that, though, my original theory behind this little change was that the sorting is wholly an implementation detail that callers don't need to promise or be promised about these features. It's very small, but, do we need to even promise these are sorted here?

srowen · 2016-10-01T20:11:16Z

Merged to master

Partial revert of apache#15277 to instead sort and store input to mod…

c0c0174

…el rather than require sorted input

srowen mentioned this pull request Sep 29, 2016

Revert "[SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement." #15298

Closed

Fix selectedFeatures doc

02a2b40

MLnick reviewed Sep 30, 2016

View reviewed changes

asfgit closed this in b88cb63 Oct 1, 2016

srowen deleted the SPARK-17704.2 branch October 1, 2016 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299

srowen commented Sep 29, 2016

yanboliang commented Sep 29, 2016

yanboliang commented Sep 29, 2016

SparkQA commented Sep 29, 2016

srowen commented Sep 29, 2016

yanboliang commented Sep 29, 2016

srowen commented Sep 29, 2016

yanboliang commented Sep 29, 2016

mpjlu commented Sep 29, 2016

srowen commented Sep 29, 2016

SparkQA commented Sep 29, 2016

MLnick commented Sep 30, 2016 •

edited

yanboliang commented Sep 30, 2016

MLnick commented Sep 30, 2016

MLnick Sep 30, 2016

MLnick Sep 30, 2016

srowen Sep 30, 2016

srowen commented Oct 1, 2016

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299

Conversation

srowen commented Sep 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

yanboliang commented Sep 29, 2016

yanboliang commented Sep 29, 2016

SparkQA commented Sep 29, 2016

srowen commented Sep 29, 2016

yanboliang commented Sep 29, 2016

srowen commented Sep 29, 2016

yanboliang commented Sep 29, 2016

mpjlu commented Sep 29, 2016

srowen commented Sep 29, 2016

SparkQA commented Sep 29, 2016

MLnick commented Sep 30, 2016 • edited

yanboliang commented Sep 30, 2016

MLnick commented Sep 30, 2016

MLnick Sep 30, 2016

Choose a reason for hiding this comment

MLnick Sep 30, 2016

Choose a reason for hiding this comment

srowen Sep 30, 2016

Choose a reason for hiding this comment

srowen commented Oct 1, 2016

MLnick commented Sep 30, 2016 •

edited