Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299

Closed
wants to merge 2 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Sep 29, 2016

What changes were proposed in this pull request?

Partial revert of #15277 to instead sort and store input to model rather than require sorted input

How was this patch tested?

Existing tests.

@yanboliang
Copy link
Contributor

@srowen I think this PR may fail MiMa tests, since it makes binary incompatible change. The major disagreement between this and #15277 is whether to keep selectedFeatures sorted. I think both is OK and these two PRs have no difference comparing computation cost and others. So I still not prefer to make this change(due to binary incompatible), since we don't have strong requirements to make it. I'm still open to hear others' thoughts. Thanks!

@yanboliang
Copy link
Contributor

We need sort cost in any case, and put it in fit/training or model has no difference. So I think if we want to introduce this binary incompatible change, there should be strong requirements or users complain. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66103 has finished for PR 15299 at commit c0c0174.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Sep 29, 2016

What is binary incompatible here?

@yanboliang
Copy link
Contributor

Oh, isSorted is left and it's not introduce binary incompatible right now. Thanks for your remind. I'm neutral for this change. Thanks!

@srowen
Copy link
Member Author

srowen commented Sep 29, 2016

Yeah I left it deprecated because I really can't figure why it was left protected. If you're ok with this then yes I'd like to restore these parts from the original change. I think it makes slightly more sense API wise

@yanboliang
Copy link
Contributor

@srowen It looks strange to left it protected, and deprecating it looks ok to me except someone tells me any reason. BTW, please update Python API docs to reflect that selectedFeatures is not necessary to be sorted. Thanks.

@mpjlu
Copy link

mpjlu commented Sep 29, 2016

hi @srowen , is @transient needed for val selectedFeatures or val filterIndices, one of them?
is it good to define filterIndices lazy?

@srowen
Copy link
Member Author

srowen commented Sep 29, 2016

I don't think it can be transient, because if serialized, this data must be part of the model. It is just an array of ints. I don't think lazy makes sense because it is just holding on to an unsorted array to later sort it. But this is an array of, in any reasonable case, less than a thousand elements. quicksorting it is trivial.

Done, I added the selectedFeatures change.

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66107 has finished for PR 15299 at commit 02a2b40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Sep 30, 2016

Hey folks - what exactly is the issue we're concerned about here for binary compat? SPARK-17017 is for 2.1, not branch-2.0.

Is MiMa really an issue since 2.1 is not released yet?

@yanboliang
Copy link
Contributor

@MLnick This change will not break binary compatibility currently. It marks isSorted as deprecated and will break binary compatibility when we delete that method. This should be not a big issue then, since a deprecated method may not be used. What's your opinion about this change? Thanks.

@MLnick
Copy link
Contributor

MLnick commented Sep 30, 2016

Ah sorry ok I was a bit confused - I thought the isSorted method had only been added after 2.0, but it has been around since the beginning. It is indeed public (well protected) so is a MiMa issue if removed.


@deprecated("not intended for subclasses to use", "2.1.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also fail to see why this needs to be exposed. +1 on deprecation.

@@ -35,14 +35,15 @@ import org.apache.spark.sql.{Row, SparkSession}
/**
* Chi Squared selector model.
*
* @param selectedFeatures list of indices to select (filter). Must be ordered asc
* @param selectedFeatures list of indices to select (filter).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should say "since the model requires sorted indices, selectedFeatures will be sorted" or something - just to make it clear the model does have this requirement, but takes care of that itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind that, though, my original theory behind this little change was that the sorting is wholly an implementation detail that callers don't need to promise or be promised about these features. It's very small, but, do we need to even promise these are sorted here?

@srowen
Copy link
Member Author

srowen commented Oct 1, 2016

Merged to master

@asfgit asfgit closed this in b88cb63 Oct 1, 2016
@srowen srowen deleted the SPARK-17704.2 branch October 1, 2016 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants