New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15299
Conversation
…el rather than require sorted input
@srowen I think this PR may fail MiMa tests, since it makes binary incompatible change. The major disagreement between this and #15277 is whether to keep |
We need |
Test build #66103 has finished for PR 15299 at commit
|
What is binary incompatible here? |
Oh, |
Yeah I left it deprecated because I really can't figure why it was left protected. If you're ok with this then yes I'd like to restore these parts from the original change. I think it makes slightly more sense API wise |
@srowen It looks strange to left it protected, and deprecating it looks ok to me except someone tells me any reason. BTW, please update Python API docs to reflect that |
hi @srowen , is @transient needed for val selectedFeatures or val filterIndices, one of them? |
I don't think it can be transient, because if serialized, this data must be part of the model. It is just an array of ints. I don't think lazy makes sense because it is just holding on to an unsorted array to later sort it. But this is an array of, in any reasonable case, less than a thousand elements. quicksorting it is trivial. Done, I added the selectedFeatures change. |
Test build #66107 has finished for PR 15299 at commit
|
Hey folks - what exactly is the issue we're concerned about here for binary compat? SPARK-17017 is for 2.1, not branch-2.0. Is MiMa really an issue since 2.1 is not released yet? |
@MLnick This change will not break binary compatibility currently. It marks |
Ah sorry ok I was a bit confused - I thought the |
|
||
@deprecated("not intended for subclasses to use", "2.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also fail to see why this needs to be exposed. +1 on deprecation.
@@ -35,14 +35,15 @@ import org.apache.spark.sql.{Row, SparkSession} | |||
/** | |||
* Chi Squared selector model. | |||
* | |||
* @param selectedFeatures list of indices to select (filter). Must be ordered asc | |||
* @param selectedFeatures list of indices to select (filter). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should say "since the model requires sorted indices, selectedFeatures will be sorted" or something - just to make it clear the model does have this requirement, but takes care of that itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind that, though, my original theory behind this little change was that the sorting is wholly an implementation detail that callers don't need to promise or be promised about these features. It's very small, but, do we need to even promise these are sorted here?
Merged to master |
What changes were proposed in this pull request?
Partial revert of #15277 to instead sort and store input to model rather than require sorted input
How was this patch tested?
Existing tests.