-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23265][ML]Update multi-column error handling logic in QuantileDiscretizer #20442
Conversation
Test build #86845 has finished for PR 20442 at commit
|
Test build #86848 has finished for PR 20442 at commit
|
) | ||
@Since("1.6.0") | ||
override def transformSchema(schema: StructType): StructType = { | ||
ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting numBucketsArray
when single-column can be an error. Since checkSingleVsMultiColumnParams
doesn't support this usage, I think we may need to check it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I will add the check.
($(inputCols), $(outputCols)) | ||
if (isSet(inputCols)) { | ||
require(getInputCols.length == getOutputCols.length, | ||
s"QuantileDiscretizer $this has mismatched Params " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this
need to be in output? Or just The QuantileDiscretizer has ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I have $this is because Bucketizer has $this and I am trying to be consistent with Bucketizer implementation.
if (isSet(inputCols)) {
require(getInputCols.length == getOutputCols.length &&
getInputCols.length == getSplitsArray.length, s"Bucketizer $this has mismatched Params " +
s"for multi-column transform. Params (inputCols, outputCols, splitsArray) should have " +
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
Test build #86864 has finished for PR 20442 at commit
|
"inputCols number do not match outputCols") | ||
($(inputCols), $(outputCols)) | ||
require(!isSet(numBucketsArray), | ||
s"numBucketsArray can't be set for single-column QuantileDiscretizer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if numBucketsArray
and numBuckets
are set at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about if I should add this check when I changed the code yesterday:
If both numBucketsArray and numBuckets are set, the current code will only take numBucketsArray. Also, numBuckets always has a default value even if it's not set. So yesterday I decided not to add the check.
But I guess it's better to tight the code to make user not set numBuckets explicitly when numBucketsArray is set. I will make the change to add the check.
Test build #86886 has finished for PR 20442 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. We could potentially slightly clean up the error messages in transformSchema
but the behavior is what we want.
Given 2.3.0 RC4 is imminent we can go ahead to merge as is.
I'm re-running tests since the last run is very stale, but +1 for getting this into RC4! |
Test build #4099 has finished for PR 20442 at commit
|
|
Yeah, this is a strong reason to separate explicitly set and default Params in the near future. Let's not block 2.3 on this PR. If you still want to try for 2.3, then I vote for option 1 but don't have a strong preference. But I'm tempted to say we should just wait for 2.4 and do the long-term fix for Params. |
Thanks for the comments. I am in China now for Chinese New Year. Will address the comments when I get back to work on 2/21. |
Sorry for not working on this earlier. Just came back from China yesterday morning. |
Test build #89820 has finished for PR 20442 at commit
|
@huaxingao Thanks for this follow-up! I realized that #19715 introduced a breaking change which we missed in Spark 2.3 QA: In Spark 2.2, a user could set inputCol but not set outputCol (since outputCol has a default value). The new check causes such user code to start failing in Spark 2.3. Since you're already working on this follow-up, would you mind adding a unit test which checks this? (Setting inputCol but not outputCol, and making sure that works.) |
@jkbradley test added. Could you please review? |
Test build #91801 has finished for PR 20442 at commit
|
Any more comments? @MLnick @jkbradley |
Test build #96152 has finished for PR 20442 at commit
|
Test build #97718 has finished for PR 20442 at commit
|
Test build #97702 has finished for PR 20442 at commit
|
Test build #97744 has finished for PR 20442 at commit
|
Test build #97777 has finished for PR 20442 at commit
|
retest this please |
Test build #99014 has finished for PR 20442 at commit
|
jenkins retest this please |
Test build #110145 has finished for PR 20442 at commit
|
) | ||
@Since("1.6.0") | ||
override def transformSchema(schema: StructType): StructType = { | ||
ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkSingleVsMultiColumnParams
can used like ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol, splits), Seq(outputCols, splitsArray))
.
If we want numBuckets
and numBucketsArray
to be exclusively set, you can use checkSingleVsMultiColumnParams
like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya for your quick reply!
The reason I didn't use
ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol, numBuckets),
Seq(outputCols, numBucketsArray))
is that we can actually setNumBuckets for multi columns. I looked the previous conversion, we have decided to allow setNumBuckets for multi columns. In the multi columns case
If however the numBucketsArray param is unset but the numBuckets param is set,
the user is saying they want the same numBuckets across all columns, then we can
use the multi-column version of approxQuantiles in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. thanks!
If no more comments, I will merge this tomorrow. |
retest this please |
Test build #110380 has finished for PR 20442 at commit
|
Thanks! Merged to master. |
…eDiscretizer ## What changes were proposed in this pull request? SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer. ## How was this patch tested? Add new unit test. Closes apache#20442 from huaxingao/spark-23265. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
What changes were proposed in this pull request?
SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer.
How was this patch tested?
Add new unit test.