[SPARK-23265][ML]Update multi-column error handling logic in QuantileDiscretizer #20442

huaxingao · 2018-01-30T20:06:12Z

What changes were proposed in this pull request?

SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer.

How was this patch tested?

Add new unit test.

SparkQA · 2018-01-30T21:13:40Z

Test build #86845 has finished for PR 20442 at commit 1367ef0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-30T23:59:12Z

Test build #86848 has finished for PR 20442 at commit b35563a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-01-31T02:33:52Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

-    )
+  @Since("1.6.0")
+  override def transformSchema(schema: StructType): StructType = {
+    ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol),


Setting numBucketsArray when single-column can be an error. Since checkSingleVsMultiColumnParams doesn't support this usage, I think we may need to check it here.

Thanks for your comment. I will add the check.

viirya · 2018-01-31T03:27:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

-      ($(inputCols), $(outputCols))
+    if (isSet(inputCols)) {
+      require(getInputCols.length == getOutputCols.length,
+        s"QuantileDiscretizer $this has mismatched Params " +


Do this need to be in output? Or just The QuantileDiscretizer has ...?

The only reason I have $this is because Bucketizer has $this and I am trying to be consistent with Bucketizer implementation.

if (isSet(inputCols)) { require(getInputCols.length == getOutputCols.length && getInputCols.length == getSplitsArray.length, s"Bucketizer $this has mismatched Params " + s"for multi-column transform. Params (inputCols, outputCols, splitsArray) should have " +

SparkQA · 2018-01-31T06:55:19Z

Test build #86864 has finished for PR 20442 at commit dfaad52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-01-31T08:36:33Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

-        "inputCols number do not match outputCols")
-      ($(inputCols), $(outputCols))
+      require(!isSet(numBucketsArray),
+        s"numBucketsArray can't be set for single-column QuantileDiscretizer.")


Should we check if numBucketsArray and numBuckets are set at the same time?

I was thinking about if I should add this check when I changed the code yesterday:
If both numBucketsArray and numBuckets are set, the current code will only take numBucketsArray. Also, numBuckets always has a default value even if it's not set. So yesterday I decided not to add the check.
But I guess it's better to tight the code to make user not set numBuckets explicitly when numBucketsArray is set. I will make the change to add the check.

SparkQA · 2018-01-31T19:52:38Z

Test build #86886 has finished for PR 20442 at commit 776a179.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick

This LGTM. We could potentially slightly clean up the error messages in transformSchema but the behavior is what we want.

Given 2.3.0 RC4 is imminent we can go ahead to merge as is.

jkbradley · 2018-02-16T22:53:39Z

I'm re-running tests since the last run is very stale, but +1 for getting this into RC4!

SparkQA · 2018-02-16T23:28:06Z

Test build #4099 has finished for PR 20442 at commit 776a179.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-17T02:50:09Z

numBuckets is a default param and can cause persistence bug too if we add multi-column error handling logic. I think we have two options:

Ignore numBuckets when inputCols and numBucketsArray are set. Don't raise error if it is set.
Similar to outputCol, also skip the default value of numBuckets if inputCols is set when saving the metadata.

jkbradley · 2018-02-17T04:06:21Z

Yeah, this is a strong reason to separate explicitly set and default Params in the near future. Let's not block 2.3 on this PR. If you still want to try for 2.3, then I vote for option 1 but don't have a strong preference. But I'm tempted to say we should just wait for 2.4 and do the long-term fix for Params.

huaxingao · 2018-02-18T01:36:20Z

Thanks for the comments. I am in China now for Chinese New Year. Will address the comments when I get back to work on 2/21.

huaxingao · 2018-02-22T00:26:19Z

Sorry for not working on this earlier. Just came back from China yesterday morning.
Not sure if 2.3 RC4 has already get cut. If this still needs to be merged in 2.3, please let me know and I will take option 1. Otherwise, I will wait for 2.4. I saw viirya already has a fix to separate explicitly set and default params in SPARK-23455.
Thanks all for your help!

…eDiscretizer

SparkQA · 2018-04-25T06:28:59Z

Test build #89820 has finished for PR 20442 at commit db35f61.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-06-13T20:48:26Z

@huaxingao Thanks for this follow-up! I realized that #19715 introduced a breaking change which we missed in Spark 2.3 QA: In Spark 2.2, a user could set inputCol but not set outputCol (since outputCol has a default value). The new check causes such user code to start failing in Spark 2.3.

Since you're already working on this follow-up, would you mind adding a unit test which checks this? (Setting inputCol but not outputCol, and making sure that works.)

…ol) works OK

huaxingao · 2018-06-13T23:32:11Z

@jkbradley test added. Could you please review?

SparkQA · 2018-06-14T00:47:06Z

Test build #91801 has finished for PR 20442 at commit 73aeb1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2018-09-04T21:51:07Z

Any more comments? @MLnick @jkbradley

SparkQA · 2018-09-17T19:55:23Z

Test build #96152 has finished for PR 20442 at commit 73aeb1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T06:50:16Z

Test build #97718 has finished for PR 20442 at commit 73aeb1c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T07:05:04Z

Test build #97702 has finished for PR 20442 at commit 73aeb1c.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T09:12:26Z

Test build #97744 has finished for PR 20442 at commit 73aeb1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T15:16:01Z

Test build #97777 has finished for PR 20442 at commit 73aeb1c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2018-11-19T19:19:31Z

retest this please

SparkQA · 2018-11-19T20:36:28Z

Test build #99014 has finished for PR 20442 at commit 73aeb1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2019-09-04T20:24:54Z

jenkins retest this please

SparkQA · 2019-09-04T21:35:29Z

Test build #110145 has finished for PR 20442 at commit 73aeb1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2019-09-06T17:45:32Z

@srowen @holdenk @viirya
Could you please take a look to see if there is anything else I need to do for this PR? Thank you very much in advance!

viirya · 2019-09-06T22:02:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

-    )
+  @Since("1.6.0")
+  override def transformSchema(schema: StructType): StructType = {
+    ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol),


checkSingleVsMultiColumnParams can used like ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol, splits), Seq(outputCols, splitsArray)).

If we want numBuckets and numBucketsArray to be exclusively set, you can use checkSingleVsMultiColumnParams like that.

Thanks @viirya for your quick reply!
The reason I didn't use

ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol, numBuckets), Seq(outputCols, numBucketsArray))

is that we can actually setNumBuckets for multi columns. I looked the previous conversion, we have decided to allow setNumBuckets for multi columns. In the multi columns case

If however the numBucketsArray param is unset but the numBuckets param is set, the user is saying they want the same numBuckets across all columns, then we can use the multi-column version of approxQuantiles in this case.

ah, I see. thanks!

viirya · 2019-09-09T01:01:31Z

If no more comments, I will merge this tomorrow.

viirya · 2019-09-10T00:21:15Z

retest this please

SparkQA · 2019-09-10T01:44:59Z

Test build #110380 has finished for PR 20442 at commit 73aeb1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-09-10T02:08:55Z

Thanks! Merged to master.

huaxingao · 2019-09-10T02:35:06Z

Thank you very much! @viirya @srowen

…eDiscretizer ## What changes were proposed in this pull request? SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer. ## How was this patch tested? Add new unit test. Closes apache#20442 from huaxingao/spark-23265. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>

viirya reviewed Jan 31, 2018

View reviewed changes

huaxingao changed the title ~~[SPARK-23265][SQL]Update multi-column error handling logic in QuantileDiscretizer~~ [SPARK-23265][ML]Update multi-column error handling logic in QuantileDiscretizer Feb 1, 2018

MLnick approved these changes Feb 16, 2018

View reviewed changes

huaxingao added 5 commits April 24, 2018 21:24

[SPARK-23265][SQL]Update multi-column error handling logic in Quantil…

106f8cf

…eDiscretizer

add check for numBucketsArray length

d22093b

address comments

9674c3c

address comments (2)

f68940b

resolve conflict

db35f61

huaxingao force-pushed the spark-23265 branch from 776a179 to db35f61 Compare April 25, 2018 05:21

adding a test to check setting inputCol only (without setting outputC…

73aeb1c

…ol) works OK

dongjoon-hyun added the ML label Jun 14, 2019

viirya reviewed Sep 6, 2019

View reviewed changes

srowen approved these changes Sep 8, 2019

View reviewed changes

viirya approved these changes Sep 9, 2019

View reviewed changes

viirya closed this in aa805ec Sep 10, 2019

huaxingao deleted the spark-23265 branch September 10, 2019 02:35

[SPARK-23265][ML]Update multi-column error handling logic in QuantileDiscretizer #20442

[SPARK-23265][ML]Update multi-column error handling logic in QuantileDiscretizer #20442

Conversation

huaxingao commented Jan 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 30, 2018

SparkQA commented Jan 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2018

MLnick left a comment

Choose a reason for hiding this comment

jkbradley commented Feb 16, 2018

SparkQA commented Feb 16, 2018

viirya commented Feb 17, 2018

jkbradley commented Feb 17, 2018

huaxingao commented Feb 18, 2018

huaxingao commented Feb 22, 2018

SparkQA commented Apr 25, 2018

jkbradley commented Jun 13, 2018

huaxingao commented Jun 13, 2018

SparkQA commented Jun 14, 2018

huaxingao commented Sep 4, 2018

SparkQA commented Sep 17, 2018

SparkQA commented Oct 22, 2018

SparkQA commented Oct 22, 2018

SparkQA commented Oct 22, 2018

SparkQA commented Oct 22, 2018

huaxingao commented Nov 19, 2018

SparkQA commented Nov 19, 2018

holdenk commented Sep 4, 2019

SparkQA commented Sep 4, 2019

huaxingao commented Sep 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Sep 9, 2019

viirya commented Sep 10, 2019

SparkQA commented Sep 10, 2019

viirya commented Sep 10, 2019

huaxingao commented Sep 10, 2019