Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23265][ML]Update multi-column error handling logic in QuantileDiscretizer #20442

Closed
wants to merge 6 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer.

How was this patch tested?

Add new unit test.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86845 has finished for PR 20442 at commit 1367ef0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86848 has finished for PR 20442 at commit b35563a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

)
@Since("1.6.0")
override def transformSchema(schema: StructType): StructType = {
ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting numBucketsArray when single-column can be an error. Since checkSingleVsMultiColumnParams doesn't support this usage, I think we may need to check it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I will add the check.

($(inputCols), $(outputCols))
if (isSet(inputCols)) {
require(getInputCols.length == getOutputCols.length,
s"QuantileDiscretizer $this has mismatched Params " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this need to be in output? Or just The QuantileDiscretizer has ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I have $this is because Bucketizer has $this and I am trying to be consistent with Bucketizer implementation.

    if (isSet(inputCols)) {
      require(getInputCols.length == getOutputCols.length &&
        getInputCols.length == getSplitsArray.length, s"Bucketizer $this has mismatched Params " +
        s"for multi-column transform.  Params (inputCols, outputCols, splitsArray) should have " +

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86864 has finished for PR 20442 at commit dfaad52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"inputCols number do not match outputCols")
($(inputCols), $(outputCols))
require(!isSet(numBucketsArray),
s"numBucketsArray can't be set for single-column QuantileDiscretizer.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if numBucketsArray and numBuckets are set at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about if I should add this check when I changed the code yesterday:
If both numBucketsArray and numBuckets are set, the current code will only take numBucketsArray. Also, numBuckets always has a default value even if it's not set. So yesterday I decided not to add the check.
But I guess it's better to tight the code to make user not set numBuckets explicitly when numBucketsArray is set. I will make the change to add the check.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86886 has finished for PR 20442 at commit 776a179.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao huaxingao changed the title [SPARK-23265][SQL]Update multi-column error handling logic in QuantileDiscretizer [SPARK-23265][ML]Update multi-column error handling logic in QuantileDiscretizer Feb 1, 2018
Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. We could potentially slightly clean up the error messages in transformSchema but the behavior is what we want.

Given 2.3.0 RC4 is imminent we can go ahead to merge as is.

@jkbradley
Copy link
Member

I'm re-running tests since the last run is very stale, but +1 for getting this into RC4!

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #4099 has finished for PR 20442 at commit 776a179.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Feb 17, 2018

numBuckets is a default param and can cause persistence bug too if we add multi-column error handling logic. I think we have two options:

  1. Ignore numBuckets when inputCols and numBucketsArray are set. Don't raise error if it is set.
  2. Similar to outputCol, also skip the default value of numBuckets if inputCols is set when saving the metadata.

@jkbradley
Copy link
Member

Yeah, this is a strong reason to separate explicitly set and default Params in the near future. Let's not block 2.3 on this PR. If you still want to try for 2.3, then I vote for option 1 but don't have a strong preference. But I'm tempted to say we should just wait for 2.4 and do the long-term fix for Params.

@huaxingao
Copy link
Contributor Author

Thanks for the comments. I am in China now for Chinese New Year. Will address the comments when I get back to work on 2/21.

@huaxingao
Copy link
Contributor Author

Sorry for not working on this earlier. Just came back from China yesterday morning.
Not sure if 2.3 RC4 has already get cut. If this still needs to be merged in 2.3, please let me know and I will take option 1. Otherwise, I will wait for 2.4. I saw viirya already has a fix to separate explicitly set and default params in SPARK-23455.
Thanks all for your help!

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89820 has finished for PR 20442 at commit db35f61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@huaxingao Thanks for this follow-up! I realized that #19715 introduced a breaking change which we missed in Spark 2.3 QA: In Spark 2.2, a user could set inputCol but not set outputCol (since outputCol has a default value). The new check causes such user code to start failing in Spark 2.3.

Since you're already working on this follow-up, would you mind adding a unit test which checks this? (Setting inputCol but not outputCol, and making sure that works.)

@huaxingao
Copy link
Contributor Author

@jkbradley test added. Could you please review?

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91801 has finished for PR 20442 at commit 73aeb1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

Any more comments? @MLnick @jkbradley

@SparkQA
Copy link

SparkQA commented Sep 17, 2018

Test build #96152 has finished for PR 20442 at commit 73aeb1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97718 has finished for PR 20442 at commit 73aeb1c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97702 has finished for PR 20442 at commit 73aeb1c.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97744 has finished for PR 20442 at commit 73aeb1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97777 has finished for PR 20442 at commit 73aeb1c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #99014 has finished for PR 20442 at commit 73aeb1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 4, 2019

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110145 has finished for PR 20442 at commit 73aeb1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

@srowen @holdenk @viirya
Could you please take a look to see if there is anything else I need to do for this PR? Thank you very much in advance!

)
@Since("1.6.0")
override def transformSchema(schema: StructType): StructType = {
ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkSingleVsMultiColumnParams can used like ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol, splits), Seq(outputCols, splitsArray)).

If we want numBuckets and numBucketsArray to be exclusively set, you can use checkSingleVsMultiColumnParams like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya for your quick reply!
The reason I didn't use

    ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol, numBuckets),
      Seq(outputCols, numBucketsArray))

is that we can actually setNumBuckets for multi columns. I looked the previous conversion, we have decided to allow setNumBuckets for multi columns. In the multi columns case

If however the numBucketsArray param is unset but the numBuckets param is set, 
the user is saying they want the same numBuckets across all columns, then we can 
use the multi-column version of approxQuantiles in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. thanks!

@viirya
Copy link
Member

viirya commented Sep 9, 2019

If no more comments, I will merge this tomorrow.

@viirya
Copy link
Member

viirya commented Sep 10, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110380 has finished for PR 20442 at commit 73aeb1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Sep 10, 2019

Thanks! Merged to master.

@viirya viirya closed this in aa805ec Sep 10, 2019
@huaxingao
Copy link
Contributor Author

Thank you very much! @viirya @srowen

@huaxingao huaxingao deleted the spark-23265 branch September 10, 2019 02:35
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
…eDiscretizer

## What changes were proposed in this pull request?

SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update  QuantileDiscretizer match the new error logic in Bucketizer.

## How was this patch tested?

Add new unit test.

Closes apache#20442 from huaxingao/spark-23265.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants