Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11725][SQL] correctly handle null inputs for UDF #9770

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null.

@cloud-fan
Copy link
Contributor Author

cc @marmbrus

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46088 has finished for PR 9770 at commit e258b6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(parameterTypes.length == inputs.length)

parameterTypes.zip(inputs).filter(_._1.isPrimitive).map(_._2).foldLeft(udf: Expression) {
case (result, input) => If(IsNull(input), Literal.create(null, udf.dataType), result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a lot easier to read in the query plan if you created a single If with Ors.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46090 has finished for PR 9770 at commit a8a3067.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46130 has finished for PR 9770 at commit 9c66274.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(parameterTypes.length == inputs.length)

val inputsNullCheck = parameterTypes.zip(inputs)
// TODO: skip null handling for not-nullable primitive inputs after we can completely
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an offline discussion with @davies , we think it's dangerous to completely trust the nullable information and optimize based on it, especially for 1.6 release. Maybe we can do it after 1.6.

cc @marmbrus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the fact that most of the common code passes are not using nullable (for example generated expression, join), it could have some corner cases that the nullable is not generated correctly (for some data sources), I think it's risky for 1.6.

I'd vote to do that in next release (consider nullable in most places)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To play devils advocate, I think when the info is wrong is usually likely to be too conservative (allow nulls when there are none). Also, I'm not really sure what is going to change between now and 1.7 (i.e. if there are bugs we need to find them eventually).

That said, I'm fine waiting, but we should use this info eventually given the amount of effort we spend passing it around.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46169 has finished for PR 9770 at commit 2f04669.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46171 has finished for PR 9770 at commit f9c38cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Looks great! Merging to master and 1.6.

asfgit pushed a commit that referenced this pull request Nov 18, 2015
If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9770 from cloud-fan/udf.

(cherry picked from commit 33b8373)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 33b8373 Nov 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants