Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20534][SQL] Make outer generate exec return empty rows #17810

Closed
wants to merge 5 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

Generate exec does not produce null values if the generator for the input row is empty and the generate operates in outer mode without join. This is caused by the fact that the join=false code path is different from the join=true code path, and that the join=false code path did deal with outer properly. This PR addresses this issue.

How was this patch tested?

Updated outer* tests in GeneratorFunctionSuite.

@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76309 has finished for PR 17810 at commit e1dfe8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76310 has finished for PR 17810 at commit 1722b9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76311 has finished for PR 17810 at commit a24ba6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -91,7 +91,7 @@ class GeneratorFunctionSuite extends QueryTest with SharedSQLContext {
val df = Seq((1, Seq(1, 2, 3)), (2, Seq())).toDF("a", "intList")
checkAnswer(
df.select(explode_outer('intList)),
Row(1) :: Row(2) :: Row(3) :: Nil)
Row(1) :: Row(2) :: Row(3) :: Row(null) :: Nil)
Copy link
Member

@gatorsmile gatorsmile Apr 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the definition of outer, outer has no effect when join is false. That means, df.select(explode_outer('intList)) should have the same outcome of df.select(explode('intList)). Both should not generate null, right?

@param outer when true, each input row will be output at least once, even if the output of the given generator is empty. outer has no effect when join is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we should move away from that definition, and make outer independent from join. The reason for this is that I think it is super confusing that the following queries produce different results:

df.select($"k", outer_explode($"v")) 
df.select(outer_explode($"v")) 

If you do not want the outer semantic, then just use a regular generator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Apr 30, 2017

Test build #76325 has finished for PR 17810 at commit 0eca481.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 1, 2017
## What changes were proposed in this pull request?
Generate exec does not produce `null` values if the generator for the input row is empty and the generate operates in outer mode without join. This is caused by the fact that the `join=false` code path is different from the `join=true` code path, and that the `join=false` code path did deal with outer properly. This PR addresses this issue.

## How was this patch tested?
Updated `outer*` tests in `GeneratorFunctionSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #17810 from hvanhovell/SPARK-20534.

(cherry picked from commit 6b44c4d)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@gatorsmile
Copy link
Member

gatorsmile commented May 1, 2017

Thanks! Merging to master/2.2

Since https://issues.apache.org/jira/browse/SPARK-13721 was only merged to 2.2 branch. Thus, I did not back port it to the earlier versions.

@asfgit asfgit closed this in 6b44c4d May 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants