Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13651] Generator outputs are not resolved correctly resulting in run time error #11497

Closed
wants to merge 2 commits into from

Conversation

dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value")

Results in following logical plan

Project [key#2,value#3]
+- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3]
   +- SubqueryAlias src
      +- Project [_1#0 AS key#2,_2#1 AS value#3]
         +- LocalRelation [_1#0,_2#1], [[id1,value1]]

The above query fails with following runtime error.

java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
    at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
    at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
    at scala.collection.Iterator$class.foreach(Iterator.scala:742)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        <stack-trace omitted.....>

In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Added unit tests in hive/SQLQuerySuite and AnalysisSuite

@dilipbiswal
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

@@ -92,6 +92,16 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
checkAnswer(query, Row(1, 1) :: Row(1, 2) :: Row(1, 3) :: Nil)
}

test("SPARK-13651: generator outputs shouldn't be resolved from its child's outpu") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: outpu -> output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed it.

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

LGTM. Please update your PR description with runtime error messages you hit. (Not necessary to post the whole stack. I think, just the first few lines are enough) Thanks!

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #2610 has finished for PR 11497 at commit abf868f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@dilipbiswal
Copy link
Contributor Author

I ran this test 5 times in my development machine and it failed once. It looks like an intermittent failure. Also i verified the plan for the failing test.

Project [(id#0L % cast(2 as bigint)) AS key#1L,if (isnull(cast(id#0L as int))) null else UDF(cast(id#0L as int)) AS UDF(id)#2]
+- Range 0, 10, 1, 32, [id#0L]

There is no Generate in the plan and so the fix shouldn't affect this testcase.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52432 has finished for PR 11497 at commit abf868f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val query =
input
.generate(Explode(generatorInput), join = true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid plan? The input to Explode is an Attribute named a, which is not in the output of input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thank you. You are right. Wenchen, i just realized that its pretty hard to simulate the error in AnalysisSuite. For this problem to happen, we need to have the rules fired in following sequence.

  1. First ResolveGenerate be a no-op because the generator is not resolved.
  2. Generator is resolved through ResolveFunction.
  3. ResolveReference now resolves the generator output attributes from child;s output

in AnalysisSuite we have an empty function registry thus i am unable to simulate this error in this
test. If you are ok, i am thinking of removing this test and getting it tested through SQLQuerySuite.

Please let me know what you think.

@@ -512,6 +512,9 @@ class Analyzer(

// A special case for Generate, because the output of Generate should not be resolved by
// ResolveReferences. Attributes in the output will be resolved by ResolveGenerate.
case g @ Generate(generator, _, _, _, _, _)
if !g.resolved && generator.resolved => g

case g @ Generate(generator, join, outer, qualifier, output, child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this 2 cases can be simplified to:

case g: Generate if g.generator.resolved => g
case g @ Generate(generator, join, outer, qualifier, output, child) =>
  the generator resolution logic...

We only care about whether generator is resolved or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks !! Made the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still keep if child.resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Hi Davis, I also was thinking about it. I felt its probably safer to handle the generate plan by these two cases and not fall through the last case like we do for this defect.
@cloud-fan What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if child.resolved is guaranteed at the beginning of this rule:
case p: LogicalPlan if !p.childrenResolved => p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thank you !!

@cloud-fan
Copy link
Contributor

LGTM, cc @davies (who fixed this special case before)

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Can we trigger a test please ?

@gatorsmile
Copy link
Member

test this please

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52540 has finished for PR 11497 at commit 93d6e69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Mar 7, 2016

Merging into master, thanks!

@asfgit asfgit closed this in d7eac9d Mar 7, 2016
asfgit pushed a commit that referenced this pull request Mar 7, 2016
…pressions

## What changes were proposed in this pull request?

It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it.

Note that this PR doesn't fix the problem in #11497

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11532 from cloud-fan/generate.
@dilipbiswal
Copy link
Contributor Author

@cloud-fan @davies @gatorsmile Thank you !!

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…in run time error

## What changes were proposed in this pull request?

```
Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value")
```
Results in following logical plan

```
Project [key#2,value#3]
+- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3]
   +- SubqueryAlias src
      +- Project [_1#0 AS key#2,_2#1 AS value#3]
         +- LocalRelation [_1#0,_2#1], [[id1,value1]]
```

The above query fails with following runtime error.
```
java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
	at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
	at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        <stack-trace omitted.....>
```
In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Added unit tests in hive/SQLQuerySuite and AnalysisSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes apache#11497 from dilipbiswal/spark-13651.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…pressions

## What changes were proposed in this pull request?

It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it.

Note that this PR doesn't fix the problem in apache#11497

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#11532 from cloud-fan/generate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants