Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32624][SQL][FOLLOWUP] Fix regression in CodegenContext.addReferenceObj on nested Scala types #29602

Closed
wants to merge 1 commit into from

Conversation

rednaxelafx
Copy link
Contributor

What changes were proposed in this pull request?

Use CodeGenerator.typeName() instead of Class.getCanonicalName() in CodegenContext.addReferenceObj() for getting the runtime class name for an object.

Why are the changes needed?

#29439 fixed a bug in CodegenContext.addReferenceObj() for Array[Byte] (i.e. Spark SQL's BinaryType) objects, but unfortunately it introduced a regression for some nested Scala types.

For example, for implicitly[Ordering[UTF8String]], after that PR CodegenContext.addReferenceObj() would return ((null) references[0] /* ... */). The actual type for implicitly[Ordering[UTF8String]] is scala.math.LowPriorityOrderingImplicits$$anon$3 in Scala 2.12.10, and Class.getCanonicalName() returns null for that class.

On the other hand, Class.getName() is safe to use for all non-array types, and Janino will happily accept the type name returned from Class.getName() for nested types. CodeGenerator.typeName() happens to do the right thing by correctly handling arrays and otherwise use Class.getName(). So it's a better alternative than Class.getCanonicalName().

Side note: rule of thumb for using Java reflection in Spark: it may be tempting to use Class.getCanonicalName(), but for functions that may need to handle Scala types, please avoid it due to potential issues with nested Scala types.
Instead, use Class.getName() or utility functions in org.apache.spark.util.Utils (e.g. Utils.getSimpleName() or Utils.getFormattedClassName() etc).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit test case for the regression case in CodeGenerationSuite.

…624: it'll get a "null" for class name of some nested Scala types
@rednaxelafx
Copy link
Contributor Author

cc original author and potential reviewers: @wangyum @cloud-fan @maropu @kiszk @viirya

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128114 has finished for PR 29602 at commit 2ecd30b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, @rednaxelafx !

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and nice description to explain the issue! Thanks!

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

HyukjinKwon pushed a commit that referenced this pull request Sep 1, 2020
…renceObj on nested Scala types

### What changes were proposed in this pull request?

Use `CodeGenerator.typeName()` instead of `Class.getCanonicalName()` in `CodegenContext.addReferenceObj()` for getting the runtime class name for an object.

### Why are the changes needed?

#29439 fixed a bug in `CodegenContext.addReferenceObj()` for `Array[Byte]` (i.e. Spark SQL's `BinaryType`) objects, but unfortunately it introduced a regression for some nested Scala types.

For example, for `implicitly[Ordering[UTF8String]]`, after that PR `CodegenContext.addReferenceObj()` would return `((null) references[0] /* ... */)`. The actual type for `implicitly[Ordering[UTF8String]]` is `scala.math.LowPriorityOrderingImplicits$$anon$3` in Scala 2.12.10, and `Class.getCanonicalName()` returns `null` for that class.

On the other hand, `Class.getName()` is safe to use for all non-array types, and Janino will happily accept the type name returned from `Class.getName()` for nested types. `CodeGenerator.typeName()` happens to do the right thing by correctly handling arrays and otherwise use `Class.getName()`. So it's a better alternative than `Class.getCanonicalName()`.

Side note: rule of thumb for using Java reflection in Spark: it may be tempting to use `Class.getCanonicalName()`, but for functions that may need to handle Scala types, please avoid it due to potential issues with nested Scala types.
Instead, use `Class.getName()` or utility functions in `org.apache.spark.util.Utils` (e.g. `Utils.getSimpleName()` or `Utils.getFormattedClassName()` etc).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new unit test case for the regression case in `CodeGenerationSuite`.

Closes #29602 from rednaxelafx/spark-32624-followup.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 6e5bc39)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@kiszk
Copy link
Member

kiszk commented Sep 2, 2020

Good catch, late LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants