[SPARK-22442][SQL] ScalaReflection should produce correct field names for special characters #19664

viirya · 2017-11-06T09:10:00Z

What changes were proposed in this pull request?

For a class with field name of special characters, e.g.:

case class MyType(`field.1`: String, `field 2`: String)

Although we can manipulate DataFrame/Dataset, the field names are encoded:

scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
scala> df.as[MyType].collect
res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))

It causes resolving problem when we try to convert the data with non-encoded field names:

spark.read.json(path).as[MyType]
...
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
ld.1];
[info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...

We should use decoded field name in Dataset schema.

How was this patch tested?

Added tests.

…ters.

SparkQA · 2017-11-06T12:04:28Z

Test build #83478 has finished for PR 19664 at commit 319c804.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-06T15:59:35Z

Test build #83486 has finished for PR 19664 at commit a671a83.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SpecialCharClass(field.1: String, field 2: String)

viirya · 2017-11-07T06:37:05Z

cc @cloud-fan for review.

cloud-fan · 2017-11-07T09:50:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

@@ -214,11 +215,13 @@ case class Invoke(
  override def eval(input: InternalRow): Any =
    throw new UnsupportedOperationException("Only code-generated evaluation is supported.")

+  private lazy val encodedFunctionName = TermName(functionName).encodedName.toString


does StaticInvoke have some issue?

Maybe, although I didn't have concrete case causing the issue for now.

Since we use Invoke to access field(s) in object, this is an issue. I didn't found StaticInvoke used similarly. So it should be fine.

cloud-fan · 2017-11-07T09:51:28Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

+    val argumentsFields = deserializer.asInstanceOf[NewInstance].arguments.flatMap { _.collect {
+      case UpCast(u: UnresolvedAttribute, _, _) => u.name
+    }}
+    assert(argumentsFields(0) == "`field.1`")


why it has backticks?

We need to deliberately wrap backticks around a field name such as field.1 because of the dot character. Otherwise UnresolvedAttribute will parse it as two name parts Seq("field", "1") and then fail resolving later.

cloud-fan · 2017-11-07T09:52:22Z

good catch!

cloud-fan · 2017-11-08T10:44:21Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

+    assert(serializer.dataType(1).name == "field 2")
+
+    val argumentsFields = deserializer.asInstanceOf[NewInstance].arguments.flatMap { _.collect {
+      case UpCast(u: UnresolvedAttribute, _, _) => u.name


actually we should collect u.nameParts here. u.name adds backticks if the nameParts contains dot, which makes the test result a little confusing.

Ok. Reasonable.

SparkQA · 2017-11-09T06:18:06Z

Test build #83622 has finished for PR 19664 at commit 10db6b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-09T10:55:13Z

thanks, merging to master!

ScalaReflection should produce correct field names for special charac…

319c804

…ters.

Add test.

a671a83

cloud-fan reviewed Nov 7, 2017

View reviewed changes

cloud-fan reviewed Nov 8, 2017

View reviewed changes

Address comment.

10db6b4

asfgit closed this in 40a8aef Nov 9, 2017

viirya deleted the SPARK-22442 branch December 27, 2023 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22442][SQL] ScalaReflection should produce correct field names for special characters #19664

[SPARK-22442][SQL] ScalaReflection should produce correct field names for special characters #19664

viirya commented Nov 6, 2017 •

edited

Loading

SparkQA commented Nov 6, 2017

SparkQA commented Nov 6, 2017

viirya commented Nov 7, 2017

cloud-fan Nov 7, 2017

viirya Nov 8, 2017

viirya Nov 8, 2017 •

edited

Loading

cloud-fan Nov 7, 2017

viirya Nov 8, 2017

cloud-fan commented Nov 7, 2017

cloud-fan Nov 8, 2017

viirya Nov 9, 2017

SparkQA commented Nov 9, 2017

cloud-fan commented Nov 9, 2017

[SPARK-22442][SQL] ScalaReflection should produce correct field names for special characters #19664

[SPARK-22442][SQL] ScalaReflection should produce correct field names for special characters #19664

Conversation

viirya commented Nov 6, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 6, 2017

SparkQA commented Nov 6, 2017

viirya commented Nov 7, 2017

cloud-fan Nov 7, 2017

Choose a reason for hiding this comment

viirya Nov 8, 2017

Choose a reason for hiding this comment

viirya Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

cloud-fan Nov 7, 2017

Choose a reason for hiding this comment

viirya Nov 8, 2017

Choose a reason for hiding this comment

cloud-fan commented Nov 7, 2017

cloud-fan Nov 8, 2017

Choose a reason for hiding this comment

viirya Nov 9, 2017

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2017

cloud-fan commented Nov 9, 2017

viirya commented Nov 6, 2017 •

edited

Loading

viirya Nov 8, 2017 •

edited

Loading