Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22442][SQL] ScalaReflection should produce correct field names for special characters #19664

Closed
wants to merge 3 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 6, 2017

What changes were proposed in this pull request?

For a class with field name of special characters, e.g.:

case class MyType(`field.1`: String, `field 2`: String)

Although we can manipulate DataFrame/Dataset, the field names are encoded:

scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
scala> df.as[MyType].collect
res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))

It causes resolving problem when we try to convert the data with non-encoded field names:

spark.read.json(path).as[MyType]
...
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
ld.1];
[info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...

We should use decoded field name in Dataset schema.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented Nov 6, 2017

Test build #83478 has finished for PR 19664 at commit 319c804.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2017

Test build #83486 has finished for PR 19664 at commit a671a83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SpecialCharClass(field.1: String, field 2: String)

@viirya
Copy link
Member Author

viirya commented Nov 7, 2017

cc @cloud-fan for review.

@@ -214,11 +215,13 @@ case class Invoke(
override def eval(input: InternalRow): Any =
throw new UnsupportedOperationException("Only code-generated evaluation is supported.")

private lazy val encodedFunctionName = TermName(functionName).encodedName.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does StaticInvoke have some issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, although I didn't have concrete case causing the issue for now.

Copy link
Member Author

@viirya viirya Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use Invoke to access field(s) in object, this is an issue. I didn't found StaticInvoke used similarly. So it should be fine.

val argumentsFields = deserializer.asInstanceOf[NewInstance].arguments.flatMap { _.collect {
case UpCast(u: UnresolvedAttribute, _, _) => u.name
}}
assert(argumentsFields(0) == "`field.1`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it has backticks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to deliberately wrap backticks around a field name such as field.1 because of the dot character. Otherwise UnresolvedAttribute will parse it as two name parts Seq("field", "1") and then fail resolving later.

@cloud-fan
Copy link
Contributor

good catch!

assert(serializer.dataType(1).name == "field 2")

val argumentsFields = deserializer.asInstanceOf[NewInstance].arguments.flatMap { _.collect {
case UpCast(u: UnresolvedAttribute, _, _) => u.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we should collect u.nameParts here. u.name adds backticks if the nameParts contains dot, which makes the test result a little confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Reasonable.

@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83622 has finished for PR 19664 at commit 10db6b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 40a8aef Nov 9, 2017
@viirya viirya deleted the SPARK-22442 branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants