Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct field names for special characters #19734

Closed
wants to merge 1 commit into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 13, 2017

What changes were proposed in this pull request?

For a class with field name of special characters, e.g.:

case class MyType(`field.1`: String, `field 2`: String)

Although we can manipulate DataFrame/Dataset, the field names are encoded:

scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
scala> df.as[MyType].collect
res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))

It causes resolving problem when we try to convert the data with non-encoded field names:

spark.read.json(path).as[MyType]
...
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
ld.1];
[info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...

We should use decoded field name in Dataset schema.

How was this patch tested?

Added tests.

@viirya
Copy link
Member Author

viirya commented Nov 13, 2017

cc @cloud-fan The backport of SPARK-22442 to branch 2.2.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83763 has finished for PR 19734 at commit 8d3fd95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

ok thanks, I'm above to tag 2.2.1. technically this isn't a regression but I could wait for a few hours (need to wait for jenkins build from the branch) if we could merge this ASAP

@felixcheung
Copy link
Member

merged to 2.2

asfgit pushed a commit that referenced this pull request Nov 13, 2017
… field names for special characters

## What changes were proposed in this pull request?

For a class with field name of special characters, e.g.:
```scala
case class MyType(`field.1`: String, `field 2`: String)
```

Although we can manipulate DataFrame/Dataset, the field names are encoded:
```scala
scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
scala> df.as[MyType].collect
res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))
```

It causes resolving problem when we try to convert the data with non-encoded field names:
```scala
spark.read.json(path).as[MyType]
...
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
ld.1];
[info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...
```

We should use decoded field name in Dataset schema.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19734 from viirya/SPARK-22442-2.2.
@viirya
Copy link
Member Author

viirya commented Nov 13, 2017

Thanks @felixcheung

@viirya viirya closed this Nov 13, 2017
@felixcheung
Copy link
Member

@viirya
Copy link
Member Author

viirya commented Nov 13, 2017

@felixcheung Yes. Looking into it.

@viirya
Copy link
Member Author

viirya commented Nov 13, 2017

val TermName: TermNameExtractor is new in scala 2.11. For 2.10, we should use deprecated newTermName. I will submit a follow-up.

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
… field names for special characters

## What changes were proposed in this pull request?

For a class with field name of special characters, e.g.:
```scala
case class MyType(`field.1`: String, `field 2`: String)
```

Although we can manipulate DataFrame/Dataset, the field names are encoded:
```scala
scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
scala> df.as[MyType].collect
res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))
```

It causes resolving problem when we try to convert the data with non-encoded field names:
```scala
spark.read.json(path).as[MyType]
...
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
ld.1];
[info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...
```

We should use decoded field name in Dataset schema.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#19734 from viirya/SPARK-22442-2.2.
@viirya viirya deleted the SPARK-22442-2.2 branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants