[SPARK-30288][SQL]Remove Parquet Column Name Check by iRakson · Pull Request #26945 · apache/spark

iRakson · 2019-12-19T06:39:49Z

What changes were proposed in this pull request?

Removed the old check for column names when creating a parquet table.

Before Changes:

scala> Seq(100).toDF("a b").write.parquet("/tmp/foo")
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

After changes:

scala> Seq(100).toDF("a b").write.parquet("/tmp/dir")

scala> spark.read.parquet("/tmp/dir").show()
+---+
|a b|
+---+
|100|
+---+
scala> Seq(100).toDF("a=b").write.parquet("/tmp/dir2")

scala> spark.read.parquet("/tmp/dir2").show()
+---+
|a=b|
+---+
|100|
+---+

scala> Seq(100).toDF("(a;b)").write.parquet("/tmp/dir3")

scala> spark.read.parquet("/tmp/dir3").show()
+---+
|(a;b)|
+---+
|100|
+---+

Why are the changes needed?

Now parquet supports all the special characters that we were checking for previously. Initially parquet used to throw errors while using these special characters. So this validity check was introduced. Now parquet do not throw any exception for column names with special characters.

In JIRA also, one of user has pasted the output when creating parquet tables in pandas. There it supports special characters in column names.

Does this PR introduce any user-facing change?

Yes. Now Users will be able to create parquet tables with special characters in column names.

How was this patch tested?

Manually.
Will add unit tests soon.

AmplabJenkins · 2019-12-19T07:14:30Z

Can one of the admins verify this patch?

iRakson · 2019-12-19T08:00:06Z

cc @HyukjinKwon

dongjoon-hyun · 2019-12-19T20:15:46Z

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

-  def checkFieldName(name: String): Unit = {
-    // ,;{}()\n\t= and space are special characters in Parquet schema
-    checkConversionRequirement(
-      !name.matches(".*[ ,;{}()\n\t=].*"),


@iRakson Could you run the UT locally for SQL module related to this change?

dongjoon-hyun · 2019-12-19T20:23:04Z

cc @rdblue , @zsxwing , @gengliangwang

rdblue · 2019-12-19T21:27:51Z

The characters that were checked in checkFieldName are token delimiters in Parquet's schema parser: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48

This change will work for Spark because Parquet files don't use the IDL form of a schema in file metadata (instead it is converted to Thrift objects). But this change will allow users to create schemas that can't be used by anything that requires parsing a Parquet message type, including Parquet's InputFormat. This will break the string schema representation used commonly to pass a schema in Configuration.

I'd recommend against this, unless someone updates that parser. I'd support making that change in upstream Parquet and then removing the check here. (I used to think this was a bad idea because it would break Avro, but I've changed my mind.)

dongjoon-hyun · 2019-12-19T21:44:35Z

Thank you, @rdblue .
Given the above the advice, I'll close this PR for now, @iRakson . You can reopen this later.

HyukjinKwon · 2019-12-20T01:36:01Z

+1 for following the advice.

brettplarson · 2021-04-29T14:07:57Z

Hello and thanks for making this MR.

Is there any long term guidance on how column names should be labeled when Spark is used? Is this documented anywhere in either the parquet or spark docs? I am having a hard time finding any specific information on guidance on naming columns.
Is there a long term plan to address this by the Spark team?

The problem is that people will use pandas and create a dataframe with this "invalid" name, but then this doesn't become an issue until it's written to parquet from Spark which could potentially happen after a project is pretty far along.

Please let me know,
Thank you!

SPARK-30288

7a02883

dongjoon-hyun reviewed Dec 19, 2019

View reviewed changes

dongjoon-hyun added the SQL label Dec 19, 2019

dongjoon-hyun closed this Dec 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30288][SQL]Remove Parquet Column Name Check#26945

[SPARK-30288][SQL]Remove Parquet Column Name Check#26945
iRakson wants to merge 1 commit intoapache:masterfrom
iRakson:SPARK-30288

iRakson commented Dec 19, 2019 •

edited

Loading

Uh oh!

AmplabJenkins commented Dec 19, 2019

Uh oh!

iRakson commented Dec 19, 2019

Uh oh!

dongjoon-hyun Dec 19, 2019

Uh oh!

dongjoon-hyun commented Dec 19, 2019

Uh oh!

rdblue commented Dec 19, 2019

Uh oh!

dongjoon-hyun commented Dec 19, 2019

Uh oh!

HyukjinKwon commented Dec 20, 2019

Uh oh!

brettplarson commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

iRakson commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Dec 19, 2019

Uh oh!

iRakson commented Dec 19, 2019

Uh oh!

dongjoon-hyun Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 19, 2019

Uh oh!

rdblue commented Dec 19, 2019

Uh oh!

dongjoon-hyun commented Dec 19, 2019

Uh oh!

HyukjinKwon commented Dec 20, 2019

Uh oh!

brettplarson commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

iRakson commented Dec 19, 2019 •

edited

Loading