Skip to content

[SPARK-30288][SQL]Remove Parquet Column Name Check#26945

Closed
iRakson wants to merge 1 commit intoapache:masterfrom
iRakson:SPARK-30288
Closed

[SPARK-30288][SQL]Remove Parquet Column Name Check#26945
iRakson wants to merge 1 commit intoapache:masterfrom
iRakson:SPARK-30288

Conversation

@iRakson
Copy link
Contributor

@iRakson iRakson commented Dec 19, 2019

What changes were proposed in this pull request?

Removed the old check for column names when creating a parquet table.

Before Changes:

scala> Seq(100).toDF("a b").write.parquet("/tmp/foo")
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

After changes:

scala> Seq(100).toDF("a b").write.parquet("/tmp/dir")

scala> spark.read.parquet("/tmp/dir").show()
+---+
|a b|
+---+
|100|
+---+
scala> Seq(100).toDF("a=b").write.parquet("/tmp/dir2")

scala> spark.read.parquet("/tmp/dir2").show()
+---+
|a=b|
+---+
|100|
+---+

scala> Seq(100).toDF("(a;b)").write.parquet("/tmp/dir3")

scala> spark.read.parquet("/tmp/dir3").show()
+---+
|(a;b)|
+---+
|100|
+---+

Why are the changes needed?

Now parquet supports all the special characters that we were checking for previously. Initially parquet used to throw errors while using these special characters. So this validity check was introduced. Now parquet do not throw any exception for column names with special characters.

In JIRA also, one of user has pasted the output when creating parquet tables in pandas. There it supports special characters in column names.

Does this PR introduce any user-facing change?

Yes. Now Users will be able to create parquet tables with special characters in column names.

How was this patch tested?

Manually.
Will add unit tests soon.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@iRakson
Copy link
Contributor Author

iRakson commented Dec 19, 2019

cc @HyukjinKwon

def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iRakson Could you run the UT locally for SQL module related to this change?

@dongjoon-hyun
Copy link
Member

cc @rdblue , @zsxwing , @gengliangwang

@rdblue
Copy link
Contributor

rdblue commented Dec 19, 2019

The characters that were checked in checkFieldName are token delimiters in Parquet's schema parser: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48

This change will work for Spark because Parquet files don't use the IDL form of a schema in file metadata (instead it is converted to Thrift objects). But this change will allow users to create schemas that can't be used by anything that requires parsing a Parquet message type, including Parquet's InputFormat. This will break the string schema representation used commonly to pass a schema in Configuration.

I'd recommend against this, unless someone updates that parser. I'd support making that change in upstream Parquet and then removing the check here. (I used to think this was a bad idea because it would break Avro, but I've changed my mind.)

@dongjoon-hyun
Copy link
Member

Thank you, @rdblue .
Given the above the advice, I'll close this PR for now, @iRakson . You can reopen this later.

@HyukjinKwon
Copy link
Member

+1 for following the advice.

@brettplarson
Copy link

Hello and thanks for making this MR.

  • Is there any long term guidance on how column names should be labeled when Spark is used? Is this documented anywhere in either the parquet or spark docs? I am having a hard time finding any specific information on guidance on naming columns.
  • Is there a long term plan to address this by the Spark team?

The problem is that people will use pandas and create a dataframe with this "invalid" name, but then this doesn't become an issue until it's written to parquet from Spark which could potentially happen after a project is pretty far along.

Please let me know,
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants