[SPARK-27442][SQL] Remove check field name when reading/writing data in parquet #35229

AngersZhuuuu · 2022-01-17T10:58:07Z

What changes were proposed in this pull request?

Spark should remove check field name when reading/writing parquet files.

Why are the changes needed?

Support spark reading existing parquet files with special chars in column names.

Does this PR introduce any user-facing change?

Such as parquet, user can use spark to read existing files with special chars in column names. And then can use back quote to wrap special column name such as `max(t)` or use `max(t)` as `max_t`, then user can use max_t.

How was this patch tested?

Added UT

LuciferYang · 2022-01-17T15:59:46Z

Remove check filename when reading data

filename? fieldname?

srowen · 2022-01-17T16:19:03Z

Dumb question, but aren't they prohibited because they'd cause problems as col names in a Spark DataFrame? or no?

AngersZhuuuu · 2022-01-17T16:24:22Z

Remove check filename when reading data

filename? fieldname?

Yea..

AngersZhuuuu · 2022-01-17T16:26:01Z

Dumb question, but aren't they prohibited because they'd cause problems as col names in a Spark DataFrame? or no?

We can use back quote or as. Also since we can generate DS with invalid col names and failed to write since check field name. So it won't be a problem.

HyukjinKwon · 2022-01-18T00:02:39Z

These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep the check here?

HyukjinKwon · 2022-01-18T00:12:18Z

See https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48 as an example. Also dot is not supported either in Parquet (PARQUET-1809).

AngersZhuuuu · 2022-01-18T03:01:11Z

These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep the check here?

This pr just support reading existing parquet file. Won't allow writing such parquet file. If user have some existing parquet data write by other system. We can support to read such file then handle it follow spark's rule.

cloud-fan · 2022-01-18T05:33:24Z

We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use some third-party library to generate such Parquet files? Also cc @sunchao

AngersZhuuuu · 2022-01-18T05:56:10Z

We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use some third-party library to generate such Parquet files? Also cc @sunchao

UT added

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

HyukjinKwon · 2022-01-18T06:37:26Z

Hey, we should at least disallow ., and should have a proper error message for Parquet specifically per PARQUET-1809 because it doesn't work with reading.

Also, I think we should at least know how the files can be generated before merging this. How were these files created if they did not use Parquet I/O library to write?

AngersZhuuuu · 2022-01-18T07:38:49Z

Also, I think we should at least know how the files can be generated before merging this. How were these files created if they did not use Parquet I/O library to write?

I create the the test file by disallow filed name checking in write side, so user may reading data write by old spark version? other system or parquet I/O library directly.

cloud-fan · 2022-01-18T07:46:01Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+    withResourceTempPath("test-data/field_with_invalid_char.snappy.parquet") { dir =>
+      val df = spark.read.parquet(dir.getAbsolutePath)
+      checkAnswer(df, Row(1, 2, 3) :: Nil)
+      assert(df.schema.names.sameElements(Array("max(t)", "a b", "{")))


can we include dot as well? I'm wondering if the parquet lib forbids dot or not during writing.

can we include dot as well? I'm wondering if the parquet lib forbids dot or not during writing.

Added

ghost · 2022-01-18T07:55:52Z

We've been hit by this, the C++ arrow::field API won't limit you on the characters you put in a field name. You can then arrow::Table::Make a table using that field and parquet will happily parquet::arrow::WriteTable it.

HyukjinKwon · 2022-01-18T08:54:36Z

@AngersZhuuuu, please update PR description:

It's OK for Spark to forbid special chars in the column name, but when we read existing parquet files, there is no point to forbid it at the Spark side. This pr remove checking filed name when reading existing files.

For paruqet, User can use spark to read existing files with special chars in column names. And then can use back quote to wrap special column name such as max(t) or use max(t) as max_t, then user can use max_t.

This PR not only affects Parquet but other sources that implement FileFormat.supportFieldName such as Avro and ORC.

AngersZhuuuu · 2022-01-18T10:13:39Z

@AngersZhuuuu, please update PR description:

It's OK for Spark to forbid special chars in the column name, but when we read existing parquet files, there is no point to forbid it at the Spark side. This pr remove checking filed name when reading existing files.

For paruqet, User can use spark to read existing files with special chars in column names. And then can use back quote to wrap special column name such as max(t) or use max(t) as max_t, then user can use max_t.

This PR not only affects Parquet but other sources that implement FileFormat.supportFieldName such as Avro and ORC.

Done

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

sunchao · 2022-01-18T22:06:00Z

We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use some third-party library to generate such Parquet files? Also cc @sunchao

Sorry for chiming in late. Yes I believe other implementations such as C++/Rust don't put this restriction so we can use them to generate test files. Nice to see @AngersZhuuuu already found a solution.

sunchao · 2022-01-18T22:15:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -434,7 +434,8 @@ case class DataSource(
          hs.partitionSchema,
          "in the partition schema",
          equality)
-        DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema)
+        DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema,
+          !hs.fileFormat.isInstanceOf[ParquetFileFormat])


Hmm I think this doesn't compile?

sunchao · 2022-01-18T22:18:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

@@ -81,12 +81,16 @@ object DataSourceUtils extends PredicateHelper {
   * in a driver side.
   */
  def verifySchema(format: FileFormat, schema: StructType): Unit = {
+    checkFieldType(format, schema)
+    checkFieldNames(format, schema)


Does this PR change anything? it looks like a refactoring by pulling out part of verifySchema as a separate method checkFieldType.

Does this PR change anything? it looks like a refactoring by pulling out part of verifySchema as a separate method checkFieldType.

oh, sorry, one file was not chosen when commit change

cloud-fan · 2022-01-19T05:22:06Z

Yes I believe other implementations such as C++/Rust don't put this restriction so we can use them to generate test files.

Ah good to know it. Then I think a simple change is to just remove ParquetFileFormat.supportFieldName so that we don't check names in both read and write. It also simplifies the test as we can just write a roundtrip test.

AngersZhuuuu · 2022-01-19T05:43:13Z

Find the history commit of this check
8ab5076#diff-efd57ba9a1b4b5809a10098791ce894ff9edf12236f7da0d61e0e8f8c549d3cc
And this check from MessageTypeParser.Tokenizer

  private static class Tokenizer {
    private StringTokenizer st;
    private int line = 0;
    private StringBuilder currentLine = new StringBuilder();

    public Tokenizer(String schemaString, String string) {
      this.st = new StringTokenizer(schemaString, " ,;{}()\n\t=", true);
    }

cloud-fan · 2022-01-19T06:14:46Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -4243,6 +4243,18 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
      checkAnswer(df3, df4)
    }
  }
+
+  test("SPARK-27442: Spark support read parquet file with invalid char in field name") {
+    withResourceTempPath("test-data/field_with_invalid_char.snappy.parquet") { dir =>


We can write the parquet file in the test, instead of generating it ahead.

We can write the parquet file in the test, instead of generating it ahead.

Updated

AngersZhuuuu · 2022-01-20T03:06:07Z

ping @cloud-fan All related test removed. For check supported field name, remained test in avro module.

HyukjinKwon · 2022-01-20T10:29:23Z

Looks fine to me. cc @liancheng FYI

cloud-fan · 2022-01-20T12:31:50Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

-               |INSERT OVERWRITE LOCAL DIRECTORY '${path.getCanonicalPath}'
-               |STORED AS PARQUET
-               |SELECT
-               |NAMED_STRUCT('ID', ID, 'IF(ID=1,ID,0)', IF(ID=1,ID,0), 'B', ABS(ID)) AS col1


Does Hive Serde fail for this case as well?

Does Hive Serde fail for this case as well?

Yea, updated

cloud-fan · 2022-01-20T12:33:25Z

An interesting finding is that Hive Parquet Serde has this field name limitation, but Parquet format does not. This Spark behavior is probably copied from Hive originally.

AngersZhuuuu · 2022-01-21T02:39:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

-    }
-  }
-
-  test("SPARK-36312: ParquetWriteSupport should check inner field") {


No exception thrown by hive serde, seems hive serder won't check inner fields. So just remove this . cc @cloud-fan

cloud-fan · 2022-01-21T07:28:10Z

thanks, merging to master!

[SPARK-27442][SQL] Remove check filename when reading data

388be6d

github-actions bot added the SQL label Jan 17, 2022

AngersZhuuuu changed the title ~~[SPARK-27442][SQL] Remove check filename when reading data~~ [SPARK-27442][SQL] Remove check field name when reading data Jan 17, 2022

HyukjinKwon changed the title ~~[SPARK-27442][SQL] Remove check field name when reading data~~ [SPARK-27442][SQL] Remove check field name when reading data in Parquet Jan 18, 2022

Add UT

4bf6a19

cloud-fan approved these changes Jan 18, 2022

View reviewed changes

HyukjinKwon reviewed Jan 18, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala Outdated Show resolved Hide resolved

Update FileFormat.scala

f3f4e4c

AngersZhuuuu changed the title ~~[SPARK-27442][SQL] Remove check field name when reading data in Parquet~~ [SPARK-27442][SQL] Remove check field name when reading data Jan 18, 2022

cloud-fan reviewed Jan 18, 2022

View reviewed changes

follow comment

3d7114a

cloud-fan reviewed Jan 18, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

AngersZhuuuu added 2 commits January 18, 2022 21:41

update

59e9030

Update DataSource.scala

d09f83d

sunchao reviewed Jan 18, 2022

View reviewed changes

AngersZhuuuu added 2 commits January 19, 2022 10:03

update

5a9d993

Update FileFormat.scala

5dc6741

AngersZhuuuu changed the title ~~[SPARK-27442][SQL] Remove check field name when reading data~~ [SPARK-27442][SQL] Remove check field name when reading existing parquet data Jan 19, 2022

AngersZhuuuu changed the title ~~[SPARK-27442][SQL] Remove check field name when reading existing parquet data~~ [SPARK-27442][SQL] Remove check field name when reading existing data in parquet Jan 19, 2022

AngersZhuuuu added 2 commits January 19, 2022 13:36

follow comment

b2d64af

Update ParquetFileFormat.scala

b3f0f09

cloud-fan reviewed Jan 19, 2022

View reviewed changes

Follow comment

fe3aeb2

AngersZhuuuu changed the title ~~[SPARK-27442][SQL] Remove check field name when reading existing data in parquet~~ [SPARK-27442][SQL] Remove check field name when reading/writing data in parquet Jan 19, 2022

AngersZhuuuu added 7 commits January 19, 2022 14:20

update

e11bcdb

Delete field_with_invalid_char.snappy.parquet

5f3430f

Update SQLQuerySuite.scala

bd5540e

Update SQLQuerySuite.scala

310b00c

Update SQLQuerySuite.scala

e6253a2

remove related UT

e3efd3f

Update HiveDDLSuite.scala

8a1dc91

Update HiveDDLSuite.scala

8883634

HyukjinKwon mentioned this pull request Jan 20, 2022

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc #35253

Closed

cloud-fan reviewed Jan 20, 2022

View reviewed changes

Update HiveDDLSuite.scala

3d4adf9

AngersZhuuuu commented Jan 21, 2022

View reviewed changes

cloud-fan closed this in 2b7442e Jan 21, 2022

zsxwing mentioned this pull request Aug 24, 2022

Support arbitrary column names delta-io/delta#957

Closed

Yohahaha mentioned this pull request Dec 7, 2023

[GLUTEN-3962][VL] Respect parsed attribute name and remove column name validate logic apache/incubator-gluten#3963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27442][SQL] Remove check field name when reading/writing data in parquet #35229

[SPARK-27442][SQL] Remove check field name when reading/writing data in parquet #35229

AngersZhuuuu commented Jan 17, 2022 •

edited

LuciferYang commented Jan 17, 2022

srowen commented Jan 17, 2022

AngersZhuuuu commented Jan 17, 2022

AngersZhuuuu commented Jan 17, 2022

HyukjinKwon commented Jan 18, 2022

HyukjinKwon commented Jan 18, 2022

AngersZhuuuu commented Jan 18, 2022 •

edited

cloud-fan commented Jan 18, 2022

AngersZhuuuu commented Jan 18, 2022

HyukjinKwon commented Jan 18, 2022 •

edited

AngersZhuuuu commented Jan 18, 2022

cloud-fan Jan 18, 2022 •

edited

AngersZhuuuu Jan 18, 2022

ghost commented Jan 18, 2022

HyukjinKwon commented Jan 18, 2022 •

edited

AngersZhuuuu commented Jan 18, 2022

sunchao commented Jan 18, 2022

sunchao Jan 18, 2022

sunchao Jan 18, 2022

AngersZhuuuu Jan 19, 2022

cloud-fan commented Jan 19, 2022

AngersZhuuuu commented Jan 19, 2022 •

edited

cloud-fan Jan 19, 2022

AngersZhuuuu Jan 19, 2022

AngersZhuuuu commented Jan 20, 2022

HyukjinKwon commented Jan 20, 2022

cloud-fan Jan 20, 2022

AngersZhuuuu Jan 21, 2022

cloud-fan commented Jan 20, 2022

AngersZhuuuu Jan 21, 2022 •

edited

cloud-fan commented Jan 21, 2022

[SPARK-27442][SQL] Remove check field name when reading/writing data in parquet #35229

[SPARK-27442][SQL] Remove check field name when reading/writing data in parquet #35229

Conversation

AngersZhuuuu commented Jan 17, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

LuciferYang commented Jan 17, 2022

srowen commented Jan 17, 2022

AngersZhuuuu commented Jan 17, 2022

AngersZhuuuu commented Jan 17, 2022

HyukjinKwon commented Jan 18, 2022

HyukjinKwon commented Jan 18, 2022

AngersZhuuuu commented Jan 18, 2022 • edited

cloud-fan commented Jan 18, 2022

AngersZhuuuu commented Jan 18, 2022

HyukjinKwon commented Jan 18, 2022 • edited

AngersZhuuuu commented Jan 18, 2022

cloud-fan Jan 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jan 18, 2022

HyukjinKwon commented Jan 18, 2022 • edited

AngersZhuuuu commented Jan 18, 2022

sunchao commented Jan 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 19, 2022

AngersZhuuuu commented Jan 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AngersZhuuuu commented Jan 20, 2022

HyukjinKwon commented Jan 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 20, 2022

AngersZhuuuu Jan 21, 2022 • edited

Choose a reason for hiding this comment

cloud-fan commented Jan 21, 2022

AngersZhuuuu commented Jan 17, 2022 •

edited

AngersZhuuuu commented Jan 18, 2022 •

edited

HyukjinKwon commented Jan 18, 2022 •

edited

cloud-fan Jan 18, 2022 •

edited

HyukjinKwon commented Jan 18, 2022 •

edited

AngersZhuuuu commented Jan 19, 2022 •

edited

AngersZhuuuu Jan 21, 2022 •

edited