Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27442][SQL] Remove check field name when reading/writing data in parquet #35229

Closed
wants to merge 20 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Jan 17, 2022

What changes were proposed in this pull request?

Spark should remove check field name when reading/writing parquet files.

Why are the changes needed?

Support spark reading existing parquet files with special chars in column names.

Does this PR introduce any user-facing change?

Such as parquet, user can use spark to read existing files with special chars in column names. And then can use back quote to wrap special column name such as `max(t)` or use `max(t)` as `max_t`, then user can use max_t.

How was this patch tested?

Added UT

@github-actions github-actions bot added the SQL label Jan 17, 2022
@LuciferYang
Copy link
Contributor

Remove check filename when reading data

filename? fieldname?

@srowen
Copy link
Member

srowen commented Jan 17, 2022

Dumb question, but aren't they prohibited because they'd cause problems as col names in a Spark DataFrame? or no?

@AngersZhuuuu
Copy link
Contributor Author

Remove check filename when reading data

filename? fieldname?

Yea..

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-27442][SQL] Remove check filename when reading data [SPARK-27442][SQL] Remove check field name when reading data Jan 17, 2022
@AngersZhuuuu
Copy link
Contributor Author

Dumb question, but aren't they prohibited because they'd cause problems as col names in a Spark DataFrame? or no?

We can use back quote or as. Also since we can generate DS with invalid col names and failed to write since check field name. So it won't be a problem.

@HyukjinKwon
Copy link
Member

These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep the check here?

@HyukjinKwon
Copy link
Member

See https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48 as an example. Also dot is not supported either in Parquet (PARQUET-1809).

@HyukjinKwon HyukjinKwon changed the title [SPARK-27442][SQL] Remove check field name when reading data [SPARK-27442][SQL] Remove check field name when reading data in Parquet Jan 18, 2022
@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Jan 18, 2022

These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep the check here?

This pr just support reading existing parquet file. Won't allow writing such parquet file. If user have some existing parquet data write by other system. We can support to read such file then handle it follow spark's rule.

@cloud-fan
Copy link
Contributor

We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use some third-party library to generate such Parquet files? Also cc @sunchao

@AngersZhuuuu
Copy link
Contributor Author

We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use some third-party library to generate such Parquet files? Also cc @sunchao

UT added

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 18, 2022

Hey, we should at least disallow ., and should have a proper error message for Parquet specifically per PARQUET-1809 because it doesn't work with reading.

Also, I think we should at least know how the files can be generated before merging this. How were these files created if they did not use Parquet I/O library to write?

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-27442][SQL] Remove check field name when reading data in Parquet [SPARK-27442][SQL] Remove check field name when reading data Jan 18, 2022
@AngersZhuuuu
Copy link
Contributor Author

Also, I think we should at least know how the files can be generated before merging this. How were these files created if they did not use Parquet I/O library to write?

I create the the test file by disallow filed name checking in write side, so user may reading data write by old spark version? other system or parquet I/O library directly.

withResourceTempPath("test-data/field_with_invalid_char.snappy.parquet") { dir =>
val df = spark.read.parquet(dir.getAbsolutePath)
checkAnswer(df, Row(1, 2, 3) :: Nil)
assert(df.schema.names.sameElements(Array("max(t)", "a b", "{")))
Copy link
Contributor

@cloud-fan cloud-fan Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include dot as well? I'm wondering if the parquet lib forbids dot or not during writing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include dot as well? I'm wondering if the parquet lib forbids dot or not during writing.

Added

@ghost
Copy link

ghost commented Jan 18, 2022

We've been hit by this, the C++ arrow::field API won't limit you on the characters you put in a field name. You can then arrow::Table::Make a table using that field and parquet will happily parquet::arrow::WriteTable it.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 18, 2022

@AngersZhuuuu, please update PR description:

It's OK for Spark to forbid special chars in the column name, but when we read existing parquet files, there is no point to forbid it at the Spark side. This pr remove checking filed name when reading existing files.

For paruqet, User can use spark to read existing files with special chars in column names. And then can use back quote to wrap special column name such as max(t) or use max(t) as max_t, then user can use max_t.

This PR not only affects Parquet but other sources that implement FileFormat.supportFieldName such as Avro and ORC.

@AngersZhuuuu
Copy link
Contributor Author

@AngersZhuuuu, please update PR description:

It's OK for Spark to forbid special chars in the column name, but when we read existing parquet files, there is no point to forbid it at the Spark side. This pr remove checking filed name when reading existing files.

For paruqet, User can use spark to read existing files with special chars in column names. And then can use back quote to wrap special column name such as max(t) or use max(t) as max_t, then user can use max_t.

This PR not only affects Parquet but other sources that implement FileFormat.supportFieldName such as Avro and ORC.

Done

@sunchao
Copy link
Member

sunchao commented Jan 18, 2022

We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use some third-party library to generate such Parquet files? Also cc @sunchao

Sorry for chiming in late. Yes I believe other implementations such as C++/Rust don't put this restriction so we can use them to generate test files. Nice to see @AngersZhuuuu already found a solution.

@@ -434,7 +434,8 @@ case class DataSource(
hs.partitionSchema,
"in the partition schema",
equality)
DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema)
DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema,
!hs.fileFormat.isInstanceOf[ParquetFileFormat])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think this doesn't compile?

@@ -81,12 +81,16 @@ object DataSourceUtils extends PredicateHelper {
* in a driver side.
*/
def verifySchema(format: FileFormat, schema: StructType): Unit = {
checkFieldType(format, schema)
checkFieldNames(format, schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR change anything? it looks like a refactoring by pulling out part of verifySchema as a separate method checkFieldType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR change anything? it looks like a refactoring by pulling out part of verifySchema as a separate method checkFieldType.

oh, sorry, one file was not chosen when commit change

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-27442][SQL] Remove check field name when reading data [SPARK-27442][SQL] Remove check field name when reading existing parquet data Jan 19, 2022
@AngersZhuuuu AngersZhuuuu changed the title [SPARK-27442][SQL] Remove check field name when reading existing parquet data [SPARK-27442][SQL] Remove check field name when reading existing data in parquet Jan 19, 2022
@cloud-fan
Copy link
Contributor

Yes I believe other implementations such as C++/Rust don't put this restriction so we can use them to generate test files.

Ah good to know it. Then I think a simple change is to just remove ParquetFileFormat.supportFieldName so that we don't check names in both read and write. It also simplifies the test as we can just write a roundtrip test.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Jan 19, 2022

Find the history commit of this check
8ab5076#diff-efd57ba9a1b4b5809a10098791ce894ff9edf12236f7da0d61e0e8f8c549d3cc
And this check from MessageTypeParser.Tokenizer

  private static class Tokenizer {
    private StringTokenizer st;
    private int line = 0;
    private StringBuilder currentLine = new StringBuilder();

    public Tokenizer(String schemaString, String string) {
      this.st = new StringTokenizer(schemaString, " ,;{}()\n\t=", true);
    }

@@ -4243,6 +4243,18 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
checkAnswer(df3, df4)
}
}

test("SPARK-27442: Spark support read parquet file with invalid char in field name") {
withResourceTempPath("test-data/field_with_invalid_char.snappy.parquet") { dir =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write the parquet file in the test, instead of generating it ahead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write the parquet file in the test, instead of generating it ahead.

Updated

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-27442][SQL] Remove check field name when reading existing data in parquet [SPARK-27442][SQL] Remove check field name when reading/writing data in parquet Jan 19, 2022
@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan All related test removed. For check supported field name, remained test in avro module.

@HyukjinKwon
Copy link
Member

Looks fine to me. cc @liancheng FYI

|INSERT OVERWRITE LOCAL DIRECTORY '${path.getCanonicalPath}'
|STORED AS PARQUET
|SELECT
|NAMED_STRUCT('ID', ID, 'IF(ID=1,ID,0)', IF(ID=1,ID,0), 'B', ABS(ID)) AS col1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Hive Serde fail for this case as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Hive Serde fail for this case as well?

Yea, updated

@cloud-fan
Copy link
Contributor

An interesting finding is that Hive Parquet Serde has this field name limitation, but Parquet format does not. This Spark behavior is probably copied from Hive originally.

}
}

test("SPARK-36312: ParquetWriteSupport should check inner field") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No exception thrown by hive serde, seems hive serder won't check inner fields. So just remove this . cc @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants