New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27442][SQL] Remove check field name when reading/writing data in parquet #35229
Conversation
filename? fieldname? |
Dumb question, but aren't they prohibited because they'd cause problems as col names in a Spark DataFrame? or no? |
Yea.. |
We can use back quote or |
These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep the check here? |
See https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48 as an example. Also dot is not supported either in Parquet (PARQUET-1809). |
This pr just support reading existing parquet file. Won't allow writing such parquet file. If user have some existing parquet data write by other system. We can support to read such file then handle it follow spark's rule. |
We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use some third-party library to generate such Parquet files? Also cc @sunchao |
UT added |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Outdated
Show resolved
Hide resolved
Hey, we should at least disallow Also, I think we should at least know how the files can be generated before merging this. How were these files created if they did not use Parquet I/O library to write? |
I create the the test file by disallow filed name checking in write side, so user may reading data write by old spark version? other system or parquet I/O library directly. |
withResourceTempPath("test-data/field_with_invalid_char.snappy.parquet") { dir => | ||
val df = spark.read.parquet(dir.getAbsolutePath) | ||
checkAnswer(df, Row(1, 2, 3) :: Nil) | ||
assert(df.schema.names.sameElements(Array("max(t)", "a b", "{"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include dot as well? I'm wondering if the parquet lib forbids dot or not during writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include dot as well? I'm wondering if the parquet lib forbids dot or not during writing.
Added
We've been hit by this, the C++ |
@AngersZhuuuu, please update PR description:
This PR not only affects Parquet but other sources that implement |
Done |
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
Sorry for chiming in late. Yes I believe other implementations such as C++/Rust don't put this restriction so we can use them to generate test files. Nice to see @AngersZhuuuu already found a solution. |
@@ -434,7 +434,8 @@ case class DataSource( | |||
hs.partitionSchema, | |||
"in the partition schema", | |||
equality) | |||
DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema) | |||
DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema, | |||
!hs.fileFormat.isInstanceOf[ParquetFileFormat]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think this doesn't compile?
@@ -81,12 +81,16 @@ object DataSourceUtils extends PredicateHelper { | |||
* in a driver side. | |||
*/ | |||
def verifySchema(format: FileFormat, schema: StructType): Unit = { | |||
checkFieldType(format, schema) | |||
checkFieldNames(format, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR change anything? it looks like a refactoring by pulling out part of verifySchema
as a separate method checkFieldType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR change anything? it looks like a refactoring by pulling out part of
verifySchema
as a separate methodcheckFieldType
.
oh, sorry, one file was not chosen when commit change
Ah good to know it. Then I think a simple change is to just remove |
Find the history commit of this check
|
@@ -4243,6 +4243,18 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark | |||
checkAnswer(df3, df4) | |||
} | |||
} | |||
|
|||
test("SPARK-27442: Spark support read parquet file with invalid char in field name") { | |||
withResourceTempPath("test-data/field_with_invalid_char.snappy.parquet") { dir => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can write the parquet file in the test, instead of generating it ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can write the parquet file in the test, instead of generating it ahead.
Updated
ping @cloud-fan All related test removed. For check supported field name, remained test in |
Looks fine to me. cc @liancheng FYI |
|INSERT OVERWRITE LOCAL DIRECTORY '${path.getCanonicalPath}' | ||
|STORED AS PARQUET | ||
|SELECT | ||
|NAMED_STRUCT('ID', ID, 'IF(ID=1,ID,0)', IF(ID=1,ID,0), 'B', ABS(ID)) AS col1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Hive Serde fail for this case as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Hive Serde fail for this case as well?
Yea, updated
An interesting finding is that Hive Parquet Serde has this field name limitation, but Parquet format does not. This Spark behavior is probably copied from Hive originally. |
} | ||
} | ||
|
||
test("SPARK-36312: ParquetWriteSupport should check inner field") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No exception thrown by hive serde, seems hive serder won't check inner fields. So just remove this . cc @cloud-fan
thanks, merging to master! |
What changes were proposed in this pull request?
Spark should remove check field name when reading/writing parquet files.
Why are the changes needed?
Support spark reading existing parquet files with special chars in column names.
Does this PR introduce any user-facing change?
Such as parquet, user can use spark to read existing files with special chars in column names. And then can use back quote to wrap special column name such as `max(t)` or use `max(t)` as `max_t`, then user can use
max_t
.How was this patch tested?
Added UT