[SPARK-28651][SS]Force the schema of Streaming file source to be nullable#25382
[SPARK-28651][SS]Force the schema of Streaming file source to be nullable#25382zsxwing wants to merge 2 commits intoapache:masterfrom
Conversation
|
Test build #108787 has finished for PR 25382 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Yea, there's inconsistency between batch and streaming source. I kind of tried to fix it a long ago but ended up with leaving it as is.
For a temp fix to match it to batch side, I am fine with this; however, I believe ideally we should respect the nullability ...
|
cc @cloud-fan and @liancheng. I believe I talked with you guys a long ago. |
|
eventually we should allow data source to report its nullability, but Spark should add validation when read data to make sure the nullability is correct. |
Agreed with this. However, I think making the file source report the correct nullability is hard. It's either be inferred or user specified, which is error-prone. As Spark right now doesn't validate data to make sure nullability is correct, it can lead to weird NPE when processing the bad data, or corrupted files when writing. It would be hard for a Spark user to debug such errors. I don't know why we added this behavior for batch queriers. Ping @liancheng to comment this since he added it in https://github.com/apache/spark/pull/6285/files#diff-3d26956194a9a58c7eca9b364395e0c2R249 |
|
Unfortunately, I didn't add a comprehensive PR description or comments in the original PR to explain why we wanted a nullable schema (I guess we were rushing the 1.4 release back then)... From what I could recall, possible motivations could be due to Hive compatibility and Parquet interoperability. Hive metastore doesn't respect nullability and always have nullable columns. Parquet interoperability was in trouble in 1.4 due to ambiguity in the parquet-format spec and different systems were using different schemas when writing nested columns, and forcing the schema nullable could help in resolving part of the cases (esp. for Hive). But Spark Parquet interoperability is no longer an issue since 1.6. |
|
Test build #108840 has finished for PR 25382 at commit
|
|
Merged to master. |
|
We should also update |
What changes were proposed in this pull request?
Right now, batch DataFrame always changes the schema to nullable automatically (See this line:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Line 399 in 325bc8e
This PR updates the streaming file source schema to force it be nullable. I also added a flag
spark.sql.streaming.fileSource.schema.forceNullableto disable this change since some users may rely on the old behavior.How was this patch tested?
The new unit test.