Skip to content

[SPARK-28651][SS]Force the schema of Streaming file source to be nullable#25382

Closed
zsxwing wants to merge 2 commits intoapache:masterfrom
zsxwing:SPARK-28651
Closed

[SPARK-28651][SS]Force the schema of Streaming file source to be nullable#25382
zsxwing wants to merge 2 commits intoapache:masterfrom
zsxwing:SPARK-28651

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Aug 7, 2019

What changes were proposed in this pull request?

Right now, batch DataFrame always changes the schema to nullable automatically (See this line:

). But streaming file source is missing this.

This PR updates the streaming file source schema to force it be nullable. I also added a flag spark.sql.streaming.fileSource.schema.forceNullable to disable this change since some users may rely on the old behavior.

How was this patch tested?

The new unit test.

@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108787 has finished for PR 25382 at commit 3ed4942.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, there's inconsistency between batch and streaming source. I kind of tried to fix it a long ago but ended up with leaving it as is.

For a temp fix to match it to batch side, I am fine with this; however, I believe ideally we should respect the nullability ...

@HyukjinKwon
Copy link
Member

cc @cloud-fan and @liancheng. I believe I talked with you guys a long ago.

@cloud-fan
Copy link
Contributor

eventually we should allow data source to report its nullability, but Spark should add validation when read data to make sure the nullability is correct.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 8, 2019

ideally we should respect the nullability ...

Agreed with this. However, I think making the file source report the correct nullability is hard. It's either be inferred or user specified, which is error-prone. As Spark right now doesn't validate data to make sure nullability is correct, it can lead to weird NPE when processing the bad data, or corrupted files when writing. It would be hard for a Spark user to debug such errors.

I don't know why we added this behavior for batch queriers. Ping @liancheng to comment this since he added it in https://github.com/apache/spark/pull/6285/files#diff-3d26956194a9a58c7eca9b364395e0c2R249

@liancheng
Copy link
Contributor

liancheng commented Aug 8, 2019

Unfortunately, I didn't add a comprehensive PR description or comments in the original PR to explain why we wanted a nullable schema (I guess we were rushing the 1.4 release back then)...

From what I could recall, possible motivations could be due to Hive compatibility and Parquet interoperability. Hive metastore doesn't respect nullability and always have nullable columns. Parquet interoperability was in trouble in 1.4 due to ambiguity in the parquet-format spec and different systems were using different schemas when writing nested columns, and forcing the schema nullable could help in resolving part of the cases (esp. for Hive). But Spark Parquet interoperability is no longer an issue since 1.6.

@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108840 has finished for PR 25382 at commit b4878b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing zsxwing changed the title [SPARK-28651][SS]Force streaming file source to be nullable [SPARK-28651][SS]Force the schema of Streaming file source to be nullable Aug 8, 2019
@HyukjinKwon
Copy link
Member

Merged to master.

@zsxwing zsxwing deleted the SPARK-28651 branch August 9, 2019 17:41
@cloud-fan
Copy link
Contributor

We should also update DataSource.resolveRelation. When we batch-read files written by streaming, we still return non-nullable schema.

@cloud-fan
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants