[SPARK-28651][SS]Force the schema of Streaming file source to be nullable by zsxwing · Pull Request #25382 · apache/spark

zsxwing · 2019-08-07T22:05:13Z

What changes were proposed in this pull request?

Right now, batch DataFrame always changes the schema to nullable automatically (See this line:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Line 399 in 325bc8e

dataSchema = dataSchema.asNullable,

). But streaming file source is missing this.

This PR updates the streaming file source schema to force it be nullable. I also added a flag spark.sql.streaming.fileSource.schema.forceNullable to disable this change since some users may rely on the old behavior.

How was this patch tested?

The new unit test.

SparkQA · 2019-08-08T01:09:31Z

Test build #108787 has finished for PR 25382 at commit 3ed4942.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Yea, there's inconsistency between batch and streaming source. I kind of tried to fix it a long ago but ended up with leaving it as is.

For a temp fix to match it to batch side, I am fine with this; however, I believe ideally we should respect the nullability ...

HyukjinKwon · 2019-08-08T01:13:48Z

cc @cloud-fan and @liancheng. I believe I talked with you guys a long ago.

cloud-fan · 2019-08-08T04:24:54Z

eventually we should allow data source to report its nullability, but Spark should add validation when read data to make sure the nullability is correct.

zsxwing · 2019-08-08T17:22:01Z

ideally we should respect the nullability ...

Agreed with this. However, I think making the file source report the correct nullability is hard. It's either be inferred or user specified, which is error-prone. As Spark right now doesn't validate data to make sure nullability is correct, it can lead to weird NPE when processing the bad data, or corrupted files when writing. It would be hard for a Spark user to debug such errors.

I don't know why we added this behavior for batch queriers. Ping @liancheng to comment this since he added it in https://github.com/apache/spark/pull/6285/files#diff-3d26956194a9a58c7eca9b364395e0c2R249

liancheng · 2019-08-08T18:57:11Z

Unfortunately, I didn't add a comprehensive PR description or comments in the original PR to explain why we wanted a nullable schema (I guess we were rushing the 1.4 release back then)...

From what I could recall, possible motivations could be due to Hive compatibility and Parquet interoperability. Hive metastore doesn't respect nullability and always have nullable columns. Parquet interoperability was in trouble in 1.4 due to ambiguity in the parquet-format spec and different systems were using different schemas when writing nested columns, and forcing the schema nullable could help in resolving part of the cases (esp. for Hive). But Spark Parquet interoperability is no longer an issue since 1.6.

SparkQA · 2019-08-08T21:04:18Z

Test build #108840 has finished for PR 25382 at commit b4878b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-08-09T09:54:34Z

Merged to master.

cloud-fan · 2019-10-30T16:32:29Z

We should also update DataSource.resolveRelation. When we batch-read files written by streaming, we still return non-nullable schema.

cloud-fan · 2019-10-30T16:34:28Z

This test result is unexpected: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala#L148

force streaming file source to be nullable

3ed4942

dongjoon-hyun added the STRUCTURED STREAMING label Aug 7, 2019

HyukjinKwon reviewed Aug 8, 2019

View reviewed changes

fix python test

b4878b9

zsxwing changed the title ~~[SPARK-28651][SS]Force streaming file source to be nullable~~ [SPARK-28651][SS]Force the schema of Streaming file source to be nullable Aug 8, 2019

cloud-fan approved these changes Aug 9, 2019

View reviewed changes

HyukjinKwon closed this in 5bb6994 Aug 9, 2019

zsxwing deleted the SPARK-28651 branch August 9, 2019 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28651][SS]Force the schema of Streaming file source to be nullable#25382

[SPARK-28651][SS]Force the schema of Streaming file source to be nullable#25382
zsxwing wants to merge 2 commits intoapache:masterfrom
zsxwing:SPARK-28651

zsxwing commented Aug 7, 2019

Uh oh!

SparkQA commented Aug 8, 2019

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Aug 8, 2019

Uh oh!

cloud-fan commented Aug 8, 2019

Uh oh!

zsxwing commented Aug 8, 2019

Uh oh!

liancheng commented Aug 8, 2019 •

edited

Loading

Uh oh!

SparkQA commented Aug 8, 2019

Uh oh!

HyukjinKwon commented Aug 9, 2019

Uh oh!

cloud-fan commented Oct 30, 2019

Uh oh!

cloud-fan commented Oct 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

zsxwing commented Aug 7, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 8, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 8, 2019

Uh oh!

cloud-fan commented Aug 8, 2019

Uh oh!

zsxwing commented Aug 8, 2019

Uh oh!

liancheng commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 8, 2019

Uh oh!

HyukjinKwon commented Aug 9, 2019

Uh oh!

cloud-fan commented Oct 30, 2019

Uh oh!

cloud-fan commented Oct 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

liancheng commented Aug 8, 2019 •

edited

Loading