[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` #33436

cfmcgrady · 2021-07-20T08:36:23Z

What changes were proposed in this pull request?

Rework PR with suggestions.

This PR make spark.read.json() has the same behavior with Datasource API spark.read.format("json").load("path"). Spark should turn a non-nullable schema into nullable when using API spark.read.json() by default.

Here is an example:

  val schema = StructType(Seq(StructField("value",
    StructType(Seq(
      StructField("x", IntegerType, nullable = false),
      StructField("y", IntegerType, nullable = false)
    )),
    nullable = true
  )))

  val testDS = Seq("""{"value":{"x":1}}""").toDS
  spark.read
    .schema(schema)
    .json(testDS)
    .printSchema()

  spark.read
    .schema(schema)
    .format("json")
    .load("/tmp/json/t1")
    .printSchema()
  // root
  //  |-- value: struct (nullable = true)
  //  |    |-- x: integer (nullable = true)
  //  |    |-- y: integer (nullable = true)

Before this pr:

// output of spark.read.json()
root
 |-- value: struct (nullable = true)
 |    |-- x: integer (nullable = false)
 |    |-- y: integer (nullable = false)

After this pr:

// output of spark.read.json()
root
 |-- value: struct (nullable = true)
 |    |-- x: integer (nullable = true)
 |    |-- y: integer (nullable = true)

spark.read.csv() also has the same problem.
Datasource API spark.read.format("json").load("path") do this logical when resolve relation.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Lines 415 to 421 in c77acf0

    
           HadoopFsRelation( 
        
             fileCatalog, 
        
             partitionSchema = partitionSchema, 
        
             dataSchema = dataSchema.asNullable, 
        
             bucketSpec = bucketSpec, 
        
             format, 
        
             caseInsensitiveOptions)(sparkSession)

Does this PR introduce any user-facing change?

Yes, spark.read.json() and spark.read.csv() not respect the user-given schema and always turn it into a nullable schema by default.

How was this patch tested?

New test.

cfmcgrady · 2021-07-20T08:37:31Z

cc @cloud-fan @HyukjinKwon @maropu

cloud-fan · 2021-07-20T08:55:45Z

ok to test

SparkQA · 2021-07-20T11:58:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45834/

SparkQA · 2021-07-20T12:33:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45834/

srowen · 2021-07-20T13:07:43Z

For my information - is the purpose here to then cause a failure at runtime when the null values are read, and the original schema is applied, which forbids null?

cloud-fan · 2021-07-20T13:17:26Z

is the purpose here to then cause a failure at runtime when the null values are read

No, the original schema won't be applied. This PR will turn it into nullable schema. This is the same as reading JSON/CSV from files.

SparkQA · 2021-07-20T13:51:19Z

Test build #141318 has finished for PR 33436 at commit 3974a43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

docs/sql-migration-guide.md

SparkQA · 2021-07-21T04:39:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45890/

SparkQA · 2021-07-21T05:33:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45890/

SparkQA · 2021-07-21T08:34:07Z

Test build #141374 has finished for PR 33436 at commit 741d0c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-21T09:05:44Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45913/

SparkQA · 2021-07-21T12:13:43Z

Test build #141395 has finished for PR 33436 at commit 56ceec4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2021-07-22T01:59:39Z

Can one of the admins verify this patch?

HyukjinKwon · 2021-07-22T02:12:00Z

Merged to master.

cfmcgrady · 2021-07-22T02:21:28Z

Thank you for the review. @cloud-fan @HyukjinKwon @srowen

HyukjinKwon · 2022-05-03T00:11:26Z

docs/sql-migration-guide.md

@@ -22,6 +22,10 @@ license: |
 * Table of contents
 {:toc}

+## Upgrading from Spark SQL 3.2 to 3.3
+
+  - Since Spark 3.3, Spark turns a non-nullable schema into nullable for API `DataFrameReader.schema(schema: StructType).json(jsonDataset: Dataset[String])` and `DataFrameReader.schema(schema: StructType).csv(csvDataset: Dataset[String])` when the schema is specified by the user and contains non-nullable fields.


I actually underestimated this problem. It can actually change the results:

import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "FAILFAST").csv(ds).show()

Before:

+---+---+ | f1| f2| +---+---+ | a| b| +---+---+

After:

+---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+

I think we should at least add a legacy configuration .. let me make a quick followup.

…ng nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ng nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6689b97) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

fix nullability of spark.read.json/spark.read.csv

3974a43

github-actions bot added the SQL label Jul 20, 2021

cloud-fan approved these changes Jul 20, 2021

View reviewed changes

HyukjinKwon approved these changes Jul 20, 2021

View reviewed changes

HyukjinKwon reviewed Jul 21, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Show resolved Hide resolved

update migration guide

741d0c9

github-actions bot added the DOCS label Jul 21, 2021

HyukjinKwon reviewed Jul 21, 2021

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

review

56ceec4

HyukjinKwon approved these changes Jul 21, 2021

View reviewed changes

HyukjinKwon closed this in 09bebc8 Jul 22, 2021

HyukjinKwon reviewed May 3, 2022

View reviewed changes

HyukjinKwon mentioned this pull request May 3, 2022

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` #33436

[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` #33436

cfmcgrady commented Jul 20, 2021 •

edited

Loading

cfmcgrady commented Jul 20, 2021

cloud-fan commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

srowen commented Jul 20, 2021

cloud-fan commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

AmplabJenkins commented Jul 22, 2021

HyukjinKwon commented Jul 22, 2021

cfmcgrady commented Jul 22, 2021

HyukjinKwon May 3, 2022

HyukjinKwon May 3, 2022

	HadoopFsRelation(
	fileCatalog,
	partitionSchema = partitionSchema,
	dataSchema = dataSchema.asNullable,
	bucketSpec = bucketSpec,
	format,
	caseInsensitiveOptions)(sparkSession)

[SPARK-35912][SQL] Fix nullability of spark.read.json/spark.read.csv #33436

[SPARK-35912][SQL] Fix nullability of spark.read.json/spark.read.csv #33436

Conversation

cfmcgrady commented Jul 20, 2021 • edited Loading

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

cfmcgrady commented Jul 20, 2021

cloud-fan commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 20, 2021

srowen commented Jul 20, 2021

cloud-fan commented Jul 20, 2021

SparkQA commented Jul 20, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

AmplabJenkins commented Jul 22, 2021

HyukjinKwon commented Jul 22, 2021

cfmcgrady commented Jul 22, 2021

HyukjinKwon May 3, 2022

Choose a reason for hiding this comment

HyukjinKwon May 3, 2022

Choose a reason for hiding this comment

[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` #33436

[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` #33436

cfmcgrady commented Jul 20, 2021 •

edited

Loading