[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435

HyukjinKwon · 2022-05-03T00:47:06Z

What changes were proposed in this pull request?

This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189):

import org.apache.spark.sql.types._
val ds = Seq("a,", "a,b").toDS
spark.read.schema(
  StructType(
    StructField("f1", StringType, nullable = false) ::
    StructField("f2", StringType, nullable = false) :: Nil)
  ).option("mode", "DROPMALFORMED").csv(ds).show()

Before:

+---+---+
| f1| f2|
+---+---+
|  a|  b|
+---+---+

After:

+---+----+
| f1|  f2|
+---+----+
|  a|null|
|  a|   b|
+---+----+

This PR adds a configuration to restore Before behaviour.

Why are the changes needed?

To avoid breakage of valid usecases.

Does this PR introduce any user-facing change?

Yes, it adds a new configuration spark.sql.legacy.respectNullabilityInTextDatasetConversion (false by default) to respect the nullability in DataFrameReader.schema(schema).csv(dataset) and DataFrameReader.schema(schema).json(dataset) when the user-specified schema is provided.

How was this patch tested?

Unittests were added.

…hema.csv/json(ds)

HyukjinKwon · 2022-05-03T00:47:38Z

cc @cloud-fan and @cfmcgrady

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

MaxGekk

LGTM in general except of minor comments.

cfmcgrady

LGTM

cfmcgrady · 2022-05-05T02:57:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+          StructType(
+            StructField("f1", StringType, nullable = false) ::
+            StructField("f2", StringType, nullable = false) :: Nil)
+        ).option("mode", "DROPMALFORMED").csv(Seq("a,", "a,b").toDS),


Shall we also add a test case for "FAILFAST/PERMISSIVE" mode? The behavior is different in the legacy conversion.

One test should be good enough to see if nullability is respected or not since that's what this PR adds as a configuration.

HyukjinKwon · 2022-05-05T07:23:27Z

Merged to master and branch-3.3.

…ng nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6689b97) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Add a legacy configuration for respecting nullability in DataFrame.sc…

88d3878

…hema.csv/json(ds)

HyukjinKwon changed the title ~~[SPARK-35912][SQL][FOLLOW-UP]Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)~~ [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) May 3, 2022

github-actions bot added DOCS SQL labels May 3, 2022

HyukjinKwon mentioned this pull request May 3, 2022

[SPARK-35912][SQL] Fix nullability of spark.read.json/spark.read.csv #33436

Closed

MaxGekk reviewed May 4, 2022

View reviewed changes

MaxGekk approved these changes May 4, 2022

View reviewed changes

Address comments

c8b35d1

HyukjinKwon force-pushed the SPARK-35912 branch from 2a4fa79 to c8b35d1 Compare May 4, 2022 23:37

cloud-fan approved these changes May 5, 2022

View reviewed changes

cfmcgrady approved these changes May 5, 2022

View reviewed changes

HyukjinKwon closed this in 6689b97 May 5, 2022

HyukjinKwon deleted the SPARK-35912 branch January 15, 2024 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435

HyukjinKwon commented May 3, 2022 •

edited

Loading

HyukjinKwon commented May 3, 2022

MaxGekk left a comment

cfmcgrady left a comment

cfmcgrady May 5, 2022

HyukjinKwon May 5, 2022

HyukjinKwon commented May 5, 2022

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435

Conversation

HyukjinKwon commented May 3, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented May 3, 2022

MaxGekk left a comment

Choose a reason for hiding this comment

cfmcgrady left a comment

Choose a reason for hiding this comment

cfmcgrady May 5, 2022

Choose a reason for hiding this comment

HyukjinKwon May 5, 2022

Choose a reason for hiding this comment

HyukjinKwon commented May 5, 2022

HyukjinKwon commented May 3, 2022 •

edited

Loading