Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 3, 2022

What changes were proposed in this pull request?

This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189):

import org.apache.spark.sql.types._
val ds = Seq("a,", "a,b").toDS
spark.read.schema(
  StructType(
    StructField("f1", StringType, nullable = false) ::
    StructField("f2", StringType, nullable = false) :: Nil)
  ).option("mode", "DROPMALFORMED").csv(ds).show()

Before:

+---+---+
| f1| f2|
+---+---+
|  a|  b|
+---+---+

After:

+---+----+
| f1|  f2|
+---+----+
|  a|null|
|  a|   b|
+---+----+

This PR adds a configuration to restore Before behaviour.

Why are the changes needed?

To avoid breakage of valid usecases.

Does this PR introduce any user-facing change?

Yes, it adds a new configuration spark.sql.legacy.respectNullabilityInTextDatasetConversion (false by default) to respect the nullability in DataFrameReader.schema(schema).csv(dataset) and DataFrameReader.schema(schema).json(dataset) when the user-specified schema is provided.

How was this patch tested?

Unittests were added.

@HyukjinKwon HyukjinKwon changed the title [SPARK-35912][SQL][FOLLOW-UP]Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) May 3, 2022
@HyukjinKwon
Copy link
Member Author

cc @cloud-fan and @cfmcgrady

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general except of minor comments.

Copy link
Contributor

@cfmcgrady cfmcgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

StructType(
StructField("f1", StringType, nullable = false) ::
StructField("f2", StringType, nullable = false) :: Nil)
).option("mode", "DROPMALFORMED").csv(Seq("a,", "a,b").toDS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also add a test case for "FAILFAST/PERMISSIVE" mode? The behavior is different in the legacy conversion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test should be good enough to see if nullability is respected or not since that's what this PR adds as a configuration.

@HyukjinKwon
Copy link
Member Author

Merged to master and branch-3.3.

HyukjinKwon added a commit that referenced this pull request May 5, 2022
…ng nullability in DataFrame.schema.csv/json(ds)

### What changes were proposed in this pull request?

This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189):

```scala
import org.apache.spark.sql.types._
val ds = Seq("a,", "a,b").toDS
spark.read.schema(
  StructType(
    StructField("f1", StringType, nullable = false) ::
    StructField("f2", StringType, nullable = false) :: Nil)
  ).option("mode", "DROPMALFORMED").csv(ds).show()
```

**Before:**

```
+---+---+
| f1| f2|
+---+---+
|  a|  b|
+---+---+
```

**After:**

```
+---+----+
| f1|  f2|
+---+----+
|  a|null|
|  a|   b|
+---+----+
```

This PR adds a configuration to restore **Before** behaviour.

### Why are the changes needed?

To avoid breakage of valid usecases.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided.

### How was this patch tested?

Unittests were added.

Closes #36435 from HyukjinKwon/SPARK-35912.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 6689b97)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-35912 branch January 15, 2024 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants