-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) #36435
Conversation
…hema.csv/json(ds)
cc @cloud-fan and @cfmcgrady |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general except of minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
StructType( | ||
StructField("f1", StringType, nullable = false) :: | ||
StructField("f2", StringType, nullable = false) :: Nil) | ||
).option("mode", "DROPMALFORMED").csv(Seq("a,", "a,b").toDS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also add a test case for "FAILFAST/PERMISSIVE"
mode? The behavior is different in the legacy conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One test should be good enough to see if nullability is respected or not since that's what this PR adds as a configuration.
Merged to master and branch-3.3. |
…ng nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6689b97) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR is a followup of #33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189):
Before:
After:
This PR adds a configuration to restore Before behaviour.
Why are the changes needed?
To avoid breakage of valid usecases.
Does this PR introduce any user-facing change?
Yes, it adds a new configuration
spark.sql.legacy.respectNullabilityInTextDatasetConversion
(false
by default) to respect the nullability inDataFrameReader.schema(schema).csv(dataset)
andDataFrameReader.schema(schema).json(dataset)
when the user-specified schema is provided.How was this patch tested?
Unittests were added.