-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size #30156
Conversation
… of whether need to pad null value when value size less then schema size
(arr: Array[String], size: Int) => arr.padTo(size, null) | ||
} else { | ||
(arr: Array[String], size: Int) => arr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass as a func to avoid repeating this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
FYI ping @HyukjinKwon @maropu @cloud-fan |
@AngersZhuuuu, shall we add a note in the migration guide as well? |
Yea, update later Updated ping @HyukjinKwon |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #130315 has finished for PR 30156 at commit
|
Test build #130308 has finished for PR 30156 at commit
|
Test build #130314 has finished for PR 30156 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130320 has finished for PR 30156 at commit
|
docs/sql-migration-guide.md
Outdated
@@ -49,6 +49,8 @@ license: | | |||
- In Spark 3.1, we remove the built-in Hive 1.2. You need to migrate your custom SerDes to Hive 2.3. See [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) for more details. | |||
|
|||
- In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`. | |||
|
|||
- In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when scrip transformation's output value size less then schema size in default-serde mode. If false, we will keep behavior as before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what's behavior as before
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what's
behavior as before
?
Updated
.internal() | ||
.doc("Whether pad null value when transformation output value size less then schema size." + | ||
"When true, we pad NULL value to keep same behavior with hive." + | ||
"When false, we keep origin behavior") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe what's origin behavior
here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe what's
origin behavior
here, too.
yea..original behavior
, updated
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130333 has finished for PR 30156 at commit
|
docs/sql-migration-guide.md
Outdated
@@ -49,6 +49,8 @@ license: | | |||
- In Spark 3.1, we remove the built-in Hive 1.2. You need to migrate your custom SerDes to Hive 2.3. See [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) for more details. | |||
|
|||
- In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`. | |||
|
|||
- In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when scrip transformation's output value size less then schema size in default-serde mode. If false, we will keep original behavior to throw `ArrayIndexOutOfBoundsException`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scrip
-> script
. Could we a bit more elaborate about "default-serde mode"?
we will keep original behavior to throw ...
-> Spark will keep original behavior to throw ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scrip
->script
. Could we a bit more elaborate about "default-serde mode"?
we will keep original behavior to throw ...
->Spark will keep original behavior to throw ...
Done
.internal() | ||
.doc("Whether pad null value when transformation output value size less then schema size." + | ||
"When true, we pad NULL value to keep same behavior with hive." + | ||
"When false, we keep original behavior to throw `ArrayIndexOutOfBoundsException`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we
-> Spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we
->Spark
Done
Looks fine |
Test build #130349 has finished for PR 30156 at commit
|
Test build #130350 has finished for PR 30156 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130359 has finished for PR 30156 at commit
|
Ah, sorry can you update the conflict in migration guide? @AngersZhuuuu |
Done |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130391 has finished for PR 30156 at commit
|
@AngersZhuuuu, sorry can you resolve conflict? I will just merge since the conflict is just in md file. |
Done |
Merged to master |
Kubernetes integration test starting |
Kubernetes integration test status success |
@@ -104,10 +104,16 @@ trait BaseScriptTransformationExec extends UnaryExecNode { | |||
val reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)) | |||
|
|||
val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD") | |||
|
|||
val padNull = if (conf.legacyPadNullWhenValueLessThenSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config name sounds like padding is the legacy behavior.
@@ -51,6 +51,8 @@ license: | | |||
- In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`. | |||
|
|||
- In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case. | |||
|
|||
- In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow other migration guide items: first explain what's the behavior change, then mention how to restore the legacy behavior with the legacy config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
- In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, with a follow up pr or revert current one? @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followup should be fine.
Test build #130439 has finished for PR 30156 at commit
|
According to #30202 (comment) , I'm going to revert it. |
Okay, I am fine with it. |
reverted |
What changes were proposed in this pull request?
Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size.
Since we can't decide whether it's a but and some use need it behavior same as Hive.
Why are the changes needed?
Provides a compatible choice between historical behavior and Hive
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existed UT