[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size #30156

AngersZhuuuu · 2020-10-27T02:37:12Z

What changes were proposed in this pull request?

Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size.
Since we can't decide whether it's a but and some use need it behavior same as Hive.

Why are the changes needed?

Provides a compatible choice between historical behavior and Hive

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existed UT

… of whether need to pad null value when value size less then schema size

AngersZhuuuu · 2020-10-27T02:38:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala

+        (arr: Array[String], size: Int) => arr.padTo(size, null)
+      } else {
+        (arr: Array[String], size: Int) => arr
+      }


pass as a func to avoid repeating this logic

AngersZhuuuu · 2020-10-27T02:38:43Z

FYI ping @HyukjinKwon @maropu @cloud-fan

HyukjinKwon · 2020-10-27T03:01:31Z

@AngersZhuuuu, shall we add a note in the migration guide as well?

AngersZhuuuu · 2020-10-27T03:05:24Z

@AngersZhuuuu, shall we add a note in the migration guide as well?

Yea， update later

Updated ping @HyukjinKwon

SparkQA · 2020-10-27T03:21:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34910/

SparkQA · 2020-10-27T03:42:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34910/

SparkQA · 2020-10-27T04:54:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34916/

SparkQA · 2020-10-27T05:10:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34917/

SparkQA · 2020-10-27T05:28:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34916/

SparkQA · 2020-10-27T05:33:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34917/

SparkQA · 2020-10-27T07:05:02Z

Test build #130315 has finished for PR 30156 at commit 8a18234.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-27T07:05:02Z

Test build #130308 has finished for PR 30156 at commit 0f4eeb0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-27T07:05:02Z

Test build #130314 has finished for PR 30156 at commit 710a672.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-10-27T07:05:53Z

retest this please

SparkQA · 2020-10-27T07:49:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34922/

SparkQA · 2020-10-27T08:18:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34922/

SparkQA · 2020-10-27T11:39:17Z

Test build #130320 has finished for PR 30156 at commit 8a18234.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-10-27T13:35:11Z

docs/sql-migration-guide.md

@@ -49,6 +49,8 @@ license: |
  - In Spark 3.1, we remove the built-in Hive 1.2. You need to migrate your custom SerDes to Hive 2.3. See [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) for more details.

  - In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`.
+
+  - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when scrip transformation's output value size less then schema size in default-serde mode. If false, we will keep behavior as before.


Could you describe what's behavior as before?

Could you describe what's behavior as before?

Updated

maropu · 2020-10-27T13:35:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("Whether pad null value when transformation output value size less then schema size." +
+        "When true, we pad NULL value to keep same behavior with hive." +
+        "When false, we keep origin behavior")


Please describe what's origin behavior here, too.

Please describe what's origin behavior here, too.

yea..original behavior, updated

SparkQA · 2020-10-27T14:59:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34935/

SparkQA · 2020-10-27T15:21:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34935/

SparkQA · 2020-10-27T18:36:14Z

Test build #130333 has finished for PR 30156 at commit 6fe15a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-10-28T00:51:11Z

docs/sql-migration-guide.md

@@ -49,6 +49,8 @@ license: |
  - In Spark 3.1, we remove the built-in Hive 1.2. You need to migrate your custom SerDes to Hive 2.3. See [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) for more details.

  - In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`.
+
+  - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when scrip transformation's output value size less then schema size in default-serde mode. If false, we will keep original behavior to throw `ArrayIndexOutOfBoundsException`.


scrip -> script. Could we a bit more elaborate about "default-serde mode"?

we will keep original behavior to throw ... -> Spark will keep original behavior to throw ...

scrip -> script. Could we a bit more elaborate about "default-serde mode"?

we will keep original behavior to throw ... -> Spark will keep original behavior to throw ...

Done

HyukjinKwon · 2020-10-28T00:51:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("Whether pad null value when transformation output value size less then schema size." +
+        "When true, we pad NULL value to keep same behavior with hive." +
+        "When false, we keep original behavior to throw `ArrayIndexOutOfBoundsException`")


we -> Spark

we -> Spark

Done

HyukjinKwon · 2020-10-28T00:51:57Z

Looks fine

SparkQA · 2020-10-28T07:05:02Z

Test build #130349 has finished for PR 30156 at commit db7d53a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-28T07:05:02Z

Test build #130350 has finished for PR 30156 at commit 198888f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-10-28T07:07:24Z

retest this please

SparkQA · 2020-10-28T07:56:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34962/

SparkQA · 2020-10-28T08:24:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34962/

SparkQA · 2020-10-28T11:30:33Z

Test build #130359 has finished for PR 30156 at commit 198888f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-10-29T01:41:01Z

Ah, sorry can you update the conflict in migration guide? @AngersZhuuuu

AngersZhuuuu · 2020-10-29T02:12:48Z

Ah, sorry can you update the conflict in migration guide? @AngersZhuuuu

Done

SparkQA · 2020-10-29T03:01:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34994/

SparkQA · 2020-10-29T03:24:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34994/

SparkQA · 2020-10-29T06:38:12Z

Test build #130391 has finished for PR 30156 at commit 3148608.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-10-30T04:51:07Z

@AngersZhuuuu, sorry can you resolve conflict? I will just merge since the conflict is just in md file.

AngersZhuuuu · 2020-10-30T05:01:37Z

@AngersZhuuuu, sorry can you resolve conflict? I will just merge since the conflict is just in md file.

Done

HyukjinKwon · 2020-10-30T05:11:06Z

Merged to master

SparkQA · 2020-10-30T05:51:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35044/

SparkQA · 2020-10-30T06:13:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35044/

cloud-fan · 2020-10-30T06:27:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala

@@ -104,10 +104,16 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
      val reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))

      val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD")
+
+      val padNull = if (conf.legacyPadNullWhenValueLessThenSchema) {


The config name sounds like padding is the legacy behavior.

cloud-fan · 2020-10-30T06:28:43Z

docs/sql-migration-guide.md

@@ -51,6 +51,8 @@ license: |
  - In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`.

  - In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case. 
+
+  - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`.


Please follow other migration guide items: first explain what's the behavior change, then mention how to restore the legacy behavior with the legacy config.

e.g.

- In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.

Got it, with a follow up pr or revert current one? @HyukjinKwon

Followup should be fine.

SparkQA · 2020-10-30T07:05:02Z

Test build #130439 has finished for PR 30156 at commit 0541ff5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-04T03:31:48Z

According to #30202 (comment) , I'm going to revert it.

HyukjinKwon · 2020-11-04T03:35:19Z

Okay, I am fine with it.

cloud-fan · 2020-11-04T04:31:30Z

reverted

[SPARK-33248][SQL] Add a configuration to control the legacy behavior…

0f4eeb0

… of whether need to pad null value when value size less then schema size

AngersZhuuuu commented Oct 27, 2020

View reviewed changes

AngersZhuuuu added 2 commits October 27, 2020 12:10

Update sql-migration-guide.md

710a672

Update sql-migration-guide.md

8a18234

maropu reviewed Oct 27, 2020

View reviewed changes

follow comment

6fe15a4

HyukjinKwon reviewed Oct 28, 2020

View reviewed changes

address comment

db7d53a

HyukjinKwon approved these changes Oct 29, 2020

View reviewed changes

AngersZhuuuu added 3 commits October 29, 2020 10:09

Merge branch 'master' into SPARK-33284

4b3f0fe

Update sql-migration-guide.md

54c1e0c

Update sql-migration-guide.md

3148608

AngersZhuuuu added 2 commits October 30, 2020 13:00

Merge branch 'master' into SPARK-33284

4a6c021

Update sql-migration-guide.md

0541ff5

HyukjinKwon closed this in 0c943cd Oct 30, 2020

cloud-fan reviewed Oct 30, 2020

View reviewed changes

AngersZhuuuu mentioned this pull request Oct 30, 2020

[SPARK-33248][SQL][FOLLOWUP] Update migration guide to make clear what behavior changed and make variable names and configuration name more clearer #30202

Closed

[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size #30156

[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size #30156

Conversation

AngersZhuuuu commented Oct 27, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AngersZhuuuu commented Oct 27, 2020

HyukjinKwon commented Oct 27, 2020

AngersZhuuuu commented Oct 27, 2020 • edited

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

AngersZhuuuu commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

SparkQA commented Oct 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 28, 2020

SparkQA commented Oct 28, 2020

SparkQA commented Oct 28, 2020

AngersZhuuuu commented Oct 28, 2020

SparkQA commented Oct 28, 2020

SparkQA commented Oct 28, 2020

SparkQA commented Oct 28, 2020

HyukjinKwon commented Oct 29, 2020

AngersZhuuuu commented Oct 29, 2020

SparkQA commented Oct 29, 2020

SparkQA commented Oct 29, 2020

SparkQA commented Oct 29, 2020

HyukjinKwon commented Oct 30, 2020

AngersZhuuuu commented Oct 30, 2020

HyukjinKwon commented Oct 30, 2020

SparkQA commented Oct 30, 2020

SparkQA commented Oct 30, 2020

Choose a reason for hiding this comment

cloud-fan Oct 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 30, 2020

cloud-fan commented Nov 4, 2020

HyukjinKwon commented Nov 4, 2020

cloud-fan commented Nov 4, 2020

AngersZhuuuu commented Oct 27, 2020 •

edited

cloud-fan Oct 30, 2020 •

edited