Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size #30156

Closed
wants to merge 11 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size.
Since we can't decide whether it's a but and some use need it behavior same as Hive.

Why are the changes needed?

Provides a compatible choice between historical behavior and Hive

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existed UT

… of whether need to pad null value when value size less then schema size
(arr: Array[String], size: Int) => arr.padTo(size, null)
} else {
(arr: Array[String], size: Int) => arr
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass as a func to avoid repeating this logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@AngersZhuuuu
Copy link
Contributor Author

FYI ping @HyukjinKwon @maropu @cloud-fan

@HyukjinKwon
Copy link
Member

@AngersZhuuuu, shall we add a note in the migration guide as well?

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Oct 27, 2020

@AngersZhuuuu, shall we add a note in the migration guide as well?

Yea, update later

Updated ping @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34910/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34910/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34916/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34917/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34916/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34917/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130315 has finished for PR 30156 at commit 8a18234.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130308 has finished for PR 30156 at commit 0f4eeb0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130314 has finished for PR 30156 at commit 710a672.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34922/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34922/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130320 has finished for PR 30156 at commit 8a18234.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -49,6 +49,8 @@ license: |
- In Spark 3.1, we remove the built-in Hive 1.2. You need to migrate your custom SerDes to Hive 2.3. See [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) for more details.

- In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`.

- In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when scrip transformation's output value size less then schema size in default-serde mode. If false, we will keep behavior as before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what's behavior as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what's behavior as before?

Updated

.internal()
.doc("Whether pad null value when transformation output value size less then schema size." +
"When true, we pad NULL value to keep same behavior with hive." +
"When false, we keep origin behavior")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe what's origin behavior here, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe what's origin behavior here, too.

yea..original behavior, updated

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34935/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34935/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130333 has finished for PR 30156 at commit 6fe15a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -49,6 +49,8 @@ license: |
- In Spark 3.1, we remove the built-in Hive 1.2. You need to migrate your custom SerDes to Hive 2.3. See [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) for more details.

- In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`.

- In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when scrip transformation's output value size less then schema size in default-serde mode. If false, we will keep original behavior to throw `ArrayIndexOutOfBoundsException`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrip -> script. Could we a bit more elaborate about "default-serde mode"?

we will keep original behavior to throw ... -> Spark will keep original behavior to throw ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrip -> script. Could we a bit more elaborate about "default-serde mode"?

we will keep original behavior to throw ... -> Spark will keep original behavior to throw ...

Done

.internal()
.doc("Whether pad null value when transformation output value size less then schema size." +
"When true, we pad NULL value to keep same behavior with hive." +
"When false, we keep original behavior to throw `ArrayIndexOutOfBoundsException`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we -> Spark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we -> Spark

Done

@HyukjinKwon
Copy link
Member

Looks fine

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130349 has finished for PR 30156 at commit db7d53a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130350 has finished for PR 30156 at commit 198888f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34962/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34962/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130359 has finished for PR 30156 at commit 198888f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Ah, sorry can you update the conflict in migration guide? @AngersZhuuuu

@AngersZhuuuu
Copy link
Contributor Author

Ah, sorry can you update the conflict in migration guide? @AngersZhuuuu

Done

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34994/

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34994/

@SparkQA
Copy link

SparkQA commented Oct 29, 2020

Test build #130391 has finished for PR 30156 at commit 3148608.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@AngersZhuuuu, sorry can you resolve conflict? I will just merge since the conflict is just in md file.

@AngersZhuuuu
Copy link
Contributor Author

@AngersZhuuuu, sorry can you resolve conflict? I will just merge since the conflict is just in md file.

Done

@HyukjinKwon
Copy link
Member

Merged to master

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35044/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35044/

@@ -104,10 +104,16 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
val reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))

val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD")

val padNull = if (conf.legacyPadNullWhenValueLessThenSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config name sounds like padding is the legacy behavior.

@@ -51,6 +51,8 @@ license: |
- In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don't fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.parquet.int96RebaseModeInRead` or/and `spark.sql.legacy.parquet.int96RebaseModeInWrite` to `LEGACY`.

- In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case.

- In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`.
Copy link
Contributor

@cloud-fan cloud-fan Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow other migration guide items: first explain what's the behavior change, then mention how to restore the legacy behavior with the legacy config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

  - In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, with a follow up pr or revert current one? @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup should be fine.

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130439 has finished for PR 30156 at commit 0541ff5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

According to #30202 (comment) , I'm going to revert it.

@HyukjinKwon
Copy link
Member

Okay, I am fine with it.

@cloud-fan
Copy link
Contributor

reverted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants