[SPARK-37013][SQL] Forbid `%0$` usage explicitly to ensure `format_string` has same behavior when using Java 8 and Java 17 #34313

LuciferYang · 2021-10-18T08:00:25Z

What changes were proposed in this pull request?

The following sql has different behavior when using Java 8 and Java 17

--PostgreSQL throw ERROR:  format specifies argument 0, but arguments are numbered from 1
select format_string('%0$s', 'Hello');

Use Java 8

-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct<format_string(%0$s, Hello):string>
-- !query output
Hello

Use Java 17

-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct<>
-- !query output
java.util.IllegalFormatArgumentIndexException
Illegal format argument index = 0

The difference in this behavior comes from the change of j.u.Formatter.FormatSpecifier.index method:

Java 8

        private int index(String s) {
            if (s != null) {
                try {
                    index = Integer.parseInt(s.substring(0, s.length() - 1));
                } catch (NumberFormatException x) {
                    assert(false);
                }
            } else {
                index = 0;
            }
            return index;
        }

Java 17

        private void index(String s, int start, int end) {
            if (start >= 0) {
                try {
                    // skip the trailing '$'
                    index = Integer.parseInt(s, start, end - 1, 10);
                    if (index <= 0) {
                       throw new IllegalFormatArgumentIndexException(index);
                    }
                } catch (NumberFormatException x) {
                    throw new IllegalFormatArgumentIndexException(Integer.MIN_VALUE);
                }
            }
        }

A index <= 0 condition is added here to ensure %0$ as a IllegalFormatArgumentIndexException expression.

So the main change of this pr is add a require check to FormatString to manually disable %0$ to ensure that Java 17 and Java 8 have the same behavior.

Why are the changes needed?

Pass UT with JDK 17

Does this PR introduce any user-facing change?

The wrong usage like format_string('%0$s', str) can no longer be used, which is also consistent with PostgreSQL.

How was this patch tested?

Pass the Jenkins or GitHub Action
Manual test mvn clean install -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.SQLQueryTestSuite with Java 11 and Java 17 passed

LuciferYang · 2021-10-18T08:11:20Z

I think there are three ways to fix this issue

Make Java 17 compatible with Java 8, just as this pr does
Using the behavior of Java 17, add judgment FormatString function to prohibit the use of %0$ pattern like

if (pattern.contains("%0$")) {
    throw new IllegalArgumentException
}

and it looks more like what is expected in the test comment: PostgreSQL throw ERROR: format specifies argument 0, but arguments are numbered from 1

Maintain different behaviors and change the test case, just as https://github.com/apache/spark/pull/34153/files# does

cc @srowen @wangyum @dongjoon-hyun @MaxGekk

SparkQA · 2021-10-18T09:30:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48835/

SparkQA · 2021-10-18T10:31:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48835/

SparkQA · 2021-10-18T13:33:35Z

Test build #144360 has finished for PR 34313 at commit 52aa467.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-10-19T16:08:21Z

I'm kinda confused - seems like using index 0 was always disallowed: https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html
"The first argument is referenced by "1$", the second by "2$", etc."

I wonder how it ever worked or what it did? If it was illegal to begin with, we could just let Java 17 enforce it. I'm OK with explicitly enforcing it in order to avoid behavior difference across JVMs for the same Spark version too. I would not edit the format string, I think.

Wouldn't all the numbered indices have to change anyway? otherwise editing the string might result in two arguments at position 1.

LuciferYang · 2021-10-20T02:33:48Z

I'm kinda confused - seems like using index 0 was always disallowed: https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html
"The first argument is referenced by "1$", the second by "2$", etc."

Let me double check this

LuciferYang · 2021-10-20T03:11:36Z

We can use the following case to test it

 import java.util.Locale
    import java.util.Formatter
    val sb = new StringBuffer()
    // Send all output to the Appendable object sb
    val formatter = new Formatter(sb, Locale.US)

    // Explicit argument indices may be used to re-order output.
    formatter.format("%4$2s %3$2s %2$2s %1$2s", "a", "b", "c", "d")
    assert(sb.toString == " d  c  b  a")
    sb.delete(0, sb.length())
    assert(sb.toString.isEmpty)
    formatter.format("%4$2s %3$2s %2$2s %0$2s", "a", "b", "c", "d")
    assert(sb.toString == " d  c  b  a")

This case is similar to the example written in the https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html

The new case tests both %4$2s %3$2s %2$2s %1$2s and %4$2s %3$2s %2$2s %0$2s and they passed when use Java 8, although "The first argument is referenced by "1$", the second by "2$", etc." is specified in the document, but %0$ is not prohibited, and the result of %0$ is equivalent to %1$.

The result of running the above case using Java 11 is the same as that of Java 8.

And Exceptions are thrown only when the above case is run using Java 17

Illegal format argument index = 0
java.util.IllegalFormatArgumentIndexException: Illegal format argument index = 0
    at java.base/java.util.Formatter$FormatSpecifier.index(Formatter.java:2808)
    at java.base/java.util.Formatter$FormatSpecifier.<init>(Formatter.java:2879)
    at java.base/java.util.Formatter.parse(Formatter.java:2747)
    at java.base/java.util.Formatter.format(Formatter.java:2671)
    at java.base/java.util.Formatter.format(Formatter.java:2625)

srowen · 2021-10-20T03:12:36Z

Right so I support either leaving it to error out on Java 17, or explicitly forbidding it in Spark for consistency

LuciferYang · 2021-10-20T03:27:02Z

OK ~

dongjoon-hyun · 2021-10-20T05:35:22Z

+1 for @srowen 's comment and I prefer to forbid explicitly because it's consistent with PostgreSQL too.

wangyum · 2021-10-20T08:17:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

-      !pattern.asInstanceOf[UTF8String].toString.contains("%0$")
-    }
+  private def checkArgumentIndexNotZero(expression: Expression): Boolean = expression match {
+    case pattern: Literal if pattern.dataType == StringType => !pattern.toString.contains("%0$")


case StringLiteral(pattern) => !pattern.contains("%0$")?

SparkQA · 2021-10-20T08:34:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48922/

SparkQA · 2021-10-20T09:18:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48922/

SparkQA · 2021-10-20T09:37:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48926/

SparkQA · 2021-10-20T10:20:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48926/

SparkQA · 2021-10-20T12:08:11Z

Test build #144449 has finished for PR 34313 at commit c57b424.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-20T13:30:28Z

Test build #144453 has finished for PR 34313 at commit 7e8c3d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-10-20T13:42:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+   * behavior of Java 8, Java 11 and Java 17.
+   */
+  private def checkArgumentIndexNotZero(expression: Expression): Boolean = expression match {
+    case StringLiteral(pattern) => !pattern.contains("%0$")


I think there may be more ways to use index 0 - what if you have something between the % and 0$?
This might in practice be close enough, just wondering if there is a simple way to make this more comprehensive without false positives

The formatSpecifier defined in j.u.Formatter is %[argument_index$][flags][width][.precision][t]conversion, from the definition of this format, can we think that there will be no other content between % and 0$?

Ah OK it goes after $ - seems OK then

Or we can move part of the j.u.Formatter.parse() method code from Java 17 to make more strict check

Ah OK it goes after $ - seems OK then

I'm also trying to create cases that may cause misjudgment or missing judgment, but I haven't found them yet.

wangyum · 2021-10-21T01:25:05Z

Do we need to add an item in docs/sql-migration-guide.md? since it's a breaking change for Java 8/11 users.

LuciferYang · 2021-10-21T05:54:33Z

@wangyum done

SparkQA · 2021-10-21T06:46:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48963/

SparkQA · 2021-10-21T07:41:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48963/

SparkQA · 2021-10-21T11:05:58Z

Test build #144491 has finished for PR 34313 at commit ba7c160.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-10-22T04:50:15Z

I copied the release notes change to the JIRA too. I'll merge to master

LuciferYang · 2021-10-22T05:39:44Z

thanks all ~

cloud-fan · 2021-10-27T03:25:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

@@ -1617,6 +1617,8 @@ case class ParseUrl(children: Seq[Expression], failOnError: Boolean = SQLConf.ge
 case class FormatString(children: Expression*) extends Expression with ImplicitCastInputTypes {

  require(children.nonEmpty, s"$prettyName() should take at least 1 argument")
+  require(checkArgumentIndexNotZero(children(0)), "Illegal format argument index = 0")


let's use the new error framework to throw error in newly added code.

@cloud-fan Sorry, is there any sample? I'll fix it later

@cloud-fan Like this PR? https://github.com/apache/spark/pull/34208/files ?

I'm very sorry I haven't started this week

…ow error in `FormatString` ### What changes were proposed in this pull request? This is a followup of #34313. The main change of this pr is change to use the new error framework to throw error when `pattern.contains("%0$")` is true. ### Why are the changes needed? Use the new error framework to throw error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #34454 from LuciferYang/SPARK-37013-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ow error in `FormatString` ### What changes were proposed in this pull request? This is a followup of apache/spark#34313. The main change of this pr is change to use the new error framework to throw error when `pattern.contains("%0$")` is true. ### Why are the changes needed? Use the new error framework to throw error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #34454 from LuciferYang/SPARK-37013-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…of forbidding %0$ usage in format_string ### What changes were proposed in this pull request? Adds a legacy flag `spark.sql.legacy.allowZeroIndexInFormatString` for the breaking change introduced in #34313 and #34454 (followup). The flag is disabled by default. But when it is enabled, restore the pre-change behavior that allows the 0 based index in `format_string`. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf on, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36101 from anchovYu/flags-format-string-java. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…of forbidding %0$ usage in format_string ### What changes were proposed in this pull request? Adds a legacy flag `spark.sql.legacy.allowZeroIndexInFormatString` for the breaking change introduced in #34313 and #34454 (followup). The flag is disabled by default. But when it is enabled, restore the pre-change behavior that allows the 0 based index in `format_string`. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf on, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36101 from anchovYu/flags-format-string-java. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7af2b3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

fix SPARPK-37013

52aa467

github-actions bot added the SQL label Oct 18, 2021

Merge branch 'upmaster' into SPARK-37013

7674b55

LuciferYang added 6 commits October 20, 2021 14:45

fix SPARK-37013

c457603

revert

b3bc471

revert

321f67a

revert

8d22f3e

revert

c57b424

refactor

3ffe3f9

wangyum reviewed Oct 20, 2021

View reviewed changes

LuciferYang added 2 commits October 20, 2021 16:22

update text.sql.out

75a1ed2

use StringLiteral

7e8c3d6

LuciferYang changed the title ~~[SPARK-37013][SQL] Ensure format_string has same behavior when using Java 8 and Java 17~~ [SPARK-37013][SQL] Forbid '%0$' explicitly to ensure format_string has same behavior when using Java 8 and Java 17 Oct 20, 2021

LuciferYang changed the title ~~[SPARK-37013][SQL] Forbid '%0$' explicitly to ensure format_string has same behavior when using Java 8 and Java 17~~ [SPARK-37013][SQL] Forbid %0$ usage explicitly to ensure format_string has same behavior when using Java 8 and Java 17 Oct 20, 2021

srowen reviewed Oct 20, 2021

View reviewed changes

srowen approved these changes Oct 20, 2021

View reviewed changes

sql-migration-guide.md

ba7c160

github-actions bot added the DOCS label Oct 21, 2021

srowen approved these changes Oct 21, 2021

View reviewed changes

wangyum approved these changes Oct 22, 2021

View reviewed changes

srowen closed this in ca861eb Oct 22, 2021

cloud-fan reviewed Oct 27, 2021

View reviewed changes

LuciferYang mentioned this pull request Nov 1, 2021

[SPARK-37013][CORE][SQL][FOLLOWUP] Use the new error framework to throw error in FormatString #34454

Closed

anchovYu mentioned this pull request Apr 7, 2022

[SPARK-37013][SQL][FOLLOWUP] Add legacy flag for the breaking change of forbidding %0$ usage in format_string #36101

Closed

LuciferYang deleted the SPARK-37013 branch October 22, 2023 07:43

[SPARK-37013][SQL] Forbid %0$ usage explicitly to ensure format_string has same behavior when using Java 8 and Java 17 #34313

[SPARK-37013][SQL] Forbid %0$ usage explicitly to ensure format_string has same behavior when using Java 8 and Java 17 #34313

Conversation

LuciferYang commented Oct 18, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

LuciferYang commented Oct 18, 2021

SparkQA commented Oct 18, 2021

SparkQA commented Oct 18, 2021

SparkQA commented Oct 18, 2021

srowen commented Oct 19, 2021

LuciferYang commented Oct 20, 2021 • edited

LuciferYang commented Oct 20, 2021

srowen commented Oct 20, 2021

LuciferYang commented Oct 20, 2021

dongjoon-hyun commented Oct 20, 2021

Choose a reason for hiding this comment

SparkQA commented Oct 20, 2021

SparkQA commented Oct 20, 2021

SparkQA commented Oct 20, 2021

SparkQA commented Oct 20, 2021

SparkQA commented Oct 20, 2021

SparkQA commented Oct 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangyum commented Oct 21, 2021 • edited

LuciferYang commented Oct 21, 2021

SparkQA commented Oct 21, 2021

SparkQA commented Oct 21, 2021

SparkQA commented Oct 21, 2021

srowen commented Oct 22, 2021

LuciferYang commented Oct 22, 2021

Choose a reason for hiding this comment

LuciferYang Oct 27, 2021 • edited

Choose a reason for hiding this comment

LuciferYang Oct 27, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[SPARK-37013][SQL] Forbid `%0$` usage explicitly to ensure `format_string` has same behavior when using Java 8 and Java 17 #34313

[SPARK-37013][SQL] Forbid `%0$` usage explicitly to ensure `format_string` has same behavior when using Java 8 and Java 17 #34313

LuciferYang commented Oct 18, 2021 •

edited

LuciferYang commented Oct 20, 2021 •

edited

wangyum commented Oct 21, 2021 •

edited

LuciferYang Oct 27, 2021 •

edited

LuciferYang Oct 27, 2021 •

edited