Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37013][SQL] Forbid %0$ usage explicitly to ensure format_string has same behavior when using Java 8 and Java 17 #34313

Closed
wants to merge 11 commits into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Oct 18, 2021

What changes were proposed in this pull request?

The following sql has different behavior when using Java 8 and Java 17

--PostgreSQL throw ERROR:  format specifies argument 0, but arguments are numbered from 1
select format_string('%0$s', 'Hello');

Use Java 8

-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct<format_string(%0$s, Hello):string>
-- !query output
Hello

Use Java 17

-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct<>
-- !query output
java.util.IllegalFormatArgumentIndexException
Illegal format argument index = 0

The difference in this behavior comes from the change of j.u.Formatter.FormatSpecifier.index method:

Java 8

        private int index(String s) {
            if (s != null) {
                try {
                    index = Integer.parseInt(s.substring(0, s.length() - 1));
                } catch (NumberFormatException x) {
                    assert(false);
                }
            } else {
                index = 0;
            }
            return index;
        }

Java 17

        private void index(String s, int start, int end) {
            if (start >= 0) {
                try {
                    // skip the trailing '$'
                    index = Integer.parseInt(s, start, end - 1, 10);
                    if (index <= 0) {
                       throw new IllegalFormatArgumentIndexException(index);
                    }
                } catch (NumberFormatException x) {
                    throw new IllegalFormatArgumentIndexException(Integer.MIN_VALUE);
                }
            }
        }

A index <= 0 condition is added here to ensure %0$ as a IllegalFormatArgumentIndexException expression.

So the main change of this pr is add a require check to FormatString to manually disable %0$ to ensure that Java 17 and Java 8 have the same behavior.

Why are the changes needed?

Pass UT with JDK 17

Does this PR introduce any user-facing change?

The wrong usage like format_string('%0$s', str) can no longer be used, which is also consistent with PostgreSQL.

How was this patch tested?

  • Pass the Jenkins or GitHub Action
  • Manual test mvn clean install -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.SQLQueryTestSuite with Java 11 and Java 17 passed

@github-actions github-actions bot added the SQL label Oct 18, 2021
@LuciferYang
Copy link
Contributor Author

I think there are three ways to fix this issue

  1. Make Java 17 compatible with Java 8, just as this pr does

  2. Using the behavior of Java 17, add judgment FormatString function to prohibit the use of %0$ pattern like

if (pattern.contains("%0$")) {
    throw new IllegalArgumentException
}

and it looks more like what is expected in the test comment: PostgreSQL throw ERROR: format specifies argument 0, but arguments are numbered from 1

  1. Maintain different behaviors and change the test case, just as https://github.com/apache/spark/pull/34153/files# does

cc @srowen @wangyum @dongjoon-hyun @MaxGekk

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48835/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48835/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Test build #144360 has finished for PR 34313 at commit 52aa467.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 19, 2021

I'm kinda confused - seems like using index 0 was always disallowed: https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html
"The first argument is referenced by "1$", the second by "2$", etc."

I wonder how it ever worked or what it did? If it was illegal to begin with, we could just let Java 17 enforce it. I'm OK with explicitly enforcing it in order to avoid behavior difference across JVMs for the same Spark version too. I would not edit the format string, I think.

Wouldn't all the numbered indices have to change anyway? otherwise editing the string might result in two arguments at position 1.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Oct 20, 2021

I'm kinda confused - seems like using index 0 was always disallowed: https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html
"The first argument is referenced by "1$", the second by "2$", etc."

Let me double check this

@LuciferYang
Copy link
Contributor Author

We can use the following case to test it

 import java.util.Locale
    import java.util.Formatter
    val sb = new StringBuffer()
    // Send all output to the Appendable object sb
    val formatter = new Formatter(sb, Locale.US)

    // Explicit argument indices may be used to re-order output.
    formatter.format("%4$2s %3$2s %2$2s %1$2s", "a", "b", "c", "d")
    assert(sb.toString == " d  c  b  a")
    sb.delete(0, sb.length())
    assert(sb.toString.isEmpty)
    formatter.format("%4$2s %3$2s %2$2s %0$2s", "a", "b", "c", "d")
    assert(sb.toString == " d  c  b  a")

This case is similar to the example written in the https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html

image

The new case tests both %4$2s %3$2s %2$2s %1$2s and %4$2s %3$2s %2$2s %0$2s and they passed when use Java 8, although "The first argument is referenced by "1$", the second by "2$", etc." is specified in the document, but %0$ is not prohibited, and the result of %0$ is equivalent to %1$.

The result of running the above case using Java 11 is the same as that of Java 8.

And Exceptions are thrown only when the above case is run using Java 17

Illegal format argument index = 0
java.util.IllegalFormatArgumentIndexException: Illegal format argument index = 0
    at java.base/java.util.Formatter$FormatSpecifier.index(Formatter.java:2808)
    at java.base/java.util.Formatter$FormatSpecifier.<init>(Formatter.java:2879)
    at java.base/java.util.Formatter.parse(Formatter.java:2747)
    at java.base/java.util.Formatter.format(Formatter.java:2671)
    at java.base/java.util.Formatter.format(Formatter.java:2625)

@srowen
Copy link
Member

srowen commented Oct 20, 2021

Right so I support either leaving it to error out on Java 17, or explicitly forbidding it in Spark for consistency

@LuciferYang
Copy link
Contributor Author

OK ~

@dongjoon-hyun
Copy link
Member

+1 for @srowen 's comment and I prefer to forbid explicitly because it's consistent with PostgreSQL too.

!pattern.asInstanceOf[UTF8String].toString.contains("%0$")
}
private def checkArgumentIndexNotZero(expression: Expression): Boolean = expression match {
case pattern: Literal if pattern.dataType == StringType => !pattern.toString.contains("%0$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case StringLiteral(pattern) => !pattern.contains("%0$")?

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48922/

@LuciferYang LuciferYang changed the title [SPARK-37013][SQL] Ensure format_string has same behavior when using Java 8 and Java 17 [SPARK-37013][SQL] Forbid '%0$' explicitly to ensure format_string has same behavior when using Java 8 and Java 17 Oct 20, 2021
@LuciferYang LuciferYang changed the title [SPARK-37013][SQL] Forbid '%0$' explicitly to ensure format_string has same behavior when using Java 8 and Java 17 [SPARK-37013][SQL] Forbid %0$ usage explicitly to ensure format_string has same behavior when using Java 8 and Java 17 Oct 20, 2021
@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48922/

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48926/

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48926/

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Test build #144449 has finished for PR 34313 at commit c57b424.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Test build #144453 has finished for PR 34313 at commit 7e8c3d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* behavior of Java 8, Java 11 and Java 17.
*/
private def checkArgumentIndexNotZero(expression: Expression): Boolean = expression match {
case StringLiteral(pattern) => !pattern.contains("%0$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be more ways to use index 0 - what if you have something between the % and 0$?
This might in practice be close enough, just wondering if there is a simple way to make this more comprehensive without false positives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatSpecifier defined in j.u.Formatter is %[argument_index$][flags][width][.precision][t]conversion, from the definition of this format, can we think that there will be no other content between % and 0$?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK it goes after $ - seems OK then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can move part of the j.u.Formatter.parse() method code from Java 17 to make more strict check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK it goes after $ - seems OK then

I'm also trying to create cases that may cause misjudgment or missing judgment, but I haven't found them yet.

@wangyum
Copy link
Member

wangyum commented Oct 21, 2021

Do we need to add an item in docs/sql-migration-guide.md? since it's a breaking change for Java 8/11 users.

@LuciferYang
Copy link
Contributor Author

@wangyum done

@github-actions github-actions bot added the DOCS label Oct 21, 2021
@SparkQA
Copy link

SparkQA commented Oct 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48963/

@SparkQA
Copy link

SparkQA commented Oct 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48963/

@SparkQA
Copy link

SparkQA commented Oct 21, 2021

Test build #144491 has finished for PR 34313 at commit ba7c160.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 22, 2021

I copied the release notes change to the JIRA too. I'll merge to master

@srowen srowen closed this in ca861eb Oct 22, 2021
@LuciferYang
Copy link
Contributor Author

thanks all ~

@@ -1617,6 +1617,8 @@ case class ParseUrl(children: Seq[Expression], failOnError: Boolean = SQLConf.ge
case class FormatString(children: Expression*) extends Expression with ImplicitCastInputTypes {

require(children.nonEmpty, s"$prettyName() should take at least 1 argument")
require(checkArgumentIndexNotZero(children(0)), "Illegal format argument index = 0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use the new error framework to throw error in newly added code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sorry, is there any sample? I'll fix it later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK ~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very sorry I haven't started this week

cloud-fan pushed a commit that referenced this pull request Nov 23, 2021
…ow error in `FormatString`

### What changes were proposed in this pull request?
This is a followup of #34313. The main change of this pr is  change to use the new error framework to throw error when `pattern.contains("%0$")` is true.

### Why are the changes needed?
Use the new error framework to throw error

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #34454 from LuciferYang/SPARK-37013-FOLLOWUP.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Nov 23, 2021
…ow error in `FormatString`

### What changes were proposed in this pull request?
This is a followup of apache/spark#34313. The main change of this pr is  change to use the new error framework to throw error when `pattern.contains("%0$")` is true.

### Why are the changes needed?
Use the new error framework to throw error

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #34454 from LuciferYang/SPARK-37013-FOLLOWUP.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 8, 2022
…of forbidding %0$ usage in format_string

### What changes were proposed in this pull request?
Adds a legacy flag `spark.sql.legacy.allowZeroIndexInFormatString` for the breaking change introduced in #34313 and #34454 (followup).

The flag is disabled by default. But when it is enabled, restore the pre-change behavior that allows the 0 based index in `format_string`.

### Why are the changes needed?
The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

### Does this PR introduce _any_ user-facing change?
With the default value of the conf, there is no user-facing difference.
If users turn this conf on, they can restore the pre-change behavior.

### How was this patch tested?
Through unit tests.

Closes #36101 from anchovYu/flags-format-string-java.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 8, 2022
…of forbidding %0$ usage in format_string

### What changes were proposed in this pull request?
Adds a legacy flag `spark.sql.legacy.allowZeroIndexInFormatString` for the breaking change introduced in #34313 and #34454 (followup).

The flag is disabled by default. But when it is enabled, restore the pre-change behavior that allows the 0 based index in `format_string`.

### Why are the changes needed?
The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

### Does this PR introduce _any_ user-facing change?
With the default value of the conf, there is no user-facing difference.
If users turn this conf on, they can restore the pre-change behavior.

### How was this patch tested?
Through unit tests.

Closes #36101 from anchovYu/flags-format-string-java.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b7af2b3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@LuciferYang LuciferYang deleted the SPARK-37013 branch October 22, 2023 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants