New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37013][SQL] Forbid %0$
usage explicitly to ensure format_string
has same behavior when using Java 8 and Java 17
#34313
Conversation
I think there are three ways to fix this issue
and it looks more like what is expected in the test comment:
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144360 has finished for PR 34313 at commit
|
I'm kinda confused - seems like using index 0 was always disallowed: https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html I wonder how it ever worked or what it did? If it was illegal to begin with, we could just let Java 17 enforce it. I'm OK with explicitly enforcing it in order to avoid behavior difference across JVMs for the same Spark version too. I would not edit the format string, I think. Wouldn't all the numbered indices have to change anyway? otherwise editing the string might result in two arguments at position 1. |
Let me double check this |
We can use the following case to test it
This case is similar to the example written in the https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html The new case tests both The result of running the above case using Java 11 is the same as that of Java 8. And Exceptions are thrown only when the above case is run using Java 17
|
Right so I support either leaving it to error out on Java 17, or explicitly forbidding it in Spark for consistency |
OK ~ |
+1 for @srowen 's comment and I prefer to forbid explicitly because it's consistent with PostgreSQL too. |
!pattern.asInstanceOf[UTF8String].toString.contains("%0$") | ||
} | ||
private def checkArgumentIndexNotZero(expression: Expression): Boolean = expression match { | ||
case pattern: Literal if pattern.dataType == StringType => !pattern.toString.contains("%0$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case StringLiteral(pattern) => !pattern.contains("%0$")
?
Kubernetes integration test starting |
format_string
has same behavior when using Java 8 and Java 17format_string
has same behavior when using Java 8 and Java 17
format_string
has same behavior when using Java 8 and Java 17%0$
usage explicitly to ensure format_string
has same behavior when using Java 8 and Java 17
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144449 has finished for PR 34313 at commit
|
Test build #144453 has finished for PR 34313 at commit
|
* behavior of Java 8, Java 11 and Java 17. | ||
*/ | ||
private def checkArgumentIndexNotZero(expression: Expression): Boolean = expression match { | ||
case StringLiteral(pattern) => !pattern.contains("%0$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be more ways to use index 0 - what if you have something between the % and 0$?
This might in practice be close enough, just wondering if there is a simple way to make this more comprehensive without false positives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatSpecifier
defined in j.u.Formatter
is %[argument_index$][flags][width][.precision][t]conversion
, from the definition of this format, can we think that there will be no other content between %
and 0$
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK it goes after $ - seems OK then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can move part of the j.u.Formatter.parse()
method code from Java 17 to make more strict check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK it goes after $ - seems OK then
I'm also trying to create cases that may cause misjudgment or missing judgment, but I haven't found them yet.
Do we need to add an item in |
@wangyum done |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144491 has finished for PR 34313 at commit
|
I copied the release notes change to the JIRA too. I'll merge to master |
thanks all ~ |
@@ -1617,6 +1617,8 @@ case class ParseUrl(children: Seq[Expression], failOnError: Boolean = SQLConf.ge | |||
case class FormatString(children: Expression*) extends Expression with ImplicitCastInputTypes { | |||
|
|||
require(children.nonEmpty, s"$prettyName() should take at least 1 argument") | |||
require(checkArgumentIndexNotZero(children(0)), "Illegal format argument index = 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use the new error framework to throw error in newly added code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sorry, is there any sample? I'll fix it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Like this PR? https://github.com/apache/spark/pull/34208/files ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very sorry I haven't started this week
…ow error in `FormatString` ### What changes were proposed in this pull request? This is a followup of #34313. The main change of this pr is change to use the new error framework to throw error when `pattern.contains("%0$")` is true. ### Why are the changes needed? Use the new error framework to throw error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #34454 from LuciferYang/SPARK-37013-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ow error in `FormatString` ### What changes were proposed in this pull request? This is a followup of apache/spark#34313. The main change of this pr is change to use the new error framework to throw error when `pattern.contains("%0$")` is true. ### Why are the changes needed? Use the new error framework to throw error ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #34454 from LuciferYang/SPARK-37013-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…of forbidding %0$ usage in format_string ### What changes were proposed in this pull request? Adds a legacy flag `spark.sql.legacy.allowZeroIndexInFormatString` for the breaking change introduced in #34313 and #34454 (followup). The flag is disabled by default. But when it is enabled, restore the pre-change behavior that allows the 0 based index in `format_string`. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf on, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36101 from anchovYu/flags-format-string-java. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…of forbidding %0$ usage in format_string ### What changes were proposed in this pull request? Adds a legacy flag `spark.sql.legacy.allowZeroIndexInFormatString` for the breaking change introduced in #34313 and #34454 (followup). The flag is disabled by default. But when it is enabled, restore the pre-change behavior that allows the 0 based index in `format_string`. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf on, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36101 from anchovYu/flags-format-string-java. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7af2b3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The following sql has different behavior when using Java 8 and Java 17
Use Java 8
Use Java 17
The difference in this behavior comes from the change of
j.u.Formatter.FormatSpecifier.index
method:Java 8
Java 17
A
index <= 0
condition is added here to ensure%0$
as aIllegalFormatArgumentIndexException
expression.So the main change of this pr is add a require check to
FormatString
to manually disable%0$
to ensure that Java 17 and Java 8 have the same behavior.Why are the changes needed?
Pass UT with JDK 17
Does this PR introduce any user-facing change?
The wrong usage like
format_string('%0$s', str)
can no longer be used, which is also consistent with PostgreSQL.How was this patch tested?
mvn clean install -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.SQLQueryTestSuite
with Java 11 and Java 17 passed