-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString #28582
Conversation
Test build #122854 has finished for PR 28582 at commit
|
@cloud-fan @HyukjinKwon @bogdanghit Please, review this PR. |
Test build #122887 has finished for PR 28582 at commit
|
@@ -100,10 +111,26 @@ class Iso8601TimestampFormatter( | |||
*/ | |||
class FractionTimestampFormatter(zoneId: ZoneId) | |||
extends Iso8601TimestampFormatter( | |||
"", zoneId, TimestampFormatter.defaultLocale, needVarLengthSecondFraction = false) { | |||
TimestampFormatter.defaultPattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong if we still pass the empty string as the pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it's used by the legacy formatter.
|
||
@transient | ||
override protected lazy val formatter = DateTimeFormatterHelper.fractionFormatter | ||
|
||
// Converts Timestamp to string according to Hive TimestampWritable convention. | ||
// The code is borrowed from Spark 2.4 DateTimeUtils.timestampToString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it really needed? DateTimeFormatterHelper.fractionFormatter
should omit tailing 0 already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should omit tailing 0 already.
The reason of making this PR is to pass java.sql.Timestamp to a legacy formatter which can accept the type but the legacy formatter of fractionFormatter is SimpleDateFormat which cannot omit tailing 0.
is it really needed?
I think so. What do you propose instead of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the comment more clear:
// The new formatter will omit the trailing 0 in the timestamp string, but the legacy formatter can't.
// Here we borrow the code from Spark 2.4 DateTimeUtils.timestampToString to omit the
// trailing 0 for the legacy formatter as well.
We don't need to mention hive at all. This is just for internal consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I updated the comment.
Test build #122883 has finished for PR 28582 at commit
|
@@ -584,7 +584,7 @@ select make_date(-44, 3, 15) | |||
-- !query schema | |||
struct<make_date(-44, 3, 15):date> | |||
-- !query output | |||
-0044-03-15 | |||
0045-03-15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is compatible with Spark 2.4 because it uses SimpleDateFormat which cannot format negative years.
Test build #122893 has finished for PR 28582 at commit
|
Test build #122902 has finished for PR 28582 at commit
|
thanks, merging to master/3.0! |
…n toHiveString ### What changes were proposed in this pull request? 1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings: - TimestampFormatter: - `def format(ts: Timestamp): String` - `def format(instant: Instant): String` - DateFormatter: - `def format(date: Date): String` - `def format(localDate: LocalDate): String` 2. Re-use the added methods from `HiveResult.toHiveString` 3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests for toHiveString and new tests in `TimestampFormatterSuite`. Closes #28582 from MaxGekk/opt-format-old-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5d67331) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ectly ### What changes were proposed in this pull request? Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR #28582 added the methods to avoid conversions to days and microseconds. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `RowJsonSuite`. Closes #28620 from MaxGekk/toJson-format-Java-datetime-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ectly ### What changes were proposed in this pull request? Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR #28582 added the methods to avoid conversions to days and microseconds. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `RowJsonSuite`. Closes #28620 from MaxGekk/toJson-format-Java-datetime-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7f36310) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
def format(ts: Timestamp): String
def format(instant: Instant): String
def format(date: Date): String
def format(localDate: LocalDate): String
HiveResult.toHiveString
java.sql.Timestamp
from Spark 2.4DateTimeUtils.timestampToString
toFractionTimestampFormatter
because legacy formatters don't support variable length patterns for seconds fractions.Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By existing tests for toHiveString and new tests in
TimestampFormatterSuite
.