-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24457][SQL] Improving performance of stringToTimestamp #21505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you get the calendar instance next time, isn't its time out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer line 130 for this. Before returning the calendar instance, it is reinitialized to the time it was originally created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya ^ Does that answer you question, or you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't timeInMillis also stored when you first update this map entity? So next time you access this calendar, you just set it with the old timeInMillis. Isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @viirya 's comment. Do we need to set the value of System.currentTimeMillis()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk Thanks, I'm updating that. BTW, can you please help me understand a scenario where that is absolutely needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think System.currentTimeMillis() is UTC-based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, Calendar.getTimeInMillis and setTimeInMillis are also UTC-based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't clear reset the timezone of that Calendar instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. It clears all the fields and zone is not a field.
|
We would appreciate it if you put the performance before and after this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep Calendar for many timezone? Since getCalendar takes a time zone input, we can just keep one Calendar instance, and set it with given timezone in getCalendar. WDYT? Regarding performance, is there big difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, only the default time zone is used. To execute Cast regarding date is called with a timezone may use another timezone. For the correctness, I think that it is necessary to support multiple timezones.
To enable caching for default time zone and to create an instance for other time zones would also work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should work functionally if we check a given time zone every time.
Do you know the typical access pattern of time zone? If there is temporal locality regarding time zone, we do not have to use mutale.Map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk @viirya I've tried running benchmarks for with/without mutable.Map implementation. Looks like setting timezone in a calendar instance is a costly operation and it drags the performance down. As the number of timezones cannot be large, maintaining a map will not be a huge memory overhead. So, I suggest going with the mutable.Map approach. Comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much performance down without a map here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a benchmark and following is the output:
string to timestamp calendar caching: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
with map 8 / 9 12.9 77.7 1.0X
without map 11 / 12 9.4 106.3 0.7X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the map approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for map approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems setTimeInMillis can result in all fields set. If so, clear is redundant.
|
@kiszk I've added a benchmark according to your recommendation, please review. |
…ing Calendar instances for input timezones instead of creating new everytime
|
The PR title is too long and truncated. Can you shorten it? |
|
cc @cloud-fan |
|
@viirya Done. |
|
LGTM |
| import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils} | ||
| import org.apache.spark.util.Benchmark | ||
|
|
||
| object StringToTimestampBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark tests Calendar creation, not string to timestamp. how about DateTimeUtilsBenchmark?
|
|
||
| def getCalendar(timeZone: TimeZone): Calendar = { | ||
| val c = threadLocalComputedCalendarsMap.get() | ||
| .getOrElseUpdate(timeZone, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: getOrElseUpdate(timeZone, Calendar.getInstance(timeZone))
|
ok to test |
|
Test build #92969 has finished for PR 21505 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big a performance impact are we talking?
| @@ -0,0 +1,53 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to go in the code base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's what I was thinking.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other calls to Calendar.getInstance in this module -- can, should, others use this utility? I'm OK with this, but have a few outstanding comments above.
| } | ||
| } | ||
|
|
||
| def getCalendar(timeZone: TimeZone): Calendar = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be private?
|
ping @ssonker to update or close. |
|
gentle ping @ssonker |
|
I need to update this PR with more changes, will open a new one! Closing this. |
What changes were proposed in this pull request?
As of now, stringToTimestamp function in DateTimeUtils creates a calendar instance on each call. This change maintains a thread-local timezone to calendar map, and creates just one calendar for each timezone. Whenever a calendar instance is queried given a timezone, it is looked-up inside the map, reinitialized and returned.
How was this patch tested?
Using existing test cases.
Please review http://spark.apache.org/contributing.html before opening a pull request.