Skip to content

Conversation

@ssonker
Copy link

@ssonker ssonker commented Jun 7, 2018

What changes were proposed in this pull request?

As of now, stringToTimestamp function in DateTimeUtils creates a calendar instance on each call. This change maintains a thread-local timezone to calendar map, and creates just one calendar for each timezone. Whenever a calendar instance is queried given a timezone, it is looked-up inside the map, reinitialized and returned.

How was this patch tested?

Using existing test cases.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you get the calendar instance next time, isn't its time out of date?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer line 130 for this. Before returning the calendar instance, it is reinitialized to the time it was originally created.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya ^ Does that answer you question, or you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't timeInMillis also stored when you first update this map entity? So next time you access this calendar, you just set it with the old timeInMillis. Isn't it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @viirya 's comment. Do we need to set the value of System.currentTimeMillis()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk Thanks, I'm updating that. BTW, can you please help me understand a scenario where that is absolutely needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk @viirya I've updated the code. Please review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think System.currentTimeMillis() is UTC-based?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, Calendar.getTimeInMillis and setTimeInMillis are also UTC-based.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't clear reset the timezone of that Calendar instance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. It clears all the fields and zone is not a field.

@kiszk
Copy link
Member

kiszk commented Jun 7, 2018

We would appreciate it if you put the performance before and after this PR?
It would be good to use Benchmark class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep Calendar for many timezone? Since getCalendar takes a time zone input, we can just keep one Calendar instance, and set it with given timezone in getCalendar. WDYT? Regarding performance, is there big difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, only the default time zone is used. To execute Cast regarding date is called with a timezone may use another timezone. For the correctness, I think that it is necessary to support multiple timezones.

To enable caching for default time zone and to create an instance for other time zones would also work correctly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk I think @viirya meant having just one thread-local calendar instance. That should also work, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should work functionally if we check a given time zone every time.
Do you know the typical access pattern of time zone? If there is temporal locality regarding time zone, we do not have to use mutale.Map.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk @viirya I've tried running benchmarks for with/without mutable.Map implementation. Looks like setting timezone in a calendar instance is a costly operation and it drags the performance down. As the number of timezones cannot be large, maintaining a map will not be a huge memory overhead. So, I suggest going with the mutable.Map approach. Comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much performance down without a map here?

Copy link
Author

@ssonker ssonker Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a benchmark and following is the output:

string to timestamp calendar caching:    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
with map                                         8 /    9         12.9          77.7       1.0X
without map                                     11 /   12          9.4         106.3       0.7X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the map approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for map approach

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya @kiszk Do you agree with this commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems setTimeInMillis can result in all fields set. If so, clear is redundant.

@ssonker
Copy link
Author

ssonker commented Jun 11, 2018

@kiszk I've added a benchmark according to your recommendation, please review.

…ing Calendar instances for input timezones instead of creating new everytime
@ssonker
Copy link
Author

ssonker commented Jun 12, 2018

@kiszk @viirya Do you have more review comments that need to be incorporated? If not, can you please get this merged?

@viirya
Copy link
Member

viirya commented Jun 12, 2018

The PR title is too long and truncated. Can you shorten it?

@viirya
Copy link
Member

viirya commented Jun 12, 2018

cc @cloud-fan

@ssonker ssonker changed the title [SPARK-24457][SQL] Improving performance of stringToTimestamp by cach… [SPARK-24457][SQL] Improving performance of stringToTimestamp Jun 12, 2018
@ssonker
Copy link
Author

ssonker commented Jun 12, 2018

@viirya Done.
cc: @cloud-fan

@kiszk
Copy link
Member

kiszk commented Jun 12, 2018

LGTM

import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils}
import org.apache.spark.util.Benchmark

object StringToTimestampBenchmark {
Copy link
Contributor

@cloud-fan cloud-fan Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark tests Calendar creation, not string to timestamp. how about DateTimeUtilsBenchmark?


def getCalendar(timeZone: TimeZone): Calendar = {
val c = threadLocalComputedCalendarsMap.get()
.getOrElseUpdate(timeZone, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getOrElseUpdate(timeZone, Calendar.getInstance(timeZone))

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 13, 2018

Test build #92969 has finished for PR 21505 at commit c940381.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big a performance impact are we talking?

@@ -0,0 +1,53 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to go in the code base?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's what I was thinking.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other calls to Calendar.getInstance in this module -- can, should, others use this utility? I'm OK with this, but have a few outstanding comments above.

}
}

def getCalendar(timeZone: TimeZone): Calendar = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be private?

@HyukjinKwon
Copy link
Member

ping @ssonker to update or close.

@kiszk
Copy link
Member

kiszk commented Aug 9, 2018

gentle ping @ssonker

@ssonker
Copy link
Author

ssonker commented Aug 9, 2018

I need to update this PR with more changes, will open a new one! Closing this.

@ssonker ssonker closed this Aug 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants