Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up date_histogram by precomputing ranges #61467

Merged
merged 15 commits into from Sep 24, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Aug 24, 2020

A few of us were talking about ways to speed up the date_histogram
using the index for the timestamp rather than the doc values. To do that
we'd have to pre-compute all of the "round down" points in the index. It
turns out that just precomputing those values speeds up rounding
fairly significantly:

Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  130598950.850 ± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op

That's a 46% speed up when there isn't a time zone and a 58% speed up
when there is.

This doesn't work for every time zone, specifically those that have two
midnights in a single day due to daylight savings time will produce wonky
results. So they don't get the optimization.

Second, this requires a few expensive computation up front to make the
transition array. And if the transition array is too large then we give
up and use the original mechanism, throwing away all of the work we did
to build the array. This seems appropriate for most usages of round,
but this change uses it for all usages of round. That seems ok for
now, but it might be worth investigating in a follow up.

I ran a macrobenchmark as well which showed an 11% preformance
improvement. BUT the benchmark wasn't tuned for my desktop so it
overwhelmed it and might have produced "funny" results. I think it is
pretty clear that this is an improvement, but know the measurement is
weird:

Before:
|               Min Throughput | hourly_agg |        0.11 |  ops/s |
|            Median Throughput | hourly_agg |        0.11 |  ops/s |
|               Max Throughput | hourly_agg |        0.11 |  ops/s |
|      50th percentile latency | hourly_agg |      650623 |     ms |
|      90th percentile latency | hourly_agg |      821478 |     ms |
|      99th percentile latency | hourly_agg |      859780 |     ms |
|     100th percentile latency | hourly_agg |      864030 |     ms |
| 50th percentile service time | hourly_agg |     9268.71 |     ms |
| 90th percentile service time | hourly_agg |        9380 |     ms |
| 99th percentile service time | hourly_agg |     9626.88 |     ms |
|100th percentile service time | hourly_agg |     9884.27 |     ms |
|                   error rate | hourly_agg |           0 |      % |

After:
|               Min Throughput | hourly_agg |        0.12 |  ops/s |
|            Median Throughput | hourly_agg |        0.12 |  ops/s |
|               Max Throughput | hourly_agg |        0.12 |  ops/s |
|      50th percentile latency | hourly_agg |      519254 |     ms |
|      90th percentile latency | hourly_agg |      653099 |     ms |
|      99th percentile latency | hourly_agg |      683276 |     ms |
|     100th percentile latency | hourly_agg |      686611 |     ms |
| 50th percentile service time | hourly_agg |     8371.41 |     ms |
| 90th percentile service time | hourly_agg |     8407.02 |     ms |
| 99th percentile service time | hourly_agg |     8536.64 |     ms |
|100th percentile service time | hourly_agg |     8538.54 |     ms |
|                   error rate | hourly_agg |           0 |      % |

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 24, 2020
A few of us were talking about ways to speed up the `date_histogram`
using the index for the timestamp rather than the doc values. To do that
we'd have to pre-compute all of the "round down" points in the index. It
turns out that *just* precomputing those values speeds up rounding
fairly significantly:
```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  130598950.850 ± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op
```

That's a 46% speed up when there isn't a time zone and a 58% speed up
when there is.

This doesn't work for every time zone, specifically those that have two
midnights in a single day due to daylight savings time will produce wonky
results. So they don't get the optimization.

Second, this requires a few expensive computation up front to make the
transition array. And if the transition array is too large then we give
up and use the original mechanism, throwing away all of the work we did
to build the array. This seems appropriate for most usages of `round`,
but this change uses it for *all* usages of `round`. That seems ok for
now, but it might be worth investigating in a follow up.

I ran a macrobenchmark as well which showed an 11% preformance
improvement. *BUT* the benchmark wasn't tuned for my desktop so it
overwhelmed it and might have produced "funny" results. I think it is
pretty clear that this is an improvement, but know the measurement is
weird:

```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  g± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op

Before:
|               Min Throughput | hourly_agg |        0.11 |  ops/s |
|            Median Throughput | hourly_agg |        0.11 |  ops/s |
|               Max Throughput | hourly_agg |        0.11 |  ops/s |
|      50th percentile latency | hourly_agg |      650623 |     ms |
|      90th percentile latency | hourly_agg |      821478 |     ms |
|      99th percentile latency | hourly_agg |      859780 |     ms |
|     100th percentile latency | hourly_agg |      864030 |     ms |
| 50th percentile service time | hourly_agg |     9268.71 |     ms |
| 90th percentile service time | hourly_agg |        9380 |     ms |
| 99th percentile service time | hourly_agg |     9626.88 |     ms |
|100th percentile service time | hourly_agg |     9884.27 |     ms |
|                   error rate | hourly_agg |           0 |      % |

After:
|               Min Throughput | hourly_agg |        0.12 |  ops/s |
|            Median Throughput | hourly_agg |        0.12 |  ops/s |
|               Max Throughput | hourly_agg |        0.12 |  ops/s |
|      50th percentile latency | hourly_agg |      519254 |     ms |
|      90th percentile latency | hourly_agg |      653099 |     ms |
|      99th percentile latency | hourly_agg |      683276 |     ms |
|     100th percentile latency | hourly_agg |      686611 |     ms |
| 50th percentile service time | hourly_agg |     8371.41 |     ms |
| 90th percentile service time | hourly_agg |     8407.02 |     ms |
| 99th percentile service time | hourly_agg |     8536.64 |     ms |
|100th percentile service time | hourly_agg |     8538.54 |     ms |
|                   error rate | hourly_agg |           0 |      % |
```
@nik9000
Copy link
Member Author

nik9000 commented Aug 24, 2020

Instead of building a shed or playing video games this was my weekend project......

I think it is a start towards the kind of thing that we'd want to use the lucene index to speed up date ranges. And I think it is a nice incremental improvement. And that it'll still apply in some cases when we don't want to use the search index. Also, it manages to re-use a ton of tests to make sure that the pre-computing is correct.

@nik9000 nik9000 requested a review from imotov August 24, 2020 13:29
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea.

* Time zones with two midnights get "funny" non-continuous rounding
* that isn't compatible with the pre-computed array rounding.
*/
private static final Set<String> HAS_TWO_MIDNIGHTS = Set.of("America/Moncton", "America/St_Johns", "Canada/Newfoundland");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be doable to detect timezones that don't work with this optimization at runtime instead of maintaining an allowlist?

Copy link
Member Author

@nik9000 nik9000 Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a way to do right now. At least, not a good way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Canada was changing time at 2 am, on another side, some other known time zones such as Asia/Gaza for example seems to indeed have 2 mightnights. I did some quick and dirty test and I have got a slightly different list for timezones in my JVM:

America/Asuncion 2020-03-22T00:00 -> 2020-03-21T23:00
America/Havana 2020-11-01T01:00 -> 2020-11-01T00:00
America/Santiago 2020-04-05T00:00 -> 2020-04-04T23:00
America/Scoresbysund 2020-10-25T01:00 -> 2020-10-25T00:00
Asia/Amman 2020-10-30T01:00 -> 2020-10-30T00:00
Asia/Beirut 2020-10-25T00:00 -> 2020-10-24T23:00
Asia/Damascus 2020-10-30T00:00 -> 2020-10-29T23:00
Asia/Gaza 2020-10-31T01:00 -> 2020-10-31T00:00
Asia/Hebron 2020-10-31T01:00 -> 2020-10-31T00:00
Asia/Tehran 2020-09-21T00:00 -> 2020-09-20T23:00
Atlantic/Azores 2020-10-25T01:00 -> 2020-10-25T00:00
Chile/Continental 2020-04-05T00:00 -> 2020-04-04T23:00
Cuba 2020-11-01T01:00 -> 2020-11-01T00:00
Iran 2020-09-21T00:00 -> 2020-09-20T23:00

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really only a problem when there are two midnights, or is it a more general problem when the time goes back due to daylight savings, and you just need special values of offset to make the bug occur if the transition is not around midnight?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played with offset to double check and can't cause issues with it. However do you think we could detect timezones that don't work dynamically instead of relying on a static list? E.g. could we iterate transitions in the considered interval and check whether there's one that brings us to a different day?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use something like the test case uses to find candidates, but it'd require loading all of the time zone rules on startup. I'm hoping that the test can prevent us from having to do that by being very careful about this list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I wasn't thinking about testing all timezones up-front on startup, I was more thinking of doing the test when building the Rounding object by only looking at transitions of the considered timezone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm - the assert that I have below sort of asserts that. But it isn't nearly as strong. I'll think about it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz I've pushed a change that does this.

@@ -1015,7 +1031,7 @@ public byte id() {

@Override
public Prepared prepare(long minUtcMillis, long maxUtcMillis) {
return wrapPreparedRounding(delegate.prepare(minUtcMillis, maxUtcMillis));
return wrapPreparedRounding(delegate.prepare(minUtcMillis - offset, maxUtcMillis - offset));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change fixing an existing bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think the bug is worse with this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new assertions in the ArrayRounding fail without this.

*/
private static class ArrayRounding implements Prepared {
private final long[] values;
private int max;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it final?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍. I have no idea why I didn't do that the first time around.

@Override
public long round(long utcMillis) {
int idx = Arrays.binarySearch(values, 0, max, utcMillis);
if (idx < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assert that idx is neither -1 nor -1 - max?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general, but timezone list is also a concern of mine, I played a bit with ZoneRules, the information that is needed is definitely there, we just need to figure out how to correctly interpret it.

* Time zones with two midnights get "funny" non-continuous rounding
* that isn't compatible with the pre-computed array rounding.
*/
private static final Set<String> HAS_TWO_MIDNIGHTS = Set.of("America/Moncton", "America/St_Johns", "Canada/Newfoundland");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Canada was changing time at 2 am, on another side, some other known time zones such as Asia/Gaza for example seems to indeed have 2 mightnights. I did some quick and dirty test and I have got a slightly different list for timezones in my JVM:

America/Asuncion 2020-03-22T00:00 -> 2020-03-21T23:00
America/Havana 2020-11-01T01:00 -> 2020-11-01T00:00
America/Santiago 2020-04-05T00:00 -> 2020-04-04T23:00
America/Scoresbysund 2020-10-25T01:00 -> 2020-10-25T00:00
Asia/Amman 2020-10-30T01:00 -> 2020-10-30T00:00
Asia/Beirut 2020-10-25T00:00 -> 2020-10-24T23:00
Asia/Damascus 2020-10-30T00:00 -> 2020-10-29T23:00
Asia/Gaza 2020-10-31T01:00 -> 2020-10-31T00:00
Asia/Hebron 2020-10-31T01:00 -> 2020-10-31T00:00
Asia/Tehran 2020-09-21T00:00 -> 2020-09-20T23:00
Atlantic/Azores 2020-10-25T01:00 -> 2020-10-25T00:00
Chile/Continental 2020-04-05T00:00 -> 2020-04-04T23:00
Cuba 2020-11-01T01:00 -> 2020-11-01T00:00
Iran 2020-09-21T00:00 -> 2020-09-20T23:00

/**
* Implementation of {@link Prepared} using pre-calculated "round down" points.
*/
private static class ArrayRounding implements Prepared {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to implement new methods added in #61369.

@nik9000
Copy link
Member Author

nik9000 commented Aug 27, 2020 via email

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized I started a review and never submitted it. Sorry about that. LGTM, couple nits if you feel like addressing them.

* rounding if the transition is <strong>at</strong> midnight. Just not
* if it is <strong>after</strong> it.
*/
private static final Map<String, Long> FORWARDS_BACKWRADS_ZONES = Map.ofEntries(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - It looks like this maps timezone name to milliseconds since epoch for when the zone changed to use a different transition time. Can you just add a comment clarifying that's how the map is structured? It seems unlikely we'd need to add to it, but just in case we do, it'd help to have a reference for how to do so handy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (i >= max) {
return orig;
}
assert values[i - 1] == orig.round(rounded - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super clear to me what this assert is guarding against. Can you add a comment please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assertKnownMovesBacktoPreviousDay("America/Moncton", "2005-10-29T03:01:00");
assertKnownMovesBacktoPreviousDay("America/St_Johns", "2010-11-07T02:31:00");
assertKnownMovesBacktoPreviousDay("Canada/Newfoundland", "2010-11-07T02:31:00");
assertKnownMovesBacktoPreviousDay("Pacific/Guam", "1969-01-25T13:01:00");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails in CI but not for me locally! I'm looking into it. Might be the specific JVM or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand this list constantly changes, so it might be different between versions of JVM and OSes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had sort of thought the past was relatively well established, but it looks like it is a Java 12+ thing....

@nik9000
Copy link
Member Author

nik9000 commented Sep 23, 2020

@jpountz would you like to have another look at this? I think I managed to remove the list of bad time zones.

if (transition.getDateTimeBefore().getDayOfMonth() == transition.getDateTimeAfter().getDayOfMonth()) {
return false;
}
if (transition.getDateTimeBefore().getMinute() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check getHour() too for safety? Eg. if such a thing existed as moving from 1AM to 11PM the day before it looks like we'd still consider that the transition doesn't move back to the previous day with this code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change that makes sure the "NANO_OF_DAY" is 0. That seems fairly good at covering it

@nik9000
Copy link
Member Author

nik9000 commented Sep 24, 2020

Thanks @jpountz !

@nik9000
Copy link
Member Author

nik9000 commented Sep 24, 2020

run elasticsearch-ci/packaging-sample-windows

@nik9000 nik9000 merged commit d9acac2 into elastic:master Sep 24, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Sep 24, 2020
A few of us were talking about ways to speed up the `date_histogram`
using the index for the timestamp rather than the doc values. To do that
we'd have to pre-compute all of the "round down" points in the index. It
turns out that *just* precomputing those values speeds up rounding
fairly significantly:
```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  130598950.850 ± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op
```

That's a 46% speed up when there isn't a time zone and a 58% speed up
when there is.

This doesn't work for every time zone, specifically those that have two
midnights in a single day due to daylight savings time will produce wonky
results. So they don't get the optimization.

Second, this requires a few expensive computation up front to make the
transition array. And if the transition array is too large then we give
up and use the original mechanism, throwing away all of the work we did
to build the array. This seems appropriate for most usages of `round`,
but this change uses it for *all* usages of `round`. That seems ok for
now, but it might be worth investigating in a follow up.

I ran a macrobenchmark as well which showed an 11% preformance
improvement. *BUT* the benchmark wasn't tuned for my desktop so it
overwhelmed it and might have produced "funny" results. I think it is
pretty clear that this is an improvement, but know the measurement is
weird:

```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  g± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op

Before:
|               Min Throughput | hourly_agg |        0.11 |  ops/s |
|            Median Throughput | hourly_agg |        0.11 |  ops/s |
|               Max Throughput | hourly_agg |        0.11 |  ops/s |
|      50th percentile latency | hourly_agg |      650623 |     ms |
|      90th percentile latency | hourly_agg |      821478 |     ms |
|      99th percentile latency | hourly_agg |      859780 |     ms |
|     100th percentile latency | hourly_agg |      864030 |     ms |
| 50th percentile service time | hourly_agg |     9268.71 |     ms |
| 90th percentile service time | hourly_agg |        9380 |     ms |
| 99th percentile service time | hourly_agg |     9626.88 |     ms |
|100th percentile service time | hourly_agg |     9884.27 |     ms |
|                   error rate | hourly_agg |           0 |      % |

After:
|               Min Throughput | hourly_agg |        0.12 |  ops/s |
|            Median Throughput | hourly_agg |        0.12 |  ops/s |
|               Max Throughput | hourly_agg |        0.12 |  ops/s |
|      50th percentile latency | hourly_agg |      519254 |     ms |
|      90th percentile latency | hourly_agg |      653099 |     ms |
|      99th percentile latency | hourly_agg |      683276 |     ms |
|     100th percentile latency | hourly_agg |      686611 |     ms |
| 50th percentile service time | hourly_agg |     8371.41 |     ms |
| 90th percentile service time | hourly_agg |     8407.02 |     ms |
| 99th percentile service time | hourly_agg |     8536.64 |     ms |
|100th percentile service time | hourly_agg |     8538.54 |     ms |
|                   error rate | hourly_agg |           0 |      % |
```
nik9000 added a commit that referenced this pull request Sep 24, 2020
…62880)

A few of us were talking about ways to speed up the `date_histogram`
using the index for the timestamp rather than the doc values. To do that
we'd have to pre-compute all of the "round down" points in the index. It
turns out that *just* precomputing those values speeds up rounding
fairly significantly:
```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  130598950.850 ± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op
```

That's a 46% speed up when there isn't a time zone and a 58% speed up
when there is.

This doesn't work for every time zone, specifically those that have two
midnights in a single day due to daylight savings time will produce wonky
results. So they don't get the optimization.

Second, this requires a few expensive computation up front to make the
transition array. And if the transition array is too large then we give
up and use the original mechanism, throwing away all of the work we did
to build the array. This seems appropriate for most usages of `round`,
but this change uses it for *all* usages of `round`. That seems ok for
now, but it might be worth investigating in a follow up.

I ran a macrobenchmark as well which showed an 11% preformance
improvement. *BUT* the benchmark wasn't tuned for my desktop so it
overwhelmed it and might have produced "funny" results. I think it is
pretty clear that this is an improvement, but know the measurement is
weird:

```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  g± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op

Before:
|               Min Throughput | hourly_agg |        0.11 |  ops/s |
|            Median Throughput | hourly_agg |        0.11 |  ops/s |
|               Max Throughput | hourly_agg |        0.11 |  ops/s |
|      50th percentile latency | hourly_agg |      650623 |     ms |
|      90th percentile latency | hourly_agg |      821478 |     ms |
|      99th percentile latency | hourly_agg |      859780 |     ms |
|     100th percentile latency | hourly_agg |      864030 |     ms |
| 50th percentile service time | hourly_agg |     9268.71 |     ms |
| 90th percentile service time | hourly_agg |        9380 |     ms |
| 99th percentile service time | hourly_agg |     9626.88 |     ms |
|100th percentile service time | hourly_agg |     9884.27 |     ms |
|                   error rate | hourly_agg |           0 |      % |

After:
|               Min Throughput | hourly_agg |        0.12 |  ops/s |
|            Median Throughput | hourly_agg |        0.12 |  ops/s |
|               Max Throughput | hourly_agg |        0.12 |  ops/s |
|      50th percentile latency | hourly_agg |      519254 |     ms |
|      90th percentile latency | hourly_agg |      653099 |     ms |
|      99th percentile latency | hourly_agg |      683276 |     ms |
|     100th percentile latency | hourly_agg |      686611 |     ms |
| 50th percentile service time | hourly_agg |     8371.41 |     ms |
| 90th percentile service time | hourly_agg |     8407.02 |     ms |
| 99th percentile service time | hourly_agg |     8536.64 |     ms |
|100th percentile service time | hourly_agg |     8538.54 |     ms |
|                   error rate | hourly_agg |           0 |      % |
```
@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2020

I looked at implementing this for interval roundings (7d instead of month) and I can't do it easily! It turns out that a critical method nextRoundingValue is kind of "weird" for those roundings. It doesn't really do what it was documented to do. I think we have to fix that first, then we can likely use the this technique to speed up those roundings.

We could also start investigating speeding up date_histogram by not rounding at all - now that we have the bounds we can use a search instead of loading the values. Something like that!

Also, we now have a much better estimate for the upper bound of the number of buckets a particular date_histogram could make. We don't do anything with that information now, but I expect we could save quite a lot of memory with it.

nik9000 added a commit that referenced this pull request Oct 5, 2020
…62881)

A few of us were talking about ways to speed up the `date_histogram`
using the index for the timestamp rather than the doc values. To do that
we'd have to pre-compute all of the "round down" points in the index. It
turns out that *just* precomputing those values speeds up rounding
fairly significantly:
```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  130598950.850 ± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op
```

That's a 46% speed up when there isn't a time zone and a 58% speed up
when there is.

This doesn't work for every time zone, specifically those that have two
midnights in a single day due to daylight savings time will produce wonky
results. So they don't get the optimization.

Second, this requires a few expensive computation up front to make the
transition array. And if the transition array is too large then we give
up and use the original mechanism, throwing away all of the work we did
to build the array. This seems appropriate for most usages of `round`,
but this change uses it for *all* usages of `round`. That seems ok for
now, but it might be worth investigating in a follow up.

I ran a macrobenchmark as well which showed an 11% preformance
improvement. *BUT* the benchmark wasn't tuned for my desktop so it
overwhelmed it and might have produced "funny" results. I think it is
pretty clear that this is an improvement, but know the measurement is
weird:

```
Benchmark  (count)      (interval)                   (range)            (zone)  Mode  Cnt          Score         Error  Units
before    10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   96461080.982 ±  616373.011  ns/op
before    10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10  g± 1249189.867  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31               UTC  avgt   10   52311775.080 ±  107171.092  ns/op
after     10000000  calendar month  2000-10-28 to 2000-10-31  America/New_York  avgt   10   54800134.968 ±  373844.796  ns/op

Before:
|               Min Throughput | hourly_agg |        0.11 |  ops/s |
|            Median Throughput | hourly_agg |        0.11 |  ops/s |
|               Max Throughput | hourly_agg |        0.11 |  ops/s |
|      50th percentile latency | hourly_agg |      650623 |     ms |
|      90th percentile latency | hourly_agg |      821478 |     ms |
|      99th percentile latency | hourly_agg |      859780 |     ms |
|     100th percentile latency | hourly_agg |      864030 |     ms |
| 50th percentile service time | hourly_agg |     9268.71 |     ms |
| 90th percentile service time | hourly_agg |        9380 |     ms |
| 99th percentile service time | hourly_agg |     9626.88 |     ms |
|100th percentile service time | hourly_agg |     9884.27 |     ms |
|                   error rate | hourly_agg |           0 |      % |

After:
|               Min Throughput | hourly_agg |        0.12 |  ops/s |
|            Median Throughput | hourly_agg |        0.12 |  ops/s |
|               Max Throughput | hourly_agg |        0.12 |  ops/s |
|      50th percentile latency | hourly_agg |      519254 |     ms |
|      90th percentile latency | hourly_agg |      653099 |     ms |
|      99th percentile latency | hourly_agg |      683276 |     ms |
|     100th percentile latency | hourly_agg |      686611 |     ms |
| 50th percentile service time | hourly_agg |     8371.41 |     ms |
| 90th percentile service time | hourly_agg |     8407.02 |     ms |
| 99th percentile service time | hourly_agg |     8536.64 |     ms |
|100th percentile service time | hourly_agg |     8538.54 |     ms |
|                   error rate | hourly_agg |           0 |      % |
```
nik9000 added a commit that referenced this pull request Nov 9, 2020
This speeds up `date_histogram` aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.

The kind of speedup we're seeing is fairly substantial in many cases:
```
|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |
```

This uses the work we did in #61467 to precompute the rounding points for
a `date_histogram`. Now, when we know the rounding points we execute the
`date_histogram` as a `range` aggregation. This is nice for two reasons:
1. We can further rewrite the `range` aggregation (see below)
2. We don't need to allocate a hash to convert rounding points
   to ordinals.
3. We can send precise cardinality estimates to sub-aggs.

Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing `range` aggregations as
a `filters` aggregation. Normally the `filters` aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original `date_histogram`.

The `range` aggregation is *fairly* careful in how it rewrites, giving up
on the `filters` aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.


So an aggregation like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "fixed_interval": "60d",
        "time_zone": "America/New_York"
      }
    }
  }
}
```

is executed like:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "range": {
        "field": "dropoff_datetime",
        "ranges": [
          {"from": 1415250000000, "to": 1420434000000},
          {"from": 1420434000000, "to": 1425618000000},
          {"from": 1425618000000, "to": 1430798400000},
          {"from": 1430798400000, "to": 1435982400000},
          {"from": 1435982400000, "to": 1441166400000},
          {"from": 1441166400000, "to": 1446350400000},
          {"from": 1446350400000, "to": 1451538000000},
          {"from": 1451538000000}
        ]
      }
    }
  }
}
```

Which in turn is executed like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "filters": {
        "filters": {
          "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
          "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
          "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
          "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
          "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
          "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
          "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
          "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
        }
      }
    }
  }
}
```

And *that* is faster because we can execute it "filter by filter".

Finally, notice the `range` query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the `filters` agg needs special case handling when the query
is a `range` query and the filter is a `range` query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard `filter` collection, but not nearly
as fast as it could be.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Nov 9, 2020
This speeds up `date_histogram` aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.

The kind of speedup we're seeing is fairly substantial in many cases:
```
|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |
```

This uses the work we did in elastic#61467 to precompute the rounding points for
a `date_histogram`. Now, when we know the rounding points we execute the
`date_histogram` as a `range` aggregation. This is nice for two reasons:
1. We can further rewrite the `range` aggregation (see below)
2. We don't need to allocate a hash to convert rounding points
   to ordinals.
3. We can send precise cardinality estimates to sub-aggs.

Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing `range` aggregations as
a `filters` aggregation. Normally the `filters` aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original `date_histogram`.

The `range` aggregation is *fairly* careful in how it rewrites, giving up
on the `filters` aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.

So an aggregation like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "fixed_interval": "60d",
        "time_zone": "America/New_York"
      }
    }
  }
}
```

is executed like:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "range": {
        "field": "dropoff_datetime",
        "ranges": [
          {"from": 1415250000000, "to": 1420434000000},
          {"from": 1420434000000, "to": 1425618000000},
          {"from": 1425618000000, "to": 1430798400000},
          {"from": 1430798400000, "to": 1435982400000},
          {"from": 1435982400000, "to": 1441166400000},
          {"from": 1441166400000, "to": 1446350400000},
          {"from": 1446350400000, "to": 1451538000000},
          {"from": 1451538000000}
        ]
      }
    }
  }
}
```

Which in turn is executed like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "filters": {
        "filters": {
          "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
          "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
          "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
          "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
          "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
          "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
          "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
          "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
        }
      }
    }
  }
}
```

And *that* is faster because we can execute it "filter by filter".

Finally, notice the `range` query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the `filters` agg needs special case handling when the query
is a `range` query and the filter is a `range` query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard `filter` collection, but not nearly
as fast as it could be.
nik9000 added a commit that referenced this pull request Nov 9, 2020
This speeds up `date_histogram` aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.

The kind of speedup we're seeing is fairly substantial in many cases:
```
|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |
```

This uses the work we did in #61467 to precompute the rounding points for
a `date_histogram`. Now, when we know the rounding points we execute the
`date_histogram` as a `range` aggregation. This is nice for two reasons:
1. We can further rewrite the `range` aggregation (see below)
2. We don't need to allocate a hash to convert rounding points
   to ordinals.
3. We can send precise cardinality estimates to sub-aggs.

Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing `range` aggregations as
a `filters` aggregation. Normally the `filters` aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original `date_histogram`.

The `range` aggregation is *fairly* careful in how it rewrites, giving up
on the `filters` aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.

So an aggregation like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "fixed_interval": "60d",
        "time_zone": "America/New_York"
      }
    }
  }
}
```

is executed like:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "range": {
        "field": "dropoff_datetime",
        "ranges": [
          {"from": 1415250000000, "to": 1420434000000},
          {"from": 1420434000000, "to": 1425618000000},
          {"from": 1425618000000, "to": 1430798400000},
          {"from": 1430798400000, "to": 1435982400000},
          {"from": 1435982400000, "to": 1441166400000},
          {"from": 1441166400000, "to": 1446350400000},
          {"from": 1446350400000, "to": 1451538000000},
          {"from": 1451538000000}
        ]
      }
    }
  }
}
```

Which in turn is executed like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "filters": {
        "filters": {
          "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
          "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
          "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
          "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
          "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
          "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
          "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
          "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
        }
      }
    }
  }
}
```

And *that* is faster because we can execute it "filter by filter".

Finally, notice the `range` query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the `filters` agg needs special case handling when the query
is a `range` query and the filter is a `range` query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard `filter` collection, but not nearly
as fast as it could be.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants