Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MergeIterator: allocate less memory at first #4341

Merged
merged 4 commits into from
Jul 6, 2021
Merged

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Jul 6, 2021

What this PR does:

We were allocating 24x the number of streams of batches, where each batch holds up to 12 samples.
By allowing c.batches to reallocate when needed, we avoid the need to pre-allocate enough memory for all possible scenarios.

Also fix innacurate end time on chunks test data, which was throwing off the benchmark, and add more realistic test sizes - at 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks is about three weeks, a highly un-representative test.

Which issue(s) this PR fixes:
Fixes #1195

Benchmarks

name                                                                                                                             old time/op    new time/op    delta
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Bigchunk-4              12.3ms ± 4%    12.1ms ± 2%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Bigchunk-4              30.3ms ± 3%    30.5ms ± 3%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Varbit-4                8.96ms ± 4%    8.80ms ± 0%     ~     (p=0.190 n=5+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Varbit-4                21.8ms ± 3%    21.6ms ± 3%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_DoubleDelta-4           10.7ms ± 4%    10.5ms ± 2%     ~     (p=0.310 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_DoubleDelta-4           26.5ms ± 3%    26.5ms ± 4%     ~     (p=0.841 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4    12.4ms ± 6%    12.4ms ± 5%     ~     (p=0.690 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4    31.7ms ± 4%    31.2ms ± 3%     ~     (p=0.421 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4     1.23ms ± 5%    1.22ms ± 3%     ~     (p=0.548 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     3.11ms ± 2%    3.13ms ± 2%     ~     (p=0.421 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4       17.2µs ± 4%    13.9µs ± 2%  -18.89%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4       44.3µs ± 2%    36.4µs ± 5%  -17.86%  (p=0.008 n=5+5)

name                                                                                                                             old alloc/op   new alloc/op   delta
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Bigchunk-4              85.0kB ± 0%    74.6kB ± 0%  -12.16%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Bigchunk-4               320kB ± 0%     288kB ± 0%   -9.84%  (p=0.029 n=4+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Varbit-4                 213kB ± 0%     202kB ± 0%   -4.86%  (p=0.029 n=4+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Varbit-4                 703kB ± 0%     672kB ± 0%     ~     (p=0.079 n=4+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_DoubleDelta-4            213kB ± 0%     202kB ± 0%   -4.86%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_DoubleDelta-4            703kB ± 0%     672kB ± 0%   -4.48%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4    85.0kB ± 0%    74.6kB ± 0%  -12.16%  (p=0.029 n=4+4)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     320kB ± 0%     288kB ± 0%   -9.84%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4     19.6kB ± 0%     9.3kB ± 0%  -52.76%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     65.3kB ± 0%    33.8kB ± 0%  -48.25%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4       11.9kB ± 0%     1.6kB ± 0%  -86.94%  (p=0.008 n=5+5)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4       35.2kB ± 0%     3.7kB ± 0%  -89.54%  (p=0.008 n=5+5)

name                                                                                                                             old allocs/op  new allocs/op  delta
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Bigchunk-4               1.01k ± 0%     1.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Bigchunk-4               3.02k ± 0%     3.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_Varbit-4                 2.01k ± 0%     2.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_Varbit-4                 6.02k ± 0%     6.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_DoubleDelta-4            3.01k ± 0%     3.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_DoubleDelta-4            9.02k ± 0%     9.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4     1.01k ± 0%     1.01k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1000_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4     3.02k ± 0%     3.02k ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4        113 ± 0%       113 ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_100_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4        323 ± 0%       323 ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_1_encoding:_PrometheusXorChunk-4         14.0 ± 0%      14.0 ± 0%     ~     (all equal)
NewChunkMergeIterator_CreateAndIterate/chunks:_1_samples_per_chunk:_100_duplication_factor:_3_encoding:_PrometheusXorChunk-4         26.0 ± 0%      26.0 ± 0%     ~     (all equal)

Checklist

  • Tests updated
  • NA Documentation added
  • CHANGELOG.md updated

The `through` time is supposed to be the last time in the chunk, and
having it one step higher was throwing off other tests and benchmarks.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
At 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks
is about three weeks, a highly un-representative test.

Instant queries, such as those done by the ruler, will only fetch one
chunk from each ingester.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
We were allocating 24x the number of streams of batches, where each
batch holds up to 12 samples.

By allowing `c.batches` to reallocate when needed, we avoid the need
to pre-allocate enough memory for all possible scenarios.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
@@ -112,8 +112,7 @@ func (c *mergeIterator) buildNextBatch(size int) bool {
for len(c.h) > 0 && (len(c.batches) == 0 || c.nextBatchEndTime() >= c.h[0].AtTime()) {
c.nextBatchBuf[0] = c.h[0].Batch()
c.batchesBuf = mergeStreams(c.batches, c.nextBatchBuf[:], c.batchesBuf, size)
copy(c.batches[:len(c.batchesBuf)], c.batchesBuf)
c.batches = c.batches[:len(c.batchesBuf)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a no-op, right? Did it impact performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess about this change (but Bryan can confirm or negate), is that we had to do this change because c.batches may need to grow after the change in newMergeIterator(). @bboreham is my understanding correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the append will grow the slice if required, whereas copy will panic. TestMergeIter/DoubleDelta fails if you don't make this change.

batches: make(batchStream, 0, len(its)*2*promchunk.BatchSize),
batchesBuf: make(batchStream, len(its)*2*promchunk.BatchSize),
batches: make(batchStream, 0, len(its)),
batchesBuf: make(batchStream, len(its)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall exactly why the pre-allocation was so big - wondering if you know why?

I can't think of a reason why this would affect correctness either, and the perf results speak for themselves...

Copy link
Contributor

@pracucci pracucci Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the correctness perspective, this change should be fine. batchesBuf looks to be written only by mergeStreams() which extends the slice if required.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason why not measuring the impact in prod 🎉 We can merge and deploy to measure impact both on queries and rules. Worst case scenario, rolling back this change is trivial.

@@ -112,8 +112,7 @@ func (c *mergeIterator) buildNextBatch(size int) bool {
for len(c.h) > 0 && (len(c.batches) == 0 || c.nextBatchEndTime() >= c.h[0].AtTime()) {
c.nextBatchBuf[0] = c.h[0].Batch()
c.batchesBuf = mergeStreams(c.batches, c.nextBatchBuf[:], c.batchesBuf, size)
copy(c.batches[:len(c.batchesBuf)], c.batchesBuf)
c.batches = c.batches[:len(c.batchesBuf)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess about this change (but Bryan can confirm or negate), is that we had to do this change because c.batches may need to grow after the change in newMergeIterator(). @bboreham is my understanding correct?

@bboreham bboreham merged commit 95fedaa into master Jul 6, 2021
@bboreham bboreham deleted the tune-merge-iterator branch July 6, 2021 13:26
alvinlin123 pushed a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
* MergeIterator: allocate less memory at first

We were allocating 24x the number of streams of batches, where each
batch holds up to 12 samples.

By allowing `c.batches` to reallocate when needed, we avoid the need
to pre-allocate enough memory for all possible scenarios.

* chunk_test: fix innacurate end time on chunks

The `through` time is supposed to be the last time in the chunk, and
having it one step higher was throwing off other tests and benchmarks.

* MergeIterator benchmark: add more realistic sizes

At 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks
is about three weeks, a highly un-representative test.

Instant queries, such as those done by the ruler, will only fetch one
chunk from each ingester.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Streaming queries are very inefficient
3 participants