Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: shrink SampleReservoir capacity on memory exhaustion #65491

Merged
merged 3 commits into from
Jun 9, 2021

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented May 19, 2021

sql: dynamically shrink sampleAggregator capacity as needed

To avoid bias, sampleAggregator must always use a capacity <= the
capacity of each samplerProcessor feeding it. This was always implicitly
true before, as every samplerProcessor and sampleAggregator used the
same fixed capacity. But the next few commits will give sampleProcessors
(and sampleAggregators) the ability to dynamically decrease capacity
when out of memory, meaning we now sometimes need to resize
sampleAggregator to keep this invariant true.

This commit does not yet change any behavior because the capacities of
samplerProcessors are still static. Next few commits will change that.

Release note: None


sql: fix memory accounting in SampleReservoir.copyRow

When overwriting the highest-rank sample with a new one, we might have
smaller datums than before, so we should account for memory use
shrinking as well as growing. (Of course, those datums are in a
datumAlloc, so we don't know exactly when they will be GC'd... but this
is the best accounting we can do without hooking into the garbage
collector.)

Release note: None


sql: shrink SampleReservoir capacity on memory exhaustion

Fixes: #62206

Instead of returning an error when out of memory, make
SampleReservoir.SampleRow dynamically reduce capacity and retry. This
will result in a lower-resolution histogram, but lower-resolution is
better than no histogram at all!

Release note (bug fix): fix histogram generation when table statistics
collection reaches memory limits.


sql: add minimum SampleReservoir capacity

Below a certain number of samples, the histograms would be so imprecise
that they are not worth generating. Add the ability to set this
threshold in SampleReservoir. This commit sets this threshold to the
max number of histogram buckets, which was picked because it seemed
to make sense and not for any mathematical or empirical reason.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2 michae2 requested a review from a team May 19, 2021 23:47
@michae2
Copy link
Collaborator Author

michae2 commented May 19, 2021

I'm still working on testing, but this is probably RFAL.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it to @rytaft to review it much closer, just a couple of quick comments from me.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/rowexec/sampler.go, line 115 at r4 (raw file):

	sampleSize, minSampleSize := int(spec.SampleSize), int(spec.MinSampleSize)
	if minSampleSize < 1 || minSampleSize > sampleSize {

super nit: this logic is duplicated in two places. Might be worth extracting into a helper.


pkg/sql/stats/row_sampling.go, line 134 at r2 (raw file):

		// for the additional memory used after copying inside copyRow.
		if sr.memAcc != nil {
			if err := sr.memAcc.Grow(ctx, int64(len(rowCopy))*int64(rowCopy[0].Size())); err != nil {

This line seems suspicious. If we have a row of INT, BYTES, DECIMAL, we will account for 3 x INT. Then in copyRow our beforeSize will be rowCopy[i].Size i.e. each of the datums separately.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/stats/row_sampling.go, line 134 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This line seems suspicious. If we have a row of INT, BYTES, DECIMAL, we will account for 3 x INT. Then in copyRow our beforeSize will be rowCopy[i].Size i.e. each of the datums separately.

I believe at this point rowCopy[0].Size() will only be EncDatumOverhead because we have not yet set the values to anything, so this should be equivalent to the sum of Size() of each element of rowCopy. But this line confused me, too. I will change it to use rowCopy.Size().

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/stats/row_sampling.go, line 134 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

I believe at this point rowCopy[0].Size() will only be EncDatumOverhead because we have not yet set the values to anything, so this should be equivalent to the sum of Size() of each element of rowCopy. But this line confused me, too. I will change it to use rowCopy.Size().

Oh yeah, indeed, we haven't performed a copy yet, nvm then.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Needs some tests, but definitely seems like it's on the right track.

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 7 of 7 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2)


pkg/sql/rowexec/sample_aggregator.go, line 287 at r1 (raw file):

				// Inverted sample row.
				// Retain the rows with the top ranks.
				if capacity, err := row[s.numRowsCol].GetInt(); err == nil {

This could use an update to the comment explaining what's happening.


pkg/sql/rowexec/sample_aggregator.go, line 309 at r1 (raw file):

			// Sample row.
			// Retain the rows with the top ranks.
			if capacity, err := row[s.numRowsCol].GetInt(); err == nil {

ditto


pkg/sql/rowexec/sampler_test.go, line 102 at r1 (raw file):

		}
		for i := 2; i < len(outTypes); i++ {
			if i != 3 /* num rows col */ && !row[i].IsNull() {

nit: I'd just make a const var called numRowsCol. While you're at it, it would be awesome if you could improve this test by making const vars for the other columns too.


pkg/sql/stats/row_sampling.go, line 82 at r1 (raw file):

}

func (sr *SampleReservoir) Cap() int {

nit: needs a comment


pkg/sql/stats/row_sampling.go, line 117 at r1 (raw file):

		samp := heap.Pop(sr).(SampledRow)
		if sr.memAcc != nil {
			sr.memAcc.Shrink(ctx, int64(samp.Row.Size()))

I'm not sure we can actually call shrink here without allocating a completely new array. Even though samples no longer references these rows, I don't think they can be garbage collected until all references to the underlying array are gone, at which point the entire array will be GC'd.


pkg/sql/stats/row_sampling.go, line 124 at r3 (raw file):

}

func (sr *SampleReservoir) trySampleRow(

Would be good to add a function comment here


pkg/sql/stats/row_sampling.go, line 151 at r3 (raw file):

	if len(sr.samples) > 0 && rank < sr.samples[0].Rank {
		if err := sr.copyRow(ctx, evalCtx, sr.samples[0].Row, row); err != nil {
			// XXX At this point sr.samples[0].Row could have a mix of old and new

What's the XXX for?


pkg/sql/stats/row_sampling.go, line 176 at r3 (raw file):

		}
		// We've used too much memory. Remove the highest-rank sample and try again.
		sr.Resize(ctx, len(sr.samples)-1)

I think we probably want to be more aggressive here -- otherwise we'll keep hitting the memory limit. Instead of just removing one sample at a time, maybe we should cut the number of samples in half (or reduce by some other fraction of the total). (This will also amortize the cost of creating a new array on each Resize, which as I mention above, is something I think you need to do to make this work correctly).


pkg/sql/stats/row_sampling.go, line 80 at r4 (raw file):

	}
	if minNumSamples < 1 {
		return errors.AssertionFailedf("minNumSamples (%d) < 1", numSamples)

numSamples -> minNumSamples


pkg/sql/stats/row_sampling.go, line 191 at r4 (raw file):

		err := sr.trySampleRow(ctx, evalCtx, row, rank)
		if err == nil || !sqlerrors.IsOutOfMemoryError(err) ||
			len(sr.samples) <= 1 || cap(sr.samples) <= sr.minNumSamples {

Do you still need to check if len(sr.samples) <= 1? Isn't that handled by minNumSamples?

@rytaft
Copy link
Collaborator

rytaft commented May 20, 2021

I think you also need to run make bazel-generate

@michae2 michae2 force-pushed the sampler branch 2 times, most recently from f9708f6 to ee0deda Compare May 26, 2021 06:02
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, still no test but I've:

  • addressed the comments (thank you)
  • added some logging
  • refactored the retry in SampleReservoir
  • fixed a spot I missed in sampleAggregator.generateHistogram

Test is coming, I swear, but pushing this to see what TeamCity thinks.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)


pkg/sql/rowexec/sample_aggregator.go, line 287 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This could use an update to the comment explaining what's happening.

OK, how's this?


pkg/sql/rowexec/sample_aggregator.go, line 309 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

ditto

OK.


pkg/sql/rowexec/sampler.go, line 115 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

super nit: this logic is duplicated in two places. Might be worth extracting into a helper.

OK, I moved it into SampleReservoir.Init.


pkg/sql/rowexec/sampler_test.go, line 102 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: I'd just make a const var called numRowsCol. While you're at it, it would be awesome if you could improve this test by making const vars for the other columns too.

OK, this should be a little better now.


pkg/sql/stats/row_sampling.go, line 117 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I'm not sure we can actually call shrink here without allocating a completely new array. Even though samples no longer references these rows, I don't think they can be garbage collected until all references to the underlying array are gone, at which point the entire array will be GC'd.

Oh crap, good catch! Done.


pkg/sql/stats/row_sampling.go, line 134 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Oh yeah, indeed, we haven't performed a copy yet, nvm then.

OK, this is now using rowCopy.Size().


pkg/sql/stats/row_sampling.go, line 124 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Would be good to add a function comment here

Done.


pkg/sql/stats/row_sampling.go, line 151 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

What's the XXX for?

This was a type of comment we used at a previous company. Changed it to WARNING.


pkg/sql/stats/row_sampling.go, line 176 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think we probably want to be more aggressive here -- otherwise we'll keep hitting the memory limit. Instead of just removing one sample at a time, maybe we should cut the number of samples in half (or reduce by some other fraction of the total). (This will also amortize the cost of creating a new array on each Resize, which as I mention above, is something I think you need to do to make this work correctly).

Yes, you're right. Done.


pkg/sql/stats/row_sampling.go, line 80 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

numSamples -> minNumSamples

Ah! Good catch! But I ended up changing it anyway. 😄


pkg/sql/stats/row_sampling.go, line 191 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Do you still need to check if len(sr.samples) <= 1? Isn't that handled by minNumSamples?

It is in almost every case, but in the (extremely unlikely) case where we hit a memory limit with only 0 or 1 row sampled the len check would matter. (The memory limit would have to be very very low for this to happen.)

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest force push:

  • New commit adding memory limit testing to row_sampling_test.go.
  • This uncovered bugs in SampleReservoir.Pop and SampleReservoir.MaybeResize which were fixed.
  • Also fixed some small compilation errors and a few comments.

I believe there will be one more commit and force push to add some testing around samplerProcessor and sampleAggregator, so you don't have to look yet (though you're welcome to).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 8 files at r5, 2 of 2 files at r6, 5 of 8 files at r9, 1 of 2 files at r10, 3 of 3 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/rowexec/sample_aggregator.go, line 287 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

OK, how's this?

Looks good


pkg/sql/rowexec/sample_aggregator.go, line 379 at r6 (raw file):

		if sr.MaybeResize(ctx, int(capacity)) {
			log.Infof(
				ctx, "histogram samples reduced from %d to %d to match sample processor",

nit: sample -> sampler


pkg/sql/rowexec/sample_aggregator.go, line 401 at r11 (raw file):

	} else if sr.Cap() != prevCapacity {
		log.Infof(
			ctx, "histogram samples reduced from %d to %d due to excessive memory utilization",

are we logging this same message multiple times? or do these all cover different cases?


pkg/sql/stats/row_sampling.go, line 60 at r11 (raw file):

	// minNumSamples is the minimum K needed for sampling to be meaningful. If the
	// reservoir capacity would fall below this, SampleRow will err instead of
	// decreasing K further.

nit: Here and below throughout this file you reference "K", but there's no variable named K, and I think it just adds confusion. I'd just say "capacity" (if that's the intended meaning) instead of "K".


pkg/sql/stats/row_sampling_test.go, line 45 at r11 (raw file):

		} else if sr.Cap() != prevCapacity {
			t.Logf(
				"samples reduced from %d to %d during SampleRow",

Instead of just logging, maybe assert that the number of samples was reduced by an expected amount?

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest push:

  • rebases on master tip
  • adds some testing for samplerProcessor and sampleAggregator
  • fixes a bug in SampleReservoir.retryMaybeResize that this testing caught
  • no longer silently drops errors returned by SampleReservoir.Init
  • tweaks some comments and commit messages slightly

I think this is RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)


pkg/sql/rowexec/sample_aggregator.go, line 401 at r11 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

are we logging this same message multiple times? or do these all cover different cases?

These cover two different cases:

  1. A child samplerProcessor ran out of memory, and this sampleAggregator has to shrink to match the capacity of that child.
  2. This sampleAggregator itself ran out of memory.

pkg/sql/stats/row_sampling.go, line 60 at r11 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: Here and below throughout this file you reference "K", but there's no variable named K, and I think it just adds confusion. I'd just say "capacity" (if that's the intended meaning) instead of "K".

"K" comes from the big comment above the definition of SampleReservoir. I've changed some of the comments so that they say "reservoir capacity (K)" instead of just "K", is that better?


pkg/sql/stats/row_sampling_test.go, line 45 at r11 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Instead of just logging, maybe assert that the number of samples was reduced by an expected amount?

It's a good point, but I'm worried that specifying an exact amount would cause the test to flake if memory usage changed slightly and the SampleReservoir shrunk more or less than it did before.

@michae2 michae2 force-pushed the sampler branch 2 times, most recently from b82bd9f to 3ce7580 Compare June 7, 2021 18:14
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, sorry for one more push. As it turns out, SampleReservoir needs to support a capacity of 0. (samplerProcessor and sampleAggregator use this when histograms are disabled). So just a slight change to SampleReservoir.Init and SampleReservoir.retryMaybeResize is needed. This should make the tests happy.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting really close!

It seems like there are a lot of commits -- can you squash any of them? Also make sure that each commit builds/passes tests individually.

Correct me if I'm wrong, but I don't see any tests that specifically verify the goal of this PR to degrade gracefully with tightening memory conditions. So instead of just verifying that we do or don't run out of memory in a certain test case, it would be good to check that in cases where we would have run out of memory previously, now we succeed, but the number of samples is reduced to an expected number (both in the sampler and the sampleAggregator). You can use make stress to make sure that you're not introducing flakes.

Reviewed 9 of 9 files at r12, 1 of 1 files at r13, 2 of 4 files at r14, 5 of 8 files at r15, 1 of 2 files at r16, 3 of 3 files at r17, 1 of 1 files at r18, 1 of 1 files at r20, 4 of 8 files at r24, 1 of 4 files at r25, 1 of 2 files at r26, 3 of 3 files at r27, 1 of 1 files at r28, 1 of 1 files at r30.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/rowexec/sample_aggregator_test.go, line 305 at r20 (raw file):

	)

	inputRows = [][]int{

would help to add a comment to explain what the following cases are testing


pkg/sql/rowexec/sample_aggregator_test.go, line 348 at r20 (raw file):

	runSampleAggregator(
		t, server, sqlDB, kvDB, st, &evalCtx, 1<<15 /* memLimitBytes */, false, /* expectOutOfMemory */
		2 /* numSamples */, 4 /* aggNumSamples */, inputRows, expected,

are these supposed to be different?


pkg/sql/rowexec/sample_aggregator_test.go, line 152 at r30 (raw file):

	spec := &execinfrapb.SampleAggregatorSpec{
		SampleSize:       aggNumSamples,
		MinSampleSize:    2,

you could pass this as a param too


pkg/sql/rowexec/sampler_test.go, line 208 at r20 (raw file):

	runSampler(t, numRows, numSamples, numSamples/2, memLimitBytes, true /* expectOutOfMemory */)
	runSampler(t, numRows, numSamples, numSamples/3, memLimitBytes, true /* expectOutOfMemory */)
	runSampler(t, numRows, numSamples, numSamples/4, memLimitBytes, false /* expectOutOfMemory */)

what are these tests showing?


pkg/sql/stats/row_sampling.go, line 60 at r11 (raw file):

Previously, michae2 (Michael Erickson) wrote…

"K" comes from the big comment above the definition of SampleReservoir. I've changed some of the comments so that they say "reservoir capacity (K)" instead of just "K", is that better?

ah got it - thanks!


pkg/sql/stats/row_sampling.go, line 131 at r30 (raw file):

// introducing bias. Increasing capacity would cause later rows to be
// over-represented relative to earlier rows.)
func (sr *SampleReservoir) MaybeResize(ctx context.Context, k int) bool {

mention in the comment what the return param means (or just remove the return param -- is it even used?)


pkg/sql/stats/row_sampling.go, line 152 at r30 (raw file):

// retryMaybeResize tries to execute a memory-allocating operation, shrinking
// the capacity of the reservoir (K) as necessary until the operation succeeds
// or fails with a non-out-of-memory error.

nit: this implies it would never fail with an out of memory error, but that doesn't seem to be the case


pkg/sql/stats/row_sampling_test.go, line 45 at r11 (raw file):

Previously, michae2 (Michael Erickson) wrote…

It's a good point, but I'm worried that specifying an exact amount would cause the test to flake if memory usage changed slightly and the SampleReservoir shrunk more or less than it did before.

Maybe you could assert that it's within a range? I think it would be better to deal with flakes and tweak the test as needed than to have it silently fail. You can check for possible flakes by running make stress on the test. If it runs a large number of times without failing it should be good to go.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


pkg/sql/rowexec/sample_aggregator_test.go, line 348 at r20 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

are these supposed to be different?

Here we're varying the memory limit, and below we're varying numSamples and aggNumSamples. But it's not really clear. Let me clean this up and add some comments.


pkg/sql/rowexec/sampler_test.go, line 208 at r20 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

what are these tests showing?

This is the closest to the testing you describe (though not exactly the same, and not as clear): it holds the memory limit fixed while decreasing minNumSamples, the required number of samples, until finding the point at which we can stay within the memory limit. minNumSamples = numSamples is equivalent to the old behavior (fail if we do not have enough memory for the requested capacity). From there we decrease the require number of samples until we're inside the memory limit.

But I think what you described (holding number of samples fixed while varying the memory limit) is a better test, so I will add that.


pkg/sql/stats/row_sampling_test.go, line 45 at r11 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Maybe you could assert that it's within a range? I think it would be better to deal with flakes and tweak the test as needed than to have it silently fail. You can check for possible flakes by running make stress on the test. If it runs a large number of times without failing it should be good to go.

Good points, you are right. I will do that.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


pkg/sql/rowexec/sampler_test.go, line 208 at r20 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This is the closest to the testing you describe (though not exactly the same, and not as clear): it holds the memory limit fixed while decreasing minNumSamples, the required number of samples, until finding the point at which we can stay within the memory limit. minNumSamples = numSamples is equivalent to the old behavior (fail if we do not have enough memory for the requested capacity). From there we decrease the require number of samples until we're inside the memory limit.

But I think what you described (holding number of samples fixed while varying the memory limit) is a better test, so I will add that.

This seems like a good test too, maybe just add a comment to explain

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, latest push squashes this all into three commits, and enhances the testing a bit, hopefully to good effect. 🙂

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)


pkg/sql/rowexec/sample_aggregator_test.go, line 305 at r20 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

would help to add a comment to explain what the following cases are testing

Done (down in the table of test cases).


pkg/sql/rowexec/sample_aggregator_test.go, line 152 at r30 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

you could pass this as a param too

Done.


pkg/sql/rowexec/sampler_test.go, line 208 at r20 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This seems like a good test too, maybe just add a comment to explain

Done.


pkg/sql/stats/row_sampling.go, line 131 at r30 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

mention in the comment what the return param means (or just remove the return param -- is it even used?)

Done. (It is only used to simplify some logging in sample_aggregator.go.)


pkg/sql/stats/row_sampling.go, line 152 at r30 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: this implies it would never fail with an out of memory error, but that doesn't seem to be the case

You're right, I changed the comment a bit.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)


pkg/sql/rowexec/sample_aggregator_test.go, line 348 at r20 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Here we're varying the memory limit, and below we're varying numSamples and aggNumSamples. But it's not really clear. Let me clean this up and add some comments.

(Done.)


pkg/sql/stats/row_sampling_test.go, line 45 at r11 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Good points, you are right. I will do that.

I ended up checking it exactly, which does seem fine.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wouldn't call this a bug fix (in your release note). "performance improvement" seems more accurate.

Otherwise, :lgtm_strong: Nice work getting this over the finish line!

Let's let this bake on master for a few weeks before backporting to 21.1 and maybe 20.2.

Reviewed 11 of 11 files at r31, 10 of 10 files at r32, 4 of 4 files at r33.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)

To avoid bias, sampleAggregator must always use a capacity <= the
capacity of each child samplerProcessor feeding it. This was always
implicitly true before, as every samplerProcessor and sampleAggregator
used the same fixed capacity. But the next few commits will give
sampleProcessors (and sampleAggregators) the ability to dynamically
shrink capacity when out of memory, meaning we now sometimes need to
resize sampleAggregator to keep this invariant true.

(Resizing sampleAggregator really means resizing the underlying
SampleReservoir, so most of the changes are there.)

This commit does not yet change any behavior, because the capacities of
samplerProcessors are still static. Next few commits will change that.

Also factor the giant closure out of TestSampleAggregator into an
external function, to make table-driven testing easier.

Release note: None
Fixes: cockroachdb#62206

Instead of returning an error when out of memory, make
SampleReservoir.SampleRow dynamically reduce capacity and retry. Fewer
samples will result in a less accurate histogram, but less accurate is
probably still better than no histogram at all, up to a point.

If the number of samples falls below a minimum threshold, then give up
and disable histogram collection, as we were doing originally, rather
than using a wildly inaccurate histogram.

Also fix some memory accounting in SampleReservoir.copyRow.

Release note (performance improvement): continue to generate histograms
when table statistics collection reaches memory limits, instead of
disabling histogram generation.
SampleAggregator can also run out of memory when gathering the datums
for histogram generation. Push this datum gathering down into
SampleReservoir so that we can use the same dynamic capacity-shrinking
strategy as used in SampleRow to reclaim some memory and try again.

Release note: None
@michae2
Copy link
Collaborator Author

michae2 commented Jun 8, 2021

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 9, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stats: table has no histogram statistics created
4 participants