Skip to content

Fix memory leak caused by object pool and zero copy#2839

Merged
bboreham merged 2 commits intocortexproject:masterfrom
storyicon:master
Jul 7, 2020
Merged

Fix memory leak caused by object pool and zero copy#2839
bboreham merged 2 commits intocortexproject:masterfrom
storyicon:master

Conversation

@storyicon
Copy link
Copy Markdown
Contributor

@storyicon storyicon commented Jul 6, 2020

What this PR does:
The purpose of this PR is to fix the memory leak of Cortex Ingester. After this PR is merged, it is expected to reduce the runtime memory of Ingester by at least 30%.

Which issue(s) this PR fixes:
Fixes #2665

Although the PR code looks very simple, the principle inside is more complicated. I try to make it as easy to understand as possible. If there are errors in some parts, please correct me.
When troubleshooting the memory problem of Ingester, I simplified Ingester.Push to make it do almost nothing. Like below:

// Push implements client.IngesterServer
func (i *Ingester) Push(ctx context.Context, req *client.WriteRequest) (*client.WriteResponse, error) {
    defer client.ReuseSlice(req.Timeseries)
    return &client.WriteResponse{}, nil
}

After the above modifications, Ingester still takes up very high memory (60G+), until OOM, I think it must be a problem with client.ReuseSlice.

However, when I commented out the reuse of TimeSeries, the problem was solved, Ingester only took up 700M memory, and it was very stable, like the following:

func ReuseTimeseries(ts *TimeSeries) {
	// ts.Labels = ts.Labels[:0]
	// ts.Samples = ts.Samples[:0]
	// timeSeriesPool.Put(ts)
}

By looking closely at the code, I found the source of the problem. First, in ReuseTimeseries

ts.Labels = ts.Labels[:0]
ts.Samples = ts.Samples[:0]

It does not release the real underlying array. We know that slice’s real structure is as follows:

type sliceHeader struct {
    Length        int
    Capacity      int
    ZerothElement *byte
}

The above operation will only change the Length in sliceHeader to 0, the underlying array still cannot be released. Secondly, you can see in the deserialization of Label:

bs.Value = yoloString(dAtA[iNdEx:postIndex])

func yoloString(buf []byte) string {
return *((*string)(unsafe.Pointer(&buf)))
}

In the deserialization, the zero-copy is used. Label.Name and Label.Value refer to the original []byte, which will cause the original request body to not be released.

This may have some abstractions, I can cite such an example:

The A request body has a total of 1M []byte (I call it RawSliceA). After Unmarshal, timeSeriesA0 contains 10 Labels. Due to the use of zero copy, these 10 Labels actually refer to RawSliceA.
After the A request is processed, timeSeriesA0 is put into timeSeriesPool, and the reference to RawSliceA is still not canceled at this time。
The B request body has a total of 500KB []byte. After Unmarshal, timeSeriesB0 contains 3 Labels. In Unmarshal, it takes and reuses timeSeriesA0 from timeSeriesPool. At this time, RawSliceA was released?
No, RawSliceA may not be released until the B request processing is completed, and even to the next request.
The whole process is like this:
image

This example is a bit similar to:golang/go#23199

For the problems mentioned above, I have the following thoughts:

  1. I think there is no problem using zero copy like yoloString, because most of the bytes of RawSlice are describing Labels, which is cost-effective.
  2. I think it is no problem to use memory pool for timeSeries, it can reuse Labels and Samples Slice.
  3. Even though the GC will periodically clean up the Pool, a large number of consecutive requests will still cause a lot of memory to be wasted.
  4. Before putting timeSeries into timeSeriesPool, must clear the Name and Value of timeSeries.Labels to dereference RawSlice.

After testing, this modification can greatly reduce the runtime memory of Ingester without introducing new problems.

before:
image

after:
image
(The illustration above was made after commenting out Push's real logic)

Signed-off-by: storyicon <storyicon@foxmail.com>
Copy link
Copy Markdown
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work figuring this out. This looks good to me and I’m excited to see what effects this change has on our environments.

Copy link
Copy Markdown
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane! Thanks for the deep dive into this!

@bboreham
Copy link
Copy Markdown
Contributor

bboreham commented Jul 7, 2020

May I ask how large the incoming messages are? E.g. how many samples do you allow per Push() ?

// ReuseTimeseries puts the timeseries back into a sync.Pool for reuse.
func ReuseTimeseries(ts *TimeSeries) {
for i := 0; i < len(ts.Labels); i++ {
ts.Labels[i].Name = ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest // Name and Value may point into a large gRPC buffer, so clear the reference to allow GC

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accepted your suggestion, thanks!

Signed-off-by: storyicon <storyicon@foxmail.com>
@storyicon
Copy link
Copy Markdown
Contributor Author

storyicon commented Jul 7, 2020

@bboreham Hello, in my production environment, the scrape_interval of Prometheus is 25s, Cortex receives 53 Pushes per second on average, and each Push contains 5530 Timeseries on average, among them:

  1. A single TimeSeries has up to 16 Labels and at least 1 Label
  2. Each TimeSeries contains only 1 sample.

On average, the buffer for each Push request is 1211066byte, of which samples occupy 88480byte and labels occupy 1122582byte.
At present, my cortex uses BlockStorage+S3, and it will always be OOM after running 10h. I am committed to reducing scrape_interval to 10s, and keep the memory of Cortex stable, I use three 252GB, 48Core linux machines, replication_factor is 2(configuring 3 will make OOM faster).
If you can give me some suggestions, I would appreciate it.

@storyicon
Copy link
Copy Markdown
Contributor Author

storyicon commented Jul 7, 2020

Reproducing this problem is simple, I can give this Demo:

Test0

package main

import (
	"fmt"
	"runtime"
	"runtime/debug"
	"sync"
	"time"
	"unsafe"
)

var timeSeriesPool = sync.Pool{
	New: func() interface{} {
		return &TimeSeries{
			Labels: make([]Label, 0, 10),
		}
	},
}

func yoloString(buf []byte) string {
	return *((*string)(unsafe.Pointer(&buf)))
}

type TimeSeries struct {
	Labels []Label
}

type Label struct {
	Name  string
	Value string
}

func unmarshal(data []byte, labelsSize int) *TimeSeries {
	ts := timeSeriesPool.Get().(*TimeSeries)
	for i := 0; i < labelsSize; i++ {
		ts.Labels = append(ts.Labels, Label{
			Name:  yoloString(data[:5]),
			Value: yoloString(data[5:10]),
		})
	}
	return ts
}

func reuseTimeSeries(ts *TimeSeries) {
	// for i := 0; i < len(ts.Labels); i++ {
	// 	ts.Labels[i].Name = ""
	// 	ts.Labels[i].Value = ""
	// }
	ts.Labels = ts.Labels[:0]
	timeSeriesPool.Put(ts)
}

func Push(data []byte, labelSize int, withGC bool) {
	ts := unmarshal(data, labelSize)
	defer reuseTimeSeries(ts)
	if withGC {
		// Manually trigger GC
		runtime.GC()
		// Return memory to operating system
		debug.FreeOSMemory()
	}
	// Reference variables to prevent memory optimization
	fmt.Println("Push Done", ts)
}

func buildPushBuffer(size int, fillByte byte) []byte {
	data := make([]byte, size)
	for i := 0; i < len(data); i++ {
		// Fill bytes to prevent Unix memory optimization.
		// This is not needed on windows.
		data[i] = fillByte
	}
	return data
}

func main() {
         // allocate 500M memory and pass in Push
	Push(buildPushBuffer(1024*1024*500, 1), 10, false)
        // allocate 500M memory and pass in Push
	Push(buildPushBuffer(1024*1024*500, 1), 1, true)

	// sleep, view memory
	fmt.Println("sleep")
	time.Sleep(time.Hour)
}

Running the above code, you can see that the memory usage of the program is 1G instead of 500M, the principle is the same as the above picture.

Test1

Uncomment the following code in Test0 and re-run:

// for i := 0; i < len(ts.Labels); i++ {
// 	ts.Labels[i].Name = ""
// 	ts.Labels[i].Value = ""
// }

You can see that the memory usage is 500M instead of 1G, because the reference to the original buffer generated by the zero copy is cleared before it is put into timeSeriesPool.

Test2

Change the following code in Test0

ts.Labels = append(ts.Labels, Label{
    Name:  yoloString(data[:5]),
    Value: yoloString(data[5:10]),
})

into:

ts.Labels = append(ts.Labels, Label{
    Name:  string(data[:5]),
    Value: string(data[5:10]),
})

The memory usage should be only 30M+, because no zero copy is used, the ts from unmarshal no longer refer to the original buffer, and the two 500M buffers are cleaned up by the GC.

@bboreham
Copy link
Copy Markdown
Contributor

bboreham commented Jul 7, 2020

(sorry my comments are somewhat duplicated in #2665)

I understand the behaviour in Go of the part you fixed. What I don't understand is the gap between Go heap of 1-3GB and a RSS of 90+GB.

In my Cortex clusters RSS is between 30% and 50% more than max Go heap.

Copy link
Copy Markdown
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Approving because I don't see any downside of doing it, but I would be glad to learn more about this, because it's quite surprising to me you're getting this benefit.

Few questions, please:

  1. Could you share a heap pprof showing the memory utilization without the fix?
  2. Could you share a graph to show how frequently the Go GC runs before and after the fix?

@storyicon
Copy link
Copy Markdown
Contributor Author

@pracucci I'm sorry I will provide these data later, because it is already late at night in my city.

@storyicon
Copy link
Copy Markdown
Contributor Author

@bboreham Hello, I replied in #2665 about the reason for the difference in our pprof data.

@bboreham bboreham merged commit 4dce560 into cortexproject:master Jul 7, 2020
@storyicon
Copy link
Copy Markdown
Contributor Author

storyicon commented Jul 8, 2020

@pracucci The GC gap is huge and even exceeds my expectations.

Before Fix

image

After Fix

image

In order to rule out other influences, I modified the Push function like this during whole testing(both Before Fix and After Fix):

func (i *Ingester) Push(ctx context.Context, req *client.WriteRequest) (*client.WriteResponse, error) {
	defer client.ReuseSlice(req.Timeseries)
	return &client.WriteResponse{}, nil
}

The only difference between Before Fix and After Fix is ​​that After Fix is ​​added:

for i := 0; i < len(ts.Labels); i++ {
    ts.Labels[i].Name = ""
    ts.Labels[i].Value = ""
}

That is, the content submitted in this PR.

The latest release version (cd9e38d) was used in the test.
Although this optimization looks exciting, at present I cannot guarantee that Cortex will still enjoy such a large memory advantage after restoring the normal logic of the Push function, because I guess there should be other object pools in Ingester that refer to some addresses that should be released.

@pracucci
Copy link
Copy Markdown
Contributor

pracucci commented Jul 8, 2020

These charts looks something proving my gut feeling, that what you actually get is the GC running way more frequently. This explains why the "leak" is not visible in the heap, because it's all about how frequently the GC runs.

@storyicon
Copy link
Copy Markdown
Contributor Author

storyicon commented Jul 8, 2020

@pracucci Your gut feeling is really great. Can you explain why GC just makes the number of heap objects drop, but fails to reduce ingester's memory resident? Do you think if I use ticker to manually trigger GC (Including debug.FreeOSMemory())frequently, can I keep the memory at 500M instead of 50G?

@pracucci
Copy link
Copy Markdown
Contributor

pracucci commented Jul 8, 2020

@pracucci Your gut feeling is really great! Can you explain why GC just makes the number of heap objects drop, but fails to reduce ingester's memory resident? Do you think if I use ticker to manually trigger GC (Including debug.FreeOSMemory())frequently, can I keep the memory at 500M instead of 50G?

I don't have an answer (I would need to further investigate it, possibly being able to reproduce it locally). Manually triggering the GC could be an experiment you can do just for testing, but if this PR fixes your issue then I would just be happy and move on 😉

@storyicon
Copy link
Copy Markdown
Contributor Author

@pracucci Your gut feeling is really great! Can you explain why GC just makes the number of heap objects drop, but fails to reduce ingester's memory resident? Do you think if I use ticker to manually trigger GC (Including debug.FreeOSMemory())frequently, can I keep the memory at 500M instead of 50G?

I don't have an answer (I would need to further investigate it, possibly being able to reproduce it locally). Manually triggering the GC could be an experiment you can do just for testing, but if this PR fixes your issue then I would just be happy and move on 😉

@pracucci
These things are fascinating. I will do relevant experiments to find out. Pracucci, you are the first person to reply to me when I put forward issue under cortex project. Thank you! 😔

@bboreham
Copy link
Copy Markdown
Contributor

bboreham commented Jul 8, 2020

Go has a mechanism to control how fast GC runs - the GOGC environment variable.
https://golang.org/pkg/runtime/debug/#SetGCPercent

Note frequent GC may chew up a lot of CPU.

@storyicon
Copy link
Copy Markdown
Contributor Author

Go has a mechanism to control how fast GC runs - the GOGC environment variable.
https://golang.org/pkg/runtime/debug/#SetGCPercent

Note frequent GC may chew up a lot of CPU.

@bboreham
Thanks, bboreham. I will continue to track this problem. If there is an error in my statement of principle above, I hope you can correct me. In fact, I don't care who holds the truth, but what it is.☹️

@bboreham
Copy link
Copy Markdown
Contributor

bboreham commented Jul 8, 2020

If there is an error in my statement of principle above, I hope you can correct me.

Sorry I don't know what this refers to.

@storyicon
Copy link
Copy Markdown
Contributor Author

If there is an error in my statement of principle above, I hope you can correct me.

Sorry I don't know what this refers to.

Sorry, English is not my mother tongue. I mean what I explained in "Which issue(s) this PR fixes"

@bboreham
Copy link
Copy Markdown
Contributor

bboreham commented Jul 8, 2020

Fixing the issue is fine; thank you for the PR.

My last comment referred to manually running GC, which is a totally different subject.

@storyicon
Copy link
Copy Markdown
Contributor Author

storyicon commented Jul 8, 2020

Fixing the issue is fine; thank you for the PR.

My last comment referred to manually running GC, which is a totally different subject.

I understand, I haven't tried to use this function to control the behavior of GC. This is a new idea. I will continue to dive into it (includes manual triggering of GC tests), thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

About the memory of ingester

5 participants