feat: add document retries #99

kruskall · 2024-01-04T08:40:51Z

~~Blocked by #100 (to make testing easier)~~

~~Add document retries logic. Followup to #91 (important to understand how the compression logic changed)~~

~~### Compression off:~~

~~In this case we simply iterate over the array and submit the failed events to the buffer. We can count them using newlines.~~

~~### Compression on:~~

~~Retrieve the offset of the failed event and decompress the "gzip container". Loop over the array and submit the failed event to buffer.~~

Testing

This PR adds tests to cover as mush edge cases as possible: one event for each gzip container, multiple events per gzip container, all events in a gzip container. Tests also verify the retry logic twice to ensure there are no conflicts between flushes.

Update:

Changes:

the approach was simplified: iterate over the array and resubmit the failed events to the buffer. For gzip we read in batches to avoid decompress the whole request.

…lushes

appender_test.go

bulk_indexer.go

marclop

I've noticed that the tests have MaxRequests: 1, which is why the tests are passing. However, in a more realistic scenario with MaxRequests > 1, the documents stored in the local bulk_indexer won't be flushed to Elasticsearch until the bulk indexer is cycled through the channel.

go-docappender/appender.go

Lines 470 to 479 in c8c4e54

    
           indexer := active 
        
           active = nil 
        
           attrs := metric.WithAttributeSet(a.config.MetricAttributes) 
        
           a.errgroup.Go(func() error { 
        
           	var err error 
        
           	took := timeFunc(func() { 
        
           		err = a.flush(a.errgroupContext, indexer) 
        
           	}) 
        
           	indexer.Reset() 
        
           	a.available <- indexer

We should prevent that from happening. I think we may need to introduce a new channel for bulk_indexers that have items from previous 429 failures, and treat it with higher priority than a.available. There are some edge cases we need to handle (like closing the appender), but it would be better than simply sending the bulk_indexer with cached items back to the channel like empty indexers.

Another concern is that there's currently no limit on the number of times 429s will be retried. In theory, the entire buffer could be filled with documents that have not been indexed due to 429s, which would cause only 1 new event to be sent in each subsequent flush. Have you given any thought on how we could configure an upper limit of consecutive retries?

bulk_indexer.go

marclop · 2024-01-31T06:45:58Z

appender_test.go

+	}{
+		"nocompression": {
+			cfg: docappender.Config{
+				MaxRequests:   1,


These tests work because MaxRequests is set to 1; however, if they're set to 2 or 3, the documents stored for subsequent retries won't be flushed in the next bulk request. Instead, the other 1 or 2 bulk indexers need to be flushed first, and only after that happens will the buffered "failed" docs be sent over to Elasticsearch.

bulk_indexer.go

kruskall · 2024-02-01T21:36:52Z

I've noticed that the tests have MaxRequests: 1, which is why the tests are passing. However, in a more realistic scenario with MaxRequests > 1, the documents stored in the local bulk_indexer won't be flushed to Elasticsearch until the bulk indexer is cycled through the channel.

We should prevent that from happening. I think we may need to introduce a new channel for bulk_indexers that have items from previous 429 failures, and treat it with higher priority than a.available. There are some edge cases we need to handle (like closing the appender), but it would be better than simply sending the bulk_indexer with cached items back to the channel like empty indexers.

To be honest, I think this is fine. We already changed quite a lot of things because of concerns around complexity. If ES is returning 429s there are a lot of events buffered/coming in. I don't think retried events are gonna stay "idle" for a long time.
I think we should get rid of the "active" channel entirely and let the runtime handle the channels but that should go in a separate PR.

Another concern is that there's currently no limit on the number of times 429s will be retried. In theory, the entire buffer could be filled with documents that have not been indexed due to 429s, which would cause only 1 new event to be sent in each subsequent flush. Have you given any thought on how we could configure an upper limit of consecutive retries?

Thanks for this! This should be fixed 👍

marclop

To be honest, I think this is fine. We already changed quite a lot of things because of concerns around complexity. If ES is returning 429s there are a lot of events buffered/coming in. I don't think retried events are gonna stay "idle" for a long time.

I don't think it's acceptable to do all of this retry work and then return a potentially full bulkIndexer back to the available channel, where it may or may not be flushed in the future.

While it may be the case that 429s tend to happen on higher throughputs and the bulkIndexer may not remain in that channel for longer, it won't be flushed on appender.Close(), leading to those buffered events to be completely lost.

Could you please add a test that ensures that previously buffered 429s are flushed once more on indexer.Close()?

In the case where 429s are a considerable % of the total documents in a bulk request, say 30-40%, we'd leave a bunch of half-full (or almost entirely full if 429 % is even higher) bulkIndexers in that available channel.

I think we should get rid of the "active" channel entirely and let the runtime handle the channels but that should go in a separate PR.

Could you elaborate on this? I'm not following your reasoning and how the runtime would be of any help to us in this case.

bulk_indexer.go

marclop · 2024-02-02T06:11:45Z

config.go

@@ -55,6 +55,9 @@ type Config struct {
 	// If MaxRequests is less than or equal to zero, the default of 10 will be used.
 	MaxRequests int

+	// MaxDocumentRetries holds the maximum number of document retries
+	MaxDocumentRetries int


Should this default to 3?

No, I'd prefer to keep it a 0 and have this behaviour opt-in for users of this library

We can revisit this in the future

bulk_indexer.go

Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>

kruskall · 2024-02-05T10:02:05Z

I don't think it's acceptable to do all of this retry work and then return a potentially full bulkIndexer back to the available channel, where it may or may not be flushed in the future.

The idea is that they will always be flushed in the future. Either the next time data comes in or after the flush intervel (currently not working because of the active channel approach we are using).

While it may be the case that 429s tend to happen on higher throughputs and the bulkIndexer may not remain in that channel for longer, it won't be flushed on appender.Close(), leading to those buffered events to be completely lost.

Could you please add a test that ensures that previously buffered 429s are flushed once more on indexer.Close()?

Mmh, this is a bug. Sorry about that! It should be fixed now!

In the case where 429s are a considerable % of the total documents in a bulk request, say 30-40%, we'd leave a bunch of half-full (or almost entirely full if 429 % is even higher) bulkIndexers in that available channel.

I don't see this as an issue. This would decrease the throughput of the bulk indexers which is fine as the whole point of the 429s is to slow down.

Could you elaborate on this? I'm not following your reasoning and how the runtime would be of any help to us in this case.

Ideally, we would just read from the channel and let the runtime switch between them, removing the active channel completely. The flush interval would then cause a flush request for the bulk indexers that still have events in the buffer. The issue of bulk indexers remaining in a "limbo" would not apply anymore because they would all be active.

simitt · 2024-02-05T12:28:56Z

The bench-diff check is green, but looking at the logs, some benchmark errors happened; please take a look and either fix the issue or loop in the automation team, in case it's not legit errors.

start/end lnidx are not really indexes so the name was misleading.

marclop

Changes mostly LGTM, I think we need to retry flushing even if the error group returns an error.

Could you elaborate on this? I'm not following your reasoning and how the runtime would be of any help to us in this case.

Ideally, we would just read from the channel and let the runtime switch between them, removing the active channel completely. The flush interval would then cause a flush request for the bulk indexers that still have events in the buffer. The issue of bulk indexers remaining in a "limbo" would not apply anymore because they would all be active.

I still don't follow what you meant.

Also, have you run some benchmarks using the tilt environment and monitoring how the metrics look? I think it's worth doing so before merging.

marclop · 2024-02-06T07:24:00Z

appender.go

+	if err := a.errgroup.Wait(); err != nil {
+		return err
 	}
-	return a.errgroup.Wait()
+	close(a.available)
+	for bi := range a.available {
+		if err := a.flush(context.Background(), bi); err != nil {
+			return fmt.Errorf("failed to flush events on close: %w", err)
+		}
+	}
+	return nil


Returning early will cause the non-empty bulkIndexers to not be flushed. What do you think about the snippet below?

// Wait until all active indexers have been flushed. err := a.errgroup.Wait() close(a.available) for bi := range a.available { a.errgroup.Go(func() error { if e := a.flush(context.Background(), bi); e != nil { return fmt.Errorf("failed to flush events on close: %w", e) } return nil }) } return errors.Join(err, a.errgroup.Wait())

This is on purpose to be consistent with the current behaviour:

go-docappender/appender.go

Lines 482 to 492 in 9748039

var err error

took := timeFunc(func() {

err = a.flush(a.errgroupContext, indexer)

})

indexer.Reset()

a.available <- indexer

a.addUpDownCount(1, &a.availableBulkRequests, a.metrics.availableBulkRequests)

a.metrics.flushDuration.Record(context.Background(), took.Seconds(),

attrs,

)

return err

Even with retries disabled, if flush fails the errgroup returns an error which will not block/wait for the others to finish.

Could you open a follow up issue then to tackle this? With adding retry behavior we are moving more towards reliability vs. dropping events, so the ask to change the behavior for flushing non empty bulk indexers seems totally fair.

That means that we won't be retrying one last time. If you could add a comment about that in the code, that'd be great to make more obvious to readers.

bulk_indexer.go

kruskall · 2024-02-06T14:14:17Z

Also, have you run some benchmarks using the tilt environment and monitoring how the metrics look? I think it's worth doing so before merging.

I only run benchmarks using the appender benchmarks in the repo but can you clarify this so we are aligned ? Are you asking to test the performance of the retry code or the overhead on the "normal operation" (requests succeed) ?

simitt · 2024-02-06T14:16:52Z

Also, have you run some benchmarks using the tilt environment and monitoring how the metrics look? I think it's worth doing so before merging.

I only run benchmarks using the appender benchmarks in the repo but can you clarify this so we are aligned ? Are you asking to test the performance of the retry code or the overhead on the "normal operation" (requests succeed) ?

I do not want to speak for Marc, but IMO we should have some numbers on how this impacts overall performance. I would expect

no performance change if there are no retries
retry behavior to add only a reasonable amount of overhead, but slowing down the processing of new events

Testing this e2e under load in a dev environment is essential before starting the promotion process.

marclop · 2024-02-07T01:30:06Z

@simitt, that's right.

@kruskall, I'll leave it up to you whether you want to test this as part of the dependency update or as part of the PR. If we go the dependency promotion path, we may approve the PR without any regression testing. I understand the retry behavior is opt-in rather than opt-out, but that's what regression testing is for.

marclop

LGTM. It'd be great to test this when either as part of the dependency update or before we merge.

marclop · 2024-02-07T01:31:01Z

appender.go

+	if err := a.errgroup.Wait(); err != nil {
+		return err
 	}
-	return a.errgroup.Wait()
+	close(a.available)
+	for bi := range a.available {
+		if err := a.flush(context.Background(), bi); err != nil {
+			return fmt.Errorf("failed to flush events on close: %w", err)
+		}
+	}
+	return nil


That means that we won't be retrying one last time. If you could add a comment about that in the code, that'd be great to make more obvious to readers.

axw

If we don't need to export Indexnth then it would be better not to, to keep the public API tidy. The way the document indexes are found looks a bit inefficient, but we could improve it in a follow up since it should not be on a hot path.

bulk_indexer.go

axw · 2024-02-07T04:04:42Z

bulk_indexer.go

+				b.retryCounts[b.itemsAdded] = count
+
+				if b.gzipw != nil {
+					gr, err := gzip.NewReader(bytes.NewReader(b.copyBuf))


It looks like we're decompressing and scanning through the request body for every failed item. Should we be decompressing once, and scanning through incrementally? e.g. if you have failures at index 1 and 8, then first skip to the index 1, then skip from index 1 to index 8 without decompressing and reprocessing up to index 1.

I might be misunderstanding something but decompressing the whole request body is not acceptable. It was explicitly put as a goal to implement retries without decompress the request body.

I don't mean decompress the whole payload at once, I meant create the decompressor once and stream through the results. AFAICS there's no need to call gzip.NewReader more than once.

Instead you could lazily create it when seeing the first 429 result, wrap it with bufio.NewScanner, and stream through the lines, skipping non-429 result lines as needed.

kruskall requested a review from a team as a code owner January 4, 2024 08:40

kruskall mentioned this pull request Jan 10, 2024

Discussion on compression changes #102

Closed

kruskall added 4 commits January 18, 2024 04:19

feat: add document retries

0ee9568

test: add more gzip retry tests

52344d6

test: add new gzip test case with high threshold

c4c67a7

test: run the retry logic twice to ensure there are no leak between f…

bd54c70

…lushes

kruskall force-pushed the feat/document-retry branch from a083095 to bd54c70 Compare January 18, 2024 03:20

marclop reviewed Jan 18, 2024

View reviewed changes

appender_test.go Outdated Show resolved Hide resolved

bulk_indexer.go Outdated Show resolved Hide resolved

bulk_indexer.go Outdated Show resolved Hide resolved

bulk_indexer.go Outdated Show resolved Hide resolved

bulk_indexer.go Outdated Show resolved Hide resolved

kruskall added 9 commits January 29, 2024 18:17

Merge branch 'main' into feat/document-retry

5e2b80c

lint: avoid unnecessary resize

8aa69ff

lint: remove compression threshold

5ea5a11

lint: rename variable for clarity

150c460

perf: reuse buffer for performance

dd51d51

lint: run make fmt

7ab238f

test: use bigger events in retry test to increase coverage

b70cb36

fix: copy slice returned by bytes buffer

190722f

lint: run make fmt

39d62cf

marclop reviewed Jan 31, 2024

View reviewed changes

kruskall added 10 commits February 1, 2024 14:11

fix: use sep parameter in indexnth func

e964865

perf: reuse copyBuf buffe

afe15d2

test: require doc len before accessing elements

4f83afe

doc: add comment on buf reset

55890de

fix: move retried docs to a separate metric

eb94228

feat: add max document retries setting

cae4a1a

test: set max document retries for documentretry test

9e816bb

perf: do not run the retry logic if retries are disabled

6982fc2

fix: add dropped retry documents to failed docs

6649fd9

build: bump to go 1.21

b399bb7

marclop reviewed Feb 2, 2024

View reviewed changes

kruskall and others added 4 commits February 5, 2024 10:24

refactor: break out early if we find match

e751902

Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>

Merge branch 'main' into feat/document-retry

cc2a6a4

Merge branch 'main' into feat/document-retry

176be71

fix: flush buffered items on close

d759bf1

kruskall requested a review from marclop February 5, 2024 10:14

Merge branch 'main' into feat/document-retry

f449953

v1v added the safe-to-test Automated label for running bench-diff on forked PRs label Feb 5, 2024

kruskall added 4 commits February 5, 2024 13:41

Merge branch 'main' into feat/document-retry

5be2760

docs: add more comments

a872cf6

refactor: rename variables for clarity

a4d3863

start/end lnidx are not really indexes so the name was misleading.

docs: improve retry limit comments

85588d1

marclop reviewed Feb 6, 2024

View reviewed changes

refactor: use > instead of != when dealing with positive numbers

0a06a26

kruskall mentioned this pull request Feb 6, 2024

Do not exit on flush error #117

Closed

marclop previously approved these changes Feb 7, 2024

View reviewed changes

axw reviewed Feb 7, 2024

View reviewed changes

kruskall added 2 commits February 7, 2024 15:47

Merge branch 'main' into feat/document-retry

5db329f

refactor: unexport indexnth

ae48c5a

kruskall dismissed marclop’s stale review via ae48c5a February 7, 2024 14:48

kruskall enabled auto-merge (squash) February 7, 2024 14:51

inge4pres approved these changes Feb 7, 2024

View reviewed changes

kruskall merged commit 94e1408 into elastic:main Feb 7, 2024
5 checks passed

kruskall deleted the feat/document-retry branch February 7, 2024 15:00

simitt mentioned this pull request Mar 20, 2024

APM Server 8.13.0 Test Plan elastic/apm-server#12822

Closed

11 tasks

StephanErb mentioned this pull request Apr 29, 2024

Perf regression, too many small bulk requests elastic/apm-server#13024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add document retries #99

feat: add document retries #99

kruskall commented Jan 4, 2024 •

edited

marclop left a comment

marclop Jan 31, 2024

kruskall commented Feb 1, 2024

marclop left a comment

marclop Feb 2, 2024

kruskall Feb 5, 2024

kruskall Feb 5, 2024

kruskall commented Feb 5, 2024

simitt commented Feb 5, 2024

marclop left a comment

marclop Feb 6, 2024

kruskall Feb 6, 2024 •

edited

simitt Feb 6, 2024

kruskall Feb 6, 2024

marclop Feb 7, 2024

kruskall commented Feb 6, 2024

simitt commented Feb 6, 2024

marclop commented Feb 7, 2024

marclop left a comment

marclop Feb 7, 2024

axw left a comment

axw Feb 7, 2024

kruskall Feb 8, 2024

axw Feb 9, 2024

	indexer := active
	active = nil
	attrs := metric.WithAttributeSet(a.config.MetricAttributes)
	a.errgroup.Go(func() error {
	var err error
	took := timeFunc(func() {
	err = a.flush(a.errgroupContext, indexer)
	})
	indexer.Reset()
	a.available <- indexer

feat: add document retries #99

feat: add document retries #99

Conversation

kruskall commented Jan 4, 2024 • edited

Testing

Changes:

marclop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kruskall commented Feb 1, 2024

marclop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kruskall commented Feb 5, 2024

simitt commented Feb 5, 2024

marclop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kruskall Feb 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kruskall commented Feb 6, 2024

simitt commented Feb 6, 2024

marclop commented Feb 7, 2024

marclop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kruskall commented Jan 4, 2024 •

edited

kruskall Feb 6, 2024 •

edited