Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add document retries #99

Merged
merged 37 commits into from
Feb 7, 2024
Merged

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Jan 4, 2024

Blocked by #100 (to make testing easier)

Add document retries logic. Followup to #91 (important to understand how the compression logic changed)

### Compression off:

In this case we simply iterate over the array and submit the failed events to the buffer. We can count them using newlines.

### Compression on:

Retrieve the offset of the failed event and decompress the "gzip container". Loop over the array and submit the failed event to buffer.

Testing

This PR adds tests to cover as mush edge cases as possible: one event for each gzip container, multiple events per gzip container, all events in a gzip container. Tests also verify the retry logic twice to ensure there are no conflicts between flushes.

Update:

Changes:

the approach was simplified: iterate over the array and resubmit the failed events to the buffer. For gzip we read in batches to avoid decompress the whole request.

@kruskall kruskall requested a review from a team as a code owner January 4, 2024 08:40
appender_test.go Outdated Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that the tests have MaxRequests: 1, which is why the tests are passing. However, in a more realistic scenario with MaxRequests > 1, the documents stored in the local bulk_indexer won't be flushed to Elasticsearch until the bulk indexer is cycled through the channel.

go-docappender/appender.go

Lines 470 to 479 in c8c4e54

indexer := active
active = nil
attrs := metric.WithAttributeSet(a.config.MetricAttributes)
a.errgroup.Go(func() error {
var err error
took := timeFunc(func() {
err = a.flush(a.errgroupContext, indexer)
})
indexer.Reset()
a.available <- indexer

We should prevent that from happening. I think we may need to introduce a new channel for bulk_indexers that have items from previous 429 failures, and treat it with higher priority than a.available. There are some edge cases we need to handle (like closing the appender), but it would be better than simply sending the bulk_indexer with cached items back to the channel like empty indexers.

Another concern is that there's currently no limit on the number of times 429s will be retried. In theory, the entire buffer could be filled with documents that have not been indexed due to 429s, which would cause only 1 new event to be sent in each subsequent flush. Have you given any thought on how we could configure an upper limit of consecutive retries?

bulk_indexer.go Outdated Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
appender_test.go Outdated
}{
"nocompression": {
cfg: docappender.Config{
MaxRequests: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests work because MaxRequests is set to 1; however, if they're set to 2 or 3, the documents stored for subsequent retries won't be flushed in the next bulk request. Instead, the other 1 or 2 bulk indexers need to be flushed first, and only after that happens will the buffered "failed" docs be sent over to Elasticsearch.

bulk_indexer.go Show resolved Hide resolved
@kruskall
Copy link
Member Author

kruskall commented Feb 1, 2024

I've noticed that the tests have MaxRequests: 1, which is why the tests are passing. However, in a more realistic scenario with MaxRequests > 1, the documents stored in the local bulk_indexer won't be flushed to Elasticsearch until the bulk indexer is cycled through the channel.

We should prevent that from happening. I think we may need to introduce a new channel for bulk_indexers that have items from previous 429 failures, and treat it with higher priority than a.available. There are some edge cases we need to handle (like closing the appender), but it would be better than simply sending the bulk_indexer with cached items back to the channel like empty indexers.

To be honest, I think this is fine. We already changed quite a lot of things because of concerns around complexity. If ES is returning 429s there are a lot of events buffered/coming in. I don't think retried events are gonna stay "idle" for a long time.
I think we should get rid of the "active" channel entirely and let the runtime handle the channels but that should go in a separate PR.

Another concern is that there's currently no limit on the number of times 429s will be retried. In theory, the entire buffer could be filled with documents that have not been indexed due to 429s, which would cause only 1 new event to be sent in each subsequent flush. Have you given any thought on how we could configure an upper limit of consecutive retries?

Thanks for this! This should be fixed 👍

Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I think this is fine. We already changed quite a lot of things because of concerns around complexity. If ES is returning 429s there are a lot of events buffered/coming in. I don't think retried events are gonna stay "idle" for a long time.

I don't think it's acceptable to do all of this retry work and then return a potentially full bulkIndexer back to the available channel, where it may or may not be flushed in the future.

While it may be the case that 429s tend to happen on higher throughputs and the bulkIndexer may not remain in that channel for longer, it won't be flushed on appender.Close(), leading to those buffered events to be completely lost.

Could you please add a test that ensures that previously buffered 429s are flushed once more on indexer.Close()?

In the case where 429s are a considerable % of the total documents in a bulk request, say 30-40%, we'd leave a bunch of half-full (or almost entirely full if 429 % is even higher) bulkIndexers in that available channel.

I think we should get rid of the "active" channel entirely and let the runtime handle the channels but that should go in a separate PR.

Could you elaborate on this? I'm not following your reasoning and how the runtime would be of any help to us in this case.

bulk_indexer.go Outdated Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
bulk_indexer.go Show resolved Hide resolved
bulk_indexer.go Show resolved Hide resolved
bulk_indexer.go Outdated Show resolved Hide resolved
@@ -55,6 +55,9 @@ type Config struct {
// If MaxRequests is less than or equal to zero, the default of 10 will be used.
MaxRequests int

// MaxDocumentRetries holds the maximum number of document retries
MaxDocumentRetries int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to 3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'd prefer to keep it a 0 and have this behaviour opt-in for users of this library

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revisit this in the future

bulk_indexer.go Outdated Show resolved Hide resolved
@kruskall
Copy link
Member Author

kruskall commented Feb 5, 2024

I don't think it's acceptable to do all of this retry work and then return a potentially full bulkIndexer back to the available channel, where it may or may not be flushed in the future.

The idea is that they will always be flushed in the future. Either the next time data comes in or after the flush intervel (currently not working because of the active channel approach we are using).

While it may be the case that 429s tend to happen on higher throughputs and the bulkIndexer may not remain in that channel for longer, it won't be flushed on appender.Close(), leading to those buffered events to be completely lost.

Could you please add a test that ensures that previously buffered 429s are flushed once more on indexer.Close()?

Mmh, this is a bug. Sorry about that! It should be fixed now!

In the case where 429s are a considerable % of the total documents in a bulk request, say 30-40%, we'd leave a bunch of half-full (or almost entirely full if 429 % is even higher) bulkIndexers in that available channel.

I don't see this as an issue. This would decrease the throughput of the bulk indexers which is fine as the whole point of the 429s is to slow down.

Could you elaborate on this? I'm not following your reasoning and how the runtime would be of any help to us in this case.

Ideally, we would just read from the channel and let the runtime switch between them, removing the active channel completely. The flush interval would then cause a flush request for the bulk indexers that still have events in the buffer. The issue of bulk indexers remaining in a "limbo" would not apply anymore because they would all be active.

@kruskall kruskall requested a review from marclop February 5, 2024 10:14
@v1v v1v added the safe-to-test Automated label for running bench-diff on forked PRs label Feb 5, 2024
@simitt
Copy link

simitt commented Feb 5, 2024

The bench-diff check is green, but looking at the logs, some benchmark errors happened; please take a look and either fix the issue or loop in the automation team, in case it's not legit errors.

Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes mostly LGTM, I think we need to retry flushing even if the error group returns an error.

Could you elaborate on this? I'm not following your reasoning and how the runtime would be of any help to us in this case.

Ideally, we would just read from the channel and let the runtime switch between them, removing the active channel completely. The flush interval would then cause a flush request for the bulk indexers that still have events in the buffer. The issue of bulk indexers remaining in a "limbo" would not apply anymore because they would all be active.

I still don't follow what you meant.

Also, have you run some benchmarks using the tilt environment and monitoring how the metrics look? I think it's worth doing so before merging.

Comment on lines +192 to +201
if err := a.errgroup.Wait(); err != nil {
return err
}
return a.errgroup.Wait()
close(a.available)
for bi := range a.available {
if err := a.flush(context.Background(), bi); err != nil {
return fmt.Errorf("failed to flush events on close: %w", err)
}
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning early will cause the non-empty bulkIndexers to not be flushed. What do you think about the snippet below?

	// Wait until all active indexers have been flushed.
	err := a.errgroup.Wait()
	close(a.available)
	for bi := range a.available {
		a.errgroup.Go(func() error {
			if e := a.flush(context.Background(), bi); e != nil {
				return fmt.Errorf("failed to flush events on close: %w", e)
			}
			return nil
		})
	}
	return errors.Join(err, a.errgroup.Wait())

Copy link
Member Author

@kruskall kruskall Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on purpose to be consistent with the current behaviour:

go-docappender/appender.go

Lines 482 to 492 in 9748039

var err error
took := timeFunc(func() {
err = a.flush(a.errgroupContext, indexer)
})
indexer.Reset()
a.available <- indexer
a.addUpDownCount(1, &a.availableBulkRequests, a.metrics.availableBulkRequests)
a.metrics.flushDuration.Record(context.Background(), took.Seconds(),
attrs,
)
return err

Even with retries disabled, if flush fails the errgroup returns an error which will not block/wait for the others to finish.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you open a follow up issue then to tackle this? With adding retry behavior we are moving more towards reliability vs. dropping events, so the ask to change the behavior for flushing non empty bulk indexers seems totally fair.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that we won't be retrying one last time. If you could add a comment about that in the code, that'd be great to make more obvious to readers.

bulk_indexer.go Outdated Show resolved Hide resolved
@kruskall
Copy link
Member Author

kruskall commented Feb 6, 2024

Also, have you run some benchmarks using the tilt environment and monitoring how the metrics look? I think it's worth doing so before merging.

I only run benchmarks using the appender benchmarks in the repo but can you clarify this so we are aligned ? Are you asking to test the performance of the retry code or the overhead on the "normal operation" (requests succeed) ?

@simitt
Copy link

simitt commented Feb 6, 2024

Also, have you run some benchmarks using the tilt environment and monitoring how the metrics look? I think it's worth doing so before merging.

I only run benchmarks using the appender benchmarks in the repo but can you clarify this so we are aligned ? Are you asking to test the performance of the retry code or the overhead on the "normal operation" (requests succeed) ?

I do not want to speak for Marc, but IMO we should have some numbers on how this impacts overall performance. I would expect

  • no performance change if there are no retries
  • retry behavior to add only a reasonable amount of overhead, but slowing down the processing of new events

Testing this e2e under load in a dev environment is essential before starting the promotion process.

@marclop
Copy link
Contributor

marclop commented Feb 7, 2024

@simitt, that's right.

@kruskall, I'll leave it up to you whether you want to test this as part of the dependency update or as part of the PR. If we go the dependency promotion path, we may approve the PR without any regression testing. I understand the retry behavior is opt-in rather than opt-out, but that's what regression testing is for.

marclop
marclop previously approved these changes Feb 7, 2024
Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It'd be great to test this when either as part of the dependency update or before we merge.

Comment on lines +192 to +201
if err := a.errgroup.Wait(); err != nil {
return err
}
return a.errgroup.Wait()
close(a.available)
for bi := range a.available {
if err := a.flush(context.Background(), bi); err != nil {
return fmt.Errorf("failed to flush events on close: %w", err)
}
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that we won't be retrying one last time. If you could add a comment about that in the code, that'd be great to make more obvious to readers.

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need to export Indexnth then it would be better not to, to keep the public API tidy. The way the document indexes are found looks a bit inefficient, but we could improve it in a follow up since it should not be on a hot path.

bulk_indexer.go Outdated Show resolved Hide resolved
b.retryCounts[b.itemsAdded] = count

if b.gzipw != nil {
gr, err := gzip.NewReader(bytes.NewReader(b.copyBuf))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're decompressing and scanning through the request body for every failed item. Should we be decompressing once, and scanning through incrementally? e.g. if you have failures at index 1 and 8, then first skip to the index 1, then skip from index 1 to index 8 without decompressing and reprocessing up to index 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding something but decompressing the whole request body is not acceptable. It was explicitly put as a goal to implement retries without decompress the request body.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean decompress the whole payload at once, I meant create the decompressor once and stream through the results. AFAICS there's no need to call gzip.NewReader more than once.

Instead you could lazily create it when seeing the first 429 result, wrap it with bufio.NewScanner, and stream through the lines, skipping non-429 result lines as needed.

@kruskall kruskall merged commit 94e1408 into elastic:main Feb 7, 2024
5 checks passed
@kruskall kruskall deleted the feat/document-retry branch February 7, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe-to-test Automated label for running bench-diff on forked PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants