[beats receivers] The interactions between otelconsumer and the ES exporter lose data on partial failure of _bulk

- Relates https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/41373 

The interactions between the otelconsumer and the ES exporter likely lose data in cases where there is a document level failure in a batch request and the collector restarts. It is likely that most of these problems are actually bugs in the ES exporter, but we should add detailed failure mode testing here to ensure we preserve Filebeat's at least once delivery guarantees.

For reference, when you make a _bulk request to Elasticsearch the response will give you document level HTTP status codes. This lets you detect partial failure along with which specific documents failed. The Beats Elasticsearch output uses this to drop permanent failures (e.g. mapping conflict on a single document) and retry individual documents (e.g. document level 429s). This happens here:

https://github.com/elastic/beats/blob/d73c6c9c50292412d757fe35c1a95e9672758e07/libbeat/outputs/elasticsearch/client.go#L263-L271

The bulk indexer used in the ES exporter has similar logic [here](https://github.com/elastic/go-docappender/blob/8c5590a635f48a0c2fb616facd0cf8daa130b159/bulk_indexer.go#L656-L687). 

The otelconsumer only accepts an error which is for the whole batch of logs it tried to ingest: https://github.com/elastic/beats/blob/d73c6c9c50292412d757fe35c1a95e9672758e07/x-pack/libbeat/outputs/otelconsumer/otelconsumer.go#L153-L170 

It can detect permanent failures but I think this is only going to be for the entire batch.

The bulk indexer can return `nil` on flush [if there were document level errors](https://github.com/elastic/go-docappender/blob/8c5590a635f48a0c2fb616facd0cf8daa130b159/bulk_indexer.go#L689-L695). Problem one is that our otelconsumer can’t see this and will ack the batch when this happens but the document in this case hasn't been persisted in Elasticsearch yet.

The bulk indexer logic [linked earlier](https://github.com/elastic/go-docappender/blob/8c5590a635f48a0c2fb616facd0cf8daa130b159/bulk_indexer.go#L676) to retry failed individual documents covers for this by re-adding failed documents to the indexer’s writer, but this also has the effect that those logs will be get Flushed on a different `pushDataLogs` call (ES exporters `ConsumeLogs` function) in the future decoupling any error from the original call in our otelconsumer. This is problem two. 

To observe this causing problems you’d need to cause partial _bulk failures that exhaust any configured retries and/or restart the collector so that the effect of our ack in otelconsumer has caused us to move past the uningested data in Filebeats registry file when ingesting a log file.

We need to do some targeted testing here where we force partial _bulk failures (e.g. permanent mapping conflict on only one doc and long lived but recoverable 429 response on only one doc).

I think the fundamental problem is that otelconsumer can’t see that only some documents may have succeeded, we are relying on retries in the ES exporter that can be stopped via a collector restart. We could consider partial failures as total failures and that would avoid data loss but cause us to re-index a lot more documents than we do today.

	eventsToRetry, stats := client.bulkCollectPublishFails(bulkResult)
	stats.reportToObserver(client.observer)

	if len(eventsToRetry) > 0 {
	span.Context.SetLabel("events_failed", len(eventsToRetry))
	batch.RetryEvents(eventsToRetry)
	} else {
	batch.ACK()
	}

	err := out.logsConsumer.ConsumeLogs(ctx, pLogs)
	if err != nil {
	// Permanent errors shouldn't be retried. This tipically means
	// the data cannot be serialized by the exporter that is attached
	// to the pipeline or when the destination refuses the data because
	// it cannot decode it. Retrying in this case is useless.
	//
	// See https://github.com/open-telemetry/opentelemetry-collector/blob/1c47d89/receiver/doc.go#L23-L40
	if consumererror.IsPermanent(err) {
	st.PermanentErrors(len(events))
	batch.Drop()
	} else {
	st.RetryableErrors(len(events))
	batch.Retry()
	}

	return fmt.Errorf("failed to send batch events to otel collector: %w", err)
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[beats receivers] The interactions between otelconsumer and the ES exporter lose data on partial failure of _bulk #45707

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[beats receivers] The interactions between otelconsumer and the ES exporter lose data on partial failure of _bulk #45707

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions