Support fallback / overflow output when primary output is down. #39703

faec · 2024-05-23T17:41:20Z

If the configured output is down for an extended period, the memory queue may hit its maximum capacity, and additional data that could have been collected during the outage may be lost. In this situation, it's desirable to support a fallback, so input received during that time is sent to a persistent receiver (e.g. a Kafka topic or a local file) instead of being lost permanently.

The scenario motivating this is a Kubernetes deployment where memory is at a premium but there is still the desire to fully recover from medium-length outages. The disk queue is not an option because it can't be used from Agent (and technical constraints prevent that from changing in the near-term). Additionally, the disk queue incurs a permanent overhead of writing and reading all events from disk even when ingestion is healthy. Fallback outputs would give a near-term option for outage recovery without the same Agent-level technical prerequisites.

Questions about this approach:

How we can scope this to have comprehensible behavior at the user level? Which scenarios / outputs should we support? Even if we have full generic output-fallback capability in the internal pipeline, it might make sense to only expose combinations that are well-tested.
Since the motivating and possibly most common use case is recovering from outages, how can we streamline the recovery process? Can we make it easy to send overflow events to a static file and reingest that file once the output is back?

elasticmachine · 2024-05-23T17:41:22Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

ycombinator · 2024-05-23T18:20:02Z

FWIW, Fluentd has this capability, possibly only for certain outputst: https://docs.fluentd.org/output#secondary-output (thanks @henrikno!). And there's a feature request for this capability in FluentBit: fluent/fluent-bit#1632.

nimarezainia · 2024-05-23T20:37:23Z

thanks @faec . I'd like to also think of this in context of disaster recovery when a switch over to a secondary cluster is required. Logic to trigger that switch over may be different (somehow tracking cluster availability) but the action is the same as above.

faec added enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fallback / overflow output when primary output is down. #39703

Support fallback / overflow output when primary output is down. #39703

faec commented May 23, 2024

elasticmachine commented May 23, 2024

ycombinator commented May 23, 2024 •

edited

nimarezainia commented May 23, 2024

Support fallback / overflow output when primary output is down. #39703

Support fallback / overflow output when primary output is down. #39703

Comments

faec commented May 23, 2024

elasticmachine commented May 23, 2024

ycombinator commented May 23, 2024 • edited

nimarezainia commented May 23, 2024

ycombinator commented May 23, 2024 •

edited