Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fallback / overflow output when primary output is down. #39703

Open
faec opened this issue May 23, 2024 · 3 comments
Open

Support fallback / overflow output when primary output is down. #39703

faec opened this issue May 23, 2024 · 3 comments
Labels
enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@faec
Copy link
Contributor

faec commented May 23, 2024

If the configured output is down for an extended period, the memory queue may hit its maximum capacity, and additional data that could have been collected during the outage may be lost. In this situation, it's desirable to support a fallback, so input received during that time is sent to a persistent receiver (e.g. a Kafka topic or a local file) instead of being lost permanently.

The scenario motivating this is a Kubernetes deployment where memory is at a premium but there is still the desire to fully recover from medium-length outages. The disk queue is not an option because it can't be used from Agent (and technical constraints prevent that from changing in the near-term). Additionally, the disk queue incurs a permanent overhead of writing and reading all events from disk even when ingestion is healthy. Fallback outputs would give a near-term option for outage recovery without the same Agent-level technical prerequisites.

Questions about this approach:

  • How we can scope this to have comprehensible behavior at the user level? Which scenarios / outputs should we support? Even if we have full generic output-fallback capability in the internal pipeline, it might make sense to only expose combinations that are well-tested.
  • Since the motivating and possibly most common use case is recovering from outages, how can we streamline the recovery process? Can we make it easy to send overflow events to a static file and reingest that file once the output is back?
@faec faec added enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 23, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@ycombinator
Copy link
Contributor

ycombinator commented May 23, 2024

FWIW, Fluentd has this capability, possibly only for certain outputst: https://docs.fluentd.org/output#secondary-output (thanks @henrikno!). And there's a feature request for this capability in FluentBit: fluent/fluent-bit#1632.

@nimarezainia
Copy link
Contributor

thanks @faec . I'd like to also think of this in context of disaster recovery when a switch over to a secondary cluster is required. Logic to trigger that switch over may be different (somehow tracking cluster availability) but the action is the same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

4 participants