You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the configured output is down for an extended period, the memory queue may hit its maximum capacity, and additional data that could have been collected during the outage may be lost. In this situation, it's desirable to support a fallback, so input received during that time is sent to a persistent receiver (e.g. a Kafka topic or a local file) instead of being lost permanently.
The scenario motivating this is a Kubernetes deployment where memory is at a premium but there is still the desire to fully recover from medium-length outages. The disk queue is not an option because it can't be used from Agent (and technical constraints prevent that from changing in the near-term). Additionally, the disk queue incurs a permanent overhead of writing and reading all events from disk even when ingestion is healthy. Fallback outputs would give a near-term option for outage recovery without the same Agent-level technical prerequisites.
Questions about this approach:
How we can scope this to have comprehensible behavior at the user level? Which scenarios / outputs should we support? Even if we have full generic output-fallback capability in the internal pipeline, it might make sense to only expose combinations that are well-tested.
Since the motivating and possibly most common use case is recovering from outages, how can we streamline the recovery process? Can we make it easy to send overflow events to a static file and reingest that file once the output is back?
The text was updated successfully, but these errors were encountered:
thanks @faec . I'd like to also think of this in context of disaster recovery when a switch over to a secondary cluster is required. Logic to trigger that switch over may be different (somehow tracking cluster availability) but the action is the same as above.
If the configured output is down for an extended period, the memory queue may hit its maximum capacity, and additional data that could have been collected during the outage may be lost. In this situation, it's desirable to support a fallback, so input received during that time is sent to a persistent receiver (e.g. a Kafka topic or a local file) instead of being lost permanently.
The scenario motivating this is a Kubernetes deployment where memory is at a premium but there is still the desire to fully recover from medium-length outages. The disk queue is not an option because it can't be used from Agent (and technical constraints prevent that from changing in the near-term). Additionally, the disk queue incurs a permanent overhead of writing and reading all events from disk even when ingestion is healthy. Fallback outputs would give a near-term option for outage recovery without the same Agent-level technical prerequisites.
Questions about this approach:
The text was updated successfully, but these errors were encountered: