Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the Beats disk queue in Elastic Agent #3490

Open
cmacknz opened this issue Sep 29, 2023 · 5 comments
Open

Support the Beats disk queue in Elastic Agent #3490

cmacknz opened this issue Sep 29, 2023 · 5 comments
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Sep 29, 2023

The Beats today support a disk queue that has been GA for some time, however it cannot be used with the Elastic Agent. Part of the reason why is that Elastic Agent does not allow configuring the queue configuration at all, but this will change after elastic/beats#36693 is merged.

Those changes would allow a user to enable the Beats disk queue, which with no other changes would instruct each Beat to create a disk queue in the same directory. That is the disk queue is not shared between processes, there is a disk queue per process, and each per process disk queue will conflict attempting to use the same files in the same directory.

For the disk queue to work properly when running under the Elastic Agent without a dedicated shipper process we need to orchestrate the queue directories correctly in the agent itself. Specifically we need to:

  1. Create a dedicated directory in the agent installation path for the disk queue files. The natural choice for the disk queue location would be the per component run directory in the versioned data path, however this would require the entire queue to be copied on upgrade. I think we should avoid this because the disk queue can be large (100+ MB depending on configuration and usage), and instead created a dedicated outside of the versioned data path that is shared between versions of the Elastic Agent. We will likely need a file lock in the directory to ensure only one version can read from this directory at a time.

  2. In the dedicated queue directory, provision a unique disk queue sub-directory for each component since queues cannot be shared between processes. The disk queue for a component should be removed when the component is removed from the agent policy.

  3. Allow the user to configure the dedicated disk queue directory. Users may want the disk queue to reside on a dedicated volume, which will be particularly important when the Elastic Agent is running on Kubernetes and the user wishes for the disk queues to be stored on a persistent volume claim.

We will also need to performance test the Elastic Agent running with the disk queue, and compare it to the Elastic Agent without the disk queue. The disk queue has a performance penalty because events must be serialized before being written to disk. We should quantify what this penalty is, particularly when the Elastic Agent is supervising multiple Beats each with their own disk queue.

The final caveat to this implementation is that the disk queue will only be supported for inputs which are based on Beats. We should add the ability for agent specification files to declare whether they support the disk queue configuration. The one special case to consider is endpoint-security which always uses a disk queue that is different from the one implemented in Beats. We will need to make this obvious to users.

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Sep 29, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@blakerouse
Copy link
Contributor

Elastic Agent spawns each component with its own work directory that is consistent. Why can't the disk queue just write to the work directory that is give to the process when it is started by the Elastic Agent?

@cmacknz
Copy link
Member Author

cmacknz commented Oct 3, 2023

Why can't the disk queue just write to the work directory that is give to the process when it is started by the Elastic Agent?

That works as long we don't have to copy the queue on upgrade for the reasons mentioned in elastic/beats#35615 (comment)

@blakerouse
Copy link
Contributor

Why can't the disk queue just write to the work directory that is give to the process when it is started by the Elastic Agent?

That works as long we don't have to copy the queue on upgrade for the reasons mentioned in elastic/beats#35615 (comment)

It is placed in the run directory which is copied.

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/upgrade/upgrade.go#L166

@mbudge
Copy link

mbudge commented Oct 5, 2023

Disk queues reduce the risk of data loss, but at the same time I can see disk-queues hammering some of our busy production servers.

Use in-memory queue and fall back to disk queue when there is a network issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

4 participants