Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow queue configuration to be specified under the output type #35615

Closed
2 tasks
Tracked by #16
faec opened this issue May 30, 2023 · 24 comments · Fixed by #36788
Closed
2 tasks
Tracked by #16

Allow queue configuration to be specified under the output type #35615

faec opened this issue May 30, 2023 · 24 comments · Fixed by #36788
Assignees
Labels
enhancement Team:Elastic-Agent Label for the Agent team

Comments

@faec
Copy link
Contributor

faec commented May 30, 2023

Beats queue configurations are specified at the top level of the config, grouped by queue type, e.g. queue.mem or queue.disk. Recent work to support the shipper involved exposing queue configuration hooks to the output itself during initialization. We should follow up on this by moving the queue configuration entirely into the output block, for example:

output.elasticsearch:
  hosts: ["https://localhost:9200"]
  queue.mem:
    events: 5000

or

output.logstash:
  hosts: ["127.0.0.1:5044"]
  queue.disk:
    max_size: 10GB

This settings block, when present, should use the same structure and behavior as the existing Beats queue settings, and should replace the root-level configuration (though we will still support root-level as a fallback).

Outputs that support this will automatically gain the ability to specify queue configurations through Agent, both with and without the shipper enabled. (However without the shipper it will apply these settings separately in each input process, duplicating the queue for each input type, so we should be careful how we communicate this option.)

In addition, please add the IdleConnectionTimeout setting for the Elasticsearch Idle Timeout to the ES Output settings.

httpcommon.WithKeepaliveSettings{IdleConnTimeout: s.IdleConnTimeout},
. This setting plus queue settings will allow users to adjust these settings for high scale.

  • Implement Queue Settings for Outputs
  • Allow configuring IdleConnectionTimeout setting for Elasticsearch output
@faec faec added enhancement Team:Elastic-Agent Label for the Agent team labels May 30, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@andrewkroh
Copy link
Member

Relates: elastic/elastic-agent#284

@cmacknz
Copy link
Member

cmacknz commented Jun 8, 2023

Inability to configure the underlying Beat queue parameters is currently one of the largest blockers for scaling the agent. High volume use cases need a larger queue to achieve the required throughput and we prevent this right now.

Implementing this will unblock those use cases, although it will result is some potentially unintuitive behavior until we have the shipper as the total queue size configured will not be what is in the output configuration but will instead be proportional to the number of running Beat inputs.

@cmacknz
Copy link
Member

cmacknz commented Jun 13, 2023

Some relevant comments on elastic/kibana#158699 (comment)

When we do this, it will allow agent to configure both the memory and disk queue. I think we want to explicitly prevent configuring the agent with a disk queue until we support it properly, for example by correctly sharing the the queue contents across upgrades by default.

@faec
Copy link
Contributor Author

faec commented Jun 13, 2023

I think we want to explicitly prevent configuring the agent with a disk queue until we support it properly, for example by correctly sharing the the queue contents across upgrades by default.

The easiest way to do this is to just not allow the disk queue to be configured in output settings yet, otherwise we'll need the outputs to special-case their config parsing based on whether they're running under Agent, which ideally they shouldn't know or care about. This sounds like a reasonable limitation to me -- there's no reason for us to advertise this new way of configuring the queue (to Beats users, at least), it just lets Beats work with Agent's unit-based configs, so a lack of disk queue support is unlikely to bite anyone. Any concerns about this approach?

@cmacknz
Copy link
Member

cmacknz commented Jun 13, 2023

Any concerns about this approach?

No, the simple approach is best here.

@zez3
Copy link

zez3 commented Aug 28, 2023

So the Fleet YAML syntax would be something like?

worker:24
queue.disk.max_size:150GB

or?

@cmacknz
Copy link
Member

cmacknz commented Aug 28, 2023

Likely yes, the syntax will be confirmed once we have finished the implementation.

@zez3
Copy link

zez3 commented Sep 1, 2023

Hmm, something seems to be off

  d79fc350-8546-11ed-b830-93d76b562dd8:
    "0": w
    "1": o
    "2": r
    "3": k
    "4": e
    "5": r
    "6": ':'
    "7": "2"
    "8": "4"
    "9": ' '
    "10": q
    "11": u
    "12": e
    "13": u
    "14": e
    "15": .
    "16": d
    "17": i
    "18": s
    "19": k
    "20": .
    "21": m
    "22": a
    "23": x
    "24": _
    "25": s
    "26": i
    "27": z
    "28": e
    "29": ':'
    "30": "3"
    "31": "0"
    "32": "0"
    "33": G
    "34": B
    api_key: mykey
    hosts:
    - https://mydom.mytld:9243
    type: elasticsearch
  default:
    api_key: somekey
    hosts:
    - https://mydom.mytld:9243
    type: elasticsearch

@zez3
Copy link

zez3 commented Sep 4, 2023

got it working with the correct yaml syntax.
Strange is that Fleet accepts the incorrect values even after performing some validation check

@pierrehilbert pierrehilbert assigned leehinman and unassigned faec Sep 7, 2023
@pierrehilbert
Copy link
Collaborator

@leehinman as discussed, I assigned you this issue for the next sprint.

@strawgate
Copy link
Contributor

Can we ensure that flush.timeout is configurable for the queue with this work?

@cmacknz
Copy link
Member

cmacknz commented Sep 8, 2023

This would allow configuring every parameter of the Beats memory queue through an agent policy: https://www.elastic.co/guide/en/beats/filebeat/current/configuring-internal-queue.html#configuration-internal-queue-memory

@cmacknz
Copy link
Member

cmacknz commented Sep 8, 2023

That includes both flush.timeout and flush.min_events.

This would technically also allow configuring the Beats disk queue, but we will likely disallow it initially since this would create one disk queue per unique input type with the current architecture in the agent policy which is probably not what most people would expect.

leehinman added a commit to leehinman/beats that referenced this issue Sep 27, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
leehinman added a commit to leehinman/beats that referenced this issue Sep 27, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
leehinman added a commit to leehinman/beats that referenced this issue Sep 27, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
@cmacknz
Copy link
Member

cmacknz commented Sep 28, 2023

When we merge this let's add an agent changelog entry as well explaining where this can be set.

@cmacknz
Copy link
Member

cmacknz commented Sep 28, 2023

This change should probably have a docs issue associated with it to make sure that this functionality is explained properly in the agent documentation.

I don't know that we need to document this for standalone Beats because it doesn't enable anything that wasn't already possible.

@nimarezainia
Copy link
Contributor

This change should probably have a docs issue associated with it to make sure that this functionality is explained properly in the agent documentation.

I don't know that we need to document this for standalone Beats because it doesn't enable anything that wasn't already possible.

@kilfoyle fyi. the configuration items are described in this section already: https://www.elastic.co/guide/en/beats/filebeat/current/configuring-internal-queue.html we are just exposing them in Agent output (via the adavanced yaml box.

the possible settings are now:

output.elasticsearch: (any output)
hosts: ["https://localhost:9200"]
queue.mem:
events: 5000
flush.min_events: 512
flush.timeout: 5s

@nimarezainia
Copy link
Contributor

This would technically also allow configuring the Beats disk queue, but we will likely disallow it initially since this would create one disk queue per unique input type with the current architecture in the agent policy which is probably not what most people would expect.

@cmacknz if we do this, we can claim disk queue support also. Fair enough that this queue is in every beat and not what user expects but then again so is the internal queue (until this change). The main usage for the spooling is resiliency when we are disconnected, is there a reason why we can't do spooling on the input rather than once on output?

My vote is to enable this while we are at it.

@zez3
Copy link

zez3 commented Sep 29, 2023

@cmacknz

What do you mean by:

by correctly sharing the the queue contents across upgrades by default.

?

The Agent is quite offen upgraded most of times together with the whole stack. Can you please better describe the issue with the queue contents after upgrade?

leehinman added a commit to leehinman/beats that referenced this issue Sep 29, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
@kilfoyle
Copy link
Contributor

This change should probably have a docs issue associated with it to make sure that this functionality is explained properly in the agent documentation.

Thanks Craig and Nima for the heads up! I'll look after this docs issue in the upcoming sprint.

@cmacknz
Copy link
Member

cmacknz commented Sep 29, 2023

The Agent is quite offen upgraded most of times together with the whole stack. Can you please better describe the issue with the queue contents after upgrade?

See elastic/elastic-agent#3490 which explains what needs to happen for the Elastic Agent to support the Beats disk queue. This requires some understanding of the internal architecture of the agent. See https://github.com/elastic/elastic-agent/blob/main/docs/architecture.md.

The disk queue will be preserved between upgrades, but we need to do it correctly and without having to copy it since it can be quite large. We need to special case this in the agent to make sure it happens properly.

leehinman added a commit to leehinman/beats that referenced this issue Sep 29, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
@blakerouse
Copy link
Contributor

The Agent is quite offen upgraded most of times together with the whole stack. Can you please better describe the issue with the queue contents after upgrade?

See elastic/elastic-agent#3490 which explains what needs to happen for the Elastic Agent to support the Beats disk queue. This requires some understanding of the internal architecture of the agent. See https://github.com/elastic/elastic-agent/blob/main/docs/architecture.md.

The disk queue will be preserved between upgrades, but we need to do it correctly and without having to copy it since it can be quite large. We need to special case this in the agent to make sure it happens properly.

Do we really not want to copy it? What happens when its corrupted on upgrade and now rollback fails because we didn't copy it and the new version corrupted?

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

Do we really not want to copy it? What happens when its corrupted on upgrade and now rollback fails because we didn't copy it and the new version corrupted?

The primary reason is that it can be GBs in size, the default size is 10 GB.

I don't want us to block the completion of upgrades on copying a possibly 10 GB file, copying a file that large is also likely to run into disk space constraints since the system needs to be able to store it twice.

We need to handle the corrupted disk queue case, but I don't think we can solve it by copying on upgrade even though it is conceptually the simplest option I don't think it is practical.

@zez3
Copy link

zez3 commented Oct 3, 2023

copying a possibly 10 GB file

FYI: When I was using graylog we had 300GB journal files fur queue buffering. This was more of a safety net for those moments when we had really high (>100k EPS) load of when we had to perform administration tasks that brought our ElasticSearch cluster offline. When the 300GB was reaching its full(>90%) capacity we where declaring to our LB that node dead.

My goal is to get the queue status from the Agent status(when the shipper will be finalized) or directly from the underlying filebeat via http endpoint

leehinman added a commit to leehinman/beats that referenced this issue Oct 6, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
leehinman added a commit to leehinman/beats that referenced this issue Oct 9, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
leehinman added a commit to leehinman/beats that referenced this issue Oct 19, 2023
- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent Label for the Agent team
Projects
None yet