[META]Allow specifying different and multiple fleet servers in agent policy #903

ruflin · 2021-11-19T10:08:17Z

Over the past few weeks we have seen several users trying to run multiple fleet-servers. This issue is to discuss the different scenarios and share an initial thoughts on how we could approach this. The goal is to come a conclusion and feed the information in the documentation and have it guideline for future features.

Core concepts

No load balancing: It is not the Elastic Agent jobs to load balance between multiple fleet-servers. This must be done at the infrastructure level through proxy or DNS.
Fail over: An Elastic Agent supports multiple fleet-server urls as failover. By default the first url is picked.
All info is shared across fleet-servers: Each fleet-server has always the same information no matter where it is deployed.

Scenarios

Below is a list of scenarios which should describe the expected behaviours to follow the above core concepts. Not all scenarios are supported today.

Scenario 1: Multiple fleet-servers, all Elastic Agent connect to all fleet-servers

In scenario 1 the users had multiple fleet-servers for redundancy or scale purpose. The user is expected to setup a proxy in front of the fleet-servers or use DNS to access the multiple fleet-servers. In the Fleet UI, a single fleet-server url is used.

Scenario 2: Elastic Cloud only

The user connects all its Elastic Agent to the fleet-server in Elastic Cloud. As ESS already has a proxy in front and allows to spin up redundant fleet-servers, the setup is already as expected today.

Scenario 3: Elastic Cloud fleet-server and on prem

The users is using Elastic Cloud with the fleet-server but also runs its on prem fleet-server to have the fleet-server closer to its Elastic Agents. By default the user wants to have its local fleet-servers to be used. In the Fleet UI, the user puts and additional fleet-server url before the Elastic Cloud fleet-server url. The user defined URL is the one used by default. In case the local fleet-server url is not reachable, Elastic Agents fall back to the Elastic Cloud fleet-server url.

The local fleet-server is expected to have the version in sync with the hosted fleet-server.

Scenario 4: Multi policy with multi data center

In this scenario the user has multiple data centers with local fleet-servers and specific policies to each data center. In the Fleet UI, a global fleet-server is specified and in addition a fleet-server per policy can be specified. The fleet-server specified in the policy is the first in the list so it will be picked as the default for all Elastic Agent which are part of the policy.

ruflin · 2021-11-19T10:08:56Z

@scunningham @joshdover @michel-laterman @nimarezainia Would be great to get your take on the above.

scunningham · 2021-11-19T10:38:23Z

@ruflin Makes sense. We could consider an automatic mode where the agent would regularly ping each listed URL and determine shortest route either by round trip time or trace-route. However, that seems like a nice to have and not sure customers would use it.

I think the bigger concern is going to be the multitude of outbound connections from the integrations to elastic. There is very little volume to the Fleet Servers in comparison. Is the idea here to allow different ES outputs per policy?

ruflin · 2021-11-19T10:51:19Z

@scunningham To keep the discussion focused on fleet-server only, should we open a different issue / thread our the ES outputs per policy to set some guidelines there too?

nimarezainia · 2021-11-22T03:51:07Z

thanks @ruflin for this discussion. I think if we work towards defining what Scenario 4 would look like - i.e FS per policy - it would address many of the multi-site use cases we see today. In the near future with either Logstash support or Secure proxy concept, we could move towards having control and data plane aggregated at a site and reduce the number of connections coming out of some of these DC or DMZ sites.

What should be the next steps? I will clean up the multi-site document to be more generic and we can use that.

SHolzhauer · 2021-11-22T16:34:07Z

Hi, wanted to pitch in regarding scenario 4.

A use case we are working/struggling with is not necessarily multiple data centers but different connection types.

There is one Elastic Stack running on premise (cloud but self managed) where the bulk of the servers are using internal
dns/connections to connect to a loadbalancer with 2+ (HA) fleet servers.

There are also endpoints (laptops) for the employees with agent installed. These laptops have an always on VPN allowing them to connect to the internal dns/connections (the same as the servers). However when out of office, disabled VPN or when you isolate a host. The internal connections are unavailable.

So it's a different setup, but allowing for endpoints to be specified based on agent policy should resolve the issues/setup complications.

ruflin · 2021-11-23T10:25:01Z

@SHolzhauer Thanks for chiming in. When the VPN is disabled, what fleet-server do you expect these Elastic Agents to connect to? Is it a global fallback one?

@nimarezainia I think first, the teams (Elastic Agent / Fleet / Security) need to agree that above scenarios is what we expect. Especially 4 I think needs a bit more discussions as it impacts also Fleet. But if everyone is aligned, we can get it moving.

michel-laterman · 2021-11-23T17:56:40Z

So my understanding of the proposals is that if we specify more than one server address, the agent will try them in order so that the 1st address (either from the settings or from the policy) will be attempted before any others (such as the cloud address if available)?

ruflin · 2021-11-24T13:42:28Z

@michel-laterman Correct. First one wins as long as it is reachable.

SHolzhauer · 2021-11-25T18:56:45Z

@SHolzhauer Thanks for chiming in. When the VPN is disabled, what fleet-server do you expect these Elastic Agents to connect to? Is it a global fallback one?

@ruflin When the VPN is enabled we would expect it to use the internal fleet endpoint.
When the vpn is disabled or when the host is isolated we would expect it to use the external fleet endpoint.

But if we can specify a fleet and ES endpoint per agent policy we already have a solution.

blakerouse · 2021-12-16T15:22:24Z

Sorry but the statement that at the moment the Elastic Agent doesn't do round-robin to multiple Fleet Servers when multiple URL's is provided is not true. Actually for each request it does try to pick the last used endpoint, as long as it has not have an error in the past X minutes.

See: https://github.com/elastic/beats/blob/b5e94143d774f2434432a578f7ef5bbd71002bca/x-pack/elastic-agent/pkg/remote/client.go#L241

aarju · 2022-02-09T10:43:11Z

Here is a real world 'multi cloud' scenario we are looking at internally in Elastic. We want to migrate to using Fleet for security and o11y in Elastic Cloud. Elastic Cloud currently spans 53 different cloud regions across multiple providers. Each of these regions has it's own stack that we collect the o11y and security data. We then use Cross Cluster Search so we can use a single 'overview' cluster to query and alert on data across all of the regional clusters. For a deep dive of our current configuration you can check out this blog post.

The question is how to do we migrate from the current filebeat and auditbeat system to using Fleet to deploy and manage elastic agents in a multi-region multi-cloud environment? Instead of having 53 different isolated Fleet servers we would like to have a single 'primary' fleet server that can manage the policies, agents, and integrations for all of the other fleet servers. Without that if we need to use an integration such as OSQuery or Elastic Security's Host Isolation we have to log into each of the 53 fleet servers to run the query.

mostlyjason · 2022-02-09T14:22:06Z

@aarju today we can provide a single control plane per cluster in Fleet. So if you want to provide a single control plane over all 53 regions, you must enroll all those agents into a single cluster. Each cluster can have multiple Fleet Servers, which you can install in as many places as you want. For example, you could have a Fleet Server for each region. All those Fleet Servers connect back to a single Elasticsearch cluster, which coordinates across them.

Currently, the control plane and data plane must use the same cluster. However, we are planning to add Logstash output support in 8.2 elastic/kibana#104987, and remote Elasticsearch output shortly after. That means you could have a dedicated cluster for your control plane in Fleet, and ship the data to your regional clusters. You could also use CCS across that data.

There a few more details to dive into (regional routing and keeping integrations in sync), but would this architecture meet your needs at a high level?

aarju · 2022-02-09T15:10:50Z

Thanks for the input @mostlyjason!

We are really looking forward to this capability in 8.2, I think that will help a lot with this problem.

Currently, the control plane and data plane must use the same cluster. However, we are planning to add Logstash output support in 8.2 elastic/kibana#104987, and remote Elasticsearch output shortly after. That means you could have a dedicated cluster for your control plane in Fleet, and ship the data to your regional clusters. You could also use CCS across that data.

ruflin · 2022-05-23T09:22:36Z

Assigning this to @ph to take over the lead on it. I would still like to see us driving this to a conclusion on the long term vision / goal to guide us on any future implementations. @joshdover

joshdover · 2022-05-23T10:47:09Z

It seems there general preference for exploring option 4, being able to assign a Fleet Server host(s) per Agent policy, which generally fits well with how we expect users to use agent policies to group agents. IMO it makes sense to start there and see which use cases we can't solve with this solution.

joshdover · 2022-05-23T14:59:37Z

Think I jumped the gun a bit on this one. Looking closer at https://github.com/elastic/elastic-agent/blob/ce95d6b5f36a43a927112517e91885570305e219/internal/pkg/remote/client.go#L222, it appears that we do currently round-robin Fleet Servers and choose the host that was used least recently. We would need to stop doing that before exploring option 4.

nimarezainia · 2022-05-24T21:27:37Z

Think I jumped the gun a bit on this one. Looking closer at https://github.com/elastic/elastic-agent/blob/ce95d6b5f36a43a927112517e91885570305e219/internal/pkg/remote/client.go#L222, it appears that we do currently round-robin Fleet Servers and choose the host that was used least recently. We would need to stop doing that before exploring option 4.

@joshdover
if the fleet server list is per policy, the behavior you describe here would pertain to those in that policy correct?

scunningham · 2022-05-25T13:28:14Z

I have no objection to binding fleet server selection per policy, however endpoints do roam and a secondary selection methodology should be implemented to properly select from multiple fleet servers within the context of a policy. I would prefer to see a user prioritized list with fallback to lower priority on failure. I am thinking in particular of laptops that roam in/out of a VPN regularly; or customers that travel across regions.

nimarezainia · 2022-05-25T14:50:18Z

@scunningham we have a requirement from the endpoint team to dynamically join agents to a policy for pretty much the same use case you mention above. One use case I believe was when someone travels from one location to another, there maybe different security requirements at the new location. Is that correct?

If a policy has a defined set of Fleet Servers, and we could switch agents between policies - do you see us needing the prioritized list with fallback?

scunningham · 2022-05-25T15:59:12Z

I suppose that's another way to do it. We did a similar design at Endgame where Agents could be automatically assigned to a specific policy by applying a prioritized set of rules to metadata regularly. We did, however, also have the concept of statically assigned agents; those for which the policy would never change. This was desirable for certain classes of user personas, for example C level executives. In those cases you may still need a mechanism to prioritize amongst a set of fleet servers.

Even within a policy, it may be the case that not all fleet servers are treated equally: for example; I want the customer to try to hit the VPN fleet server first and if not accessible only then fallback to a regional or global fleet server.

joshdover · 2022-06-07T08:32:50Z

I personally prefer we decouple the dynamic policy assignment feature from having a prioritized Fleet Server hosts list. The latter is generically useful for more simple deployments that don’t need dynamic policy assignments and for the case Sean mentioned above. A prioritized list is also likely much lower effort than dynamic policy assignment to deliver and would unblock being able to release documentation/guidance on this.

nicpenning · 2022-07-11T14:58:37Z

Hello all, I would like to drop a little feedback here with what I have experienced.

I am in the first camp: Scenario 1 where we have multiple agents and multiple Fleet servers all hosted on-premise (0 cloud).

1 Fleet server in the DMZ (DNS fleet.example.org) and 1 Fleet server on the LAN (fleet.local.org).

The current issue (As of version 8.3.1) is that agents that leave the LAN and are connected in hostile networks, they cannot resolve the fleet.local.org address. Instead of failing and moving onto the DMZ node of fleet.example.org, it is stuck on the failure. The expected behavior is that if an agent cannot resolve the DNS name of one server or fails to connect in any way, then try the other Fleet servers it knows about.

More specifically on the DNS error, I see this:

[elastic_agent][warn] DNS lookup failure "fleet.org.local": lookup fleet1.state.sd.local: no such host
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet.org.local:8220/api/fleet/agents/redacted/checkin?": lookup fleet1.state.sd.local: no such host

The only work around is getting the system to the local network so it can connect to the internal Fleet server or rebooting the device and hoping it will hit the DMZ node for it to connect.

Unless I am missing something, it doesn't seem that the Elastic Agent will try other Fleet servers when it fails to look up a host. Perhaps there is some Fleet settings I am unaware of that can change this behavior? Either way, I wanted to chime in here since this is a much needed capability because ideally we would like to minimize Fleet server traffic to DMZ nodes from external assets and internal Fleet server(s) for internal assets.

nchaulet · 2022-10-03T21:09:52Z

@michel-laterman I am wondering if there is any Fleet server to do here, or it's just a matter of populating differently fleet.hosts in the agent policy from Kibana depending on what the user is configuring, and showing different enroll command too.

jlind23 · 2022-10-04T11:25:21Z

@jen-huang Nima is assigned to that issue, shouldn't it be an engineer from your team instead?

nimarezainia · 2022-10-04T17:25:43Z

@jlind23 @jen-huang I think we should consider this as a discuss issue and others as the implementation issue.

jlind23 · 2022-10-05T15:28:44Z

Discussing this with @blakerouse and @michalpristas today and no changes are needed on fleet-server's end.
@nchaulet No different enrolment command is needed.

nimarezainia · 2022-11-07T22:28:59Z

elastic/kibana#137785

ruflin added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 19, 2021

ruflin self-assigned this Nov 19, 2021

joshdover mentioned this issue Jan 6, 2022

[Fleet] Make default integration install explicit elastic/kibana#121628

Merged

4 tasks

ruflin assigned ph and unassigned ruflin May 23, 2022

kpollich mentioned this issue Aug 1, 2022

[Fleet] Support multiple Fleet Servers in Fleet UI elastic/kibana#137785

Closed

17 tasks

ph removed their assignment Aug 11, 2022

mukeshelastic changed the title ~~[Discuss] Using multiple fleet-servers~~ Allow specifying different and multiple fleet servers in agent policy Aug 23, 2022

mukeshelastic assigned nimarezainia Aug 23, 2022

pierrehilbert added the 8.6-candidate label Sep 12, 2022

kpollich mentioned this issue Sep 13, 2022

[Fleet] Support per-policy proxy settings in Fleet UI elastic/kibana#140533

Closed

15 tasks

jlind23 assigned nimarezainia and unassigned nimarezainia Oct 4, 2022

jlind23 changed the title ~~Allow specifying different and multiple fleet servers in agent policy~~ [META]Allow specifying different and multiple fleet servers in agent policy Oct 5, 2022

jlind23 added the Meta label Oct 5, 2022

nimarezainia closed this as completed Nov 7, 2022

nimarezainia mentioned this issue Nov 7, 2022

[REQUEST]: Support multiple Fleet Servers in Fleet UI elastic/observability-docs#2343

Closed

joepa37 mentioned this issue Feb 12, 2023

[Fleet] Outputs > Specify where agents will send data elastic/kibana#150111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META]Allow specifying different and multiple fleet servers in agent policy #903

[META]Allow specifying different and multiple fleet servers in agent policy #903

ruflin commented Nov 19, 2021 •

edited by joshdover

Loading

ruflin commented Nov 19, 2021

scunningham commented Nov 19, 2021

ruflin commented Nov 19, 2021

nimarezainia commented Nov 22, 2021

SHolzhauer commented Nov 22, 2021

ruflin commented Nov 23, 2021

michel-laterman commented Nov 23, 2021

ruflin commented Nov 24, 2021

SHolzhauer commented Nov 25, 2021

blakerouse commented Dec 16, 2021

aarju commented Feb 9, 2022

mostlyjason commented Feb 9, 2022

aarju commented Feb 9, 2022

ruflin commented May 23, 2022

joshdover commented May 23, 2022

joshdover commented May 23, 2022 •

edited

Loading

nimarezainia commented May 24, 2022

scunningham commented May 25, 2022

nimarezainia commented May 25, 2022

scunningham commented May 25, 2022

joshdover commented Jun 7, 2022

nicpenning commented Jul 11, 2022

nchaulet commented Oct 3, 2022

jlind23 commented Oct 4, 2022

nimarezainia commented Oct 4, 2022

jlind23 commented Oct 5, 2022 •

edited

Loading

nimarezainia commented Nov 7, 2022

[META]Allow specifying different and multiple fleet servers in agent policy #903

[META]Allow specifying different and multiple fleet servers in agent policy #903

Comments

ruflin commented Nov 19, 2021 • edited by joshdover Loading

Core concepts

Scenarios

Scenario 1: Multiple fleet-servers, all Elastic Agent connect to all fleet-servers

Scenario 2: Elastic Cloud only

Scenario 3: Elastic Cloud fleet-server and on prem

Scenario 4: Multi policy with multi data center

ruflin commented Nov 19, 2021

scunningham commented Nov 19, 2021

ruflin commented Nov 19, 2021

nimarezainia commented Nov 22, 2021

SHolzhauer commented Nov 22, 2021

ruflin commented Nov 23, 2021

michel-laterman commented Nov 23, 2021

ruflin commented Nov 24, 2021

SHolzhauer commented Nov 25, 2021

blakerouse commented Dec 16, 2021

aarju commented Feb 9, 2022

mostlyjason commented Feb 9, 2022

aarju commented Feb 9, 2022

ruflin commented May 23, 2022

joshdover commented May 23, 2022

joshdover commented May 23, 2022 • edited Loading

nimarezainia commented May 24, 2022

scunningham commented May 25, 2022

nimarezainia commented May 25, 2022

scunningham commented May 25, 2022

joshdover commented Jun 7, 2022

nicpenning commented Jul 11, 2022

nchaulet commented Oct 3, 2022

jlind23 commented Oct 4, 2022

nimarezainia commented Oct 4, 2022

jlind23 commented Oct 5, 2022 • edited Loading

nimarezainia commented Nov 7, 2022

ruflin commented Nov 19, 2021 •

edited by joshdover

Loading

joshdover commented May 23, 2022 •

edited

Loading

jlind23 commented Oct 5, 2022 •

edited

Loading