Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META]Allow specifying different and multiple fleet servers in agent policy #903

Closed
ruflin opened this issue Nov 19, 2021 · 27 comments
Closed
Assignees
Labels
8.6-candidate Meta Team:Fleet Label for the Fleet team

Comments

@ruflin
Copy link
Member

ruflin commented Nov 19, 2021

Over the past few weeks we have seen several users trying to run multiple fleet-servers. This issue is to discuss the different scenarios and share an initial thoughts on how we could approach this. The goal is to come a conclusion and feed the information in the documentation and have it guideline for future features.

Core concepts

  • No load balancing: It is not the Elastic Agent jobs to load balance between multiple fleet-servers. This must be done at the infrastructure level through proxy or DNS.
  • Fail over: An Elastic Agent supports multiple fleet-server urls as failover. By default the first url is picked.
  • All info is shared across fleet-servers: Each fleet-server has always the same information no matter where it is deployed.

Scenarios

Below is a list of scenarios which should describe the expected behaviours to follow the above core concepts. Not all scenarios are supported today.

Scenario 1: Multiple fleet-servers, all Elastic Agent connect to all fleet-servers

In scenario 1 the users had multiple fleet-servers for redundancy or scale purpose. The user is expected to setup a proxy in front of the fleet-servers or use DNS to access the multiple fleet-servers. In the Fleet UI, a single fleet-server url is used.

Scenario 2: Elastic Cloud only

The user connects all its Elastic Agent to the fleet-server in Elastic Cloud. As ESS already has a proxy in front and allows to spin up redundant fleet-servers, the setup is already as expected today.

Scenario 3: Elastic Cloud fleet-server and on prem

The users is using Elastic Cloud with the fleet-server but also runs its on prem fleet-server to have the fleet-server closer to its Elastic Agents. By default the user wants to have its local fleet-servers to be used. In the Fleet UI, the user puts and additional fleet-server url before the Elastic Cloud fleet-server url. The user defined URL is the one used by default. In case the local fleet-server url is not reachable, Elastic Agents fall back to the Elastic Cloud fleet-server url.

The local fleet-server is expected to have the version in sync with the hosted fleet-server.

Scenario 4: Multi policy with multi data center

In this scenario the user has multiple data centers with local fleet-servers and specific policies to each data center. In the Fleet UI, a global fleet-server is specified and in addition a fleet-server per policy can be specified. The fleet-server specified in the policy is the first in the list so it will be picked as the default for all Elastic Agent which are part of the policy.

@ruflin
Copy link
Member Author

ruflin commented Nov 19, 2021

@scunningham @joshdover @michel-laterman @nimarezainia Would be great to get your take on the above.

@ruflin ruflin added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 19, 2021
@ruflin ruflin self-assigned this Nov 19, 2021
@scunningham
Copy link

@ruflin Makes sense. We could consider an automatic mode where the agent would regularly ping each listed URL and determine shortest route either by round trip time or trace-route. However, that seems like a nice to have and not sure customers would use it.

I think the bigger concern is going to be the multitude of outbound connections from the integrations to elastic. There is very little volume to the Fleet Servers in comparison. Is the idea here to allow different ES outputs per policy?

@ruflin
Copy link
Member Author

ruflin commented Nov 19, 2021

@scunningham To keep the discussion focused on fleet-server only, should we open a different issue / thread our the ES outputs per policy to set some guidelines there too?

@nimarezainia
Copy link

thanks @ruflin for this discussion. I think if we work towards defining what Scenario 4 would look like - i.e FS per policy - it would address many of the multi-site use cases we see today. In the near future with either Logstash support or Secure proxy concept, we could move towards having control and data plane aggregated at a site and reduce the number of connections coming out of some of these DC or DMZ sites.

What should be the next steps? I will clean up the multi-site document to be more generic and we can use that.

@SHolzhauer
Copy link

Hi, wanted to pitch in regarding scenario 4.

A use case we are working/struggling with is not necessarily multiple data centers but different connection types.

There is one Elastic Stack running on premise (cloud but self managed) where the bulk of the servers are using internal
dns/connections to connect to a loadbalancer with 2+ (HA) fleet servers.

There are also endpoints (laptops) for the employees with agent installed. These laptops have an always on VPN allowing them to connect to the internal dns/connections (the same as the servers). However when out of office, disabled VPN or when you isolate a host. The internal connections are unavailable.

So it's a different setup, but allowing for endpoints to be specified based on agent policy should resolve the issues/setup complications.

@ruflin
Copy link
Member Author

ruflin commented Nov 23, 2021

@SHolzhauer Thanks for chiming in. When the VPN is disabled, what fleet-server do you expect these Elastic Agents to connect to? Is it a global fallback one?

@nimarezainia I think first, the teams (Elastic Agent / Fleet / Security) need to agree that above scenarios is what we expect. Especially 4 I think needs a bit more discussions as it impacts also Fleet. But if everyone is aligned, we can get it moving.

@michel-laterman
Copy link
Contributor

So my understanding of the proposals is that if we specify more than one server address, the agent will try them in order so that the 1st address (either from the settings or from the policy) will be attempted before any others (such as the cloud address if available)?

@ruflin
Copy link
Member Author

ruflin commented Nov 24, 2021

@michel-laterman Correct. First one wins as long as it is reachable.

@SHolzhauer
Copy link

@SHolzhauer Thanks for chiming in. When the VPN is disabled, what fleet-server do you expect these Elastic Agents to connect to? Is it a global fallback one?

@ruflin When the VPN is enabled we would expect it to use the internal fleet endpoint.
When the vpn is disabled or when the host is isolated we would expect it to use the external fleet endpoint.

But if we can specify a fleet and ES endpoint per agent policy we already have a solution.

@blakerouse
Copy link
Contributor

Sorry but the statement that at the moment the Elastic Agent doesn't do round-robin to multiple Fleet Servers when multiple URL's is provided is not true. Actually for each request it does try to pick the last used endpoint, as long as it has not have an error in the past X minutes.

See: https://github.com/elastic/beats/blob/b5e94143d774f2434432a578f7ef5bbd71002bca/x-pack/elastic-agent/pkg/remote/client.go#L241

@aarju
Copy link

aarju commented Feb 9, 2022

Here is a real world 'multi cloud' scenario we are looking at internally in Elastic. We want to migrate to using Fleet for security and o11y in Elastic Cloud. Elastic Cloud currently spans 53 different cloud regions across multiple providers. Each of these regions has it's own stack that we collect the o11y and security data. We then use Cross Cluster Search so we can use a single 'overview' cluster to query and alert on data across all of the regional clusters. For a deep dive of our current configuration you can check out this blog post.

The question is how to do we migrate from the current filebeat and auditbeat system to using Fleet to deploy and manage elastic agents in a multi-region multi-cloud environment? Instead of having 53 different isolated Fleet servers we would like to have a single 'primary' fleet server that can manage the policies, agents, and integrations for all of the other fleet servers. Without that if we need to use an integration such as OSQuery or Elastic Security's Host Isolation we have to log into each of the 53 fleet servers to run the query.

@mostlyjason
Copy link

@aarju today we can provide a single control plane per cluster in Fleet. So if you want to provide a single control plane over all 53 regions, you must enroll all those agents into a single cluster. Each cluster can have multiple Fleet Servers, which you can install in as many places as you want. For example, you could have a Fleet Server for each region. All those Fleet Servers connect back to a single Elasticsearch cluster, which coordinates across them.

Currently, the control plane and data plane must use the same cluster. However, we are planning to add Logstash output support in 8.2 elastic/kibana#104987, and remote Elasticsearch output shortly after. That means you could have a dedicated cluster for your control plane in Fleet, and ship the data to your regional clusters. You could also use CCS across that data.

There a few more details to dive into (regional routing and keeping integrations in sync), but would this architecture meet your needs at a high level?

@aarju
Copy link

aarju commented Feb 9, 2022

Thanks for the input @mostlyjason!

We are really looking forward to this capability in 8.2, I think that will help a lot with this problem.

Currently, the control plane and data plane must use the same cluster. However, we are planning to add Logstash output support in 8.2 elastic/kibana#104987, and remote Elasticsearch output shortly after. That means you could have a dedicated cluster for your control plane in Fleet, and ship the data to your regional clusters. You could also use CCS across that data.

@ruflin ruflin assigned ph and unassigned ruflin May 23, 2022
@ruflin
Copy link
Member Author

ruflin commented May 23, 2022

Assigning this to @ph to take over the lead on it. I would still like to see us driving this to a conclusion on the long term vision / goal to guide us on any future implementations. @joshdover

@joshdover
Copy link
Contributor

It seems there general preference for exploring option 4, being able to assign a Fleet Server host(s) per Agent policy, which generally fits well with how we expect users to use agent policies to group agents. IMO it makes sense to start there and see which use cases we can't solve with this solution.

@joshdover
Copy link
Contributor

joshdover commented May 23, 2022

Think I jumped the gun a bit on this one. Looking closer at https://github.com/elastic/elastic-agent/blob/ce95d6b5f36a43a927112517e91885570305e219/internal/pkg/remote/client.go#L222, it appears that we do currently round-robin Fleet Servers and choose the host that was used least recently. We would need to stop doing that before exploring option 4.

@nimarezainia
Copy link

Think I jumped the gun a bit on this one. Looking closer at https://github.com/elastic/elastic-agent/blob/ce95d6b5f36a43a927112517e91885570305e219/internal/pkg/remote/client.go#L222, it appears that we do currently round-robin Fleet Servers and choose the host that was used least recently. We would need to stop doing that before exploring option 4.

@joshdover
if the fleet server list is per policy, the behavior you describe here would pertain to those in that policy correct?

@scunningham
Copy link

I have no objection to binding fleet server selection per policy, however endpoints do roam and a secondary selection methodology should be implemented to properly select from multiple fleet servers within the context of a policy. I would prefer to see a user prioritized list with fallback to lower priority on failure. I am thinking in particular of laptops that roam in/out of a VPN regularly; or customers that travel across regions.

@nimarezainia
Copy link

@scunningham we have a requirement from the endpoint team to dynamically join agents to a policy for pretty much the same use case you mention above. One use case I believe was when someone travels from one location to another, there maybe different security requirements at the new location. Is that correct?

If a policy has a defined set of Fleet Servers, and we could switch agents between policies - do you see us needing the prioritized list with fallback?

@scunningham
Copy link

I suppose that's another way to do it. We did a similar design at Endgame where Agents could be automatically assigned to a specific policy by applying a prioritized set of rules to metadata regularly. We did, however, also have the concept of statically assigned agents; those for which the policy would never change. This was desirable for certain classes of user personas, for example C level executives. In those cases you may still need a mechanism to prioritize amongst a set of fleet servers.

Even within a policy, it may be the case that not all fleet servers are treated equally: for example; I want the customer to try to hit the VPN fleet server first and if not accessible only then fallback to a regional or global fleet server.

@joshdover
Copy link
Contributor

I personally prefer we decouple the dynamic policy assignment feature from having a prioritized Fleet Server hosts list. The latter is generically useful for more simple deployments that don’t need dynamic policy assignments and for the case Sean mentioned above. A prioritized list is also likely much lower effort than dynamic policy assignment to deliver and would unblock being able to release documentation/guidance on this.

@nicpenning
Copy link

Hello all, I would like to drop a little feedback here with what I have experienced.

I am in the first camp: Scenario 1 where we have multiple agents and multiple Fleet servers all hosted on-premise (0 cloud).

1 Fleet server in the DMZ (DNS fleet.example.org) and 1 Fleet server on the LAN (fleet.local.org).

The current issue (As of version 8.3.1) is that agents that leave the LAN and are connected in hostile networks, they cannot resolve the fleet.local.org address. Instead of failing and moving onto the DMZ node of fleet.example.org, it is stuck on the failure. The expected behavior is that if an agent cannot resolve the DNS name of one server or fails to connect in any way, then try the other Fleet servers it knows about.

More specifically on the DNS error, I see this:

[elastic_agent][warn] DNS lookup failure "fleet.org.local": lookup fleet1.state.sd.local: no such host
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet.org.local:8220/api/fleet/agents/redacted/checkin?": lookup fleet1.state.sd.local: no such host

The only work around is getting the system to the local network so it can connect to the internal Fleet server or rebooting the device and hoping it will hit the DMZ node for it to connect.

Unless I am missing something, it doesn't seem that the Elastic Agent will try other Fleet servers when it fails to look up a host. Perhaps there is some Fleet settings I am unaware of that can change this behavior? Either way, I wanted to chime in here since this is a much needed capability because ideally we would like to minimize Fleet server traffic to DMZ nodes from external assets and internal Fleet server(s) for internal assets.

@ph ph removed their assignment Aug 11, 2022
@mukeshelastic mukeshelastic changed the title [Discuss] Using multiple fleet-servers Allow specifying different and multiple fleet servers in agent policy Aug 23, 2022
@nchaulet
Copy link
Member

nchaulet commented Oct 3, 2022

@michel-laterman I am wondering if there is any Fleet server to do here, or it's just a matter of populating differently fleet.hosts in the agent policy from Kibana depending on what the user is configuring, and showing different enroll command too.

@jlind23
Copy link
Contributor

jlind23 commented Oct 4, 2022

@jen-huang Nima is assigned to that issue, shouldn't it be an engineer from your team instead?

@jlind23 jlind23 assigned nimarezainia and unassigned nimarezainia Oct 4, 2022
@nimarezainia
Copy link

@jlind23 @jen-huang I think we should consider this as a discuss issue and others as the implementation issue.

@jlind23 jlind23 changed the title Allow specifying different and multiple fleet servers in agent policy [META]Allow specifying different and multiple fleet servers in agent policy Oct 5, 2022
@jlind23 jlind23 added the Meta label Oct 5, 2022
@jlind23
Copy link
Contributor

jlind23 commented Oct 5, 2022

Discussing this with @blakerouse and @michalpristas today and no changes are needed on fleet-server's end.
@nchaulet No different enrolment command is needed.

@nimarezainia
Copy link

elastic/kibana#137785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.6-candidate Meta Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests