-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[META]Allow specifying different and multiple fleet servers in agent policy #903
Comments
@scunningham @joshdover @michel-laterman @nimarezainia Would be great to get your take on the above. |
@ruflin Makes sense. We could consider an automatic mode where the agent would regularly ping each listed URL and determine shortest route either by round trip time or trace-route. However, that seems like a nice to have and not sure customers would use it. I think the bigger concern is going to be the multitude of outbound connections from the integrations to elastic. There is very little volume to the Fleet Servers in comparison. Is the idea here to allow different ES outputs per policy? |
@scunningham To keep the discussion focused on fleet-server only, should we open a different issue / thread our the ES outputs per policy to set some guidelines there too? |
thanks @ruflin for this discussion. I think if we work towards defining what Scenario 4 would look like - i.e FS per policy - it would address many of the multi-site use cases we see today. In the near future with either Logstash support or Secure proxy concept, we could move towards having control and data plane aggregated at a site and reduce the number of connections coming out of some of these DC or DMZ sites. What should be the next steps? I will clean up the multi-site document to be more generic and we can use that. |
Hi, wanted to pitch in regarding scenario 4. A use case we are working/struggling with is not necessarily multiple data centers but different connection types. There is one Elastic Stack running on premise (cloud but self managed) where the bulk of the servers are using internal There are also endpoints (laptops) for the employees with agent installed. These laptops have an always on VPN allowing them to connect to the internal dns/connections (the same as the servers). However when out of office, disabled VPN or when you isolate a host. The internal connections are unavailable. So it's a different setup, but allowing for endpoints to be specified based on agent policy should resolve the issues/setup complications. |
@SHolzhauer Thanks for chiming in. When the VPN is disabled, what fleet-server do you expect these Elastic Agents to connect to? Is it a global fallback one? @nimarezainia I think first, the teams (Elastic Agent / Fleet / Security) need to agree that above scenarios is what we expect. Especially 4 I think needs a bit more discussions as it impacts also Fleet. But if everyone is aligned, we can get it moving. |
So my understanding of the proposals is that if we specify more than one server address, the agent will try them in order so that the 1st address (either from the settings or from the policy) will be attempted before any others (such as the cloud address if available)? |
@michel-laterman Correct. First one wins as long as it is reachable. |
@ruflin When the VPN is enabled we would expect it to use the internal fleet endpoint. But if we can specify a fleet and ES endpoint per agent policy we already have a solution. |
Sorry but the statement that at the moment the Elastic Agent doesn't do round-robin to multiple Fleet Servers when multiple URL's is provided is not true. Actually for each request it does try to pick the last used endpoint, as long as it has not have an error in the past X minutes. |
Here is a real world 'multi cloud' scenario we are looking at internally in Elastic. We want to migrate to using Fleet for security and o11y in Elastic Cloud. Elastic Cloud currently spans 53 different cloud regions across multiple providers. Each of these regions has it's own stack that we collect the o11y and security data. We then use Cross Cluster Search so we can use a single 'overview' cluster to query and alert on data across all of the regional clusters. For a deep dive of our current configuration you can check out this blog post. The question is how to do we migrate from the current filebeat and auditbeat system to using Fleet to deploy and manage elastic agents in a multi-region multi-cloud environment? Instead of having 53 different isolated Fleet servers we would like to have a single 'primary' fleet server that can manage the policies, agents, and integrations for all of the other fleet servers. Without that if we need to use an integration such as OSQuery or Elastic Security's Host Isolation we have to log into each of the 53 fleet servers to run the query. |
@aarju today we can provide a single control plane per cluster in Fleet. So if you want to provide a single control plane over all 53 regions, you must enroll all those agents into a single cluster. Each cluster can have multiple Fleet Servers, which you can install in as many places as you want. For example, you could have a Fleet Server for each region. All those Fleet Servers connect back to a single Elasticsearch cluster, which coordinates across them. Currently, the control plane and data plane must use the same cluster. However, we are planning to add Logstash output support in 8.2 elastic/kibana#104987, and remote Elasticsearch output shortly after. That means you could have a dedicated cluster for your control plane in Fleet, and ship the data to your regional clusters. You could also use CCS across that data. There a few more details to dive into (regional routing and keeping integrations in sync), but would this architecture meet your needs at a high level? |
Thanks for the input @mostlyjason! We are really looking forward to this capability in 8.2, I think that will help a lot with this problem.
|
Assigning this to @ph to take over the lead on it. I would still like to see us driving this to a conclusion on the long term vision / goal to guide us on any future implementations. @joshdover |
It seems there general preference for exploring option 4, being able to assign a Fleet Server host(s) per Agent policy, which generally fits well with how we expect users to use agent policies to group agents. IMO it makes sense to start there and see which use cases we can't solve with this solution. |
Think I jumped the gun a bit on this one. Looking closer at https://github.com/elastic/elastic-agent/blob/ce95d6b5f36a43a927112517e91885570305e219/internal/pkg/remote/client.go#L222, it appears that we do currently round-robin Fleet Servers and choose the host that was used least recently. We would need to stop doing that before exploring option 4. |
@joshdover |
I have no objection to binding fleet server selection per policy, however endpoints do roam and a secondary selection methodology should be implemented to properly select from multiple fleet servers within the context of a policy. I would prefer to see a user prioritized list with fallback to lower priority on failure. I am thinking in particular of laptops that roam in/out of a VPN regularly; or customers that travel across regions. |
@scunningham we have a requirement from the endpoint team to dynamically join agents to a policy for pretty much the same use case you mention above. One use case I believe was when someone travels from one location to another, there maybe different security requirements at the new location. Is that correct? If a policy has a defined set of Fleet Servers, and we could switch agents between policies - do you see us needing the prioritized list with fallback? |
I suppose that's another way to do it. We did a similar design at Endgame where Agents could be automatically assigned to a specific policy by applying a prioritized set of rules to metadata regularly. We did, however, also have the concept of statically assigned agents; those for which the policy would never change. This was desirable for certain classes of user personas, for example C level executives. In those cases you may still need a mechanism to prioritize amongst a set of fleet servers. Even within a policy, it may be the case that not all fleet servers are treated equally: for example; I want the customer to try to hit the VPN fleet server first and if not accessible only then fallback to a regional or global fleet server. |
I personally prefer we decouple the dynamic policy assignment feature from having a prioritized Fleet Server hosts list. The latter is generically useful for more simple deployments that don’t need dynamic policy assignments and for the case Sean mentioned above. A prioritized list is also likely much lower effort than dynamic policy assignment to deliver and would unblock being able to release documentation/guidance on this. |
Hello all, I would like to drop a little feedback here with what I have experienced. I am in the first camp: Scenario 1 where we have multiple agents and multiple Fleet servers all hosted on-premise (0 cloud). 1 Fleet server in the DMZ (DNS fleet.example.org) and 1 Fleet server on the LAN (fleet.local.org). The current issue (As of version 8.3.1) is that agents that leave the LAN and are connected in hostile networks, they cannot resolve the fleet.local.org address. Instead of failing and moving onto the DMZ node of fleet.example.org, it is stuck on the failure. The expected behavior is that if an agent cannot resolve the DNS name of one server or fails to connect in any way, then try the other Fleet servers it knows about. More specifically on the DNS error, I see this:
The only work around is getting the system to the local network so it can connect to the internal Fleet server or rebooting the device and hoping it will hit the DMZ node for it to connect. Unless I am missing something, it doesn't seem that the Elastic Agent will try other Fleet servers when it fails to look up a host. Perhaps there is some Fleet settings I am unaware of that can change this behavior? Either way, I wanted to chime in here since this is a much needed capability because ideally we would like to minimize Fleet server traffic to DMZ nodes from external assets and internal Fleet server(s) for internal assets. |
@michel-laterman I am wondering if there is any Fleet server to do here, or it's just a matter of populating differently |
@jen-huang Nima is assigned to that issue, shouldn't it be an engineer from your team instead? |
@jlind23 @jen-huang I think we should consider this as a discuss issue and others as the implementation issue. |
Discussing this with @blakerouse and @michalpristas today and no changes are needed on fleet-server's end. |
Over the past few weeks we have seen several users trying to run multiple fleet-servers. This issue is to discuss the different scenarios and share an initial thoughts on how we could approach this. The goal is to come a conclusion and feed the information in the documentation and have it guideline for future features.
Core concepts
Scenarios
Below is a list of scenarios which should describe the expected behaviours to follow the above core concepts. Not all scenarios are supported today.
Scenario 1: Multiple fleet-servers, all Elastic Agent connect to all fleet-servers
In scenario 1 the users had multiple fleet-servers for redundancy or scale purpose. The user is expected to setup a proxy in front of the fleet-servers or use DNS to access the multiple fleet-servers. In the Fleet UI, a single fleet-server url is used.
Scenario 2: Elastic Cloud only
The user connects all its Elastic Agent to the fleet-server in Elastic Cloud. As ESS already has a proxy in front and allows to spin up redundant fleet-servers, the setup is already as expected today.
Scenario 3: Elastic Cloud fleet-server and on prem
The users is using Elastic Cloud with the fleet-server but also runs its on prem fleet-server to have the fleet-server closer to its Elastic Agents. By default the user wants to have its local fleet-servers to be used. In the Fleet UI, the user puts and additional fleet-server url before the Elastic Cloud fleet-server url. The user defined URL is the one used by default. In case the local fleet-server url is not reachable, Elastic Agents fall back to the Elastic Cloud fleet-server url.
The local fleet-server is expected to have the version in sync with the hosted fleet-server.
Scenario 4: Multi policy with multi data center
In this scenario the user has multiple data centers with local fleet-servers and specific policies to each data center. In the Fleet UI, a global fleet-server is specified and in addition a fleet-server per policy can be specified. The fleet-server specified in the policy is the first in the list so it will be picked as the default for all Elastic Agent which are part of the policy.
The text was updated successfully, but these errors were encountered: