Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Add agent policy API to add settings not yet supported by UI #158699

Closed
3 tasks
juliaElastic opened this issue May 31, 2023 · 20 comments · Fixed by #159414
Closed
3 tasks

[Fleet] Add agent policy API to add settings not yet supported by UI #158699

juliaElastic opened this issue May 31, 2023 · 20 comments · Fixed by #159414
Assignees
Labels
Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@juliaElastic
Copy link
Contributor

juliaElastic commented May 31, 2023

To increase supportability, Fleet should support a way to add custom settings to agent policies.
This will enable unblocking clients when a policy change is required to fix agents managed by Fleet.
Today this is not easy because Fleet UI/API only enables updating certain fields in agent policy.

Use cases:
Global parameters on the agent policy (monitoring, timeout settings)

Tasks:

  • Create API: POST /api/fleet/agent_policies/:id/custom_settings
    • The request body should be key value pairs that are going to be appended to the agent policy
  • Make sure existing agent policy API updates don't remove custom settings
  • The custom settings should be visible in the View policy UI as yaml

Open question:

  • Do we need any whitelisting on which fields can be added as custom settings?
  • Do we allow to override existing fields in custom settings e.g. agent.download.sourceURI, outputs?

cc @joshdover

@juliaElastic juliaElastic added Team:Fleet Team label for Observability Data Collection Fleet team Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. labels May 31, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor

The custom settings should be visible in the View policy UI as yaml

I think they also need to be visible at the bottom of the "settings" tab of the Agent policy. It's relevant there because these settings could override other things in the policy and create issues. I'd even argue they should be shown in a yellow callout banner w/ a link to API documentation on how to remove them.

/api/fleet/agent_policies/:id/custom_settings

Looking at this again I think the API should be /api/fleet/agent_policies/:id/overrides instead to make it more obvious this is out-of-the-norm option.

  • Do we need any whitelisting on which fields can be added as custom settings?
  • Do we allow to override existing fields in custom settings e.g. agent.download.sourceURI, outputs?

I think we should make this API quite flexible so it can be used to workaround any bugs in Fleet. If we restrict this too much it will lose it's value. That said, I think it may make sense to not support any keys with the inputs prefix as it's highly unlikely these overrides will work for overriding keys in that array and we have other ways to ship fixes out-of-band to that part of the config (a prerelease version of a package).

@joshdover
Copy link
Contributor

cc @cmacknz

@juliaElastic
Copy link
Contributor Author

I think they also need to be visible at the bottom of the "settings" tab of the Agent policy

Do you mean as a read only box or as a textarea to update? The latter increases the scope slightly to add yaml validation, something like we have in Fleet Server integration policy "Advanced YAML settings"

@joshdover
Copy link
Contributor

Readonly is fine IMO

@joshdover
Copy link
Contributor

I would even prefer we break this into two phases and do 0 UI changes in the first phase. Most important thing is to unblock fixing customer deployments. Next most important thing is helping users know when they're still using these overrides.

@joshdover joshdover changed the title [Fleet] Add agent policy API to save custom settings [Fleet] Add agent policy API to add settings not yet supported by UI May 31, 2023
@nchaulet
Copy link
Member

nchaulet commented Jun 5, 2023

Do we need a new API for that could it be an extra property on the existing agent policy API?

POST /agent_policies
{
"overrides": {
"key": "val"
}
}

@juliaElastic
Copy link
Contributor Author

Do we need a new API for that could it be an extra property on the existing agent policy API?

We could, though perhaps the intention was to keep the overrides as an escape hatch and not include in the main API as it is not intended for normal use.

@nchaulet nchaulet self-assigned this Jun 6, 2023
@nchaulet
Copy link
Member

nchaulet commented Jun 7, 2023

We could, though perhaps the intention was to keep the overrides as an escape hatch and not include in the main API as it is not intended for normal use.

I think with or without a new API this should be pointed in the API doc. I do not have a strong argument for a new API or reuse the policy API for that. Reusing the policy API is maybe a little simpler to implement/maintains.

@jlind23
Copy link
Contributor

jlind23 commented Jun 7, 2023

I think with or without a new API this should be pointed in the API doc.

I agree with that.

I do not have a strong argument for a new API or reuse the policy API for that

Aren't we mixing things here? I would rather add a new experimental API, that could lately be removed if needed. Experimental will also avoid us officially supporting it.

@joshdover
Copy link
Contributor

One reason we were thinking that it should be a separate API is to prevent accidental removals of any custom settings. For example, if we use the existing API we have to make sure that PUT /agent_policies from our agent policy settings page doesn't forget to include the previous overrides. This is just one example, others would be impacted as well.

I do agree it would be more straightforward though to not handle this as a special case with a separate API. Another option could be to use the existing API but require that clients explicitly set an overrides key to null in order to remove these settings on PUTs.

@nchaulet
Copy link
Member

nchaulet commented Jun 8, 2023

One reason we were thinking that it should be a separate API is to prevent accidental removals of any custom settings. For example, if we use the existing API we have to make sure that PUT /agent_policies from our agent policy settings page doesn't forget to include the previous overrides. This is just one example, others would be impacted as well.

Actually all of the PUT API fleet API are in fact PATCH so if the user do not explicitly send a null value it will not be cleared.

@lucabelluccini
Copy link
Contributor

To improve supportability:

  1. Can we highlight a custom setting has been pushed to Elastic Agents in the diagnostic?
    The idea is to immediately spot if a user is adding custom settings. As the custom settings will not be validated and they can be any setting we usually allow in a Beat product, it is crucial to spot if a user added them without manually checking every single rendered configuration on Elastic Agent.

  2. Can we ensure we have a documentation and (when it will exist) a warning in the custom setting section of the policy UI to warn we do not test against any possible combination of settings which will be passed down to the components of Elastic Agent and things can break?

  3. Clearly state the maturity (tech preview, etc...) of this feature in docs so that users will not abuse of this?

@jlind23
Copy link
Contributor

jlind23 commented Jun 13, 2023

Can we highlight a custom setting has been pushed to Elastic Agents in the diagnostic?
The idea is to immediately spot if a user is adding custom settings. As the custom settings will not be validated and they can be any setting we usually allow in a Beat product, it is crucial to spot if a user added them without manually checking every single rendered configuration on Elastic Agent.

Custom settings will be set the same way as standard settings so i'm not sure this is something we should highlight. However we might probably want to create an audit log entry for this.

We should probably pull @kilfoyle and @karenzone to discuss the docs options here.

@lucabelluccini
Copy link
Contributor

Custom settings will be set the same way as standard settings so i'm not sure this is something we should highlight.

Yes, but those settings are not getting validated (at least from what I understood from the issue) and they can break the behavior (we will not test any possible combination of them + they will possibly override some defaults we are using).

E.g. if a user adds:

queue.disk:
  max_size: 10GB

What will happen?
If this setting gets pushed down from EA to components:

  • Each component might generate a queue consuming up to 10GB (they're not preallocated but still could happen)?
  • When we upgrade, are the queues going to be copied over?

@cmacknz
Copy link
Member

cmacknz commented Jun 13, 2023

If a user adds queue.disk to the agent policy today, nothing will happen because the agent policy doesn't support queue configuration yet. This would just be ignored.

If elastic/beats#35615 is implemented as planned, then in theory they could try to configure the disk queue before it is supported properly (sharing the queue across upgrades for example), and the agent should probably refuse to configure it in that case.

In general if someone puts valid YAML into the agent policy for a feature or directive that isn't supported it will be silently ignored. This is the same behavior Beats has. This probably isn't ideal but fixing this is challenging, ideally we would generate an error or warning when this happens.

@lucabelluccini
Copy link
Contributor

If a user adds queue.disk to the agent policy today, nothing will happen because the agent policy doesn't support queue configuration yet. This would just be ignored.

Thank you @cmacknz I agree - I am aware the queue settings are ignored at the moment.

But if we implement this feature - we are opening the door to users possibly expecting any setting which is injected via the YAML section will work.

We already had situations where:

  • users struggle to know if a setting is taken in account or not (due to the fact Beats doesn't validate the settings) - customers add a setting with a typo, it is silently accepted but it's actually not used because it is never parsed by the binaries/components
  • users might use valid settings (example queue.disk) to overcome to some current limitations with Elastic Agent and expect it to work in production, while we're introducing this setting to apply some "light" workaround

That's why I would like to make it clear in the diagnostics and Elastic Agent policy the custom settings added by the user are visible as it might be crucial to spot if the user injected some "unsupported" setting in the freeform YAML.

@kilfoyle
Copy link
Contributor

Regarding the docs:

Can we ensure we have a documentation and (when it will exist) a warning in the custom setting section of the policy UI to warn we do not test against any possible combination of settings which will be passed down to the components of Elastic Agent and things can break?

As far as I know the only API docs we have currently are these generated API docs which will include the newly added API descriptions, but when the settings UI is updated we can certainly also add some warning text to the corresponding Fleet UI settings page.

@nimarezainia
Copy link
Contributor

nimarezainia commented Jun 21, 2023

Would it be possible to get the full list of what possible configuration can be applied here?

My initial gut feel is that this is only a tool for Support/Engineering to resolve customer issues and not something that we want to publicize too much in our documentation. I would like to avoid unintended consequences of users applying a config to the policy.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jun 27, 2023

One use case to use this API is to configure the agent download timeout: elastic/elastic-agent#4580

To set download timeout:

  1. Create an agent policy
  2. Take the agent policy id, name, namespace (mandatory for overrides request)
  3. Update agent policy with overrides
PUT kbn:/api/fleet/agent_policies/24292680-158e-11ee-b8ce-f39328f70950
{
   "name": "Test policy",
    "namespace": "default",
    "overrides": {
         "agent": {
              "download": {
                "timeout": "120s"
              }
            }
    }
}
  1. Go to Fleet / Agent policy / Click policy / View policy
  2. Verify that the download timeout is set under agent
image
  1. TODO: Verify that the download timeout applies when the agent is being upgraded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants