Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Allow to reset agents log level to honor agent policy log level #180778

Closed
5 tasks done
criamico opened this issue Apr 15, 2024 · 26 comments · Fixed by #183434
Closed
5 tasks done

[Fleet] Allow to reset agents log level to honor agent policy log level #180778

criamico opened this issue Apr 15, 2024 · 26 comments · Fixed by #183434
Assignees
Labels
QA:Ready for Testing Code is merged and ready for QA to validate Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@criamico
Copy link
Contributor

criamico commented Apr 15, 2024

Follow up of #158861

With #158861 we added a select in Agent policy settings that allow the user to set the log level per policy. However, if the user has also changed the log level at agent level, this overrides the previous setting and the agent policy doesn't have control of the log level for that agent (see this comment).

  • Implement a reset mechanism that resets the the agents' log level to match what's in the agent policy. See @kpollich comment about possible ways to implement this mechanism
    • Verify that fleet server supports this functionality by passing it to elastic agent
    • Verify that the elastic agent supports this functionality
    • Verify that a fleet server agent supports this functionality
  • Enable Agent policy visibility switching this flag to true. It is currently set to false.
@criamico criamico added the Team:Fleet Team label for Observability Data Collection Fleet team label Apr 15, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@lucabelluccini
Copy link
Contributor

Hello 👋 @criamico
I would like to ask if we will follow/allow the current behavior:

  • Elastic agents will default to INFO
  • It will be possible to set a default logging level at Elastic Agent policy level (e.g. Policy X log level is ERROR)
  • It will be possible to set a temporary log level for a single Elastic Agent (valid until next restart of EA or manual reset via API)
  • It will be possible to set a temporary log level for multiple Elastic Agents (valid until next restart of EA or manual reset via API)

@nimarezainia
Copy link
Contributor

@lucabelluccini this particular issue will not be changing the default behavior.
It would allow the user to set the log level for all agents in Policy X to the same level.
You can still set the log level at the agent directly (details page) and that will override the policy. This issue here is how we reset that override so agent goes back to whatever the policy has.

Do address your last point (there's no direct way):

  1. go to the details page of each agent and set it. what you can do today.
  2. in fleet UI, select multiple agents, move them to a new policy and apply the log level (needs these new changes)

hope this makes sense.

@nimarezainia
Copy link
Contributor

@kpollich @criamico lmk if this can be done in sprint 28 - I am not sure of the level of effort. thank you.

@kpollich
Copy link
Member

Level of effort is unclear without some investigation on how passing a null log level through the agent action is handled by Fleet Server today. We could probably move this into the current sprint 28 to do that investigation, but if it turns out we need to make changes to Fleet Server to support this kind of reset I doubt we'll be able to finish the implementation in this sprint.

@kpollich kpollich self-assigned this Apr 16, 2024
@criamico
Copy link
Contributor Author

I am not sure of the level of effort

Besides what @kpollich side, we also need to define the UI for this change. I think that we have two possible ways:

  1. When the user chooses the log level in agent policy settings we also reset the log level policy for all the agents on the policy. I this case we should inform the user about this. Something like "This setting will reset the log level for all the agents on the policy"
    Screenshot 2024-04-16 at 14 36 08

  2. We give the user a choice:

  • The user selects the log level in agent policy settings and submits it
  • We show a modal asking the user if they want to reset the log level policy for all the agents on the policy

Screenshot 2024-04-16 at 14 40 48

@kpollich
Copy link
Member

We could also do the rest on a per agent basis, so there'd be a new option in the individual agent log level dropdown to "reset". I think since the only path to diverge an agent's log level from its parent policy's is to select an option in that dropdown, it'd be okay to expose the reset function as a single agent operation basically in reverse of the initial change.

The common workflow here as I understand it:

  1. User creates an agent policy with log level info
  2. User notices one agent misbehaving
  3. User checks agent's logs within fleet ui
  4. User changes log level for that agent to debug
  5. User sees helpful logs at the debug level, takes corrective action
  6. User changes resets agent's log level back to policy default

We'll also want to make sure this option is obvious when the agent-level setting is divergent from its policy, so maybe this should live outside of the dropdown and be a button instead. It'd be easy to set the agent back to info but still be divergent, as the agent-level setting is still selected.

@juliaElastic
Copy link
Contributor

juliaElastic commented May 3, 2024

I tried using null in the settings action to reset the log level of an agent. (The same action is being used when changing the log level on the agent details UI)
First had to adjust the actions API schema to get past the validation, to allow null (schema.nullable).

POST kbn:/api/fleet/agents/93636369-e6fc-44f6-9049-c056f03c1ccd/actions
{"action":{"type":"SETTINGS","data":{"log_level":null}}}

{
  "statusCode": 400,
  "error": "Bad Request",
  "message": """[request body.action]: types that failed validation:
- [request body.action.1.data.log_level]: types that failed validation:
 - [request body.action.data.log_level.0]: expected value to equal [debug]
 - [request body.action.data.log_level.1]: expected value to equal [info]
 - [request body.action.data.log_level.2]: expected value to equal [warning]
 - [request body.action.data.log_level.3]: expected value to equal [error]"""

Then, the action worked, but fleet-server gave an error (agent went to unhealthy). So, fleet-server has to be updated to accept resetting the log level. The error comes from here. The change looks simple, but I'm not sure if there is any complexity to reset the log level to the one that comes from the agent policy. @nchaulet or @michel-laterman could probably estimate better.

Actions: invalid log level, expected debug|info|warning|error and received ''

One more thing that has to be fixed on the UI:
The agent log level defaults to info on the Agent details page. This means, even if the Agent policy is set to something else e.g. Warning, we can't change the individual agent log level to info on the UI, because the Apply changes button is not visible.

image

@kpollich
Copy link
Member

kpollich commented May 3, 2024

Thanks, @juliaElastic for taking a look. I think we have what we need to get started on an implementation here next sprint.

@nchaulet
Copy link
Member

nchaulet commented May 3, 2024

The change in Fleet server do not look to complicated, but I guess we will have to do a similar change in elastic-agent for the null settings to be handled there too https://github.com/elastic/elastic-agent/blob/fd7984b1d70dc968ba67fb8f4221905e508d6a06/internal/pkg/agent/application/actions/handlers/handler_action_settings.go#L41

@kpollich kpollich removed their assignment May 3, 2024
@nimarezainia
Copy link
Contributor

When the "reset" is set on the agent, I guess the log level will revert only when the policy payload is received from Fleet. or is the agent storing the log level value that was received in the last policy payload ?

@kpollich
Copy link
Member

kpollich commented May 6, 2024

When the "reset" is set on the agent, I guess the log level will revert only when the policy payload is received from Fleet. or is the agent storing the log level value that was received in the last policy payload ?

There's a dedicated action for changing the log level of an agent, so "resetting" it will happen in real time without a policy update.

@nimarezainia
Copy link
Contributor

When the "reset" is set on the agent, I guess the log level will revert only when the policy payload is received from Fleet. or is the agent storing the log level value that was received in the last policy payload ?

There's a dedicated action for changing the log level of an agent, so "resetting" it will happen in real time without a policy update.

Does this resetting cause a policy update to occur?
I'm trying to understand how the resetting action by the user on the agent reverts the log level to what is indicated in the latest version of the agent policy.

@kpollich
Copy link
Member

kpollich commented May 7, 2024

Does this resetting cause a policy update to occur?

No it does not. Changing the log level from the existing dropdown sets the log level value on the agent document in .fleet-agents, e.g.

image

image

So, the action associated with the dropdown is just coupled to this field on .fleet-agents index, it doesn't require a policy update to change. The "reset" action would just be wiping out what's in the agent.logging_level property here, and allowing the agent to fall back to what's defined on its policy.

@pchila
Copy link
Member

pchila commented May 8, 2024

@kpollich , @nimarezainia I am currently implementing the elastic-agent side of this change.
You can find the draft PR ---> elastic/elastic-agent#3090

As part of this PR I am allowing a reset of the log level through the settings action https://github.com/elastic/elastic-agent/pull/3090/files#diff-2bd1a4e0a667e4ae3065d0fba47659d4157dd6dc41528aed1eee3aa41aeaa163R19

The idea is that the settings action will now allow an empty value of log level and interpret that as a reset for the log level of the specific agent.

When the agent receives the action to clear its specific log level it will fall back to the policy log level (current implementation keeps track of it) or the hardcoded default log level if there's no agent-level or policy-level log level specified.

Currently struggling with the fact that the agent-specific log level is populated at startup even before any settings action is received but hopefully we can remove that value without too much damage.

@nimarezainia
Copy link
Contributor

Currently struggling with the fact that the agent-specific log level is populated at startup even before any settings action is received but hopefully we can remove that value without too much damage.

so this should be at what ever the system default is (INFO) at startup. Aren't there logs send before policy is downloaded that may be important to us during that startup phase?

@pchila
Copy link
Member

pchila commented May 9, 2024

@nimarezainia

There are indeed some logs that the agent writes before downloading a policy...
The problem is that during initialization elastic-agent stores the default log level in the same config field as when it's received through the settings action, making it impossible to distinguish where such value came from.
I removed the code that stores the default log level in such a field for a fleet-managed agent and the agent starts and works normally.

I may need to check again but I think I saw elastic-agent using a warning log level if we don't define a default log level at startup... I will check again

Do we have a preference on which log level we want to set at startup of a managed agent ?

@nimarezainia
Copy link
Contributor

Do we have a preference on which log level we want to set at startup of a managed agent ?

I reckon it should be a default, what ever that is set to (in the future we may change the default from info to something else)

Couldn't this be written into the elastic-agent.yml file which is read at startup and then overwritten by the policy when it's received (which happens for all the configuration in that file). Again I don't know how important those log files are that get sent during this window. More importantly whether they are needed during troubleshooting.

@jen-huang
Copy link
Contributor

jen-huang commented May 13, 2024

Currently struggling with the fact that the agent-specific log level is populated at startup even before any settings action is received but hopefully we can remove that value without too much damage.

I removed the code that stores the default log level in such a field for a fleet-managed agent and the agent starts and works normally.

With your changes, does it mean that a new agent will not have local_metadata.elastic.agent.log_level set? I am trying to determine the same information for the UI work: whether any log level was manually set by the user either at the policy or agent level. It will be helpful if this field is unset until user sends the settings action, and unset again if they send a null value for log level.

Edit: I tested the changes and and the metadata field is still set at enrollment. I think we'll always want to keep this metadata so unsetting it is not an option. But, I could not find a way to determine the source of the log level.

@pchila is there a way to expose the source of the log level? Something like this would be really nice:

local_metadata.elastic.agent:
      log_level_source: 'agent_setting' | 'agent_policy' | 'default'

Btw, I tried enrolling a new agent into a policy that already had log level set to debug. The agent successfully sent debug logs but the metadata was reporting level as info. Let me know if you want this as a separate bug report (it happens with the build from your PR and 8.15 snapshot), or if I am misunderstanding how this field should behave:

image

@jen-huang
Copy link
Contributor

On the good news side, I was able to send null and have agent receive it correctly, remain healthy, and log at the level set in its policy:

POST kbn:/api/fleet/agents/accc6c31-8142-4eac-aa33-571f8100c556/actions
{"action":{"type":"SETTINGS","data":{"log_level": null}}}
14:52:26.455 [elastic_agent][info] Settings action done, setting agent log level to 
14:52:26.455 [elastic_agent][debug] Successfully dispatched action: 'action_id: 0e137694-7481-4e9c-bf51-e8f334ff2b99, type: SETTINGS, log_level: '

This required no Fleet Server changes. The fleet-server LOC that Julia mentioned in #180778 (comment) contains a similar error but actually the original error originates from elastic-agent.

@juliaElastic @nchaulet (and @michel-laterman?) Even though it works without Fleet Server changes for a normal agent, maybe the func here:

https://github.com/elastic/fleet-server/blob/b0b84f6f90b4292748d9dc4d1a9546d7470937ea/internal/pkg/config/logging.go#L73

needs to be updated to match this one?

https://github.com/elastic/fleet-server/blob/b0b84f6f90b4292748d9dc4d1a9546d7470937ea/internal/pkg/config/fleet.go#L21

(my poor understanding of Go leads me to think the first file is for handling fleet server log level?)

jen-huang added a commit that referenced this issue May 13, 2024
…#183203)

## Summary

Spinning out a few unrelated tweaks from the work for
#180778. This PR:

- In addition to strings, allows React components to be used for
configured settings `description`
- Fixes small i18n issues
- Fixes some setting names to have sentence casing and adjust some
description copies
- Moves deprecated unenrollment timeout setting to bottom of list

Screenshot with all changes:


![image](https://github.com/elastic/kibana/assets/1965714/0c7b692b-e47f-4013-a95c-6ec49f5976e4)

---------

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@juliaElastic
Copy link
Contributor

juliaElastic commented May 14, 2024

@jen-huang How did you test with the agent changes? the latest agent 8.15-SNAPSHOT is 1 week old, and doesn't include the log level changes yet.

EDIT: I managed to test by building elastic-agent from source, and seeing the same behaviour, the reset action seems to work, though we might want to improve the log message to say that the log level is reset to default Settings action done, setting agent log level to
The linked Validate function looks to be for the fleet-server.log, I'm not sure if that code is triggered when resetting the agent log level. Probably we can test with a fleet-server agent.

@pchila
Copy link
Member

pchila commented May 14, 2024

With your changes, does it mean that a new agent will not have local_metadata.elastic.agent.log_level set? I am trying to determine the same information for the UI work: whether any log level was manually set by the user either at the policy or agent level. It will be helpful if this field is unset until user sends the settings action, and unset again if they send a null value for log level.

Edit: I tested the changes and and the metadata field is still set at enrollment. I think we'll always want to keep this metadata so unsetting it is not an option. But, I could not find a way to determine the source of the log level.

@pchila is there a way to expose the source of the log level? Something like this would be really nice:

local_metadata.elastic.agent:
      log_level_source: 'agent_setting' | 'agent_policy' | 'default'

@jen-huang in the current agent implementation we don't keep track of where a specific configuration value comes from but we merge everything in a single config and use that to run agent. This kinda rule out the log_level_source field unless we change the format of fleet.enc file and rewrite it /re-encrypt it for every change in log level.

The local_metadata that you are looking at comes from the log level set for the specific agent as the fleet policy log level is not persisted as agent metadata but is kept in memory at runtime (if we overwrite the log level settings for the specific agent we will no longer know if that is a policy value or a settings value).

Btw, I tried enrolling a new agent into a policy that already had log level set to debug. The agent successfully sent debug logs but the metadata was reporting level as info. Let me know if you want this as a separate bug report (it happens with the build from your PR and 8.15 snapshot), or if I am misunderstanding how this field should behave:

Since the policy log level value is not written on disk you are still seeing the default value of the agent log level (in case it's not set we output "info" anyways), to check the actual log level you can use elastic-agent inspect which outputs the internal (actual) state of the agent and you will see the correct log level there.

Edit: as for the bug, the local metadata is reflecting what is written on disk, not sure if that should change if the log level from policy is taken into account... Maybe @michalpristas @cmacknz can weigh in here...

@jen-huang
Copy link
Contributor

jen-huang commented May 14, 2024

With log_level metadata not representing the actual state of the agent, and no log_level_source field, the UI will have a lot of shortcomings.

Consider the scenario where policy is set to debug and no agent log level set. That agent's metadata log_level will report info, which means we will select info in the agent log level dropdown, leading user to believe that is the current log level. We'll also be offering a Reset log level even though there is no agent level set, because we don't know the source of the log level.

The Reset offering is acceptable even though it's not ideal. But at minimum we need something that represents the real log level state of the agent.

@cmacknz
Copy link
Member

cmacknz commented May 14, 2024

+1, the log_level in the metadata sent from agent can't be wrong. The system is in a state of internal conflict if agent is reporting to Fleet that it uses the info log level, but is actually operating at the debug log level.

I really like the idea of having a log_level_source so we can easily tell why we have the log level that we do. Although, it does seem from a pure correctness perspective we are fine just having the local_metadata.elastic.agent.log_level be correct regardless of what set it.

@jen-huang
Copy link
Contributor

jen-huang commented May 14, 2024

I created two followup issues:

@ycombinator @pchila The first bug should be addressed for 8.15, I consider this a blocker for the UI work. However both should be addressed for the ideal & intended UX.

Edit: one more small bug:

@jen-huang
Copy link
Contributor

The linked Validate function looks to be for the fleet-server.log, I'm not sure if that code is triggered when resetting the agent log level. Probably we can test with a fleet-server agent.

I confirmed that reset works for a fleet-server agent as well (with a build from source) and it remains in healthy:

10:47:52.127 [elastic_agent][info] Unit state changed fleet-server-default-fleet-server-fleet_server-f9d2e5c2-9874-4cf8-b16d-793dc64ada17 (CONFIGURING->HEALTHY): Running on policy with Fleet Server integration: fleet-server-policy
10:48:45.849 [elastic_agent][info] Settings action done, setting agent log level to 

EDIT: I managed to test by building elastic-agent from source, and seeing the same behaviour, the reset action seems to work, though we might want to improve the log message to say that the log level is reset to default Settings action done, setting agent log level to

Agreed, opened another small bug for this: elastic/elastic-agent#4749

jen-huang added a commit that referenced this issue May 16, 2024
‼️ Should be reverted if
elastic/elastic-agent#4747 does not make
8.15.0.

## Summary

Resolves #180778 

This PR allows agent log level to be reset back to the level set on its
policy (or if not set, simply the default agent level, see
elastic/elastic-agent#3090).

To achieve this, this PR:
- Allows `null` to be passed for the log level settings action, i.e.:

```
POST kbn:/api/fleet/agents/<AGENT_ID>/actions
{"action":{"type":"SETTINGS","data":{"log_level": null}}}
```
- Enables the agent policy log level setting implemented in
#180607
- Always show `Apply changes` on the agent details > Logs tab
- For agents >= 8.15.0, always show `Reset to policy` on the agent
details > Logs tab
- Ensures both buttons are disabled if user does not have access to
write to agents

<img width="1254" alt="image"
src="https://github.com/elastic/kibana/assets/1965714/bcdf763e-2053-4071-9aa8-8bcb57b8fee1">

<img width="1267" alt="image"
src="https://github.com/elastic/kibana/assets/1965714/182ac54d-d5ad-435f-9376-70bb24f288f9">

### Caveats
1. The reported agent log level is not accurate if agent is using the
level from its policy and does not have a log level set on its own level
(elastic/elastic-agent#4747), so the initial
selection on the agent log level could be wrong
2. We have no way to tell where the log level came from
(elastic/elastic-agent#4748), so that's why
`Apply changes` and `Reset to policy` are always shown

### Testing
Use the latest `8.15.0-SNAPSHOT` for agents or fleet server to test this
change

### Checklist

Delete any items that are not applicable to this PR.

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
@jen-huang jen-huang added the QA:Ready for Testing Code is merged and ready for QA to validate label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QA:Ready for Testing Code is merged and ready for QA to validate Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants