[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected #40542

cmacknz · 2024-08-15T19:47:20Z

In #40400 we added the ability for metricsets to report their status using the Elastic Agent control protocol to make errors more visible. One consequence of this has been surfacing previously hidden problems collecting information about PIDs.

It appears that by default we fail to get complete metrics for some PIDs on both Windows and Linux, resulting in the Elastic Agent being permanently unhealthy in the Fleet UI where no quick fix:

[Flaky Test]: TestLongRunningAgentForLeaks/TestHandleLeak – Metricbeat input status reporting makes Windows agent permanently degraded elastic-agent#5300
[metricbeat/system module] should differentiate kernel and user-space threads #40537

While this functionality is useful and has helped us find bugs, we do not want users to immediately experience unhealthy agents they can't obviously fix. As a stop gap while we fix the underlying problems, let's keep reporting the error message but have the system process metricsets report as healthy. This will put the errors messages in the UI without making the agent unhealthy, which most users treat as something to quickly fix.

For example we see something like the following in our leak detection tests regularly:

    - id: system/metrics-default
      state:
        message: 'Healthy: communicating with pid ''1556'''
        pid: 0
        state: 2
        units:
            input-system/metrics-default-system/metrics-system-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                state: DEGRADED
                message: |-
                    Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                    error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                    GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                payload:
                    streams:
                        system/metrics-system.process.summary-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                            error: |-
                                Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                                error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                                GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                            status: DEGRADED

The proposal is that we keep the error messages, but report the state as healthy.

    - id: system/metrics-default
      state:
        message: 'Healthy: communicating with pid ''1556'''
        pid: 0
        state: 2
        units:
            input-system/metrics-default-system/metrics-system-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                state: HEALTHY
                message: |-
                    Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                    error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                    GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                payload:
                    streams:
                        system/metrics-system.process.summary-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                            error: |-
                                Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                                error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                                GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                            status: HEALTHY

This will keep the information accessible for bug reports but should hopefully reduce the perceived urgency of the problem and the level of support cases. The control protocol always allows both a state and a message regardless of what the state is, see here.

There will be a follow up issue to make switching between these two.

elasticmachine · 2024-08-15T19:47:22Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2024-08-15T19:58:30Z

I also think we should put this choice behind a module or metricset level configuration flag. I created #40543 to track implementing that.

leehinman · 2024-08-15T20:06:57Z

Long term could we look at a way for users to configure that certain error conditions are acceptable or don't contribute to a DEGRADED state?

I'm thinking as a user I might want to say "Yes, I know we will never be able to collect process stats from process X, but any other process I want to know if there are errors"

cmacknz · 2024-08-15T20:18:32Z

Yes I think a granular filter would make sense, either as PIDs that contribute to errors or just an exclusion or inclusion list of process names/paths/pids to not even both collecting metrics from.

In the case of not having enough privileges, we would probably want to come up with well known error codes users can filter on (I'd probably just use the OS error codes e.g. EACCESS) rather that making our errors messages the API for this.

ycombinator · 2024-08-20T19:14:55Z

Please remember to revert elastic/elastic-agent#5301 once this issue here is resolved.

ycombinator · 2024-08-23T13:57:56Z

@VihasMakwana Now that you've resolved this issue via #40565, please remember to revert elastic/elastic-agent#5301, probably on Monday once the changes from #40565 have made it into a Beats snapshot release.

VihasMakwana · 2024-08-23T14:09:30Z

@ycombinator yes, that's on my list. I'll do that on Monday.

cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 15, 2024

ycombinator mentioned this issue Aug 15, 2024

Agent Gets Unhealthy on updating endpoint policy elastic/elastic-agent#5288

Closed

VihasMakwana mentioned this issue Aug 19, 2024

[metricbeat/system][windows] - Metricbeat reports DEGRADED while running in privileged mode #40484

Open

pierrehilbert assigned VihasMakwana Aug 19, 2024

VihasMakwana mentioned this issue Aug 22, 2024

[metricbeat][system/process, system/process_summary]: mark module as healthy if metrics are partially filled #40565

Merged

6 tasks

VihasMakwana closed this as completed in #40565 Aug 23, 2024

mergify bot mentioned this issue Aug 23, 2024

[8.15](backport #40565) [metricbeat][system/process, system/process_summary]: mark module as healthy if metrics are partially filled #40602

Closed

6 tasks

VihasMakwana mentioned this issue Aug 26, 2024

Revert "Relax leak test condition from Healthy to not Failed." elastic/elastic-agent#5356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected #40542

[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected #40542

cmacknz commented Aug 15, 2024

elasticmachine commented Aug 15, 2024

cmacknz commented Aug 15, 2024

leehinman commented Aug 15, 2024

cmacknz commented Aug 15, 2024

ycombinator commented Aug 20, 2024

ycombinator commented Aug 23, 2024

VihasMakwana commented Aug 23, 2024 •

edited

Loading

[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected #40542

[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected #40542

Comments

cmacknz commented Aug 15, 2024

elasticmachine commented Aug 15, 2024

cmacknz commented Aug 15, 2024

leehinman commented Aug 15, 2024

cmacknz commented Aug 15, 2024

ycombinator commented Aug 20, 2024

ycombinator commented Aug 23, 2024

VihasMakwana commented Aug 23, 2024 • edited Loading

VihasMakwana commented Aug 23, 2024 •

edited

Loading