Agent should collect and report CPU and memory usage of service runtime components #4083

ycombinator · 2024-01-12T22:58:17Z

Relates Endpoint Security metrics are not available #1944

Currently, Agent collects and reports CPU and memory usage of command runtime components (and even that is incomplete). It is not collecting and reporting CPU and memory usage of service runtime components, e.g. Endpoint, leading to undercounting.

elasticmachine · 2024-01-12T22:58:19Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2024-01-15T14:22:18Z

This was blocked by elastic/beats#17314, which is now resolved.

fearful-symmetry · 2024-03-04T21:46:07Z

So, I'm tinkering around with how to add monitoring config for something like endpoint to the v1_monitor.go setup code. It's a little harder than I expected, as most of the monitoring config seems to be generated before components start, which means we can't just pass a PID to the monitoring code that it'll use to run some normal process metrics code. The alternative would be to search for the process based on a regex name match, witch seems...kinda hacky. The other option is to invent some new monitoring scheme that that we can send a PID to.

There's also the question of how we report these metrics to begin with; the existing monitoring code uses the beat/stats and json metricsets, which aren't available to us if we're monitoring something like endpoint. If we use the system/process metricset, which seems like the most obvious solution, we either have to format the metrics so they'll be in a structure that Kibana expects for its monitoring queries, or we change the kibana queries used for generating usage metrics. Not sure which implementation is best.

fearful-symmetry · 2024-03-04T22:25:54Z

So, still poking around the codebase, but I'm not sure I see good way to pass the PID, since the monitoring config seems to happen at the same time as the rest of the config, which looks like it would happen before a given be is (re)started. Assuming I'm right, what we could do is restart the monitoring beat after everything else has started, which seems a bit needless and could also result in missing data.

cmacknz · 2024-03-06T16:52:38Z

@nfritts @brian-mckinney we want to include endpoint's CPU and memory usage metrics in Fleet. To do this we need to know the endpoint process's PID.

If we added the current PID as a field that can be reported in the CheckinObserved message is that something that would be easy for endpoint to implement?

This seemed like the simplest approach to us but wanted to confirm. Otherwise we'd be iterating through every PID on the system looking for the one that maps to endpoint-security regularly. If you have other ideas feel free to propose them.

nfritts · 2024-03-06T17:21:03Z

Overall I don't think adding our PID should be too much work. There might be an easy alternative once we migrate to the named pipe, but we might be able to just start sending the PID when we move to the named pipe gRPC and do both at once.

fearful-symmetry · 2024-03-08T17:04:26Z

@nfritts should we or someone file an issue on the endpoint side so we can track this? Is there a timeframe for when this might get done?

@cmacknz should we just go ahead with providing the PID via endpoint and have this issue block on that implementation?

cmacknz · 2024-03-08T17:34:20Z

Yes we are going to need the PID to continue. The next step should be on us though to define how they provide the PID.

@fearful-symmetry put up a PR proposing how we do this in the control protocol so the endpoint team knows where to put the PID.

fearful-symmetry · 2024-03-08T18:19:13Z

So, I'm digging more into the code, and there's an extra caveat here, which is how updates to the monitoring beats happen. The monitoring config is injected into the config model, which happens when we call coordinator.refreshComponentModel() I don't really know enough about the agent backend to know if it's wise to do this whenever we get a new PID from endpoint. Because (I think) this is called before the beats and endpoint start, it might be possible that we call refreshComponentModel(), and then immediately call it again once we get a PID from endpoint.

@faec / @blakerouse you know more about the coordinator, is it advisable to just call coordinator.refreshComponentModel() whenever we want to update the monitoring beat config, without affecting other running components or adding too much overhead?

Assuming it is, I don't think it would be too hard to add some logic to the coordinator to update the component model if we get a PID update from endpoint.

brian-mckinney · 2024-03-08T18:46:03Z

@cmacknz Sorry I missed this question the other day. It would be pretty easy for endpoint to incorporate the pid into CheckinObserved. We would just need the updated protobuf definitions. Ping me on the PR, or when they are updated and I will get on it 👍

brian-mckinney · 2024-04-01T20:13:30Z

Endpoint PR has been merged: https://github.com/elastic/endpoint-dev/pull/14338

intxgo · 2024-04-02T10:31:50Z

we want to include endpoint's CPU and memory usage metrics in Fleet. To do this we need to know the endpoint process's PID.

Alternatively Endpoint could send you the expected metric document in the CheckinObserved message, but Agent shouldn't have any problem querying the basic memory and CPU usage.

elasticmachine · 2024-05-05T15:51:16Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

## What does this PR do? This is part of elastic/elastic-agent#4083 This adds a `GetOneRootEvent()`. method to `ProcStats` we need this for elastic/elastic-agent#4083 , as that issue requires metricbeat's system/process metricset to monitor a specific PID, which it can't do now. This will have to be followed up with a change to beat itself to rope this in. This also cleans up a bit of code and adds some docs. ## Why is it important? needed for elastic/elastic-agent#4083 ## Checklist - [x] My code follows the style guidelines of this project - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I have added an entry in `CHANGELOG.md`

jlind23 · 2024-05-15T08:09:02Z

@fearful-symmetry now that elastic/elastic-agent-system-metrics#150 has been merged, what is remaining here?

fearful-symmetry · 2024-05-17T16:09:24Z

@jlind23 mostly blocked by SDHes this week, I'm hoping to have a PR next week. It's 95% done.

ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jan 12, 2024

cmacknz mentioned this issue Jan 17, 2024

[Fleet] Fix agent memory query elastic/kibana#174458

Open

pierrehilbert assigned fearful-symmetry Feb 6, 2024

jlind23 mentioned this issue Mar 26, 2024

Endpoint Security metrics are not available #1944

Closed

fearful-symmetry mentioned this issue Mar 26, 2024

Add checkin_pid to CheckinObserved elastic/elastic-agent-client#106

Merged

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 5, 2024

fearful-symmetry mentioned this issue May 9, 2024

Add GetOneRootEvent method to process handler elastic/elastic-agent-system-metrics#150

Merged

4 tasks

kpollich mentioned this issue May 15, 2024

[Fleet UI] CPU/memory usage query for Agent Listing page should only consider only latest data elastic/kibana#174799

Open

This was referenced May 17, 2024

let the system/process metricset monitor a single PID elastic/beats#39620

Merged

Add component values to endpoint metrics elastic/integrations#9928

Merged

Allow agent to monitor endpoint #4789

Merged

fearful-symmetry closed this as completed in #4789 Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent should collect and report CPU and memory usage of service runtime components #4083

Agent should collect and report CPU and memory usage of service runtime components #4083

ycombinator commented Jan 12, 2024 •

edited by cmacknz

elasticmachine commented Jan 12, 2024

cmacknz commented Jan 15, 2024

fearful-symmetry commented Mar 4, 2024

fearful-symmetry commented Mar 4, 2024

cmacknz commented Mar 6, 2024

nfritts commented Mar 6, 2024

fearful-symmetry commented Mar 8, 2024

cmacknz commented Mar 8, 2024

fearful-symmetry commented Mar 8, 2024

brian-mckinney commented Mar 8, 2024

brian-mckinney commented Apr 1, 2024

intxgo commented Apr 2, 2024

elasticmachine commented May 5, 2024

jlind23 commented May 15, 2024

fearful-symmetry commented May 17, 2024

Agent should collect and report CPU and memory usage of service runtime components #4083

Agent should collect and report CPU and memory usage of service runtime components #4083

Comments

ycombinator commented Jan 12, 2024 • edited by cmacknz

elasticmachine commented Jan 12, 2024

cmacknz commented Jan 15, 2024

fearful-symmetry commented Mar 4, 2024

fearful-symmetry commented Mar 4, 2024

cmacknz commented Mar 6, 2024

nfritts commented Mar 6, 2024

fearful-symmetry commented Mar 8, 2024

cmacknz commented Mar 8, 2024

fearful-symmetry commented Mar 8, 2024

brian-mckinney commented Mar 8, 2024

brian-mckinney commented Apr 1, 2024

intxgo commented Apr 2, 2024

elasticmachine commented May 5, 2024

jlind23 commented May 15, 2024

fearful-symmetry commented May 17, 2024

ycombinator commented Jan 12, 2024 •

edited by cmacknz