Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent should collect and report CPU and memory usage of service runtime components #4083

Closed
ycombinator opened this issue Jan 12, 2024 · 15 comments · Fixed by #4789
Closed
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@ycombinator
Copy link
Contributor

ycombinator commented Jan 12, 2024

Currently, Agent collects and reports CPU and memory usage of command runtime components (and even that is incomplete). It is not collecting and reporting CPU and memory usage of service runtime components, e.g. Endpoint, leading to undercounting.

@ycombinator ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jan 12, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Jan 15, 2024

This was blocked by elastic/beats#17314, which is now resolved.

@fearful-symmetry
Copy link
Contributor

So, I'm tinkering around with how to add monitoring config for something like endpoint to the v1_monitor.go setup code. It's a little harder than I expected, as most of the monitoring config seems to be generated before components start, which means we can't just pass a PID to the monitoring code that it'll use to run some normal process metrics code. The alternative would be to search for the process based on a regex name match, witch seems...kinda hacky. The other option is to invent some new monitoring scheme that that we can send a PID to.

There's also the question of how we report these metrics to begin with; the existing monitoring code uses the beat/stats and json metricsets, which aren't available to us if we're monitoring something like endpoint. If we use the system/process metricset, which seems like the most obvious solution, we either have to format the metrics so they'll be in a structure that Kibana expects for its monitoring queries, or we change the kibana queries used for generating usage metrics. Not sure which implementation is best.

@fearful-symmetry
Copy link
Contributor

So, still poking around the codebase, but I'm not sure I see good way to pass the PID, since the monitoring config seems to happen at the same time as the rest of the config, which looks like it would happen before a given be is (re)started. Assuming I'm right, what we could do is restart the monitoring beat after everything else has started, which seems a bit needless and could also result in missing data.

@cmacknz
Copy link
Member

cmacknz commented Mar 6, 2024

@nfritts @brian-mckinney we want to include endpoint's CPU and memory usage metrics in Fleet. To do this we need to know the endpoint process's PID.

If we added the current PID as a field that can be reported in the CheckinObserved message is that something that would be easy for endpoint to implement?

This seemed like the simplest approach to us but wanted to confirm. Otherwise we'd be iterating through every PID on the system looking for the one that maps to endpoint-security regularly. If you have other ideas feel free to propose them.

@nfritts
Copy link

nfritts commented Mar 6, 2024

Overall I don't think adding our PID should be too much work. There might be an easy alternative once we migrate to the named pipe, but we might be able to just start sending the PID when we move to the named pipe gRPC and do both at once.

@fearful-symmetry
Copy link
Contributor

@nfritts should we or someone file an issue on the endpoint side so we can track this? Is there a timeframe for when this might get done?

@cmacknz should we just go ahead with providing the PID via endpoint and have this issue block on that implementation?

@cmacknz
Copy link
Member

cmacknz commented Mar 8, 2024

Yes we are going to need the PID to continue. The next step should be on us though to define how they provide the PID.

@fearful-symmetry put up a PR proposing how we do this in the control protocol so the endpoint team knows where to put the PID.

@fearful-symmetry
Copy link
Contributor

So, I'm digging more into the code, and there's an extra caveat here, which is how updates to the monitoring beats happen. The monitoring config is injected into the config model, which happens when we call coordinator.refreshComponentModel() I don't really know enough about the agent backend to know if it's wise to do this whenever we get a new PID from endpoint. Because (I think) this is called before the beats and endpoint start, it might be possible that we call refreshComponentModel(), and then immediately call it again once we get a PID from endpoint.

@faec / @blakerouse you know more about the coordinator, is it advisable to just call coordinator.refreshComponentModel() whenever we want to update the monitoring beat config, without affecting other running components or adding too much overhead?

Assuming it is, I don't think it would be too hard to add some logic to the coordinator to update the component model if we get a PID update from endpoint.

@brian-mckinney
Copy link

@cmacknz Sorry I missed this question the other day. It would be pretty easy for endpoint to incorporate the pid into CheckinObserved. We would just need the updated protobuf definitions. Ping me on the PR, or when they are updated and I will get on it 👍

@brian-mckinney
Copy link

Endpoint PR has been merged: https://github.com/elastic/endpoint-dev/pull/14338

@intxgo
Copy link
Contributor

intxgo commented Apr 2, 2024

we want to include endpoint's CPU and memory usage metrics in Fleet. To do this we need to know the endpoint process's PID.

Alternatively Endpoint could send you the expected metric document in the CheckinObserved message, but Agent shouldn't have any problem querying the basic memory and CPU usage.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 5, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

fearful-symmetry added a commit to elastic/elastic-agent-system-metrics that referenced this issue May 10, 2024
## What does this PR do?

This is part of elastic/elastic-agent#4083 

This adds a `GetOneRootEvent()`. method to `ProcStats` we need this for
elastic/elastic-agent#4083 , as that issue
requires metricbeat's system/process metricset to monitor a specific
PID, which it can't do now. This will have to be followed up with a
change to beat itself to rope this in. This also cleans up a bit of code
and adds some docs.

## Why is it important?

needed for elastic/elastic-agent#4083 

## Checklist

- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added an entry in `CHANGELOG.md`
@jlind23
Copy link
Contributor

jlind23 commented May 15, 2024

@fearful-symmetry now that elastic/elastic-agent-system-metrics#150 has been merged, what is remaining here?

@fearful-symmetry
Copy link
Contributor

@jlind23 mostly blocked by SDHes this week, I'm hoping to have a PR next week. It's 95% done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants