Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TRACKING] Enhanced libs stats / metrics #1347

Closed
20 tasks
incertum opened this issue Sep 12, 2023 · 2 comments
Closed
20 tasks

[TRACKING] Enhanced libs stats / metrics #1347

incertum opened this issue Sep 12, 2023 · 2 comments
Labels
kind/feature New feature or request
Milestone

Comments

@incertum
Copy link
Contributor

Motivation

Have enhanced 24/7 production visibility into Falco's software functioning to assess what may be causing high CPU or memory usage or possible memory leaks.

Feature

Expose additional existing libsinsp stats and add more relevant stats. Ultimately, the goal is to make them available in Falco metrics.

This will help us better debug CPU and memory usage of Falco or custom libs clients in production, especially because periodic metrics snapshots can be taken 24/7. Running Falco in special debug mode is more difficult in production for various reasons.

This issue will track possible metrics to be added to augment the substantial amount of metrics and counters that we already expose to consumers:

libsinsp state:

Here are some existing counters from the results of string searching for #ifdef GATHER_INTERNAL_STATS that we should evaluate:

  • uint64_t m_n_preemptions;
  • uint64_t m_n_noncached_fd_lookups;
  • uint64_t m_n_cached_fd_lookups;
  • uint64_t m_n_failed_fd_lookups;
  • uint64_t m_n_threads;
  • uint64_t m_n_fds;
  • uint64_t m_n_added_fds;
  • uint64_t m_n_removed_fds;
  • uint64_t m_n_stored_evts;
  • uint64_t m_n_store_drops;
  • uint64_t m_n_retrieved_evts;
  • uint64_t m_n_retrieve_drops;

sinsp_thread_manager

  • m_failed_lookups
  • m_cached_lookups
  • m_non_cached_lookups
  • m_added_threads
  • m_removed_threads

overall server load

In addition to the recently added CPU and memory usage snapshot metrics, we should also expose the following:

  • Overall server CPU usage
  • Overall server memory usage
  • The total number of currently running threads on the server, which serves as the ground truth for assessing the stability of our libsinsp state cache
@incertum incertum added the kind/feature New feature or request label Sep 12, 2023
@incertum incertum added this to the 0.14.0 milestone Sep 12, 2023
@Andreagit97
Copy link
Member

This seems a reasonable point! I'm a little bit worried about our planning for Falco 0.37...looking at the libs milestone 0.14.0 and Falco milestone 0.37 we already have tons of stuff like:

  • ia32 support
  • k8s client
  • falco-driver-loader refactor
  • memleak issue
  • ...

Probably we need to discuss what we really want to do in the next release :/

@sboschman
Copy link

To have insight in all these metrics it would be great to expose them with a prometheus style http metrics endpoint or integrate a OpenTelemetry client, so we can push them easily to a metrics database and view/graph/analyse them with f.e. Grafana.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants