Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup(libsinsp): consolidation and extension of libsinsp stats / metrics sinsp_stats_v2 #1433

Merged
merged 10 commits into from
Nov 17, 2023

Conversation

incertum
Copy link
Contributor

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area API-version

/area build

/area CI

/area driver-kmod

/area driver-bpf

/area driver-modern-bpf

/area libscap-engine-bpf

/area libscap-engine-gvisor

/area libscap-engine-kmod

/area libscap-engine-modern-bpf

/area libscap-engine-nodriver

/area libscap-engine-noop

/area libscap-engine-source-plugin

/area libscap-engine-savefile

/area libscap-engine-udig

/area libscap

/area libpman

/area libsinsp

/area tests

/area proposals

Does this PR require a change in the driver versions?

/version driver-API-version-major

/version driver-API-version-minor

/version driver-API-version-patch

/version driver-SCHEMA-version-major

/version driver-SCHEMA-version-minor

/version driver-SCHEMA-version-patch

What this PR does / why we need it:

See #1347: Expanding libsinsp stats will help to debug and optimize libsinsp functionality around the threadtable and other aspects in production settings.

In essence, this PR refactors / consolidates / simplifies the libsinsp stats interface and adds valuable metrics that already existed, but weren't easily accessible by clients such as Falco. Plus added overall CPU and memory usages of the host to put the agents' usage easier in relation to the current server load.

Justification for removing the old sinsp stats:

First every metric was preserved with the replacement solution, except the counters below were duplicative of the scap counters, hence we do not re-copy them in the new sinsp stats solution

m_stats->m_n_seen_evts = stats.n_evts;
m_stats->m_n_drops = stats.n_drops;
m_stats->m_n_preemptions = stats.n_preemptions;

The following 2 metrics are snapshot metrics anyways, therefore they now take place in libsinsp::stats::get_sinsp_stats_v2 just like the resource utilization snapshots.

uint64_t m_n_threads;
uint64_t m_n_fds;

As said all other counters that were on the hot path are preserved in the simple sinsp_stats_v2 struct.

Please note that GATHER_INTERNAL_STATS is currently broken in 2 places: When activated libsinsp would not compile because of missing updates in fdinfo.h and a legacy threadtable loop approach void sinsp_thread_manager::update_statistics() -> This is why I believe no one could have seriously used those stats for some time.

Additional information:

During the October core maintainer meeting we discussed, that we are all fine with having all these stats / metrics in the sinsp_stats_v2_buffer and while this PR at least removes and consolidate legacy libsinsp stats, we are all on the same page that as we further cleanup and refactor libsinsp the libsinsp stats functionality will be part of future cleanups.

Question:

Are we ok with just having all these counters just like that on the hot path or do we want to gate them? Meaning another if condition or is it cheaper to just increment or decrement the respective count real fast?

Which issue(s) this PR fixes:

#1347

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

cleanup(libsinsp): consolidation and extension of libsinsp stats / metrics `sinsp_stats_v2 `

@incertum
Copy link
Contributor Author

The corresponding Falco PR is up and here is some example output of the new metrics falcosecurity/falco#2883 (comment)

@incertum
Copy link
Contributor Author

Had a new idea on how to expose additional stats / metrics around the container engine cache.

Missing container images in the logs has been called out for a very very long time. While in some cases when a new container just comes up we can't do much about it, we at least want to make sure everything we can improve has been considered.

It will help to investigate falcosecurity/falco#2708 better.

@Andreagit97 Andreagit97 added this to the 0.14.0 milestone Oct 23, 2023
@incertum incertum force-pushed the sinsp-stats-v2-extension branch 3 times, most recently from ca550c7 to 3b2dc0b Compare October 24, 2023 16:40
Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR; generally speaking, i'd avoid the check for m_inspector != nullptr since that is already taken for granted in so many places.

Also, i'd disable stats by default and i'd add a sinsp_enable_stats to enable them, making the m_sinsp_stats_v2 field a unique_ptr that is only allocated if stats are enabled.

@@ -88,11 +88,24 @@ bool sinsp_container_manager::remove_inactive_containers()
});

auto containers = m_containers.lock();
if (m_inspector != nullptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these checks are necessary since eg: L69 above is already assuming that m_inspector cannot be null.

userspace/libsinsp/fdinfo.cpp Show resolved Hide resolved
#ifdef GATHER_INTERNAL_STATS
m_inspector->m_stats->m_n_added_fds++;
#endif
if (m_inspector != nullptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as other comment, 3 lines above we are already accessing m_inspector with no checks; i'd avoid polluting the code :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same below.

#ifdef GATHER_INTERNAL_STATS
m_inspector->m_thread_manager->m_failed_lookups->decrement();
#endif
if (m_inspector != nullptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again!

@@ -1121,7 +1122,8 @@ VISIBILITY_PRIVATE
bool m_is_dumping;
const scap_machine_info* m_machine_info;
const scap_agent_info* m_agent_info;
scap_stats_v2 m_sinsp_stats_v2[SINSP_MAX_RESOURCE_UTILIZATION];
sinsp_stats_v2 m_sinsp_stats_v2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since stats were protected by a macro before, can we retains a similar approach by making this back a unique_ptr and only creating it if requested (ie: by a sinsp_enable_stats() API or similar that must be called before the opening of the inspector).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need a unique_ptr, just ensure the struct is zero-initialized and have an option to disable the stats updates?

If you disable the stats and then read them and expect non-zero values, I'm not sure you have someone else to blame :D

Copy link
Contributor

@FedeDP FedeDP Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would save us lots of if m_sinsp_stats_v2 != nil checks, neat!

EDIT: on a second thought, we still have the checks, and still have to embed a "huge" struct inside the inspector + a boolean to know if stats are enabled.
I think a single unique_ptr would allow us:

  • to not waste space when not needed
  • do not use the bool at all since we already have the ptr != nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnosek ^ you also onboard, if yes can go ahead and re-push later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FedeDP pushed a suggestion, plz let me know. It seems checking the inspector would only be needed in one place and in fact in most cases it's already de-referenced anyways before.

In addition, luckily had a pod sandbox running in the background and realized we should not be counting sandboxes as missing container images ... that's the one important metric I really would like to check on in prod, so we better get it right.

@incertum
Copy link
Contributor Author

@FedeDP thanks for answering the question if we want a flag to enable everything.

Re the null checks, some unit tests create conditions where the inspector in fact would be NULL, hence follow up question:

  • Change unit tests?
  • Check if inspector is null only sometimes?
  • Never check if inspector is null?

Since you didn't object against removing the old implementation I assume that means it's ok?

@FedeDP
Copy link
Contributor

FedeDP commented Oct 25, 2023

thanks for answering the question if we want a flag to enable everything.

Uh i totally missed that and answered it randomly 🤣

hence follow up question:

I'd say either 1 or 3; checking if inspector is null "sometimes" makes no great sense since it has a penalty and we cannot still be sure that inspector is always not null.

Since you didn't object against removing the old implementation I assume that means it's ok?

I haven't got strong opinion, i think it is ok to remove/improve it like you did.
But i am not an user of the API :)
cc @gnosek

@incertum
Copy link
Contributor Author

@FedeDP and @gnosek I added the flag in for now and kept is as simple struct as before as @gnosek suggested, but I can always change it if needed. Personally feel these stats should be as simple as possible in any regards.

Also cleaned up the git history a little bit.

Once we decide on #1433 (comment) I'll make those changes.

@@ -59,7 +59,8 @@ class sinsp_with_test_input : public ::testing::Test {
sinsp m_inspector;

void open_inspector(scap_mode_t mode = SCAP_MODE_TEST) {
m_inspector.open_test_input(m_test_data.get(), mode);
m_inspector.open_test_input(m_test_data.get(), mode);
m_inspector.set_sinsp_stats_v2_enabled(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Setting this in the unit test only did not work.

@@ -2708,6 +2660,27 @@ void sinsp::set_proc_scan_log_interval_ms(uint64_t val)
m_proc_scan_log_interval_ms = val;
}

void sinsp::set_sinsp_stats_v2_enabled()
{
m_sinsp_stats_v2 = std::make_unique<sinsp_stats_v2>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FedeDP forgot to add checks if it's NULL in case you would call it multiple times. Will re-push, but I'll wait for your feedback first in case there is more.

@incertum
Copy link
Contributor Author

@jasondellaluce is also on board with this interim refactor that will allow us to consume the new metrics.
I created #1463 to track future API refactors we already agreed on.

One more question: Do we or can we merge this and the Falco PR this week? Else I would propose the week of Nov 13th as Jason and I will be out at KubeCon NA.

This Falco PR falcosecurity/falco#2879 also needs to be merged before the corresponding Falco PR falcosecurity/falco#2883 or I consolidate those 2 PRs in one.
In any case we should update Falco so it's not incompatible anymore with current libs master.

Copy link
Member

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. To be honest, I'm not fully convinced by the approach we are following with these stats, we have sinsp_stats, machine_info_stats, agent_info_stats, and resource_stats ... some in the platforms some with a shared pointer in the inspector and then populated with an external method libsinsp::stats::get_.... Probably before going on with further steps, I would like to plan a final design to understand what is the final aim because at the moment it is not clear to me

{
if (m_sinsp_stats_v2 == nullptr)
{
m_sinsp_stats_v2 = std::make_unique<sinsp_stats_v2>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want a shared or a unique pointer? if consumers want to manage these metrics probably we want to use a shared pointer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct needs to be a shared pointer exactly because of what you wrote here

uint64_t m_n_failed_thread_lookups;
uint64_t m_n_added_threads;
uint64_t m_n_removed_threads;
uint32_t m_n_missing_container_images;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why uint32_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems large enough and uint64_t likely never needed. I doubt machines would go above 200-500 containers per host at a time anytime soon ... and let me know if we need additional comments to clarify that these container counters are not monotonic and the current view only similar to n_threads and n_fds. We could make them monotonic, but then we need to hook into the container engine differently. I know it's always a mess and believe me likely no winning ever with finding the perfect metrics.

@@ -44,6 +44,7 @@ limitations under the License.
#include "plugin_filtercheck.h"
#include "strl.h"
#include "scap-int.h"
#include "stats.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we include sinsp.h so probably we don't need stats.h also here

Comment on lines 63 to 65
m_inspector.set_sinsp_stats_v2_enabled();
// Extra call to see we don;t fail
m_inspector.set_sinsp_stats_v2_enabled();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would enable the stats directly on the tests we need, not in all tests by default

m_failed_lookups->increment();
#endif
// Extra m_inspector nullptr check
if (m_inspector != nullptr && m_inspector->m_sinsp_stats_v2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thread manager is contained into the inspector so probably we don't need this null check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the problem child condition: asan will fail if we don't check here because of TEST(sinsp_thread_manager, remove_non_existing_thread), change the unit test?

@@ -1444,14 +1444,6 @@ void sinsp_thread_manager::clear()
m_last_tid = 0;
m_last_flush_time_ns = 0;
m_n_drops = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SInce we have removed all these fields at this point i would remove also m_n_drops and add it to sinsp_stats. Maybe we could use a more meaningful name like full_threadtable_drops. In this way we can handle this stat like all the others without a dedicated method get_m_n_drops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but came at a small costs in terms of changing a g_logger log that now only fires if sinsp stats v2 are enabled.

Comment on lines 1532 to 1535
if (m_inspector->m_sinsp_stats_v2)
{
m_inspector->m_sinsp_stats_v2->m_n_added_threads++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to the end of the function because here it is not counting the added_threads, the table could be full. This counts the requests of adding a thread

buffer[SINSP_STATS_V2_N_FDS].value.u64 = 0;
threadinfo_map_t* threadtable = thread_manager->get_threads();
threadtable->loop([&] (sinsp_threadinfo& tinfo) {
sinsp_fdtable* fdtable = tinfo.get_fd_table();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we need a check sinsp_fdtable != nullptr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ more segfaults is the last thing we need, which makes we wonder if we could come up with a more proper project inspection in this regard as sometimes we say we trust the pointer to be never NULL and prefer to save the overhead of checking vs sometimes we definitely need to check ...

@Andreagit97
Copy link
Member

Before going on with code changes, I have some general questions not related to the code:

  • I saw that in Falco (and I imagine also in other consumers) we use the get_capture_stats method to obtain the number of drops/events. On the other hand, with stats_v2 we are using an agnostic approach, where the final consumer receives a vector of metrics already populated by sinsp. My question here is, do we want scap_stats_v2 to replace the old scap_stats? If yes, how do we obtain the specific number of drops/events from this agnostic approach? Do we want to keep these specific numbers or the final goal is to expose a set of metrics with a Prometheus endpoint?
  • If the final goal is to expose only a Prometheus endpoint using something like https://github.com/jupp0r/prometheus-cpp, are we doing the right thing by exposing a vector of struct scap_stats_v2? At this point we could directly create Prometheus structs in sinsp and expose them to the consumers, so they can add them to the registry and expose them.

WDYT?

incertum and others added 10 commits November 13, 2023 22:53
- Prepare m_sinsp_stats_v2 buffer to hold more stats and metrics than resource utilization metrics
- Expand resource utilization metrics (e.g. include overall host CPU and memory usages)

Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
- Finalize exposing valuable legacy libsinsp stats / metrics / counters related to the sinsp state tables (e.g. adding, removing, storing, failed lookup activities) to the consolidated libsinsp stats
- Add counter of currently cached containers as well
- Includes additional cleanups

Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
- Get container cache stats via hijacking the periodic container cache flush functionality
- Includes some general cleanup

Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Co-authored-by: Federico Di Pierro <nierro92@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
* make m_sinsp_stats_v2 a smart pointer initialized only when enabling sinsp_stats_v2 counters
* only keep m_inspector nullptr check for `sinsp_thread_manager::remove_thread(int64_t tid)`

Co-authored-by: Federico Di Pierro <nierro92@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Co-authored-by: Andrea Terzolo <andreaterzolo3@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
@incertum
Copy link
Contributor Author

@Andreagit97 thank you, I have moved the discussion to relevant issues.

@Andreagit97
Copy link
Member

Andreagit97 commented Nov 14, 2023

thank you, I will take a look in these days! sorry, but I'm a bit busy with k8s plugin + falco config/falco-driver-loader + various cleanup 😖 In the meanwhile if someone has other ideas please jump in 🙏

Copy link
Member

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Nov 17, 2023

LGTM label has been added.

Git tree hash: b1e038d84fffa650a145d188fae3ef2244f92de1

Copy link
Contributor

@jasondellaluce jasondellaluce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Nov 17, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Andreagit97, incertum, jasondellaluce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Andreagit97,incertum,jasondellaluce]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana merged commit 9866af1 into falcosecurity:master Nov 17, 2023
26 checks passed
@incertum incertum deleted the sinsp-stats-v2-extension branch December 8, 2023 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants