Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async container info #1326

Merged
merged 15 commits into from
Apr 19, 2019
Merged

Async container info #1326

merged 15 commits into from
Apr 19, 2019

Conversation

mstemm
Copy link
Contributor

@mstemm mstemm commented Mar 7, 2019

Changes to stop looking up docker container metadata inline with event processing and perform the lookups in the background instead. A new event PPME_CONTAINER_JSON_E (*) signals when the metadata lookup is complete and full information about the container exists.

While the metadata lookup is underway, a "stub" container is created with a name="incomplete", image="incomplete". This ensures that there is a valid container object for processes running in a container. Obviously, the negative impact is that while the lookup is underway, for any initial activity in the container you can't know anything about the container that it's running in, other than the container id.

This fixes #1321.

(*) The event previously existed, but had the EF_INTERNAL flag so it was hidden and not ever returned by the inspector. It was used to save container information to trace files.

@mstemm
Copy link
Contributor Author

mstemm commented Mar 7, 2019

I think this is in pretty good shape, although I still want to do some perf tests. Could you take a look?

@mstemm
Copy link
Contributor Author

mstemm commented Mar 8, 2019

Ok could you take another look? One pending source of failures on osx is tbb. I could just change it to protect all that stuff with HAS_CAPTURE but I'd like to avoid it if possible.

Copy link
Contributor

@adalton adalton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only remaining comment is non-blocking.

@mstemm
Copy link
Contributor Author

mstemm commented Mar 11, 2019

Any opinions on getting rid of m_enabled = false on the first failed container metadata lookup? I think it's more robust this way--what I found in busy environments is that it was pretty easy to get a single lookup failure which would effectively disable all future lookups.

@@ -195,7 +156,6 @@ docker::docker_response libsinsp::container_engine::docker::get_docker(sinsp_con
{
case 0: /* connection failed, apparently */
g_logger.format(sinsp_logger::SEV_NOTICE, "Docker connection failed, disabling Docker container engine");
m_enabled = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this happen now? or do we keep trying to hit docker forever now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was my last general comment on the PR. We always try to fetch forever now. The general consensus was that this was better than giving up forever after the first failure.

/* PPME_CONTAINER_JSON_E */{"container", EC_INTERNAL, EF_SKIPPARSERESET | EF_MODIFIES_STATE, 1, {{"json", PT_CHARBUF, PF_NA} } },
/* PPME_CONTAINER_JSON_X */{"container", EC_INTERNAL, EF_UNUSED, 0},
/* PPME_CONTAINER_JSON_E */{"container", EC_PROCESS, EF_MODIFIES_STATE, 1, {{"json", PT_CHARBUF, PF_NA} } },
/* PPME_CONTAINER_JSON_X */{"container", EC_PROCESS, EF_UNUSED, 0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the flags change? (not saying it's unneeded, trying to understand)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EC_INTERNAL means they aren't returned from sinsp::next(), but we rely on the container event now to indicate that a container has been created. EF_SKIPPARSERRESET means it would be skipped by falco and also the thread state isn't properly created in sinsp_parser::reset(), but we want the thread info for the event.

return Json::FastWriter().write(obj);
}

bool sinsp_container_manager::container_to_sinsp_event(const string& json, sinsp_evt* evt)
bool sinsp_container_manager::container_to_sinsp_event(const string& json, sinsp_evt* evt, int64_t tid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't tids unsigned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container_info.m_type = s_cri_runtime_type;
// It might also be a docker container, so set the type
// to UNKNOWN for now.
container_info.m_type = CT_UNKNOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when both docker and cri fail to return metadata for a container? Are we stuck with CT_UNKNOWN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I suppose we could have a combined type or just pick one of them, but I think it'd be better to have this explicit type for when we don't know the metadata.

Copy link
Contributor Author

@mstemm mstemm Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnosek and I chatted offline and we're going to get rid of CT_UNKNOWN and rely on m_metadata_complete to internally denote that container info is incomplete. Since docker runs first, it will set the type to CT_DOCKER. If cri looks up info successfully it will set the type properly. If both fail, the type will remain at CT_DOCKER, which seems as good as you can given the cgroup layouts are the same.

// use this to ensure that the tid of the CONTAINER_JSON event
// we eventually emit is the top running thread in the
// container.
typedef tbb::concurrent_hash_map<std::string, int64_t> top_tid_table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to track the top threadinfo (or rather, why do we need a special thread chosen from the container at all)?

What happens if we fail to find an event from the toppest-of-top tids during the async lookup (and presumably use a child tid instead)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We track the threadinfo so the container event has some thread info associated with it. And we need to update this on the fly as it's very common that the initial process that created the container exits once the actual entrypoint is started, so without any extra steps you'd have a container event with a zombie threadinfo.

It's not the end of the world if it's not exactly the top thread but it is useful to be some thread in the container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced we need a tid in the container event. Where do we use it?

Copy link
Contributor

@mattpag mattpag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we aren't doing any filtering of containerinfos with incomplete metadata in the container.* filterchecks code, use cases like sysdig 'container.id != host' -p "%container.image" are now returning incomplete until the fetching is completed.
Can we go ahead and do it? Or have you already thought about this and something worse happens if we do?

This also means that Falco rules with "negative predicates" - !=, not in ... - based on most container fields will misbehave unless we address it in some way, right?

@@ -134,7 +134,8 @@ class sinsp_container_info
m_cpu_period(100000),
m_has_healthcheck(false),
m_healthcheck_exe(""),
m_is_pod_sandbox(false)
m_is_pod_sandbox(false),
m_metadata_complete(true)
Copy link
Contributor

@mattpag mattpag Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the default value false? Can't we set it to true immediately if the metadata fetch is synchronous and when we have all the data for the async case? (this is also related to the "top-level" PR review comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the default at true so container engines that don't have any notion of background metadata lookups can just create a container_info and know the metadata will be flagged as complete. It's only docker (and later cri) that will set the flag to false and then true when the metadata is all fetched.

@mstemm
Copy link
Contributor Author

mstemm commented Apr 1, 2019

About the container info with incomplete, I'll check but I thought the way I had it set up is that the container event e.g. an event that can be returned via sinsp::next() is only emitted when the info is complete. The container manager does create a container info object with image, etc == incomplete, but I intended that there's no sinsp event emitted.

Yeah double checked with the following: sudo ./userspace/sysdig/sysdig evt.type=container and running docker run busybox:latest. Only got 1 container event:

56033 16:37:10.038194346 0 <NA> (3091) > container json={"container":{"Mounts":[],"cpu_period":100000,"cpu_quota":0,"cpu_shares":1024,"env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"id":"59447b0e4b9e","image":"busybox:latest","imagedigest":"sha256:061ca9704a714ee3e8b80523ec720c64f6209ad3f97c0ff7cb9ec7d19f15149f","imageid":"d8233ab899d419c58cf3634c0df54ff5d8acc28f8173f09c21df4a07229e1205","imagerepo":"busybox","imagetag":"latest","ip":"172.17.0.2","is_pod_sandbox":false,"labels":null,"memory_limit":0,"name":"determined_mclean","port_mappings":[],"privileged":false,"swap_limit":0,"type":0}}

@mattpag
Copy link
Contributor

mattpag commented Apr 2, 2019

Yes the container event is emitted only when the metadata fetching is completed, but that's not what I'm talking about.

Since the container_info is available from the container_manager as soon as we start the async fetch, container filterchecks will be able to fetch it and use it but will return wrong results.
Just try the example command I pasted above and you'll see that incomplete is returned for all the syscalls processed while the async fetch is still in progress (since the filtercheck handling code will pick up the "placeholder" container_info), then when it completes you'll start seeing the correct image name.

And let me reiterate, this means that Falco rules with "negative predicates" - !=, not in, ... - based on most container fields will misbehave unless we address it in some way, because you'll have FPs (if we return incomplete for the first few hundreds of syscalls a rule like container.image not in (foo,bar) will incorrectly trigger for all of them)

@mstemm
Copy link
Contributor Author

mstemm commented Apr 2, 2019

Ok, here's a falco rules change that should make the container_started macro only trigger when there is complete metadata: falcosecurity/falco#570.

const Json::Value& mesos_task_id = container["mesos_task_id"];
if(!mesos_task_id.isNull() && mesos_task_id.isConvertibleTo(Json::stringValue))
{
container_info.m_mesos_task_id = mesos_task_id.asString();
}

#ifdef HAS_ANALYZER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we drop the #ifdef? I'd hope we eventually remove the flag completely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll remove it, but I'll remove it around all references of m_metadata_deadline, not just this one, right?

Add a non-blocking queue to the inspector and in sinsp::next() try to
read events from that queue, similar to how we look at m_metaevt.

If any events are found, they are returned from sinsp::next() instead of
reading from the libscap.

In order to support truly portable sinsp_event objects, we
need (optional) storage for the matching libscap event in m_pevt. This
is in m_pevt_storage, NULL by default but if non-NULL will be
freed. This will be used to pass container events to the inspector via
the queue.
Make sure container_json events have a real thread id by passing it as
an argument to container_to_sinsp_event.

Modify notify_new_container to use the non-blocking queue to pass events
to the inspector instead of m_meta_evt.
Previously, CONTAINER_JSON events had some, but not all, of the info
that was in a sinsp_container_info object. Fix this so all important
info is both dumped to json in container_to_json and parsed in
parse_container_json_evt.
Refactor docker metadata fetches to be asynchronous, using the framework
in sysdig::async_key_value_source.

docker_async_source (a global static so it can be long-lived) is now
responsible for looking up docker metadata. Some methods that used to be
in docker:: like parse_docker(), get_docker(), etc move to
docker_async_source. It dequeues requests, calling parse_docker() to
look up the information as needed, and calls store_value() to pass the
metadata back to the caller.

docker::resolve now uses parse_docker_async to schedule the lookup with
docker_async_source. Before scheduling the lookup it creates a stub
container with type UNKNOWN (UNKNOWN because you can't tell the
difference between cri and docker containers only from the cgroup), with
the id set, and with a name/image set to "incomplete". This ensures that
threads have some associated container info object with it. In the
callback once the full metadata is available, it calls
notify_new_container, which creates the CONTAINER_JSON event and pushes
it to the inspector.

There's a fair amount of bookeeping to make sure that the container
metadata has a valid tid. The very first tid that creates the container
often exits after forking of the real container entrypoint, so you need
to keep track of the "top tid" in the container for every call to
docker::resolve() and replace it if you find it's exited.

Previously, on the first error fetching container metadata, a flag
m_enabled would be set to false, and all subsequent attempts to fetch
container metadata would be skipped. Now that lookups are done in the
background, I think it makes sense to always try a lookup for every
container, even after failures. So remove m_enabled entirely from the
docker engine.

Also, as a part of this change, reorganize the way the async lookups are
done to better support the choices of different osen (linux, windows,
and HAS_CAPTURE). Instead of compiling out the curl handles when
HAS_CAPTURE is false, always compile the code for them but don't even
attempt any lookups in docker::resolve. Note that
docker_async_source::get_docker no longer changes behavior based on
HAS_CAPTURE.
cri::resolve might be called after docker::resolve, so there could be a
container_info object with type == UNKNOWN.

Update it to handle this i.e. do the lookup anyway.
We need it now that tbb is a core part of the inspector and that the
curl part of the docker container engine isn't #ifdefd with HAS_CAPTURE.

Both are already downloaded/built as dependencies of sysdig so it's just
a matter of CMakeLists.txt config.

Also, disable libidn2 explicitly when building libcurl. On OSX builds
this gets picked up from an autodetected library, while it does not get
picked up on linux. We already disable libidn1, so also disable libidn2.
Instead of inferring is_pod_sandbox from the container name only for
docker, save/load it to json directly from the container info. This
ensures it works for other container engines than docker.
Instead of having a CT_UNKNOWN type, iniitally set the type to CT_DOCKER
when starting the async lookup. Cri runs next and if the grpc lookup
completes successfully this will be replaced with CT_CRIO. If both grpc
and docker metadata lookups fail the type will remain at CT_DOCKER.
It's a bit simpler than including sinsp.h, which has many many dependent
header files.
It's required for the container.* filterchecks to work.
Call detect_docker, which identifies a docker container, before starting
the async lookup thread. If no docker containers are ever used, the
async thread won't be started.
Instead of just setting the image to incomplete, set all container
metadata fields to incomplete.
It's filled in when using the analyzer, but we can define it in both
cases and just not use it when there is no analyzer.
@mstemm mstemm merged commit 199d5a6 into dev Apr 19, 2019
@mstemm mstemm deleted the async-container-info branch April 19, 2019 01:25
@davideschiera davideschiera restored the async-container-info branch May 23, 2019 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker container metadata fetching is blocking, which can result in dropped system calls
4 participants