-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent is Unable to kill container when Task Definition Memory Is reached #794
Comments
Hi @dm03514, thanks for filing the issue. I'd like a little more information before I can help though. Could you add the output of Thanks, |
Emailing you the task definition, Thank you |
Thanks for the additional info @dm03514. Can you confirm a few suspicions I have?
Assuming those are both true, what you're seeing is related to #124 (comment). Docker's --memory flag, and the associated api, default to configuring swap memory equal to the amount requested in the flag. |
Thank you, i'll check first thing tomorrow morning (EST) |
It looks like we do not have swap enabled based on:
And IO burst balance was observed to be dropping by the person helping us with AWS support |
Thank you for the info. I haven't been able to reproduce this on my end, but if you have repro steps, please let me know. Could you send me the container instance arn, docker logs, and ecs-agent logs on an instance where this is happening? You can use the ECS Logs Collector to grab the logs as well as some helpful system logs. |
We're debugging the exact same issue here, but I believe the issue lies with the kernel and not the ECS agent or even docker. (the oomkiller lives in the kernel) Very basic containers (a few different nodejs-based apps, one collectd) container, reaches its memory limit, observed to sit between 99.9 and 100% of the limit, starts chewing through IO read on the docker volume, which eventually exhausts our burst balance, at which point the host (and other workloads) become pretty unhappy. The container may or may not be eventually killed OOM, but not as soon as one would expect. In one case I directly observed, A few things that seem relevant to note:
@dm03514 if we figure it out I'll make sure you hear about it, would appreciate the same! |
@bobzoller absolutely, i was afraid the problem was going to be in the OS :( debbuging those sorts of issues is pretty over my experience level. Have you happened to have any success with any different kernel versions :p looking for the easy way out :) |
@swettk and I spent more time with this today, and have plausible theory: as the container approaches its memory limit, it causes major page faults. This could be why we see high reads but almost no writes, and they are reads from the docker disk, not the swap disk. It eventually crosses the actual cgroup memory limit and will then get killed OOM, but while it hangs out at that boundary you may end up thrashing your disk. we're personally planning to investigate:
|
@bobzoller and/or @dm03514 I'd like to add one more avenue of investigation. Try increasing your task's memory limit a bit. The page faults you're seeing are likely due to the page cache being partially flushed to free up more process memory. This, in turn, causes your application to need to re-read portions of itself (or its dependencies) from disk. Depending on your application's structure, this can cause a fairly tight feedback loop where for any unit of work to proceed, lots of disk IO will occur. |
correct @jhaynes, as I said major page faults are absolutely the problem and increasing the memory limit will resolve it until you bump up against the limit again. as we're striving for container isolation and protecting the health of the host, we chose to write a simple reaper that runs on every ECS instance and stops containers that have crossed a major page fault threshold we chose based on our environment (happy containers might cause 300/day, and sad containers can rack up hundreds of thousands within a few minutes). running it every minute using cron has been effective: these containers are now killed off within 60 seconds of them starting to thrash the disk, and the host recovers without intervention. ECS reschedules the container if necessary, and we notify the responsible engineer so they can investigate later. 👌 Our script looks something like this: #!/bin/sh
# don't kill containers using these images even if they're misbehaving
EXCLUDES_PATTERN=$(cat <<'EOF' | xargs | sed 's/ /|/g'
amazon/amazon-ecs-agent
EOF
)
# list all the candidate containers
targets=$(docker ps --no-trunc --format '{{.ID}} {{.Image}}' | grep -Ev "$EXCLUDES_PATTERN" | awk '{ print $1; }' | xargs)
for target in $targets; do
cd "/cgroup/memory/docker/$target" || exit
info="id=$target $(docker inspect --format 'image={{.Config.Image}} StartedAt="{{.State.StartedAt}}"' "$target") pgmajfault=$(grep total_pgmajfault memory.stat | awk '{print $2;}')"
value=$(echo "$info" | awk '{ print $4;}' | sed 's/pgmajfault=//g')
if [ "$value" -gt 10000 ]; then
echo "Executing docker stop on container due to $value major page faults ($info)"
docker stop "$target" &
fi
cd - || exit
done
wait HTH! |
@dm03514 I'm inclined to close this since it isn't directly related to an ECS issue. However, if you or @bobzoller wind up with other questions or issues, feel free to open bugs here or engage directly with AWS Support. |
@jhaynes @bobzoller Just ran into this issue myself and am wondering whether an "out-of-agent" cron job is still the recommended course of action? |
we still run our cron job "reaper" just in case, but since moving off Amazon Linux onto Ubuntu we haven't seen a single occurrence. I'd assume this is more to do with kernel version and less to do with distro, but I can't tell you for sure. FWIW we're currently running kernel |
@bobzoller thanks. I am seeing this on an
|
Thanks @bobzoller For the wonderful script... #!/bin/bash -e
##
# Use this annotated script as base for killing container misbehaving on reaching memory limit
#
# Requirements:
# - `jq` must be installed on ecs machine
##
# don't kill containers using these images even if they're misbehaving
EXCLUDES_PATTERN=$(cat <<'EOF' | xargs | sed 's/ /|/g'
amazon/amazon-ecs-agent
EOF
)
# list all the candidate containers
targets=$(docker ps --no-trunc --format '{{.ID}} {{.Image}}' | grep -Ev "$EXCLUDES_PATTERN" | awk '{ print $1; }' | xargs)
for target in $targets; do
# get taskid and dockerid from ecs
task=$(curl -s http://localhost:51678/v1/tasks?dockerid=$target)
taskId=$(echo $task | jq -r ".Arn" | cut -d "/" -f 2)
dockerId=$(echo $task | jq -r ".Containers[0] .DockerId")
memoryStatsFile="/cgroup/memory/ecs/$taskId/$dockerId/memory.stat"
# skip current target if cannot find memory stats file, might not be managed by ecs
if ! [ -f $memoryStatsFile ]
then echo "Memory stats not found for taskId=$taskid dockerId=$dockerId" && continue
fi
info="id=$target $(docker inspect --format 'image={{.Config.Image}} StartedAt="{{.State.StartedAt}}"' "$target") pgmajfault=$(grep total_pgmajfault $memoryStatsFile | awk '{print $2;}')"
majorPageFaults=$(echo "$info" | awk '{ print $4;}' | sed 's/pgmajfault=//g')
if [ "$majorPageFaults" -gt 5000 ]; then
echo "Stopping container due to major page faults exceeding threshold ($info)"
docker stop "$target"
fi
done |
We are also having the same problem on Besides the reaper cron, has anyone found a reasonable solution? |
amzn-ami-2017.09.i-amazon-ecs-optimized is still affected by the issue. |
We hit this issue ourselves when someone configured to little memory to a task. I think one part of the problem is that the container never reaches its memory limit. I tested this by giving a container that requires 128MB RAM just to start, only 8 MB. The container (according to quay.io/vektorlab/ctop, My biggest annoyance with this, is that it is really hard to detect. I could use the scripte provided by vikalpj, and log the output to a log group in CloudWatch, and trigger an alarm on new events. But that is not my expectations of a the ECS product, I expect it to kill the container and inform me why. Now it just trashed the disk. @jhaynes, are you open to re-opent this issue, or look at alternatives to log this with ecs-agent ? |
Yes, all the fs cache has disappeared, but the application is not "out of memory". The application did allocate only 6MB, but when the kernel needs to access the code of the application, it is not available in memory so it has to read it from disk. As if it was running 100% on swap except the heap memory segment.. The workaround I have is to configure ecs tasks with a memory reservation ( aka "soft" ) big enough to fit the process image and all the files needed by the application. Then you hope that your application will never break the limit or if it does, it will be a big allocation that will break the limit at once before any disk trashing occurs allowing the oom killer to destroy your process. Obviously you have to spend some time reading docker stats for your workload.. And if your application leaks slowly you will hit the problem again and again.. Maybe some fine tuning on sys/vm could fix or serioulsy alleviate the issue, but I would like to have the official ecs ami configured with a correct setting. |
this issue, or one very similar to it, appears to still be present (hello from the tail-end of 2019). Is there any official documentation relating to how to approach this issue, as it appears to have been closed intentionally not-fixed? |
The memory allocation limits enforcement is carried out by a host operating system's cgroups and oom killer. Just as we don't expect that the whole os will shut down just because some of the processes it runs has eaten up the memory(*), we shouldn't expect that from containers. In fact, they are not much more than just processes running on a system. What we usually observe is the oom killer ending the processes that cause the exhaustion. In the case of containers that according to good practices contain only one process, killing by the oom killer will have an effect of terminating container, as this particular PID 1 process and container is the same thing. The problem begins when an additional manager is introduced on the container. This can be by forking additional processes or by using tools like the supervisor, systemd, etc. Here's the example with plain Docker:
The container allocates almost all of the available memory. The CPU usage and disk reads are high (the lack of memory causes that the app and libraries files can't be saved in the cache and are constantly reread for execution purposes).
At the same time, the oom killer tries to end processes that caused the exhaustion:
which are then spawned again and again according to the stress-ng docs: "If the out of memory killer (OOM) on Linux kills the worker or the allocation fails then the allocating process starts all over again." Many apps behave similar. What else could they do to keep working if some of their workers have been stopped/killed? Theoretically, even if the enforcement of limits was the ECS agent's responsibility, as a result of the OS intervention, memory usage stays below the given threshold, and the agent wouldn't be able to take any actions. How to approach that?
(*)For non-containerized systems, this is actually possible with the kernel setting: vm.panic_on_oom = 1. |
I am running a container and when the hard task memory limit is reached it is not killed. In addition to not dying it begins to do a large amount of
docker.io.read_bytes
(observed from ECS datadog integration).Agent version
1.14.1
Stats shows that the container Id frequently reaches 100% memory complete, and shows BLOCK I/O perpetually increasing (the application should only be using BLOCK I/O to read a configuration file during startup)
The container remains up:
Sometimes the agent IS able to kill the container after 10-20 minutes:
Also once the container is in a 100% state, if I try to
exec -it <container_id> /bin/bash
it will hang for a while and then register the SIGKILL, almost like it finally recognizes SIGKILL only after I exec.The daemonization feature, and auto restart is critical to keeping resource depletion failures from taking down other services and would really appreciate any insight possible.
Thank you
The text was updated successfully, but these errors were encountered: