Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When processing with high memory load is performed, it is forcibly terminated with "shim reaped". #2202

Closed
yohiram opened this issue Mar 12, 2018 · 19 comments

Comments

@yohiram
Copy link

yohiram commented Mar 12, 2018

Hello. I'm sorry for my poor English.

I run the Tensorflow program in Docker. And this program is forcibly terminated with "shim reaped".

Processing with high memory load is executed, but as far as the results of the top command are concerned, there is a considerable margin in the memory. My server has 256 GB memory. It uses at most 600 MB - 4 GB during execution. This program runs without problem unless using Docker. I almost tried the docker run option on memory. And I tried the storage driver with devicemapper, overlay, overlay 2.

Could you tell me whether this problem is bad on how to use Docker or if it is planned to be cured with already recognized problems?


BUG REPORT INFORMATION

Use the commands below to provide key information from your environment:
You do NOT have to include this information if this is a FEATURE REQUEST
-->

It occurs when executing high load Tensorflow processing (train ()) in Docker.

Describe the results you received:

In console Error 137

In syslog (use overlay2)

Mar 12 13:02:44 gpu1 dockerd[9951]: time="2018-03-12T13:02:44+09:00" level=info msg="shim reaped" id=f24980703b5d13089200d6bd84bf4c8648a9d4216633e5da8ecd237f0ff6e0bb module="containerd/tasks"
Mar 12 13:02:44 gpu1 dockerd[9951]: time="2018-03-12T13:02:44.340010941+09:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 12 13:03:05 gpu1 sshd[13321]: Set /proc/self/oom_score_adj to 0

Describe the results you expected:

I want to finish processing without error.

Output of containerd --version:

containerd github.com/containerd/containerd v1.0.2 cfd04396dc68220d1cecbe686a6cc3aa5ce3667c

Output of sudo docker info:

Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 1
Server Version: 18.02.0-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c (expected: 9b55aab90508bd389d7654c4baf173a981477d55)
runc version: N/A (expected: 9f9c96235cc97674e935002fc3d78361b696a69e)
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.1.50
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 251.8GiB
Name: gpu1.dssci.ssk.yahoo.co.jp
ID: 47CX:FHBR:53ZM:IH3N:ZUEC:256P:D76K:QDSV:OHJ4:R4QR:JIWA:Z5EX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
@kunalkushwaha
Copy link
Contributor

I tried to reporduce this behavior at my local machine, but couldn't.
Though I didnt used Tensorflow, instead I used simple program, which allocates memory. I tried till 5 GB, and it works fine.

Probably, full docker daemon log may help .
You may try to confirm memory issue by running following.

$ docker run -it kunalkushwaha/high-mem-allocator 4

Above container will allocate 4 GB memory and wait for Enter key to release it.

Also, for your docker container, you may try --oom-kill-disable and see if oom happens.
More options https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory

@stevvooe
Copy link
Member

@yohiram In general, I would recommend troubleshooting this in https://github.com/moby/moby. There may be problems in docker contributing to the killing of the process.

In general, a "shim reaped" message simply means that the shim process was collected properly after an exit. This doesn't necessarily indicate a defect.

Other things you can try include updating to a newer kernel, if possible. If that helps, let us know and we'll what we can do.

@yohiram
Copy link
Author

yohiram commented Mar 13, 2018

@kunalkushwaha Thank you for your reply. I tried running your docker image several times, but it seems that the problem didn't reproduce with my probrem. Thank you for making the tool to debug!

Using "journalctl -u docker.service" log
'''
Mar 13 12:16:14 gpu1.dssci.ssk.yahoo.co.jp dockerd[9951]: time="2018-03-13T12:16:14+09:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/41203fce6a0b160af14466e06f310ce1831d87541ff96ab0bd9d71c906955849/shim.sock" debug=false module="containerd/tasks" pid=14517
'''
When the error occurs.
'''
Mar 13 12:10:53 gpu1.dssci.ssk.yahoo.co.jp dockerd[9951]: time="2018-03-13T12:10:53+09:00" level=info msg="shim reaped" id=aadc71bc6c8b5b5168a9d79074d94e0146adee65f4013dd9524b43b80bbd7234 module="containerd/tasks"
Mar 13 12:10:53 gpu1.dssci.ssk.yahoo.co.jp dockerd[9951]: time="2018-03-13T12:10:53.778550447+09:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
'''
Other logs are not displayed at run time. So, I'm confused.

I also tried the --oom-kill-disable option, but it doesn't work for some reason.
I tried various other --memory options etc, but it doesn't work.

However, When --kernel-memory option is specified, it seems that it takes longer to generate the error than in other cases.

@yohiram
Copy link
Author

yohiram commented Mar 13, 2018

@stevvooe Thank you for your reply.
I tried it with two kernel versions. The problem has not been resolved for now.
Although it may be better to try it with other versions.

CentOS Linux (4.1.50) 7 (Core)
CentOS Linux (3.10.0-693.5.2.el7.x86_64) 7 (Core)

I'll try to investigate variously using moby/moby. Thank you for telling!

@yohiram
Copy link
Author

yohiram commented Mar 13, 2018

If this doesn't work, it will not cause fatal problems in my work, so it may take time to investigate.
So You don't need to spend time on this problem.. Thank you for taking valuable time!

@crosbymichael
Copy link
Member

You should be able to use sudo journalctl -k to see if it really is the oom killer killing your task. It could help pinpoint the reason why or look in your application's logs for the reason why it is being killed.

The logs you are seeing from docker/containerd are standard log messages for when a task is kill, nothing important here. Also, could this be something with GPU memory/resources and nothing to do with system RAM?

@yohiram
Copy link
Author

yohiram commented Mar 20, 2018

Thanks for your advice. I tried immediately.

Mar 20 16:18:13 *** kernel: docker0: port 1(veth145d1b6) entered disabled state
Mar 20 16:18:13 *** kernel: vethde1e5ed: renamed from eth0
Mar 20 16:18:13 *** kernel: docker0: port 1(veth145d1b6) entered disabled state
Mar 20 16:18:13 *** kernel: device veth145d1b6 left promiscuous mode
Mar 20 16:18:13 *** kernel: docker0: port 1(veth145d1b6) entered disabled state

In particular, no abnormal message might be displayed. This phenomenon happens even with only CPU...
I don't yet know that RAM isn't the cause.

@kunalkushwaha
Copy link
Contributor

have you enabled the debug mode in docker daemon?

@yohiram
Copy link
Author

yohiram commented Mar 20, 2018

Yes, I have enabled the debug mode.

@yohiram
Copy link
Author

yohiram commented Mar 20, 2018

I'm sorry to behave selfishly, I close this issue.
Because this problem is occurring only in my own environment, there seems to be no one worrying about similar things.
And I don't have to solve this problem no matter what. Thanks for polite advice!

@yohiram yohiram closed this as completed Mar 20, 2018
@trumpetlicks
Copy link

I have to admit, I wouldnt mind seeing this re-opened just as I am having the same problem, with exactly the same output from jounalctl

This problem seems intermittent, where sometimes the software within the docker runs for hours before being forcefully restarted, and others less than 1/2 hour. @yohiram - did you ever find an answer, or any better understanding of what is going on?

@pbelskiy
Copy link

pbelskiy commented Aug 2, 2019

I have the same output in journalctl my JNLP slave for Jenkins is dead spontaneously when many build are created and aborted in short period of time.

And I cannot reproduce this if disable network related config in docker-compose.yml

@filipsuk
Copy link

I am facing the same issue in google cloud compute docker environment with high memory nodejs tasks. Have you guys found a solution?

@crosbymichael
Copy link
Member

For those of you who are still encountering this issue, can you please open a new issue with information for us to debug and reproduce what you are seeing?

@jwongz
Copy link

jwongz commented Oct 24, 2019

I am facing the same issue in centos 4.1.0-28.el7, docker version 18.03-ce.

@htfy96
Copy link

htfy96 commented Apr 20, 2020

We also encountered similar issue on 19.03.8-ce on Fedora 31. For fellow subscribers, some issues at moby have the same error message at https://github.com/moby/moby/issues?q=is%3Aissue+is%3Aopen+shim+reaped . Maybe it's worth to take a look.

@tsu1980
Copy link

tsu1980 commented May 13, 2020

In my case, this is caused by stackoverflow problem in my code.

@jwongz
Copy link

jwongz commented May 18, 2020

Hi,all:
I resolve it by disable Transparent Huge Pages.It's Ok now

@CrackerHax
Copy link

CrackerHax commented Jun 16, 2020

Had this same problem. You can try (like jwongz suggested above) disabling transparent hugepages.

https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests