Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Disk I/O fails with EREMOTEIO on Oracle OCI #2180
Container Linux Version
Disk I/O succeeds.
There are two cases, both involving EREMOTEIO (errno 121).
(The last line of that log is not normally present, but happened to appear when I went to capture a sample.)
This is caused by
Afterward, the root filesystem has not been resized. Manually running
and resizes the filesystem.
These log messages, appearing once per boot:
This doesn't reproduce 100% reliably, but generally seems to happen at least once every two or three instance launches.
I have not seen this on bare metal, only VMs. I also reproduced case 1 with a 4.12.14 kernel as well as the 4.13.3 kernel above.
I'm now unable to reproduce this either with a newly-built image or with an existing image that was failing before. Something may have changed in the implementation of the OCI iSCSI target. kola now has additional checks (coreos/mantle#733) that should catch the problem if it shows up again, so I'll close for now. Please reopen if the problem recurs.
Just hit this on Alpha 1590.0.0.
I had our engineers take a look at this. We've seen occasional issues, and suggest some tweaks to the iSCSI configuration:
In iscsid.conf, set the replacement_timeout to 6000:
Redhat and derivatives (CentOS, Oracle Linux, etc.) use dracut in their initramfs and require us to pass this in as a kernel parameter as well, or the initramfs doesn't use the setting for the initial iSCSI connection it establishes (via iscsi-start):
We also advise turning off noop timeouts for the root volume only:
On RedHat based distributions iscsi drops a file in:
We already configure similar settings before starting
I've confirmed that
A network trace shows the target returning ILLEGAL REQUEST / INVALID COMMAND OPERATION CODE in response to a WRITE SAME. The kernel has code to handle this case, but apparently it's not completely functioning; I'm investigating whether torvalds/linux@d5ce4c3 fixes it. Meanwhile, writing 0 to
I have two questions for you:
This is awesome work. Thank you!…
On Wed, Jan 3, 2018 at 6:05 PM Benjamin Gilbert ***@***.***> wrote: 1. Yes, sometimes. In this case I think it's likely that we'll backport. 2. On 1632, you can also select Docker 1.12 <https://admin.coreos.com/blog/toward-docker-17-in-container-linux#choosing-a-docker-version> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2180 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAhF2N_iGMzEkMW5sVMS4iiEei7g4Szhks5tHAfVgaJpZM4PuVC3> .
@bgilbert lastly, I want to confirm whether there is not a similar way as documented for Docker 1.12 to select a docker with 17.x range - I am asking because Docker 17.09 works perfectly for me, and for working around the above bug, it does not make complete sense to drop all the way back to Docker 1.12.
Of course, cannot wait for Torcx release upcoming in May 18.
Ah, I see the "What about Docker 17.03?" part on the blog post you shared. It would probably apply to my question too - which is revert back to 1.12 until Torcx get released.
In that case, I would go ahead and do that, but would be great as you mentioned to get this fix backported.