Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Disk Corruption #51
Not sure if this is the proper place to open an issue as I am unclear where the issue is actually happening, but we are experiencing disk corruption on the worker nodes. We have tried v20, v23, and v24 and the root volumes showed corruption in as little as a couple hours and up to 4 days.
What ends up happening is that we start to see that these errors in the logs:
Then eventually the node runs out of disk space, fails instance status checks, and it gets replaced by the auto scaling group.
Any help would be appreciated.
Although v22 isn't all roses either - we've seen a somewhat related issue of docker freezing up (node not ready with "PLEG is not healthy",
We're currently experimenting with a different Docker version (
I don't have a reliable way to test which makes it hard to confirm a fix and confirm the problem. However, I converted the root fs to ext4 to see if I would still have issues and my nodes are much more stable, so I believe it is xfs related. In regards to that link, I did check the dd_type and that was set correctly as far as I could tell.
We also believe that it's something to do with xfs. Mostly because of this mail thread, which describes a very similar issue.
I haven't tried ext4 and don't have a repro, although it does fail reliably in our setup. We're currently using v22 of the AMI + Docker 18.06.1ce, which has been a stable combination thus far. However, I have provided information about specific instances and occurrences to AWS support. They told me that the EKS team has passed the issue onto the EC2 team. Waiting to see where that gets us.
added a commit
Oct 18, 2018
We have seen a similar disk corruption issue occur roughly once every two weeks in a cluster of six m5.xlarge nodes. The nodes are lightly loaded: ~5% CPU usage, ~50% memory usage, ~50 pods. We haven't been able to track down any causal pattern or unusual system logs; the corruption appears to occur at random.
This doesn't immediately impact service, only monitoring, so while we lack a repro we have obtained a snapshot of an affected volume. We use 100G root volumes, which after corruption are consistently reported as
Attaching the affected volume to a separate instance and running an
Comparing those two numbers:
This looks like a bit-flip in the superblock free data block count, just as described in the above mail thread.
The origin of that thread is https://phabricator.wikimedia.org/T199198, in which the authors appear to have narrowed down the issue to the free inode btree (finobt) feature. However, in our case it is already disabled: