-
Notifications
You must be signed in to change notification settings - Fork 30
Paced TCP downloads in 1745.6.0 break down #2457
Comments
+1 |
I can reproduce the behaviour on CoreOS It takes up to 90 seconds before the download drops below 1KB/s but its 100 % reproducible for me on |
I cannot reproduce it with
|
We have also been experiencing issues with the latest version, we have several docker pull that hangs mid-way for several minutes (20-30m). Unreproducible on older versions. |
Thanks for the report. coreos/linux@a6f81fc from 4.14.48 seems like a possible culprit, but I haven't been able to reproduce the problem and so can't test. Do you see this on a freshly-installed machine with default configs? Has anyone seen this problem on any cloud platform? One thing you could do is test with beta 1772.2.0 and, if that succeeds, 1772.3.0. That would help determine whether 4.14.48 is in fact responsible. |
@bgilbert I experience this with a freshly create VM created from the the official image: https://stable.release.core-os.net/amd64-usr/1745.6.0/coreos_production_openstack_image.img.bz2. I just boot the vm, login to it and run the above curl command. I'll try to reproduce with |
I confirm I can't reproduce the described issue on 1772.2.0 using kernel 4.14.47 I tried this on AWS with fresh VMs. |
I am hitting this too on 1745.6.0 nodes running in ec2, specifically the docker pulls hanging part. Some things that seem to make the hanging happen more often:
I've put together some terraform that makes it easy to replicate: https://github.com/paulcichonski/replicate-docker-pull-hanging. |
@paulcichonski Thanks for the repro! I've been able to reproduce on 1745.6.0, and have been unable to reproduce on these images:
Could someone who's been experiencing this problem confirm that one of the last two images (hopefully the last one) fixes the problem for them? (Note that those are developer images which will not update themselves, so don't use them as a long-term fix.) |
I'm experiencing this issue on 1772.3.0 beta release on EC2. Certain docker image pulls from Docker hub hang mid-download. |
I no longer see the issue with |
@paulcichonski Thanks. |
This should be fixed in beta 1772.4.0 and stable 1745.7.0, due shortly. Alpha is not affected. Thanks, everyone, for the detailed reports, reproduction, and testing. |
Rolling updates to affected machines is going to be difficult, since the issue also affects retrieval of the update payload. We're investigating mitigations. |
We’re now rolling 1745.7.0 and 1772.4.0 which have the fix. We have blacklisted the affected versions (1745.6.0 and 1772.3.0) from updating because the update process overwrites the other USR partition. This would prevent the ability to roll back to previous versions. To update to 1745.7.0 or 1772.4.0, roll back to the previous version then let the update process proceed normally. We’re investigating ways to update the stuck 1745.6.0 and 1772.3.0 systems and will update this bug once we know more. |
any chance a manual upgrade will be possible? a download does not always stall... is there no way to kill the downloading process and retry? |
Repro: kola spawn -p aws --aws-type m3.large --aws-ami ami-401f5e38
curl --limit-rate 1M -o /dev/null https://stable.release.core-os.net/amd64-usr/current/coreos_production_image.bin.bz2 |
As pointed out by @lucab, running sudo sysctl net.ipv4.tcp_moderate_rcvbuf=0 on affected systems appears to work around the issue. Existing TCP connections are not affected, however. |
@zyclonite We plan to update affected systems automatically, but we're still working out the details. Right now, because we've blacklisted the affected versions, the only way to upgrade them is to roll back first. |
@bgilbert thank you, we did the rollback way, was not so complicated as we thought no issues after the upgrade |
@bgilbert I'm stuck in the upgrade limbo. I'm currently running I'm unable to roll back to Is there a way to download the image without bandwidth control and |
We've now rolled out fixes to both the beta and stable channels. All machines running 1772.3.0 or 1745.6.0 should either a) have updated to the fixed versions or b) still be in the process of downloading the update. We have implemented mitigations on the public CoreUpdate server which we believe will help affected nodes update in a reasonable amount of time. We're monitoring the effect of those mitigations and will continue to adjust them as necessary. Machines that update from a private CoreUpdate server do not benefit from these mitigations, unfortunately. If you're in that situation, or if you'd like to update more quickly, you should be able to apply the update by running:
In the worst case, even if mitigation is ineffective and the above commands are not executed, we believe machines will still eventually update themselves to a fixed version. In that case, downloading and applying the update may take a week or more, but it should eventually complete. |
@DrMurx There's not a completely straightforward way to apply the update manually. You're correct that the partially-applied 1772.4.0 would have overwritten 1772.2.0. Did you restart update-engine after applying the sysctl? The sysctl only affects new TCP connections, not established ones. |
@bgilbert I did restart You don't have a dedicated feature request tracker, do you? Because given this experience I'd suggest that |
@DrMurx Glad to hear it got resolved. We track feature requests in this issue tracker as well. However, update_engine is not actively maintained and won't be used as the update system for the successor to Container Linux, so we're unlikely to make major adjustments to it. |
The public CoreUpdate server is now seeing only a handful of active downloads for these updates. It appears that the mitigations have been effective. If you have machines which are still failing to update from 1772.3.0 or 1745.6.0 (after following the instructions above in the case of a private CoreUpdate instance) please leave a comment here. |
@bgilbert thanks so much for your prompt response to this issue and for keeping us informed through the entire process. It was really helpful and gives me a lot of confidence in continuing to use CoreOS. |
We're going to drop the mitigations from the public CoreUpdate server on October 4. If you're still launching machines with stable 1745.6.0 or beta 1772.3.0, you should switch to a newer release. |
Mitigations have been dropped from the public CoreUpdate server. |
Issue Report
It looks like any application on a coreos 1745.6.0 node not processing the data as fast as the source can provide it, will suffer a break down in transfer speed larger than the actual processing speed.
Docker image downloads where affected first, but it can be easily reproduced with curl.
Reverting to a prior version 1745.5.0 on the same host does not exhibit the behaviour.
But going back again to 1745.6.0 will.
It is also happening across machines (of the same type)
Bug
Container Linux Version
Environment
Baremetal
Intel E7- 4870
2x Cisco VIC ENIC in a bond (active-backup) mtu 9000, 10000baseT/Full
Expected Behavior
As in the prior version (1745.5.0), same machine.
The downloads proceeds with (more or less) the limited speed.
This is not limited to curl, but also the docker daemon downloads, and presumably others.
If the client does not process the data as fast as the network delivers it, the speed breaks down.
Actual Behavior
The traffic drops fastly under the speed limit.
This is also happening, with kubernetes stopped and iptables cleared.
Reproduction Steps
Other Information
A tcpdump seems to indicate, that the client scales the window size to 384 bytes at a roughly 10 packets per seconds.
Prior version:
The text was updated successfully, but these errors were encountered: