New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paced TCP downloads in 1745.6.0 break down #2457

Closed
fwiesel opened this Issue Jun 13, 2018 · 30 comments

Comments

Projects
None yet
@fwiesel

fwiesel commented Jun 13, 2018

Issue Report

It looks like any application on a coreos 1745.6.0 node not processing the data as fast as the source can provide it, will suffer a break down in transfer speed larger than the actual processing speed.

Docker image downloads where affected first, but it can be easily reproduced with curl.

Reverting to a prior version 1745.5.0 on the same host does not exhibit the behaviour.
But going back again to 1745.6.0 will.
It is also happening across machines (of the same type)

Bug

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.6.0
VERSION_ID=1745.6.0
BUILD_ID=2018-06-08-0926
PRETTY_NAME="Container Linux by CoreOS 1745.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Baremetal
Intel E7- 4870
2x Cisco VIC ENIC in a bond (active-backup) mtu 9000, 10000baseT/Full

Expected Behavior

As in the prior version (1745.5.0), same machine.

curl --limit-rate 1M -o /dev/null http://<high-speed-low-latency-source>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  956M    0 8439k    0     0  1023k      0  0:15:57  0:00:08  0:15:49 1020k

The downloads proceeds with (more or less) the limited speed.
This is not limited to curl, but also the docker daemon downloads, and presumably others.
If the client does not process the data as fast as the network delivers it, the speed breaks down.

Actual Behavior

curl --limit-rate 1M -o /dev/null http://<high-speed-low-latency-source>
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
 0  956M    0 6482k    0     0  41779      0  6:40:17  0:02:38  6:37:39   623

The traffic drops fastly under the speed limit.
This is also happening, with kubernetes stopped and iptables cleared.

Reproduction Steps

  1. Run curl with a speed limit lower than the source can deliver. The faster the source the better
  2. Wait for the traffic to drop vastly under the limit, when some buffer is presumably full (in our case ~6MiB)
  3. Restart same host with prior version (1745.5.0)
  4. Run same command and see expected behaviour.

Other Information

A tcpdump seems to indicate, that the client scales the window size to 384 bytes at a roughly 10 packets per seconds.

Prior version:

cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.5.0
VERSION_ID=1745.5.0
BUILD_ID=2018-05-31-0701
PRETTY_NAME="Container Linux by CoreOS 1745.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

@fwiesel fwiesel changed the title from Paced TCP downloads in 4.14.48-coreos-r1 slow down to Paced TCP downloads in 4.14.48-coreos-r1 break down Jun 13, 2018

@urfuwo

This comment has been minimized.

urfuwo commented Jun 13, 2018

+1

@databus23

This comment has been minimized.

databus23 commented Jun 13, 2018

I can reproduce the behaviour on CoreOS 1745.6.0 running as a vm on esxi.
So it seems like this is a general problem not related to any special hardware.

It takes up to 90 seconds before the download drops below 1KB/s but its 100 % reproducible for me on 1745.6.0 and I can't make it happen on 1745.5.0.

@fwiesel

This comment has been minimized.

fwiesel commented Jun 13, 2018

I cannot reproduce it with 4.14.48-041448-generic from (http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14.48/) on ubuntu xenial.

$ curl --limit-rate 1M -o /dev/null http://<some-source>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 16  956M   16  157M    0     0  1024k      0  0:15:56  0:02:37  0:13:19 1024

@fwiesel fwiesel changed the title from Paced TCP downloads in 4.14.48-coreos-r1 break down to Paced TCP downloads in 1745.6.0 break down Jun 13, 2018

@zikphil

This comment has been minimized.

zikphil commented Jun 13, 2018

We have also been experiencing issues with the latest version, we have several docker pull that hangs mid-way for several minutes (20-30m). Unreproducible on older versions.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 13, 2018

Thanks for the report. coreos/linux@a6f81fc from 4.14.48 seems like a possible culprit, but I haven't been able to reproduce the problem and so can't test. Do you see this on a freshly-installed machine with default configs? Has anyone seen this problem on any cloud platform?

One thing you could do is test with beta 1772.2.0 and, if that succeeds, 1772.3.0. That would help determine whether 4.14.48 is in fact responsible.

@databus23

This comment has been minimized.

databus23 commented Jun 13, 2018

@bgilbert I experience this with a freshly create VM created from the the official image: https://stable.release.core-os.net/amd64-usr/1745.6.0/coreos_production_openstack_image.img.bz2.

I just boot the vm, login to it and run the above curl command.

I'll try to reproduce with 1772.2.0

@sjourdan

This comment has been minimized.

sjourdan commented Jun 13, 2018

I confirm I can't reproduce the described issue on 1772.2.0 using kernel 4.14.47
But I can on 1772.3.0 using 4.14.48-coreos-r1

I tried this on AWS with fresh VMs.

@paulcichonski

This comment has been minimized.

paulcichonski commented Jun 13, 2018

I am hitting this too on 1745.6.0 nodes running in ec2, specifically the docker pulls hanging part. Some things that seem to make the hanging happen more often:

  • using an s3 backed docker registry, pulls from there (especially with large image layers) seem to hang almost all the time, whereas pulling the same image from docker hub does not hang.
  • using non ebs optimized instance types.

I've put together some terraform that makes it easy to replicate: https://github.com/paulcichonski/replicate-docker-pull-hanging.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 14, 2018

@paulcichonski Thanks for the repro! I've been able to reproduce on 1745.6.0, and have been unable to reproduce on these images:

Description us-west-2 AMI
Alpha 1800.1.0 ami-f623628e
1745.6.0 with coreos/linux@a6f81fc reverted ami-abf4b7d3
1745.6.0 with coreos/linux@02db557 applied ami-df1153a7

Could someone who's been experiencing this problem confirm that one of the last two images (hopefully the last one) fixes the problem for them?

(Note that those are developer images which will not update themselves, so don't use them as a long-term fix.)

@Jwpe

This comment has been minimized.

Jwpe commented Jun 14, 2018

I'm experiencing this issue on 1772.3.0 beta release on EC2. Certain docker image pulls from Docker hub hang mid-download.

@paulcichonski

This comment has been minimized.

paulcichonski commented Jun 14, 2018

I no longer see the issue with ami-df1153a7 (1745.6.0 with coreos/linux@02db557 applied).

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 14, 2018

@paulcichonski Thanks.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 14, 2018

This should be fixed in beta 1772.4.0 and stable 1745.7.0, due shortly. Alpha is not affected. Thanks, everyone, for the detailed reports, reproduction, and testing.

@bgilbert bgilbert closed this Jun 14, 2018

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 14, 2018

Rolling updates to affected machines is going to be difficult, since the issue also affects retrieval of the update payload. We're investigating mitigations.

@bgilbert bgilbert reopened this Jun 14, 2018

@ajeddeloh

This comment has been minimized.

ajeddeloh commented Jun 14, 2018

We’re now rolling 1745.7.0 and 1772.4.0 which have the fix. We have blacklisted the affected versions (1745.6.0 and 1772.3.0) from updating because the update process overwrites the other USR partition. This would prevent the ability to roll back to previous versions. To update to 1745.7.0 or 1772.4.0, roll back to the previous version then let the update process proceed normally.

We’re investigating ways to update the stuck 1745.6.0 and 1772.3.0 systems and will update this bug once we know more.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 15, 2018

@zyclonite

This comment has been minimized.

zyclonite commented Jun 15, 2018

any chance a manual upgrade will be possible? a download does not always stall... is there no way to kill the downloading process and retry?

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 15, 2018

Repro:

kola spawn -p aws --aws-type m3.large --aws-ami ami-401f5e38
curl --limit-rate 1M -o /dev/null https://stable.release.core-os.net/amd64-usr/current/coreos_production_image.bin.bz2
@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 15, 2018

As pointed out by @lucab, running

sudo sysctl net.ipv4.tcp_moderate_rcvbuf=0

on affected systems appears to work around the issue. Existing TCP connections are not affected, however.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 15, 2018

@zyclonite We plan to update affected systems automatically, but we're still working out the details. Right now, because we've blacklisted the affected versions, the only way to upgrade them is to roll back first.

@zyclonite

This comment has been minimized.

zyclonite commented Jun 15, 2018

@bgilbert thank you, we did the rollback way, was not so complicated as we thought

no issues after the upgrade

@DrMurx

This comment has been minimized.

DrMurx commented Jun 17, 2018

@bgilbert I'm stuck in the upgrade limbo. I'm currently running 1772.3.0 while upgrade_engine tries to download 1772.4.0.

I'm unable to roll back to 1772.2.0 - the machine doesn't boot if I try, maybe its partition is already partially overwritten with the upgrade. Also, disabling net.ipv4.tcp_moderate_rcvbuf like you suggested didn't work for me.

Is there a way to download the image without bandwidth control and dd it to the partition?

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 17, 2018

We've now rolled out fixes to both the beta and stable channels. All machines running 1772.3.0 or 1745.6.0 should either a) have updated to the fixed versions or b) still be in the process of downloading the update. We have implemented mitigations on the public CoreUpdate server which we believe will help affected nodes update in a reasonable amount of time. We're monitoring the effect of those mitigations and will continue to adjust them as necessary.

Machines that update from a private CoreUpdate server do not benefit from these mitigations, unfortunately. If you're in that situation, or if you'd like to update more quickly, you should be able to apply the update by running:

sudo sysctl net.ipv4.tcp_moderate_rcvbuf=0
sudo systemctl restart update-engine
update_engine_client -update

In the worst case, even if mitigation is ineffective and the above commands are not executed, we believe machines will still eventually update themselves to a fixed version. In that case, downloading and applying the update may take a week or more, but it should eventually complete.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 17, 2018

@DrMurx There's not a completely straightforward way to apply the update manually. You're correct that the partially-applied 1772.4.0 would have overwritten 1772.2.0. Did you restart update-engine after applying the sysctl? The sysctl only affects new TCP connections, not established ones.

@DrMurx

This comment has been minimized.

DrMurx commented Jun 17, 2018

@bgilbert I did restart update-engine after the sysctl tuning. It was still stuck at 0.8%, but later out of a sudden the situation seems to have resolved (maybe due to your changes on your servers).

You don't have a dedicated feature request tracker, do you? Because given this experience I'd suggest that update-engine should download to a temporary file first before applying any changes to the lower prioritized partition.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 17, 2018

@DrMurx Glad to hear it got resolved. We track feature requests in this issue tracker as well. However, update_engine is not actively maintained and won't be used as the update system for the successor to Container Linux, so we're unlikely to make major adjustments to it.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Jun 17, 2018

The public CoreUpdate server is now seeing only a handful of active downloads for these updates. It appears that the mitigations have been effective. If you have machines which are still failing to update from 1772.3.0 or 1745.6.0 (after following the instructions above in the case of a private CoreUpdate instance) please leave a comment here.

@Jwpe

This comment has been minimized.

Jwpe commented Jun 18, 2018

@bgilbert thanks so much for your prompt response to this issue and for keeping us informed through the entire process. It was really helpful and gives me a lot of confidence in continuing to use CoreOS.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Sep 19, 2018

We're going to drop the mitigations from the public CoreUpdate server on October 4. If you're still launching machines with stable 1745.6.0 or beta 1772.3.0, you should switch to a newer release.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Oct 5, 2018

Mitigations have been dropped from the public CoreUpdate server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment