Paced TCP downloads in 1745.6.0 break down #2457

fwiesel · 2018-06-13T08:34:42Z

Issue Report

It looks like any application on a coreos 1745.6.0 node not processing the data as fast as the source can provide it, will suffer a break down in transfer speed larger than the actual processing speed.

Docker image downloads where affected first, but it can be easily reproduced with curl.

Reverting to a prior version 1745.5.0 on the same host does not exhibit the behaviour.
But going back again to 1745.6.0 will.
It is also happening across machines (of the same type)

Bug

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.6.0
VERSION_ID=1745.6.0
BUILD_ID=2018-06-08-0926
PRETTY_NAME="Container Linux by CoreOS 1745.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Baremetal
Intel E7- 4870
2x Cisco VIC ENIC in a bond (active-backup) mtu 9000, 10000baseT/Full

Expected Behavior

As in the prior version (1745.5.0), same machine.

curl --limit-rate 1M -o /dev/null http://<high-speed-low-latency-source>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  956M    0 8439k    0     0  1023k      0  0:15:57  0:00:08  0:15:49 1020k

The downloads proceeds with (more or less) the limited speed.
This is not limited to curl, but also the docker daemon downloads, and presumably others.
If the client does not process the data as fast as the network delivers it, the speed breaks down.

Actual Behavior

curl --limit-rate 1M -o /dev/null http://<high-speed-low-latency-source>
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
 0  956M    0 6482k    0     0  41779      0  6:40:17  0:02:38  6:37:39   623

The traffic drops fastly under the speed limit.
This is also happening, with kubernetes stopped and iptables cleared.

Reproduction Steps

Run curl with a speed limit lower than the source can deliver. The faster the source the better
Wait for the traffic to drop vastly under the limit, when some buffer is presumably full (in our case ~6MiB)
Restart same host with prior version (1745.5.0)
Run same command and see expected behaviour.

Other Information

A tcpdump seems to indicate, that the client scales the window size to 384 bytes at a roughly 10 packets per seconds.

Prior version:

cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.5.0
VERSION_ID=1745.5.0
BUILD_ID=2018-05-31-0701
PRETTY_NAME="Container Linux by CoreOS 1745.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

The text was updated successfully, but these errors were encountered:

urfuwo · 2018-06-13T08:46:29Z

+1

databus23 · 2018-06-13T11:32:51Z

I can reproduce the behaviour on CoreOS 1745.6.0 running as a vm on esxi.
So it seems like this is a general problem not related to any special hardware.

It takes up to 90 seconds before the download drops below 1KB/s but its 100 % reproducible for me on 1745.6.0 and I can't make it happen on 1745.5.0.

fwiesel · 2018-06-13T14:32:24Z

I cannot reproduce it with 4.14.48-041448-generic from (http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14.48/) on ubuntu xenial.

$ curl --limit-rate 1M -o /dev/null http://<some-source>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 16  956M   16  157M    0     0  1024k      0  0:15:56  0:02:37  0:13:19 1024

zikphil · 2018-06-13T17:12:23Z

We have also been experiencing issues with the latest version, we have several docker pull that hangs mid-way for several minutes (20-30m). Unreproducible on older versions.

bgilbert · 2018-06-13T18:18:49Z

Thanks for the report. coreos/linux@a6f81fc from 4.14.48 seems like a possible culprit, but I haven't been able to reproduce the problem and so can't test. Do you see this on a freshly-installed machine with default configs? Has anyone seen this problem on any cloud platform?

One thing you could do is test with beta 1772.2.0 and, if that succeeds, 1772.3.0. That would help determine whether 4.14.48 is in fact responsible.

databus23 · 2018-06-13T18:23:35Z

@bgilbert I experience this with a freshly create VM created from the the official image: https://stable.release.core-os.net/amd64-usr/1745.6.0/coreos_production_openstack_image.img.bz2.

I just boot the vm, login to it and run the above curl command.

I'll try to reproduce with 1772.2.0

sjourdan · 2018-06-13T18:55:22Z

I confirm I can't reproduce the described issue on 1772.2.0 using kernel 4.14.47
But I can on 1772.3.0 using 4.14.48-coreos-r1

I tried this on AWS with fresh VMs.

paulcichonski · 2018-06-13T20:39:47Z

I am hitting this too on 1745.6.0 nodes running in ec2, specifically the docker pulls hanging part. Some things that seem to make the hanging happen more often:

using an s3 backed docker registry, pulls from there (especially with large image layers) seem to hang almost all the time, whereas pulling the same image from docker hub does not hang.
using non ebs optimized instance types.

I've put together some terraform that makes it easy to replicate: https://github.com/paulcichonski/replicate-docker-pull-hanging.

bgilbert · 2018-06-14T04:08:24Z

@paulcichonski Thanks for the repro! I've been able to reproduce on 1745.6.0, and have been unable to reproduce on these images:

Description	us-west-2 AMI
Alpha 1800.1.0	`ami-f623628e`
1745.6.0 with coreos/linux@`a6f81fc` reverted	`ami-abf4b7d3`
1745.6.0 with coreos/linux@`02db557` applied	`ami-df1153a7`

Could someone who's been experiencing this problem confirm that one of the last two images (hopefully the last one) fixes the problem for them?

(Note that those are developer images which will not update themselves, so don't use them as a long-term fix.)

Jwpe · 2018-06-14T10:23:19Z

I'm experiencing this issue on 1772.3.0 beta release on EC2. Certain docker image pulls from Docker hub hang mid-download.

paulcichonski · 2018-06-14T14:20:37Z

I no longer see the issue with ami-df1153a7 (1745.6.0 with coreos/linux@02db557 applied).

bgilbert · 2018-06-14T16:14:22Z

@paulcichonski Thanks.

bgilbert · 2018-06-14T17:54:51Z

This should be fixed in beta 1772.4.0 and stable 1745.7.0, due shortly. Alpha is not affected. Thanks, everyone, for the detailed reports, reproduction, and testing.

bgilbert · 2018-06-14T20:34:21Z

Rolling updates to affected machines is going to be difficult, since the issue also affects retrieval of the update payload. We're investigating mitigations.

ajeddeloh · 2018-06-14T23:34:35Z

We’re now rolling 1745.7.0 and 1772.4.0 which have the fix. We have blacklisted the affected versions (1745.6.0 and 1772.3.0) from updating because the update process overwrites the other USR partition. This would prevent the ability to roll back to previous versions. To update to 1745.7.0 or 1772.4.0, roll back to the previous version then let the update process proceed normally.

We’re investigating ways to update the stuck 1745.6.0 and 1772.3.0 systems and will update this bug once we know more.

bgilbert · 2018-06-15T02:28:46Z

stable@vger mailing list thread.

zyclonite · 2018-06-15T06:33:26Z

any chance a manual upgrade will be possible? a download does not always stall... is there no way to kill the downloading process and retry?

bgilbert · 2018-06-15T16:30:47Z

Repro:

kola spawn -p aws --aws-type m3.large --aws-ami ami-401f5e38
curl --limit-rate 1M -o /dev/null https://stable.release.core-os.net/amd64-usr/current/coreos_production_image.bin.bz2

bgilbert · 2018-06-15T16:49:32Z

As pointed out by @lucab, running

sudo sysctl net.ipv4.tcp_moderate_rcvbuf=0

on affected systems appears to work around the issue. Existing TCP connections are not affected, however.

bgilbert · 2018-06-15T16:54:20Z

@zyclonite We plan to update affected systems automatically, but we're still working out the details. Right now, because we've blacklisted the affected versions, the only way to upgrade them is to roll back first.

zyclonite · 2018-06-15T17:14:14Z

@bgilbert thank you, we did the rollback way, was not so complicated as we thought

no issues after the upgrade

DrMurx · 2018-06-17T01:02:07Z

@bgilbert I'm stuck in the upgrade limbo. I'm currently running 1772.3.0 while upgrade_engine tries to download 1772.4.0.

I'm unable to roll back to 1772.2.0 - the machine doesn't boot if I try, maybe its partition is already partially overwritten with the upgrade. Also, disabling net.ipv4.tcp_moderate_rcvbuf like you suggested didn't work for me.

Is there a way to download the image without bandwidth control and dd it to the partition?

bgilbert · 2018-06-17T07:29:26Z

We've now rolled out fixes to both the beta and stable channels. All machines running 1772.3.0 or 1745.6.0 should either a) have updated to the fixed versions or b) still be in the process of downloading the update. We have implemented mitigations on the public CoreUpdate server which we believe will help affected nodes update in a reasonable amount of time. We're monitoring the effect of those mitigations and will continue to adjust them as necessary.

Machines that update from a private CoreUpdate server do not benefit from these mitigations, unfortunately. If you're in that situation, or if you'd like to update more quickly, you should be able to apply the update by running:

sudo sysctl net.ipv4.tcp_moderate_rcvbuf=0
sudo systemctl restart update-engine
update_engine_client -update

In the worst case, even if mitigation is ineffective and the above commands are not executed, we believe machines will still eventually update themselves to a fixed version. In that case, downloading and applying the update may take a week or more, but it should eventually complete.

bgilbert · 2018-06-17T07:33:45Z

@DrMurx There's not a completely straightforward way to apply the update manually. You're correct that the partially-applied 1772.4.0 would have overwritten 1772.2.0. Did you restart update-engine after applying the sysctl? The sysctl only affects new TCP connections, not established ones.

DrMurx · 2018-06-17T11:21:03Z

@bgilbert I did restart update-engine after the sysctl tuning. It was still stuck at 0.8%, but later out of a sudden the situation seems to have resolved (maybe due to your changes on your servers).

You don't have a dedicated feature request tracker, do you? Because given this experience I'd suggest that update-engine should download to a temporary file first before applying any changes to the lower prioritized partition.

bgilbert · 2018-06-17T19:34:11Z

@DrMurx Glad to hear it got resolved. We track feature requests in this issue tracker as well. However, update_engine is not actively maintained and won't be used as the update system for the successor to Container Linux, so we're unlikely to make major adjustments to it.

bgilbert · 2018-06-17T19:41:41Z

The public CoreUpdate server is now seeing only a handful of active downloads for these updates. It appears that the mitigations have been effective. If you have machines which are still failing to update from 1772.3.0 or 1745.6.0 (after following the instructions above in the case of a private CoreUpdate instance) please leave a comment here.

Jwpe · 2018-06-18T07:42:04Z

@bgilbert thanks so much for your prompt response to this issue and for keeping us informed through the entire process. It was really helpful and gives me a lot of confidence in continuing to use CoreOS.

bgilbert · 2018-09-19T21:37:11Z

We're going to drop the mitigations from the public CoreUpdate server on October 4. If you're still launching machines with stable 1745.6.0 or beta 1772.3.0, you should switch to a newer release.

bgilbert · 2018-10-05T21:17:56Z

Mitigations have been dropped from the public CoreUpdate server.

fwiesel changed the title ~~Paced TCP downloads in 4.14.48-coreos-r1 slow down~~ Paced TCP downloads in 4.14.48-coreos-r1 break down Jun 13, 2018

fwiesel changed the title ~~Paced TCP downloads in 4.14.48-coreos-r1 break down~~ Paced TCP downloads in 1745.6.0 break down Jun 13, 2018

bgilbert added kind/regression area/performance component/kernel team/os labels Jun 13, 2018

bgilbert self-assigned this Jun 14, 2018

bgilbert closed this as completed Jun 14, 2018

bgilbert reopened this Jun 14, 2018

bgilbert closed this as completed Jun 18, 2018

bgilbert mentioned this issue Jul 3, 2018

Download from ratelimit.update.core-os.net fails #2473

Closed

ajeddeloh mentioned this issue Sep 11, 2018

Determine how to handle automatic rollback coreos/fedora-coreos-tracker#47

Open

jlebon mentioned this issue Jul 11, 2022

Add a repo option to auto-prune other deployments (e.g. rollback) when starting upgrade ostreedev/ostree#2670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paced TCP downloads in 1745.6.0 break down #2457

Paced TCP downloads in 1745.6.0 break down #2457

fwiesel commented Jun 13, 2018 •

edited

Loading

urfuwo commented Jun 13, 2018

databus23 commented Jun 13, 2018 •

edited

Loading

fwiesel commented Jun 13, 2018

zikphil commented Jun 13, 2018

bgilbert commented Jun 13, 2018

databus23 commented Jun 13, 2018

sjourdan commented Jun 13, 2018

paulcichonski commented Jun 13, 2018

bgilbert commented Jun 14, 2018

Jwpe commented Jun 14, 2018 •

edited

Loading

paulcichonski commented Jun 14, 2018

bgilbert commented Jun 14, 2018

bgilbert commented Jun 14, 2018

bgilbert commented Jun 14, 2018

ajeddeloh commented Jun 14, 2018 •

edited

Loading

bgilbert commented Jun 15, 2018

zyclonite commented Jun 15, 2018

bgilbert commented Jun 15, 2018

bgilbert commented Jun 15, 2018

bgilbert commented Jun 15, 2018

zyclonite commented Jun 15, 2018

DrMurx commented Jun 17, 2018

bgilbert commented Jun 17, 2018

bgilbert commented Jun 17, 2018

DrMurx commented Jun 17, 2018

bgilbert commented Jun 17, 2018

bgilbert commented Jun 17, 2018

Jwpe commented Jun 18, 2018

bgilbert commented Sep 19, 2018

bgilbert commented Oct 5, 2018

Paced TCP downloads in 1745.6.0 break down #2457

Paced TCP downloads in 1745.6.0 break down #2457

Comments

fwiesel commented Jun 13, 2018 • edited Loading

Issue Report

Bug

Container Linux Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

urfuwo commented Jun 13, 2018

databus23 commented Jun 13, 2018 • edited Loading

fwiesel commented Jun 13, 2018

zikphil commented Jun 13, 2018

bgilbert commented Jun 13, 2018

databus23 commented Jun 13, 2018

sjourdan commented Jun 13, 2018

paulcichonski commented Jun 13, 2018

bgilbert commented Jun 14, 2018

Jwpe commented Jun 14, 2018 • edited Loading

paulcichonski commented Jun 14, 2018

bgilbert commented Jun 14, 2018

bgilbert commented Jun 14, 2018

bgilbert commented Jun 14, 2018

ajeddeloh commented Jun 14, 2018 • edited Loading

bgilbert commented Jun 15, 2018

zyclonite commented Jun 15, 2018

bgilbert commented Jun 15, 2018

bgilbert commented Jun 15, 2018

bgilbert commented Jun 15, 2018

zyclonite commented Jun 15, 2018

DrMurx commented Jun 17, 2018

bgilbert commented Jun 17, 2018

bgilbert commented Jun 17, 2018

DrMurx commented Jun 17, 2018

bgilbert commented Jun 17, 2018

bgilbert commented Jun 17, 2018

Jwpe commented Jun 18, 2018

bgilbert commented Sep 19, 2018

bgilbert commented Oct 5, 2018

fwiesel commented Jun 13, 2018 •

edited

Loading

databus23 commented Jun 13, 2018 •

edited

Loading

Jwpe commented Jun 14, 2018 •

edited

Loading

ajeddeloh commented Jun 14, 2018 •

edited

Loading