ERL-698: 200ms delay in file:sendfile on Linux #3877

OTP-Maintainer · 2018-08-08T18:02:06Z

Original reporter: igor
Affected version: OTP-21.0
Fixed in version: OTP-21.2
Component: erts
Migrated from: https://bugs.erlang.org/browse/ERL-698

The attached [escript|https://bugs.erlang.org/secure/attachment/11219/sendfile_test.erl] demonstrates the problem. The script first sends a short string with {{gen_tcp:send}} and then a small file with {{file:sendfile}}. The receiver measures the time between getting the packet with the short string and fully receiving the file. This time ends up always being around 200ms on several Linux systems that I've tried, with various OTP versions starting with 18 and ending with 21. This is rather unfortunate given that the main reason for using {{sendfile}} is performance improvement.

The text was updated successfully, but these errors were encountered:

OTP-Maintainer · 2018-08-08T20:05:52Z

igor said:

The behavior is explained by how {{sendfile}} is implemented on Linux. Namely, it is implemented through [splice|http://man7.org/linux/man-pages/man2/splice.2.html] system call, which takes the length of the data to be "spliced" as one of the parameters. It assumes that {{len}} parameter is meaningful, i.e., it uses it to determine if more data is coming. Specifically, it sets {{MSG_MORE}} more flag until {{len}} bytes have been sent, which kernel then in turn uses as a hint that more data is coming and will try to do "nagle-ish" type of algorithm to ensure optimally-sized segments are sent on the wire when the destination is a TCP socket. Hence, passing {{2^64-1}} as the file size makes it wait for more data. If you try to strace the attached escript, you'll see that {{sendfile}} system call actually completes immediately, i.e., the delay is really coming from the TCP stack. Note that this behavior is not affected by setting {{TCP_NODELAY}} socket option.

This is 100% reproducible for me with the attached escript, which sends the data over loopback. It is harder to reproduce over the real network and with larger files but it definitely happens - that's how I bumped into this issue. I suspect the differences are due to Linux optimizations for TSO (TCP segment offloading), where it tries to fill close to a "window-full" of data before passing the buffer to the NIC.

OTP-Maintainer · 2018-08-08T22:24:37Z

igor said:

I see a few ways to fix this:

1. Pass the correct file size from {{sendfile/2}} to {{sendfile/5}} (see [pull request|https://github.com/erlang/otp/pull/1916]). This is a very simple change but it has some drawbacks:
    * Erlang doesn't seem to have an API for getting the file size from a file descriptor and {{file:read_file_info}} is relatively expensive - the kernel will have to parse the Filename again just to get to the inode.
    * It's an extra system call that is unnecessary for OSs other than Linux
    * This won't work if you ever decide to support anything other than regular files in sendfile and has an obvious race with somebody modifying the file while {{sendfile}} is running. The race clearly exists anyway though.

2. Use [{{TCP_CORK}}|https://yarchive.net/comp/linux/sendfile.html] to "cork" the destination socket for the duration of {{sendfile}} (see [pull request|https://github.com/erlang/otp/pull/1915]). This is somewhat more involved but is presumably the proper way to fix this. As an added advantage, this empirically results in fewer TCP segments getting produced and will work with "special" files if they are ever supported.

3. A variation of 1) above - rather than getting the file size in {{file.erl}}, get it in {{inet_drv.c}}. I.e., if the file size passed as a parameter is {{2^64-1}} and we're on linux, get the real file size through {{fstats}} and set it as {{desc->sendfile.length}}. I don't have a pull request with this but let me know if this is the solution you prefer and I'll be happy to make one. I kinda dislike this since {{TCP_CORK}} looks like a cleaner solution if we're changing {{inet_drv.c}} anyway but this would work too.

OTP-Maintainer · 2018-08-10T06:58:00Z

lukas said:

I read this article by the nginx folks: https://thoughts.t37.net/nginx-optimization-understanding-sendfile-tcp-nodelay-and-tcp-nopush-c55cdd276765

From that I gather that on Linux (and possibly the BSDs as well), what you want to have is both NODELAY and CORK. Would you agree with that?

I'm wondering if it would be better to add CORK/NOPUSH to the general gen_tcp options and use them from erlang instead of doing the code in c?

i.e.

{code}
%% Internal sendfile functions
sendfile(#file_descriptor{ module = Mod } = Fd, Sock, Offset, Bytes,
	 ChunkSize, Headers, Trailers, Opts)
  when is_port(Sock) ->
    ok = inet:setopts(Sock, [{nopush, true}]),
    case Mod:sendfile(Fd, Sock, Offset, Bytes, ChunkSize, Headers, Trailers,
		      Opts) of
	{error, enotsup} ->
	    Res = sendfile_fallback(Fd, Sock, Offset, Bytes, ChunkSize,
			      Headers, Trailers),
            ok = inet:setopts(Sock, [{nopush, false}]),
            Res;
	Else ->
            ok = inet:setopts(Sock, [{nopush, false}]),
	    Else
    end;
sendfile(_,_,_,_,_,_,_,_) ->
    {error, badarg}.
{code}

and maybe add a note in the documentation that it is preferred to have the supplied socket set in nodelay mode?

OTP-Maintainer · 2018-08-11T01:25:41Z

igor said:

> I'm wondering if it would be better to add CORK/NOPUSH to the general gen_tcp options and use them from erlang instead of doing the code in c?

Adding this as a general gen_tcp option definitely sounds quite useful regardless of the {{sendfile}} behavior, imho. If you're willing to expose it as a socket option available to the user code, I think you could credibly claim that setting it around the calls to {{file:sendfile}} is user's responsibility and be done with this...

> what you want to have is both NODELAY and CORK

web servers will likely want both since sendfile is nearly always the last thing to send in a response but in general they are kinda orthogonal; if anything, TCP_CORK is anti-NODELAY (and will "overwrite" TCP_NODELAY). However, {{nodelay}} option is already available to the user and we might as well leave it up to the user to set it - {{sendfile}} doesn't directly affect Nagle's behavior. Other applications may want to keep the socket corked even after sendfile if they have anything else to send on it immediately.

I'm not too familiar with BSD's NOPUSH but it seems similar enough, especially given [this comment|http://dotat.at/writing/nopush.html].

As far as the implementation above, several comments:

* we'd probably want to check what the user setting for this option was and restore it after {{sendfile}} completes rather than just unsetting {{nopush}}. Ideally, including the case of user setting it through {{raw}} socket option ;) The idea is that "good" user code would presumably want to control the corking to include  any "headers/footers" around {{sendfile}}.
* we need to be careful about what sort of socket we're dealing with. E.g., Unix domain sockets are not affected by this 200ms delay so it's not necessary; moreover, setsockopt with {{TCP_CORK}} will fail on them, as well as on UDP sockets. 

I try to handle both of the above in my pull request.

Note that there is also a {{UDP_CORK}} socket option on Linux (set with {{IPPROTO_UDP}} level), but we almost certainly do not want it set on UDP sockets for {{sendfile}} - it will make the kernel pack as much data as it can into the UDP datagrams, effectively resulting in 64K UDP datagrams that are very likely to be fragmented in transit. It might be worth exposing this through the same {{nopush}} setsockopt just for completeness though.

All of the above is for Linux, BSD might be somewhat different.

OTP-Maintainer · 2018-08-16T06:01:59Z

lukas said:

So, I've been thinking and discussing with some colleagues about this and we do like the changes that you have done, but want the changes to be done using the inet API instead. i.e. add {{inet:getopts(S, \[nopush\])}} and {{inet:setopts(S,\[\{nopush,boolean\}\])}} that works on TCP/UDP/SCTP.

As you say we could after this claim that it is up to the user to use the options, but I believe that it would be better to have a good default, so I think that the code in file:sendfile should be modified to set TCP_CORK on streaming sockets over the sendfile call if it is not set already. If we later discover that it is sometimes a good thing to not set TCP_CORK we'll just have to add an option later on that disables it. The documentation should also be updated to explain the tradeoffs and reasoning.

Does your PR that sets the size of the file to be sent still make sense if we implement TCP_CORK?

OTP-Maintainer · 2018-08-18T14:51:35Z

igor said:

Yes, this sounds reasonable.

> Does your PR that sets the size of the file to be sent still make sense if we implement TCP_CORK?

I don't think so. The only other positive side-effect of setting the right file size that I bumped into so far is that the linux kernel seems to set the PUSH flag on the last segment of the file if it knows the right size. I could do more testing of this if you think it's important. I'm not sure if it's worth the trouble anyway - I'd probably set the proper size if that size could be obtained from the file descriptor but not if we need to go through the current {{file:read_file_info}} with filename. Just an IMHO...

OTP-Maintainer · 2018-08-20T04:08:01Z

lukas said:

I'll close that PR then for now, if we later discover that it would be a good thing to have we can always reopen it. I wouldn't be against then adding support for file:read_file_info/1 to get the info from an open fd.

Will you make an new/updated PR that exposes the nopush option in inet?

OTP-Maintainer · 2018-08-22T17:11:02Z

igor said:

> Will you make an new/updated PR that exposes the nopush option in inet?

Yeah, I can give it a try but it might take some time...

OTP-Maintainer · 2018-10-09T21:11:53Z

igor said:

My apologies for the delay. I've finally gotten some time to work on this; here are the results so far.

See my [pull request|https://github.com/erlang/otp/pull/1979] for the {{nopush}} TCP socket option. This works on Linux/FreeBSD/OSX as far as I can tell; I can't directly test on Windows but I did test the (simulated) behavior when this option is not defined by the OS. A few comments about the change:

* The {{nopush}} option is silently ignored on OSs that don't support it; it sounds like this should make using it easier and the same approach is taken for a few other options in inet_drv - though some do return an error when not supported.
* I couldn't find anything similar to {{nopush}} for SCTP; let me know if I just missed it and I'll get it added.
* I didn't add {{nopush}} for UDP. It's only available on Linux (as {{UDP_CORK}}) and, frankly, I have a hard time coming up with (m)any use cases where it could be useful. I can certainly add it if you feel it's a good idea.
* {{nopush}} behaves rather differently on OSX than everywhere else; namely, unsetting {{nopush}} does *not* cause the accumulated data to be sent out immediately. This is rather unfortunate to put it mildly...

OTP-Maintainer · 2018-10-09T21:38:25Z

igor said:

As far as the original issue with the delay in {{file:sendfile}} is concerned, I implemented a fix along the lines you suggested (I put the changes into  prim_inet.erl since dealing with sockets in file.erl didn't feel right). However, the OSX behavior with {{nopush}} turned out to be a real bummer - my original escript test works just fine on Linux/FreeBSD with the fix but results in a 5 seconds (!!) delay on OSX. You can view the change in [my fork|https://github.com/igors/otp/commit/8e45a074d7756f226851c26a638d925f752447a3] but I don't think merging it makes sense at the moment, hence no pull request.

Given all that, I guess I can see several ways forward:
# Add an OS check  when setting {{nopush}} in prim_inet and don't set it if we're on OSX
# Add {{file:read_file_info}} variant that takes a file descriptor and just use that to get the real file size in {{file:sendfile/2}}
# Change inet_drv code to completely ignore {{nopush}} option on OSX
# Go back to something like my original pull request

I can't say that any of the above is particularly attractive. The first one is probably the easiest, the second one is probably the cleanest; the third one makes sense because {{nopush}} on OSX is so badly broken that it's not clear if exposing it even makes sense; the last one is there mostly for completeness ;)

OTP-Maintainer · 2018-10-17T04:12:47Z

john said:

I've merged your PR. Thanks!

(https://github.com/erlang/otp/pull/1979)

OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021

OTP-Maintainer added this to the OTP-21.2 milestone Feb 10, 2021

OTP-Maintainer assigned garazdawi Feb 10, 2021

OTP-Maintainer closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERL-698: 200ms delay in file:sendfile on Linux #3877

ERL-698: 200ms delay in file:sendfile on Linux #3877

OTP-Maintainer commented Aug 8, 2018

OTP-Maintainer commented Aug 8, 2018

OTP-Maintainer commented Aug 8, 2018

OTP-Maintainer commented Aug 10, 2018

OTP-Maintainer commented Aug 11, 2018

OTP-Maintainer commented Aug 16, 2018

OTP-Maintainer commented Aug 18, 2018

OTP-Maintainer commented Aug 20, 2018

OTP-Maintainer commented Aug 22, 2018

OTP-Maintainer commented Oct 9, 2018

OTP-Maintainer commented Oct 9, 2018

OTP-Maintainer commented Oct 17, 2018

ERL-698: 200ms delay in file:sendfile on Linux #3877

ERL-698: 200ms delay in file:sendfile on Linux #3877

Comments

OTP-Maintainer commented Aug 8, 2018

OTP-Maintainer commented Aug 8, 2018

OTP-Maintainer commented Aug 8, 2018

OTP-Maintainer commented Aug 10, 2018

OTP-Maintainer commented Aug 11, 2018

OTP-Maintainer commented Aug 16, 2018

OTP-Maintainer commented Aug 18, 2018

OTP-Maintainer commented Aug 20, 2018

OTP-Maintainer commented Aug 22, 2018

OTP-Maintainer commented Oct 9, 2018

OTP-Maintainer commented Oct 9, 2018

OTP-Maintainer commented Oct 17, 2018