handle broken locking on remote filesystems better #685

eassmann · 2013-04-22T09:21:27Z

On my CentOS 5.8 system, fish hangs for a long time (minutes) after every command, built-in or external, though not after a blank line. The command output and new prompt will be printed, then the hang occurs.

It also hangs after entering a comment, although in this case before printing a prompt.

It does not hang when a command is executed as 'fish -c ls'

strace snippet executing 'ls', the blank lines indicate long waits:

write(1, "assmann@n101 \33[32m~\33[30m\33(B\33[m> "..., 35) = 35
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo
...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo
...}) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
futex(0x9980030, FUTEX_WAIT_PRIVATE, 2, NULL

) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=30, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}

) = -1 EIO (Input/output error)
close(6)                                = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_RDONLY) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fcntl(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}

) = -1 ENOLCK (No locks available)
close(6)                                = 0

The text was updated successfully, but these errors were encountered:

ridiculousfish · 2013-04-22T10:39:21Z

Is your home directory perhaps on an NFS mount, or any filesystem other than something simple and local?

eassmann · 2013-04-22T11:03:57Z

Yes, it is on NFS. I neglected to mention that because (a) the file system is usually fast, (b) I tried running fish in a local mount and it did not change anything, and (c) I did simply not remember that it might be about the home directory rather than the working directory.

Anyway, running fish with $HOME set to a local directory gets rid of the hangs, so your guess was spot on. It would still be nice to be able to use fish. Changing my $HOME is not really an option, but I could conceivably have the .config/fish on the local FS, if that is possible.

(To elucidate: The machine is a cluster with a shared NFS and "scratch" directories on local HDDs. I do want my $HOME to be on the NFS, but it would be okay to run fish only on the login node and have the config directory on its scratch FS.)

zanchey · 2013-04-22T12:56:52Z

I run fish with an NFS-mounted home directory, and it seems fine. Are you running an NFS lock manager? As you may know, rpcinfo -p $SERVER against your NFS server should show an entry for nlockmgr.

ridiculousfish · 2013-04-22T17:34:17Z

I am definitely interested in the answer to zanchey's question.

I would also like to make fish not fall over in the presence of even a NFS mount that does not support file locking. Does anyone have any suggestions for how best to set one up for testing?

eassmann · 2013-04-23T07:01:33Z

zanchey: I see that all of the syscalls that caused the trouble seem to be locking-related :-). nlockmgr does show up in the output of rpcinfo:

   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    873  rquotad
    100011    2   udp    873  rquotad
    100011    1   tcp    876  rquotad
    100011    2   tcp    876  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100021    1   udp  37999  nlockmgr
    100021    3   udp  37999  nlockmgr
    100021    4   udp  37999  nlockmgr
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100021    1   tcp  33570  nlockmgr
    100021    3   tcp  33570  nlockmgr
    100021    4   tcp  33570  nlockmgr
    100005    1   udp    965  mountd
    100005    1   tcp    968  mountd
    100005    2   udp    965  mountd
    100005    2   tcp    968  mountd
    100005    3   udp    965  mountd
    100005    3   tcp    968  mountd

However, the NFS share is mounted 'nolock', if that has anything to do with it. I had a problem a while ago where some program failed because it was not able to do locks on the NFS. I do not now remember what program it was, but a quick web search suggests that this has to do with flock() rather than fcntl(), so maybe it is unrelated.

For completeness, I will also mention that sometimes on this machine, yum will hang in a futex syscall (contrary to fish, it never recovers). One then needs to remove some stale lock files.

ridiculousfish: About the 'presence' of an NFS mount, the mere presence seems not to be enough. When I start fish with HOME set to a local mount, I can go to the NFS mount and work there without problems. (Sorry if this was obvious.)

ridiculousfish · 2013-04-27T08:16:57Z

What's happening is that fish attempts to take a file lock and then append to history. If it can't take a file lock, it falls back to the old mode of creating a new file containing the old and new history, and then moving that in place. Creating the new file is slow, as it requires copying the contents of the old file each time. This is currently not multithreaded.

ridiculousfish · 2013-04-27T22:27:46Z

I relaxed the constraint that we must take a lock to do the history append-in-place, and now we always write complete records with O_APPEND, which is closer to what bash does. This probably won't always work on lockless NFS - you may lose some new history items. But performance should be better most of the time.

Fixed as 3e69e5b . Thanks for reporting this.

eassmann · 2013-04-30T12:58:09Z

I have now had a chance to try this out. Sorry to say this, but the problem persists. I will also say that the lag is much longer than I would expect for making a copy of the history.

Actually, after the patch fish now seems to hang forever:

write(1, "\33[38;5;245m\342\217\216\33[0m              "..., 126) = 126
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
write(1, "assmann@n101 \33[32m~\33[30m\33(B\33[m> "..., 35) = 35
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\0", 1)                        = 1
futex(0x14f8cfc0, FUTEX_WAIT_PRIVATE, 2, NULL

) = 0

similar as before, then

rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)                                = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
futex(0x14f8cfc0, FUTEX_WAKE_PRIVATE, 1) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

and I have not seen that call return.

ridiculousfish · 2013-04-30T17:56:16Z

Thanks for your conscientious testing.

open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)

That shows fish attempting to take the lock, failing, appending to the history anyways, and then closing it, which is as designed. So the fix appears to be working correctly so far.

select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

This shows fish waiting on input from file descriptors. (fd 4 is probably the iothread completion pipe.) 0 is stdin. So it looks like fish is just waiting for input on stdin.

eassmann · 2013-05-03T08:20:30Z

I stand corrected. At the select, fish is really just waiting for input. I cannot reconstruct why I thought it had crashed the other day.

However, it is still not usable because the calls futex(0x81ef030, FUTEX_WAIT_PRIVATE, 2, NULL) and fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) take such a long time. If you want, the problem is 33% solved, since 1 out of originally 3 lagging syscalls has been avoided :-).

frogshead · 2013-05-03T10:06:22Z

i had lagging problem my ubuntu box as well when my history file was corrupted, there were some invalid or empty commands (or something). When I wipe out whole file and created new one then lagging were gone.

ridiculousfish · 2013-05-03T21:42:25Z

You might also just had a lot of history. We cap it at 262144 ( = 256 * 1024) items. History searches may take the lock and not release it for a while. We should do better - reader/writer locks, or more fine grained locking.

ridiculousfish · 2013-05-03T21:48:20Z

I'm surprised that the F_SETLKW takes a long time - I would expect it to immediately if locks are not supported.

zanchey · 2013-05-04T01:47:12Z

(A side note - if you use strace -T it will print the amount of time spent in each call.)

zanchey · 2015-10-19T13:24:52Z

Was there ever any progress with this? I wondered idly if NFSv4 might be displaying some unusual semantics here, but there hasn't been any activity in the last two years. I'm not sure if @xebtl is still having the problem?

eassmann · 2015-10-19T13:33:18Z

From my side, no progress. More precisely, I simply stopped trying after the previous discussion — I do not use the machine in question (a cluster) all that much, so it was not worth more effort to me. I actually use the machine even less these days, but if it is of interest to you, I can certainly try the current version of fish.

On the other hand, is NFSv4 a new version? The NFS on our cluster is not being updated, if that was your point.

zanchey · 2015-10-19T13:46:57Z

NFSv3 and NFSv4 are two quite different protocols, and certainly at our site we have aggressively avoided migration to NFSv4 due to a variety of problems - Linux server performance, reduced fault tolerance and changed semantics. We have picked up at least one problem in fish that occurs only on NFSv4 mounts, though.

I'd be interested to know if it still happens. There's been quite a bit of change to the code in this area.

eassmann · 2015-10-19T16:13:15Z

How do I find out if my cluster is using v3 or v4?

eassmann · 2015-10-22T08:57:16Z

@zanchey: nfsstat says “Client nfs v3”, so I guess it is NFSv3.

eassmann · 2015-10-22T09:07:27Z

I compiled the current 2.2.0 tarball from the website. Unfortunately, it seems that the problem persists:

futex(0x184cb980, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 <289.078032>
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0 <0.000058>
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6 <0.326795>
fcntl(6, F_GETFD)                       = 0 <0.000029>
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0 <0.000021>
fstat(6, {st_mode=S_IFREG|0644, st_size=182, ...}) = 0 <0.000020>
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available) <59.370349>

zanchey · 2015-10-22T14:10:02Z

My suspicion is that the NFS server isn't running an NFS locking manager (rpcinfo -p SERVERNAME | grep nlockmgr should produce some output if the locking manager is present). If that's all OK, could you try building the current git master? Or using a nightly build?

eassmann · 2015-10-27T13:09:33Z

It seems the locking manager is present.

I failed to build from master because the configuration needs autoconf 2.60 but only 2.59 is available for my OS (CentOS 5.11). Installing the nightly RPM failed as well, I think also because the OS release is too old. Is there any way to get around the autoconf dependency?

zanchey · 2015-10-27T13:31:14Z

Try autoreconf rather than autoconf to get the source to build.

eassmann · 2015-10-27T15:35:11Z

That still complains

$ autoreconf                                                                       
configure.ac:13: error: Autoconf version 2.60 or higher is required                                                    
configure.ac:13: the top level                                                                                         
autom4te: /usr/bin/m4 failed with exit status: 63                                                                      
aclocal: autom4te failed with exit status: 63                                                                          
autoreconf: aclocal failed with exit status: 63

I tried to remove the constraint from configure.ac, but then autoreconf reports an error and the resulting configure fails with

./configure: line 2069: AC_PROG_SED: command not found
CXXFLAGS: -g -O2
checking for msgfmt... msgfmt
./configure: line 2236: syntax error near unexpected token `$doxygen_version,'
./configure: line 2236: `            AS_VERSION_COMPARE($doxygen_version, $doxygen_minimum,'

zanchey · 2015-10-29T07:31:12Z

Of course - my apologies.

Try the nightly tarball, which has the configure file generated. You should just be able to use ./configure; and make install. You can download it from the openSUSE build service - e.g. fish_2.2.0-392-g1055864.orig.tar.gz.

eassmann · 2015-10-29T09:33:55Z

That tarball built, but the problem is still there. Sorry.

zanchey · 2015-10-30T14:10:54Z

No, that's good to know!

krader1961 · 2016-12-13T00:11:54Z

Is anyone still interested in debugging this? If not we should close it. Getting a strace with timing data along with fish -d5 output would be really helpful.

eassmann · 2016-12-13T13:45:12Z

I ran another test with the the build of fish-2.2.0-392-g1055864 on the same machine, using the command strace -fTostrace ./fish -d 5 2>stderr. I entered ls^M^D and waited for fish to exit.

I put the complete output in a gist. The syscalls that take a long time are

#line   PID
----- -----
 4999 28559 fcntl(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5107 28561 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5196 28551 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5197 28559 <... fcntl resumed> )             = -1 ENOLCK (No locks available) <30.004000>
 5203 28561 <... futex resumed> )             = 0 <29.925293>
 5210 28551 <... futex resumed> )             = 0 <29.597652>
 5241 28551 fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5351 28551 <... fcntl resumed> )             = -1 EIO (Input/output error) <150.011093>

As far as I could tell, fish's corresponding debug output was

fish: path_get_path( 'l' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )

If you want to see output with different options, just let me know. If you want me to try with a newer version, ditto, though I would have to compile that.

krader1961 · 2016-12-13T16:03:36Z

This is a bug in the NFS statd service. See for example https://bugzilla.kernel.org/show_bug.cgi?id=8159. Your OS is broken. Nothing fish can do about it other than not attempt to obtain locks on the history file and I don't think we're going to do that.

krader1961 · 2016-12-13T18:37:43Z

@eassmann: You can fix your immediate problem by setting $XDG_DATA_HOME to a path on a local file system so your fish history file is no longer on NFS.

While walking my dogs this morning it occurred to me we can deal with this type of situation better. Simply time how long it takes to obtain the lock (or fails as the case may be) and if it is above a threshold (say 250 ms) then emit a warning and stop trying to lock the history file. It's better to risk losing history than to have an unusable shell.

eassmann · 2016-12-14T09:29:12Z

@krader1961: Thanks for the tip about $XDG_DATA_HOME.

It's better to risk losing history than to have an unusable shell.

I absolutely agree.

ridiculousfish closed this as completed Apr 27, 2013

ridiculousfish reopened this Apr 30, 2013

krader1961 added the needs more info label Dec 13, 2016

krader1961 added the question label Dec 13, 2016

krader1961 added enhancement and removed needs more info question labels Dec 13, 2016

krader1961 changed the title ~~fish hangs on CentOS 5.8~~ handle broken locking on remote filesystems better Dec 13, 2016

krader1961 self-assigned this Dec 16, 2016

krader1961 modified the milestones: fish 2.5.0, fish-future Dec 16, 2016

krader1961 closed this as completed in 483e9fd Dec 17, 2016

zanchey added the release notes Something that is or should be mentioned in the release notes label Dec 19, 2016

github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle broken locking on remote filesystems better #685

handle broken locking on remote filesystems better #685

eassmann commented Apr 22, 2013

ridiculousfish commented Apr 22, 2013

eassmann commented Apr 22, 2013

zanchey commented Apr 22, 2013

ridiculousfish commented Apr 22, 2013

eassmann commented Apr 23, 2013

ridiculousfish commented Apr 27, 2013

ridiculousfish commented Apr 27, 2013

eassmann commented Apr 30, 2013

ridiculousfish commented Apr 30, 2013

eassmann commented May 3, 2013

frogshead commented May 3, 2013

ridiculousfish commented May 3, 2013

ridiculousfish commented May 3, 2013

zanchey commented May 4, 2013

zanchey commented Oct 19, 2015

eassmann commented Oct 19, 2015

zanchey commented Oct 19, 2015

eassmann commented Oct 19, 2015

eassmann commented Oct 22, 2015

eassmann commented Oct 22, 2015

zanchey commented Oct 22, 2015

eassmann commented Oct 27, 2015

zanchey commented Oct 27, 2015

eassmann commented Oct 27, 2015

zanchey commented Oct 29, 2015

eassmann commented Oct 29, 2015

zanchey commented Oct 30, 2015

krader1961 commented Dec 13, 2016

eassmann commented Dec 13, 2016

krader1961 commented Dec 13, 2016

krader1961 commented Dec 13, 2016

eassmann commented Dec 14, 2016

handle broken locking on remote filesystems better #685

handle broken locking on remote filesystems better #685

Comments

eassmann commented Apr 22, 2013

ridiculousfish commented Apr 22, 2013

eassmann commented Apr 22, 2013

zanchey commented Apr 22, 2013

ridiculousfish commented Apr 22, 2013

eassmann commented Apr 23, 2013

ridiculousfish commented Apr 27, 2013

ridiculousfish commented Apr 27, 2013

eassmann commented Apr 30, 2013

ridiculousfish commented Apr 30, 2013

eassmann commented May 3, 2013

frogshead commented May 3, 2013

ridiculousfish commented May 3, 2013

ridiculousfish commented May 3, 2013

zanchey commented May 4, 2013

zanchey commented Oct 19, 2015

eassmann commented Oct 19, 2015

zanchey commented Oct 19, 2015

eassmann commented Oct 19, 2015

eassmann commented Oct 22, 2015

eassmann commented Oct 22, 2015

zanchey commented Oct 22, 2015

eassmann commented Oct 27, 2015

zanchey commented Oct 27, 2015

eassmann commented Oct 27, 2015

zanchey commented Oct 29, 2015

eassmann commented Oct 29, 2015

zanchey commented Oct 30, 2015

krader1961 commented Dec 13, 2016

eassmann commented Dec 13, 2016

krader1961 commented Dec 13, 2016

krader1961 commented Dec 13, 2016

eassmann commented Dec 14, 2016