handle broken locking on remote filesystems better #685

Closed
eassmann opened this Issue Apr 22, 2013 · 32 comments

Comments

Projects
None yet
5 participants
@eassmann

On my CentOS 5.8 system, fish hangs for a long time (minutes) after every command, built-in or external, though not after a blank line. The command output and new prompt will be printed, then the hang occurs.

It also hangs after entering a comment, although in this case before printing a prompt.

It does not hang when a command is executed as 'fish -c ls'

strace snippet executing 'ls', the blank lines indicate long waits:

write(1, "assmann@n101 \33[32m~\33[30m\33(B\33[m> "..., 35) = 35
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo
...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo
...}) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
futex(0x9980030, FUTEX_WAIT_PRIVATE, 2, NULL

) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=30, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}

) = -1 EIO (Input/output error)
close(6)                                = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_RDONLY) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fcntl(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}

) = -1 ENOLCK (No locks available)
close(6)                                = 0
@ridiculousfish

This comment has been minimized.

Show comment
Hide comment
@ridiculousfish

ridiculousfish Apr 22, 2013

Member

Is your home directory perhaps on an NFS mount, or any filesystem other than something simple and local?

Member

ridiculousfish commented Apr 22, 2013

Is your home directory perhaps on an NFS mount, or any filesystem other than something simple and local?

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Apr 22, 2013

Yes, it is on NFS. I neglected to mention that because (a) the file system is usually fast, (b) I tried running fish in a local mount and it did not change anything, and (c) I did simply not remember that it might be about the home directory rather than the working directory.

Anyway, running fish with $HOME set to a local directory gets rid of the hangs, so your guess was spot on. It would still be nice to be able to use fish. Changing my $HOME is not really an option, but I could conceivably have the .config/fish on the local FS, if that is possible.

(To elucidate: The machine is a cluster with a shared NFS and "scratch" directories on local HDDs. I do want my $HOME to be on the NFS, but it would be okay to run fish only on the login node and have the config directory on its scratch FS.)

Yes, it is on NFS. I neglected to mention that because (a) the file system is usually fast, (b) I tried running fish in a local mount and it did not change anything, and (c) I did simply not remember that it might be about the home directory rather than the working directory.

Anyway, running fish with $HOME set to a local directory gets rid of the hangs, so your guess was spot on. It would still be nice to be able to use fish. Changing my $HOME is not really an option, but I could conceivably have the .config/fish on the local FS, if that is possible.

(To elucidate: The machine is a cluster with a shared NFS and "scratch" directories on local HDDs. I do want my $HOME to be on the NFS, but it would be okay to run fish only on the login node and have the config directory on its scratch FS.)

@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey Apr 22, 2013

Member

I run fish with an NFS-mounted home directory, and it seems fine. Are you running an NFS lock manager? As you may know, rpcinfo -p $SERVER against your NFS server should show an entry for nlockmgr.

Member

zanchey commented Apr 22, 2013

I run fish with an NFS-mounted home directory, and it seems fine. Are you running an NFS lock manager? As you may know, rpcinfo -p $SERVER against your NFS server should show an entry for nlockmgr.

@ridiculousfish

This comment has been minimized.

Show comment
Hide comment
@ridiculousfish

ridiculousfish Apr 22, 2013

Member

I am definitely interested in the answer to zanchey's question.

I would also like to make fish not fall over in the presence of even a NFS mount that does not support file locking. Does anyone have any suggestions for how best to set one up for testing?

Member

ridiculousfish commented Apr 22, 2013

I am definitely interested in the answer to zanchey's question.

I would also like to make fish not fall over in the presence of even a NFS mount that does not support file locking. Does anyone have any suggestions for how best to set one up for testing?

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Apr 23, 2013

zanchey: I see that all of the syscalls that caused the trouble seem to be locking-related :-). nlockmgr does show up in the output of rpcinfo:

   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    873  rquotad
    100011    2   udp    873  rquotad
    100011    1   tcp    876  rquotad
    100011    2   tcp    876  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100021    1   udp  37999  nlockmgr
    100021    3   udp  37999  nlockmgr
    100021    4   udp  37999  nlockmgr
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100021    1   tcp  33570  nlockmgr
    100021    3   tcp  33570  nlockmgr
    100021    4   tcp  33570  nlockmgr
    100005    1   udp    965  mountd
    100005    1   tcp    968  mountd
    100005    2   udp    965  mountd
    100005    2   tcp    968  mountd
    100005    3   udp    965  mountd
    100005    3   tcp    968  mountd

However, the NFS share is mounted 'nolock', if that has anything to do with it. I had a problem a while ago where some program failed because it was not able to do locks on the NFS. I do not now remember what program it was, but a quick web search suggests that this has to do with flock() rather than fcntl(), so maybe it is unrelated.

For completeness, I will also mention that sometimes on this machine, yum will hang in a futex syscall (contrary to fish, it never recovers). One then needs to remove some stale lock files.

ridiculousfish: About the 'presence' of an NFS mount, the mere presence seems not to be enough. When I start fish with HOME set to a local mount, I can go to the NFS mount and work there without problems. (Sorry if this was obvious.)

zanchey: I see that all of the syscalls that caused the trouble seem to be locking-related :-). nlockmgr does show up in the output of rpcinfo:

   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    873  rquotad
    100011    2   udp    873  rquotad
    100011    1   tcp    876  rquotad
    100011    2   tcp    876  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100021    1   udp  37999  nlockmgr
    100021    3   udp  37999  nlockmgr
    100021    4   udp  37999  nlockmgr
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100021    1   tcp  33570  nlockmgr
    100021    3   tcp  33570  nlockmgr
    100021    4   tcp  33570  nlockmgr
    100005    1   udp    965  mountd
    100005    1   tcp    968  mountd
    100005    2   udp    965  mountd
    100005    2   tcp    968  mountd
    100005    3   udp    965  mountd
    100005    3   tcp    968  mountd

However, the NFS share is mounted 'nolock', if that has anything to do with it. I had a problem a while ago where some program failed because it was not able to do locks on the NFS. I do not now remember what program it was, but a quick web search suggests that this has to do with flock() rather than fcntl(), so maybe it is unrelated.

For completeness, I will also mention that sometimes on this machine, yum will hang in a futex syscall (contrary to fish, it never recovers). One then needs to remove some stale lock files.

ridiculousfish: About the 'presence' of an NFS mount, the mere presence seems not to be enough. When I start fish with HOME set to a local mount, I can go to the NFS mount and work there without problems. (Sorry if this was obvious.)

@ridiculousfish

This comment has been minimized.

Show comment
Hide comment
@ridiculousfish

ridiculousfish Apr 27, 2013

Member

What's happening is that fish attempts to take a file lock and then append to history. If it can't take a file lock, it falls back to the old mode of creating a new file containing the old and new history, and then moving that in place. Creating the new file is slow, as it requires copying the contents of the old file each time. This is currently not multithreaded.

Member

ridiculousfish commented Apr 27, 2013

What's happening is that fish attempts to take a file lock and then append to history. If it can't take a file lock, it falls back to the old mode of creating a new file containing the old and new history, and then moving that in place. Creating the new file is slow, as it requires copying the contents of the old file each time. This is currently not multithreaded.

@ridiculousfish

This comment has been minimized.

Show comment
Hide comment
@ridiculousfish

ridiculousfish Apr 27, 2013

Member

I relaxed the constraint that we must take a lock to do the history append-in-place, and now we always write complete records with O_APPEND, which is closer to what bash does. This probably won't always work on lockless NFS - you may lose some new history items. But performance should be better most of the time.

Fixed as 3e69e5b . Thanks for reporting this.

Member

ridiculousfish commented Apr 27, 2013

I relaxed the constraint that we must take a lock to do the history append-in-place, and now we always write complete records with O_APPEND, which is closer to what bash does. This probably won't always work on lockless NFS - you may lose some new history items. But performance should be better most of the time.

Fixed as 3e69e5b . Thanks for reporting this.

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Apr 30, 2013

I have now had a chance to try this out. Sorry to say this, but the problem persists. I will also say that the lag is much longer than I would expect for making a copy of the history.

Actually, after the patch fish now seems to hang forever:

write(1, "\33[38;5;245m\342\217\216\33[0m              "..., 126) = 126
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
write(1, "assmann@n101 \33[32m~\33[30m\33(B\33[m> "..., 35) = 35
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\0", 1)                        = 1
futex(0x14f8cfc0, FUTEX_WAIT_PRIVATE, 2, NULL

) = 0

similar as before, then

rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)                                = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
futex(0x14f8cfc0, FUTEX_WAKE_PRIVATE, 1) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

and I have not seen that call return.

I have now had a chance to try this out. Sorry to say this, but the problem persists. I will also say that the lag is much longer than I would expect for making a copy of the history.

Actually, after the patch fish now seems to hang forever:

write(1, "\33[38;5;245m\342\217\216\33[0m              "..., 126) = 126
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
write(1, "assmann@n101 \33[32m~\33[30m\33(B\33[m> "..., 35) = 35
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\0", 1)                        = 1
futex(0x14f8cfc0, FUTEX_WAIT_PRIVATE, 2, NULL

) = 0

similar as before, then

rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)                                = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
futex(0x14f8cfc0, FUTEX_WAKE_PRIVATE, 1) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

and I have not seen that call return.

@ridiculousfish

This comment has been minimized.

Show comment
Hide comment
@ridiculousfish

ridiculousfish Apr 30, 2013

Member

Thanks for your conscientious testing.

open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)

That shows fish attempting to take the lock, failing, appending to the history anyways, and then closing it, which is as designed. So the fix appears to be working correctly so far.

select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

This shows fish waiting on input from file descriptors. (fd 4 is probably the iothread completion pipe.) 0 is stdin. So it looks like fish is just waiting for input on stdin.

Member

ridiculousfish commented Apr 30, 2013

Thanks for your conscientious testing.

open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)

That shows fish attempting to take the lock, failing, appending to the history anyways, and then closing it, which is as designed. So the fix appears to be working correctly so far.

select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

This shows fish waiting on input from file descriptors. (fd 4 is probably the iothread completion pipe.) 0 is stdin. So it looks like fish is just waiting for input on stdin.

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann May 3, 2013

I stand corrected. At the select, fish is really just waiting for input. I cannot reconstruct why I thought it had crashed the other day.

However, it is still not usable because the calls futex(0x81ef030, FUTEX_WAIT_PRIVATE, 2, NULL) and fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) take such a long time. If you want, the problem is 33% solved, since 1 out of originally 3 lagging syscalls has been avoided :-).

eassmann commented May 3, 2013

I stand corrected. At the select, fish is really just waiting for input. I cannot reconstruct why I thought it had crashed the other day.

However, it is still not usable because the calls futex(0x81ef030, FUTEX_WAIT_PRIVATE, 2, NULL) and fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) take such a long time. If you want, the problem is 33% solved, since 1 out of originally 3 lagging syscalls has been avoided :-).

@frogshead

This comment has been minimized.

Show comment
Hide comment
@frogshead

frogshead May 3, 2013

Contributor

i had lagging problem my ubuntu box as well when my history file was corrupted, there were some invalid or empty commands (or something). When I wipe out whole file and created new one then lagging were gone.

Contributor

frogshead commented May 3, 2013

i had lagging problem my ubuntu box as well when my history file was corrupted, there were some invalid or empty commands (or something). When I wipe out whole file and created new one then lagging were gone.

@ridiculousfish

This comment has been minimized.

Show comment
Hide comment
@ridiculousfish

ridiculousfish May 3, 2013

Member

You might also just had a lot of history. We cap it at 262144 ( = 256 * 1024) items. History searches may take the lock and not release it for a while. We should do better - reader/writer locks, or more fine grained locking.

Member

ridiculousfish commented May 3, 2013

You might also just had a lot of history. We cap it at 262144 ( = 256 * 1024) items. History searches may take the lock and not release it for a while. We should do better - reader/writer locks, or more fine grained locking.

@ridiculousfish

This comment has been minimized.

Show comment
Hide comment
@ridiculousfish

ridiculousfish May 3, 2013

Member

I'm surprised that the F_SETLKW takes a long time - I would expect it to immediately if locks are not supported.

Member

ridiculousfish commented May 3, 2013

I'm surprised that the F_SETLKW takes a long time - I would expect it to immediately if locks are not supported.

@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey May 4, 2013

Member

(A side note - if you use strace -T it will print the amount of time spent in each call.)

Member

zanchey commented May 4, 2013

(A side note - if you use strace -T it will print the amount of time spent in each call.)

@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey Oct 19, 2015

Member

Was there ever any progress with this? I wondered idly if NFSv4 might be displaying some unusual semantics here, but there hasn't been any activity in the last two years. I'm not sure if @xebtl is still having the problem?

Member

zanchey commented Oct 19, 2015

Was there ever any progress with this? I wondered idly if NFSv4 might be displaying some unusual semantics here, but there hasn't been any activity in the last two years. I'm not sure if @xebtl is still having the problem?

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Oct 19, 2015

From my side, no progress. More precisely, I simply stopped trying after the previous discussion — I do not use the machine in question (a cluster) all that much, so it was not worth more effort to me. I actually use the machine even less these days, but if it is of interest to you, I can certainly try the current version of fish.

On the other hand, is NFSv4 a new version? The NFS on our cluster is not being updated, if that was your point.

From my side, no progress. More precisely, I simply stopped trying after the previous discussion — I do not use the machine in question (a cluster) all that much, so it was not worth more effort to me. I actually use the machine even less these days, but if it is of interest to you, I can certainly try the current version of fish.

On the other hand, is NFSv4 a new version? The NFS on our cluster is not being updated, if that was your point.

@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey Oct 19, 2015

Member

NFSv3 and NFSv4 are two quite different protocols, and certainly at our site we have aggressively avoided migration to NFSv4 due to a variety of problems - Linux server performance, reduced fault tolerance and changed semantics. We have picked up at least one problem in fish that occurs only on NFSv4 mounts, though.

I'd be interested to know if it still happens. There's been quite a bit of change to the code in this area.

Member

zanchey commented Oct 19, 2015

NFSv3 and NFSv4 are two quite different protocols, and certainly at our site we have aggressively avoided migration to NFSv4 due to a variety of problems - Linux server performance, reduced fault tolerance and changed semantics. We have picked up at least one problem in fish that occurs only on NFSv4 mounts, though.

I'd be interested to know if it still happens. There's been quite a bit of change to the code in this area.

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Oct 19, 2015

How do I find out if my cluster is using v3 or v4?

How do I find out if my cluster is using v3 or v4?

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Oct 22, 2015

@zanchey: nfsstat says “Client nfs v3”, so I guess it is NFSv3.

@zanchey: nfsstat says “Client nfs v3”, so I guess it is NFSv3.

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Oct 22, 2015

I compiled the current 2.2.0 tarball from the website. Unfortunately, it seems that the problem persists:

futex(0x184cb980, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 <289.078032>
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0 <0.000058>
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6 <0.326795>
fcntl(6, F_GETFD)                       = 0 <0.000029>
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0 <0.000021>
fstat(6, {st_mode=S_IFREG|0644, st_size=182, ...}) = 0 <0.000020>
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available) <59.370349>

I compiled the current 2.2.0 tarball from the website. Unfortunately, it seems that the problem persists:

futex(0x184cb980, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 <289.078032>
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0 <0.000058>
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6 <0.326795>
fcntl(6, F_GETFD)                       = 0 <0.000029>
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0 <0.000021>
fstat(6, {st_mode=S_IFREG|0644, st_size=182, ...}) = 0 <0.000020>
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available) <59.370349>
@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey Oct 22, 2015

Member

My suspicion is that the NFS server isn't running an NFS locking manager (rpcinfo -p SERVERNAME | grep nlockmgr should produce some output if the locking manager is present). If that's all OK, could you try building the current git master? Or using a nightly build?

Member

zanchey commented Oct 22, 2015

My suspicion is that the NFS server isn't running an NFS locking manager (rpcinfo -p SERVERNAME | grep nlockmgr should produce some output if the locking manager is present). If that's all OK, could you try building the current git master? Or using a nightly build?

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Oct 27, 2015

It seems the locking manager is present.

I failed to build from master because the configuration needs autoconf 2.60 but only 2.59 is available for my OS (CentOS 5.11). Installing the nightly RPM failed as well, I think also because the OS release is too old. Is there any way to get around the autoconf dependency?

It seems the locking manager is present.

I failed to build from master because the configuration needs autoconf 2.60 but only 2.59 is available for my OS (CentOS 5.11). Installing the nightly RPM failed as well, I think also because the OS release is too old. Is there any way to get around the autoconf dependency?

@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey Oct 27, 2015

Member

Try autoreconf rather than autoconf to get the source to build.

Member

zanchey commented Oct 27, 2015

Try autoreconf rather than autoconf to get the source to build.

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Oct 27, 2015

That still complains

$ autoreconf                                                                       
configure.ac:13: error: Autoconf version 2.60 or higher is required                                                    
configure.ac:13: the top level                                                                                         
autom4te: /usr/bin/m4 failed with exit status: 63                                                                      
aclocal: autom4te failed with exit status: 63                                                                          
autoreconf: aclocal failed with exit status: 63                                                                        

I tried to remove the constraint from configure.ac, but then autoreconf reports an error and the resulting configure fails with

./configure: line 2069: AC_PROG_SED: command not found
CXXFLAGS: -g -O2
checking for msgfmt... msgfmt
./configure: line 2236: syntax error near unexpected token `$doxygen_version,'
./configure: line 2236: `            AS_VERSION_COMPARE($doxygen_version, $doxygen_minimum,'

That still complains

$ autoreconf                                                                       
configure.ac:13: error: Autoconf version 2.60 or higher is required                                                    
configure.ac:13: the top level                                                                                         
autom4te: /usr/bin/m4 failed with exit status: 63                                                                      
aclocal: autom4te failed with exit status: 63                                                                          
autoreconf: aclocal failed with exit status: 63                                                                        

I tried to remove the constraint from configure.ac, but then autoreconf reports an error and the resulting configure fails with

./configure: line 2069: AC_PROG_SED: command not found
CXXFLAGS: -g -O2
checking for msgfmt... msgfmt
./configure: line 2236: syntax error near unexpected token `$doxygen_version,'
./configure: line 2236: `            AS_VERSION_COMPARE($doxygen_version, $doxygen_minimum,'
@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey Oct 29, 2015

Member

Of course - my apologies.

Try the nightly tarball, which has the configure file generated. You should just be able to use ./configure; and make install. You can download it from the openSUSE build service - e.g. fish_2.2.0-392-g1055864.orig.tar.gz.

Member

zanchey commented Oct 29, 2015

Of course - my apologies.

Try the nightly tarball, which has the configure file generated. You should just be able to use ./configure; and make install. You can download it from the openSUSE build service - e.g. fish_2.2.0-392-g1055864.orig.tar.gz.

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Oct 29, 2015

That tarball built, but the problem is still there. Sorry.

That tarball built, but the problem is still there. Sorry.

@zanchey

This comment has been minimized.

Show comment
Hide comment
@zanchey

zanchey Oct 30, 2015

Member

No, that's good to know!

Member

zanchey commented Oct 30, 2015

No, that's good to know!

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Dec 13, 2016

Contributor

Is anyone still interested in debugging this? If not we should close it. Getting a strace with timing data along with fish -d5 output would be really helpful.

Contributor

krader1961 commented Dec 13, 2016

Is anyone still interested in debugging this? If not we should close it. Getting a strace with timing data along with fish -d5 output would be really helpful.

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Dec 13, 2016

I ran another test with the the build of fish-2.2.0-392-g1055864 on the same machine, using the command strace -fTostrace ./fish -d 5 2>stderr. I entered ls^M^D and waited for fish to exit.

I put the complete output in a gist. The syscalls that take a long time are

#line   PID
----- -----
 4999 28559 fcntl(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5107 28561 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5196 28551 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5197 28559 <... fcntl resumed> )             = -1 ENOLCK (No locks available) <30.004000>
 5203 28561 <... futex resumed> )             = 0 <29.925293>
 5210 28551 <... futex resumed> )             = 0 <29.597652>
 5241 28551 fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5351 28551 <... fcntl resumed> )             = -1 EIO (Input/output error) <150.011093>

As far as I could tell, fish's corresponding debug output was

fish: path_get_path( 'l' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )

If you want to see output with different options, just let me know. If you want me to try with a newer version, ditto, though I would have to compile that.

I ran another test with the the build of fish-2.2.0-392-g1055864 on the same machine, using the command strace -fTostrace ./fish -d 5 2>stderr. I entered ls^M^D and waited for fish to exit.

I put the complete output in a gist. The syscalls that take a long time are

#line   PID
----- -----
 4999 28559 fcntl(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5107 28561 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5196 28551 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5197 28559 <... fcntl resumed> )             = -1 ENOLCK (No locks available) <30.004000>
 5203 28561 <... futex resumed> )             = 0 <29.925293>
 5210 28551 <... futex resumed> )             = 0 <29.597652>
 5241 28551 fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5351 28551 <... fcntl resumed> )             = -1 EIO (Input/output error) <150.011093>

As far as I could tell, fish's corresponding debug output was

fish: path_get_path( 'l' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )

If you want to see output with different options, just let me know. If you want me to try with a newer version, ditto, though I would have to compile that.

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Dec 13, 2016

Contributor

This is a bug in the NFS statd service. See for example https://bugzilla.kernel.org/show_bug.cgi?id=8159. Your OS is broken. Nothing fish can do about it other than not attempt to obtain locks on the history file and I don't think we're going to do that.

Contributor

krader1961 commented Dec 13, 2016

This is a bug in the NFS statd service. See for example https://bugzilla.kernel.org/show_bug.cgi?id=8159. Your OS is broken. Nothing fish can do about it other than not attempt to obtain locks on the history file and I don't think we're going to do that.

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Dec 13, 2016

Contributor

@eassmann: You can fix your immediate problem by setting $XDG_DATA_HOME to a path on a local file system so your fish history file is no longer on NFS.

While walking my dogs this morning it occurred to me we can deal with this type of situation better. Simply time how long it takes to obtain the lock (or fails as the case may be) and if it is above a threshold (say 250 ms) then emit a warning and stop trying to lock the history file. It's better to risk losing history than to have an unusable shell.

Contributor

krader1961 commented Dec 13, 2016

@eassmann: You can fix your immediate problem by setting $XDG_DATA_HOME to a path on a local file system so your fish history file is no longer on NFS.

While walking my dogs this morning it occurred to me we can deal with this type of situation better. Simply time how long it takes to obtain the lock (or fails as the case may be) and if it is above a threshold (say 250 ms) then emit a warning and stop trying to lock the history file. It's better to risk losing history than to have an unusable shell.

@krader1961 krader1961 changed the title from fish hangs on CentOS 5.8 to handle broken locking on remote filesystems better Dec 13, 2016

@eassmann

This comment has been minimized.

Show comment
Hide comment
@eassmann

eassmann Dec 14, 2016

@krader1961: Thanks for the tip about $XDG_DATA_HOME.

It's better to risk losing history than to have an unusable shell.

I absolutely agree.

@krader1961: Thanks for the tip about $XDG_DATA_HOME.

It's better to risk losing history than to have an unusable shell.

I absolutely agree.

@krader1961 krader1961 self-assigned this Dec 16, 2016

@krader1961 krader1961 modified the milestones: fish 2.5.0, fish-future Dec 16, 2016

krader1961 added a commit to krader1961/fish-shell that referenced this issue Dec 17, 2016

check if locking takes too long
If acquiring a lock on the history or uvar file takes more than 250 ms
disable locking of the file. On systems with broken remote file system
locking it can cause tens of seconds delay after running each command
which can make the shell borderline unusable.

This also changes history file locking to use flock() rather than
fcntl() to be consistent with uvar file locking. It also implements the
250 ms time limit before giving up on locking.

Fixes #685

krader1961 added a commit to krader1961/fish-shell that referenced this issue Dec 17, 2016

check if locking takes too long
If acquiring a lock on the history or uvar file takes more than 250 ms
disable locking of the file. On systems with broken remote file system
locking it can cause tens of seconds delay after running each command
which can make the shell borderline unusable.

This also changes history file locking to use flock() rather than
fcntl() to be consistent with uvar file locking. It also implements the
250 ms time limit before giving up on locking.

Fixes #685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment