Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle broken locking on remote filesystems better #685

Closed
eassmann opened this issue Apr 22, 2013 · 32 comments
Closed

handle broken locking on remote filesystems better #685

eassmann opened this issue Apr 22, 2013 · 32 comments
Assignees
Labels
enhancement release notes Something that is or should be mentioned in the release notes
Milestone

Comments

@eassmann
Copy link

On my CentOS 5.8 system, fish hangs for a long time (minutes) after every command, built-in or external, though not after a blank line. The command output and new prompt will be printed, then the hang occurs.

It also hangs after entering a comment, although in this case before printing a prompt.

It does not hang when a command is executed as 'fish -c ls'

strace snippet executing 'ls', the blank lines indicate long waits:

write(1, "assmann@n101 \33[32m~\33[30m\33(B\33[m> "..., 35) = 35
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo
...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo
...}) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
futex(0x9980030, FUTEX_WAIT_PRIVATE, 2, NULL

) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=30, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}

) = -1 EIO (Input/output error)
close(6)                                = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_RDONLY) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fcntl(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}

) = -1 ENOLCK (No locks available)
close(6)                                = 0
@ridiculousfish
Copy link
Member

Is your home directory perhaps on an NFS mount, or any filesystem other than something simple and local?

@eassmann
Copy link
Author

Yes, it is on NFS. I neglected to mention that because (a) the file system is usually fast, (b) I tried running fish in a local mount and it did not change anything, and (c) I did simply not remember that it might be about the home directory rather than the working directory.

Anyway, running fish with $HOME set to a local directory gets rid of the hangs, so your guess was spot on. It would still be nice to be able to use fish. Changing my $HOME is not really an option, but I could conceivably have the .config/fish on the local FS, if that is possible.

(To elucidate: The machine is a cluster with a shared NFS and "scratch" directories on local HDDs. I do want my $HOME to be on the NFS, but it would be okay to run fish only on the login node and have the config directory on its scratch FS.)

@zanchey
Copy link
Member

zanchey commented Apr 22, 2013

I run fish with an NFS-mounted home directory, and it seems fine. Are you running an NFS lock manager? As you may know, rpcinfo -p $SERVER against your NFS server should show an entry for nlockmgr.

@ridiculousfish
Copy link
Member

I am definitely interested in the answer to zanchey's question.

I would also like to make fish not fall over in the presence of even a NFS mount that does not support file locking. Does anyone have any suggestions for how best to set one up for testing?

@eassmann
Copy link
Author

zanchey: I see that all of the syscalls that caused the trouble seem to be locking-related :-). nlockmgr does show up in the output of rpcinfo:

   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    873  rquotad
    100011    2   udp    873  rquotad
    100011    1   tcp    876  rquotad
    100011    2   tcp    876  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100021    1   udp  37999  nlockmgr
    100021    3   udp  37999  nlockmgr
    100021    4   udp  37999  nlockmgr
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100021    1   tcp  33570  nlockmgr
    100021    3   tcp  33570  nlockmgr
    100021    4   tcp  33570  nlockmgr
    100005    1   udp    965  mountd
    100005    1   tcp    968  mountd
    100005    2   udp    965  mountd
    100005    2   tcp    968  mountd
    100005    3   udp    965  mountd
    100005    3   tcp    968  mountd

However, the NFS share is mounted 'nolock', if that has anything to do with it. I had a problem a while ago where some program failed because it was not able to do locks on the NFS. I do not now remember what program it was, but a quick web search suggests that this has to do with flock() rather than fcntl(), so maybe it is unrelated.

For completeness, I will also mention that sometimes on this machine, yum will hang in a futex syscall (contrary to fish, it never recovers). One then needs to remove some stale lock files.

ridiculousfish: About the 'presence' of an NFS mount, the mere presence seems not to be enough. When I start fish with HOME set to a local mount, I can go to the NFS mount and work there without problems. (Sorry if this was obvious.)

@ridiculousfish
Copy link
Member

What's happening is that fish attempts to take a file lock and then append to history. If it can't take a file lock, it falls back to the old mode of creating a new file containing the old and new history, and then moving that in place. Creating the new file is slow, as it requires copying the contents of the old file each time. This is currently not multithreaded.

@ridiculousfish
Copy link
Member

I relaxed the constraint that we must take a lock to do the history append-in-place, and now we always write complete records with O_APPEND, which is closer to what bash does. This probably won't always work on lockless NFS - you may lose some new history items. But performance should be better most of the time.

Fixed as 3e69e5b . Thanks for reporting this.

@eassmann
Copy link
Author

I have now had a chance to try this out. Sorry to say this, but the problem persists. I will also say that the lag is much longer than I would expect for making a copy of the history.

Actually, after the patch fish now seems to hang forever:

write(1, "\33[38;5;245m\342\217\216\33[0m              "..., 126) = 126
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
write(1, "assmann@n101 \33[32m~\33[30m\33(B\33[m> "..., 35) = 35
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
fstat(2, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\0", 1)                        = 1
futex(0x14f8cfc0, FUTEX_WAIT_PRIVATE, 2, NULL

) = 0

similar as before, then

rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)                                = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
futex(0x14f8cfc0, FUTEX_WAKE_PRIVATE, 1) = 0
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

and I have not seen that call return.

@ridiculousfish
Copy link
Member

Thanks for your conscientious testing.

open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available)
write(6, "- cmd: ls\n   when: 1367317539\n", 30) = 30
close(6)

That shows fish attempting to take the lock, failing, appending to the history anyways, and then closing it, which is as designed. So the fix appears to be working correctly so far.

select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\3", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\1", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL)    = 1 (in [4])
read(4, "\2", 1)                        = 1
select(5, [0 3 4], NULL, NULL, NULL

This shows fish waiting on input from file descriptors. (fd 4 is probably the iothread completion pipe.) 0 is stdin. So it looks like fish is just waiting for input on stdin.

@eassmann
Copy link
Author

eassmann commented May 3, 2013

I stand corrected. At the select, fish is really just waiting for input. I cannot reconstruct why I thought it had crashed the other day.

However, it is still not usable because the calls futex(0x81ef030, FUTEX_WAIT_PRIVATE, 2, NULL) and fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) take such a long time. If you want, the problem is 33% solved, since 1 out of originally 3 lagging syscalls has been avoided :-).

@frogshead
Copy link
Contributor

i had lagging problem my ubuntu box as well when my history file was corrupted, there were some invalid or empty commands (or something). When I wipe out whole file and created new one then lagging were gone.

@ridiculousfish
Copy link
Member

You might also just had a lot of history. We cap it at 262144 ( = 256 * 1024) items. History searches may take the lock and not release it for a while. We should do better - reader/writer locks, or more fine grained locking.

@ridiculousfish
Copy link
Member

I'm surprised that the F_SETLKW takes a long time - I would expect it to immediately if locks are not supported.

@zanchey
Copy link
Member

zanchey commented May 4, 2013

(A side note - if you use strace -T it will print the amount of time spent in each call.)

@zanchey
Copy link
Member

zanchey commented Oct 19, 2015

Was there ever any progress with this? I wondered idly if NFSv4 might be displaying some unusual semantics here, but there hasn't been any activity in the last two years. I'm not sure if @xebtl is still having the problem?

@eassmann
Copy link
Author

From my side, no progress. More precisely, I simply stopped trying after the previous discussion — I do not use the machine in question (a cluster) all that much, so it was not worth more effort to me. I actually use the machine even less these days, but if it is of interest to you, I can certainly try the current version of fish.

On the other hand, is NFSv4 a new version? The NFS on our cluster is not being updated, if that was your point.

@zanchey
Copy link
Member

zanchey commented Oct 19, 2015

NFSv3 and NFSv4 are two quite different protocols, and certainly at our site we have aggressively avoided migration to NFSv4 due to a variety of problems - Linux server performance, reduced fault tolerance and changed semantics. We have picked up at least one problem in fish that occurs only on NFSv4 mounts, though.

I'd be interested to know if it still happens. There's been quite a bit of change to the code in this area.

@eassmann
Copy link
Author

How do I find out if my cluster is using v3 or v4?

@eassmann
Copy link
Author

@zanchey: nfsstat says “Client nfs v3”, so I guess it is NFSv3.

@eassmann
Copy link
Author

I compiled the current 2.2.0 tarball from the website. Unfortunately, it seems that the problem persists:

futex(0x184cb980, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 <289.078032>
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0 <0.000058>
open("/home/assmann/.config/fish/fish_history", O_WRONLY|O_APPEND) = 6 <0.326795>
fcntl(6, F_GETFD)                       = 0 <0.000029>
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0 <0.000021>
fstat(6, {st_mode=S_IFREG|0644, st_size=182, ...}) = 0 <0.000020>
fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOLCK (No locks available) <59.370349>

@zanchey
Copy link
Member

zanchey commented Oct 22, 2015

My suspicion is that the NFS server isn't running an NFS locking manager (rpcinfo -p SERVERNAME | grep nlockmgr should produce some output if the locking manager is present). If that's all OK, could you try building the current git master? Or using a nightly build?

@eassmann
Copy link
Author

It seems the locking manager is present.

I failed to build from master because the configuration needs autoconf 2.60 but only 2.59 is available for my OS (CentOS 5.11). Installing the nightly RPM failed as well, I think also because the OS release is too old. Is there any way to get around the autoconf dependency?

@zanchey
Copy link
Member

zanchey commented Oct 27, 2015

Try autoreconf rather than autoconf to get the source to build.

@eassmann
Copy link
Author

That still complains

$ autoreconf                                                                       
configure.ac:13: error: Autoconf version 2.60 or higher is required                                                    
configure.ac:13: the top level                                                                                         
autom4te: /usr/bin/m4 failed with exit status: 63                                                                      
aclocal: autom4te failed with exit status: 63                                                                          
autoreconf: aclocal failed with exit status: 63                                                                        

I tried to remove the constraint from configure.ac, but then autoreconf reports an error and the resulting configure fails with

./configure: line 2069: AC_PROG_SED: command not found
CXXFLAGS: -g -O2
checking for msgfmt... msgfmt
./configure: line 2236: syntax error near unexpected token `$doxygen_version,'
./configure: line 2236: `            AS_VERSION_COMPARE($doxygen_version, $doxygen_minimum,'

@zanchey
Copy link
Member

zanchey commented Oct 29, 2015

Of course - my apologies.

Try the nightly tarball, which has the configure file generated. You should just be able to use ./configure; and make install. You can download it from the openSUSE build service - e.g. fish_2.2.0-392-g1055864.orig.tar.gz.

@eassmann
Copy link
Author

That tarball built, but the problem is still there. Sorry.

@zanchey
Copy link
Member

zanchey commented Oct 30, 2015

No, that's good to know!

@krader1961
Copy link
Contributor

Is anyone still interested in debugging this? If not we should close it. Getting a strace with timing data along with fish -d5 output would be really helpful.

@eassmann
Copy link
Author

I ran another test with the the build of fish-2.2.0-392-g1055864 on the same machine, using the command strace -fTostrace ./fish -d 5 2>stderr. I entered ls^M^D and waited for fish to exit.

I put the complete output in a gist. The syscalls that take a long time are

#line   PID
----- -----
 4999 28559 fcntl(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5107 28561 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5196 28551 futex(0x1d7d4b60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
 5197 28559 <... fcntl resumed> )             = -1 ENOLCK (No locks available) <30.004000>
 5203 28561 <... futex resumed> )             = 0 <29.925293>
 5210 28551 <... futex resumed> )             = 0 <29.597652>
 5241 28551 fcntl(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0} <unfinished ...>
 5351 28551 <... fcntl resumed> )             = -1 EIO (Input/output error) <150.011093>

As far as I could tell, fish's corresponding debug output was

fish: path_get_path( 'l' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )
fish: path_get_path( 'ls' )

If you want to see output with different options, just let me know. If you want me to try with a newer version, ditto, though I would have to compile that.

@krader1961
Copy link
Contributor

This is a bug in the NFS statd service. See for example https://bugzilla.kernel.org/show_bug.cgi?id=8159. Your OS is broken. Nothing fish can do about it other than not attempt to obtain locks on the history file and I don't think we're going to do that.

@krader1961
Copy link
Contributor

@eassmann: You can fix your immediate problem by setting $XDG_DATA_HOME to a path on a local file system so your fish history file is no longer on NFS.

While walking my dogs this morning it occurred to me we can deal with this type of situation better. Simply time how long it takes to obtain the lock (or fails as the case may be) and if it is above a threshold (say 250 ms) then emit a warning and stop trying to lock the history file. It's better to risk losing history than to have an unusable shell.

@krader1961 krader1961 changed the title fish hangs on CentOS 5.8 handle broken locking on remote filesystems better Dec 13, 2016
@eassmann
Copy link
Author

@krader1961: Thanks for the tip about $XDG_DATA_HOME.

It's better to risk losing history than to have an unusable shell.

I absolutely agree.

@krader1961 krader1961 self-assigned this Dec 16, 2016
@krader1961 krader1961 modified the milestones: fish 2.5.0, fish-future Dec 16, 2016
@zanchey zanchey added the release notes Something that is or should be mentioned in the release notes label Dec 19, 2016
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement release notes Something that is or should be mentioned in the release notes
Projects
None yet
Development

No branches or pull requests

5 participants