-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle broken locking on remote filesystems better #685
Comments
Is your home directory perhaps on an NFS mount, or any filesystem other than something simple and local? |
Yes, it is on NFS. I neglected to mention that because (a) the file system is usually fast, (b) I tried running fish in a local mount and it did not change anything, and (c) I did simply not remember that it might be about the home directory rather than the working directory. Anyway, running fish with $HOME set to a local directory gets rid of the hangs, so your guess was spot on. It would still be nice to be able to use fish. Changing my $HOME is not really an option, but I could conceivably have the .config/fish on the local FS, if that is possible. (To elucidate: The machine is a cluster with a shared NFS and "scratch" directories on local HDDs. I do want my $HOME to be on the NFS, but it would be okay to run fish only on the login node and have the config directory on its scratch FS.) |
I run fish with an NFS-mounted home directory, and it seems fine. Are you running an NFS lock manager? As you may know, |
I am definitely interested in the answer to zanchey's question. I would also like to make fish not fall over in the presence of even a NFS mount that does not support file locking. Does anyone have any suggestions for how best to set one up for testing? |
zanchey: I see that all of the syscalls that caused the trouble seem to be locking-related :-). nlockmgr does show up in the output of rpcinfo:
However, the NFS share is mounted 'nolock', if that has anything to do with it. I had a problem a while ago where some program failed because it was not able to do locks on the NFS. I do not now remember what program it was, but a quick web search suggests that this has to do with flock() rather than fcntl(), so maybe it is unrelated. For completeness, I will also mention that sometimes on this machine, yum will hang in a futex syscall (contrary to fish, it never recovers). One then needs to remove some stale lock files. ridiculousfish: About the 'presence' of an NFS mount, the mere presence seems not to be enough. When I start fish with HOME set to a local mount, I can go to the NFS mount and work there without problems. (Sorry if this was obvious.) |
What's happening is that fish attempts to take a file lock and then append to history. If it can't take a file lock, it falls back to the old mode of creating a new file containing the old and new history, and then moving that in place. Creating the new file is slow, as it requires copying the contents of the old file each time. This is currently not multithreaded. |
I relaxed the constraint that we must take a lock to do the history append-in-place, and now we always write complete records with O_APPEND, which is closer to what bash does. This probably won't always work on lockless NFS - you may lose some new history items. But performance should be better most of the time. Fixed as 3e69e5b . Thanks for reporting this. |
I have now had a chance to try this out. Sorry to say this, but the problem persists. I will also say that the lag is much longer than I would expect for making a copy of the history. Actually, after the patch fish now seems to hang forever:
similar as before, then
and I have not seen that call return. |
Thanks for your conscientious testing.
That shows fish attempting to take the lock, failing, appending to the history anyways, and then closing it, which is as designed. So the fix appears to be working correctly so far.
This shows fish waiting on input from file descriptors. (fd 4 is probably the iothread completion pipe.) 0 is stdin. So it looks like fish is just waiting for input on stdin. |
I stand corrected. At the However, it is still not usable because the calls |
i had lagging problem my ubuntu box as well when my history file was corrupted, there were some invalid or empty commands (or something). When I wipe out whole file and created new one then lagging were gone. |
You might also just had a lot of history. We cap it at 262144 ( = 256 * 1024) items. History searches may take the lock and not release it for a while. We should do better - reader/writer locks, or more fine grained locking. |
I'm surprised that the F_SETLKW takes a long time - I would expect it to immediately if locks are not supported. |
(A side note - if you use |
Was there ever any progress with this? I wondered idly if NFSv4 might be displaying some unusual semantics here, but there hasn't been any activity in the last two years. I'm not sure if @xebtl is still having the problem? |
From my side, no progress. More precisely, I simply stopped trying after the previous discussion — I do not use the machine in question (a cluster) all that much, so it was not worth more effort to me. I actually use the machine even less these days, but if it is of interest to you, I can certainly try the current version of fish. On the other hand, is NFSv4 a new version? The NFS on our cluster is not being updated, if that was your point. |
NFSv3 and NFSv4 are two quite different protocols, and certainly at our site we have aggressively avoided migration to NFSv4 due to a variety of problems - Linux server performance, reduced fault tolerance and changed semantics. We have picked up at least one problem in fish that occurs only on NFSv4 mounts, though. I'd be interested to know if it still happens. There's been quite a bit of change to the code in this area. |
How do I find out if my cluster is using v3 or v4? |
@zanchey: |
I compiled the current 2.2.0 tarball from the website. Unfortunately, it seems that the problem persists:
|
My suspicion is that the NFS server isn't running an NFS locking manager ( |
It seems the locking manager is present. I failed to build from master because the configuration needs autoconf 2.60 but only 2.59 is available for my OS (CentOS 5.11). Installing the nightly RPM failed as well, I think also because the OS release is too old. Is there any way to get around the autoconf dependency? |
Try |
That still complains
I tried to remove the constraint from
|
Of course - my apologies. Try the nightly tarball, which has the configure file generated. You should just be able to use |
That tarball built, but the problem is still there. Sorry. |
No, that's good to know! |
Is anyone still interested in debugging this? If not we should close it. Getting a strace with timing data along with |
I ran another test with the the build of fish-2.2.0-392-g1055864 on the same machine, using the command I put the complete output in a gist. The syscalls that take a long time are
As far as I could tell, fish's corresponding debug output was
If you want to see output with different options, just let me know. If you want me to try with a newer version, ditto, though I would have to compile that. |
This is a bug in the NFS |
@eassmann: You can fix your immediate problem by setting While walking my dogs this morning it occurred to me we can deal with this type of situation better. Simply time how long it takes to obtain the lock (or fails as the case may be) and if it is above a threshold (say 250 ms) then emit a warning and stop trying to lock the history file. It's better to risk losing history than to have an unusable shell. |
@krader1961: Thanks for the tip about
I absolutely agree. |
On my CentOS 5.8 system, fish hangs for a long time (minutes) after every command, built-in or external, though not after a blank line. The command output and new prompt will be printed, then the hang occurs.
It also hangs after entering a comment, although in this case before printing a prompt.
It does not hang when a command is executed as 'fish -c ls'
strace snippet executing 'ls', the blank lines indicate long waits:
The text was updated successfully, but these errors were encountered: