-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Retry try_lock
in case rename fails
#14793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deal with errors returned by `File.rename` On Unix-like systems, rename typically overwrites the destination On Windows it fails with `eexist` In `try_lock` retry locking In `unlock` swallow the error and do a best effort cleanup
Hey @lukaszsamson!
Are you able to reproduce this on your machine? For me the file is replaced, even if open:
Perhaps it has to do with file system, settings or permissions. It would be good to know if the issue only appears with the file open, or regardless. |
File.rename!(port_path, lock_path) | ||
# We linked to lock_N successfully, so port_path should exist. | ||
# On Windows, renaming to an existing destination returns :eexist. | ||
# In that case, another process won the race; we signal a retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no races here, only one process should be able to obtain lock_N
and go into take-over.
We need to figure out in which circumstances File.rename/2
fails and find a workaround for that. It is expected that the file may be open temporarily, as another Elixir process reads its content while locking; in that case we could wait a bit and retry the rename. The question is if being open is the issue.
Also, for full context, does the exception you posted always happen, or is it non-deterministic? |
Assuming it's non-deterministic, I think I know how this happens. I just learnt that Here's my understanding of
It could be atomic by simply using the What can happen is that between 3. and 4. a concurrent process creates B (in our case I need to test some ideas to see if we can skip |
I wasn't able to reproduce it locally. I've only seen it in ElixirLS telemetry and only for windows users. I'd say it's non-deterministic and rather unlikely. Maybe it would be easier to reproduce with multiple processes? See
The otp code https://github.com/erlang/otp/blob/940ec0f6f0370ecf5cd93cae31fd91f4651ddca6/erts/emulator/nifs/win32/win_prim_file.c#L1240 can return
The directory errors are not likely so the probable scenario is:
|
I opened #14800 with an alternative to |
For reference, below is a specific race condition, even if we retry (this PR). In practice it is extremely unlikely, because it requires several processes interleaving operations in just the right order, but it is not impossible.
|
Thank you for the report and the proposed fix! |
Deal with errors returned by
File.rename
On Unix-like systems, rename typically overwrites the destination On Windows it fails with
eexist
In
try_lock
retry lockingIn
unlock
swallow the error and do a best effort cleanupThis PR addresses an error observed in ElixirLS logs on Windows: