Skip to content

Conversation

lukaszsamson
Copy link
Contributor

Deal with errors returned by File.rename
On Unix-like systems, rename typically overwrites the destination On Windows it fails with eexist
In try_lock retry locking
In unlock swallow the error and do a best effort cleanup

This PR addresses an error observed in ElixirLS logs on Windows:

an exception was raised:
    ** (File.RenameError) could not rename from "<REDACTED: user-file-path>" to "<REDACTED: user-file-path>": file already exists
        (elixir 1.18.4) <REDACTED: user-file-path>:793: File.rename!_2
USER_PATH_mix_sync_lock.ex:306: Mix.Sync.Lock.take_over_2
USER_PATH_mix_sync_lock.ex:179: Mix.Sync.Lock.try_lock_4
USER_PATH_mix_sync_lock.ex:138: Mix.Sync.Lock.lock_2
USER_PATH_mix_sync_lock.ex:108: Mix.Sync.Lock.with_lock_3
USER_PATH_mix_tasks_deps.loadpaths.ex:68: Mix.Tasks.Deps.Loadpaths.run_1
USER_PATH_mix_task.ex:495: anonymous fn_3 in Mix.Task.run_task_5
USER_PATH_mix_tasks_loadpaths.ex:37: Mix.Tasks.Loadpaths.run_1
USER_PATH_mix_task.ex:495: anonymous fn_3 in Mix.Task.run_task_5
USER_PATH_mix_tasks_compile.ex:136: Mix.Tasks.Compile.run_1
USER_PATH_mix_task.ex:495: anonymous fn_3 in Mix.Task.run_task_5

Deal with errors returned by `File.rename`
On Unix-like systems, rename typically overwrites the destination
On Windows it fails with `eexist`
In `try_lock` retry locking
In `unlock` swallow the error and do a best effort cleanup
@jonatanklosko
Copy link
Member

Hey @lukaszsamson!

On Windows it fails with eexist

Are you able to reproduce this on your machine? For me the file is replaced, even if open:

iex(1)> File.write!("a.txt", "a")
:ok
iex(2)> File.write!("b.txt", "b")
:ok
iex(3)> File.open("b.txt")
{:ok, #PID<0.107.0>}
iex(4)> File.rename!("a.txt", "b.txt")
:ok
iex(5)> File.read!("b.txt")
"a"

Perhaps it has to do with file system, settings or permissions. It would be good to know if the issue only appears with the file open, or regardless.

File.rename!(port_path, lock_path)
# We linked to lock_N successfully, so port_path should exist.
# On Windows, renaming to an existing destination returns :eexist.
# In that case, another process won the race; we signal a retry.
Copy link
Member

@jonatanklosko jonatanklosko Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no races here, only one process should be able to obtain lock_N and go into take-over.

We need to figure out in which circumstances File.rename/2 fails and find a workaround for that. It is expected that the file may be open temporarily, as another Elixir process reads its content while locking; in that case we could wait a bit and retry the rename. The question is if being open is the issue.

@jonatanklosko
Copy link
Member

jonatanklosko commented Sep 26, 2025

Also, for full context, does the exception you posted always happen, or is it non-deterministic?

@jonatanklosko
Copy link
Member

jonatanklosko commented Sep 26, 2025

Assuming it's non-deterministic, I think I know how this happens.

I just learnt that File.rename is not atomic on Windows:

Here's my understanding of rename(A, B), assuming both A and B exist:

  1. Try moving A to B without overriding. ref
  2. Given that B exists, 1. fails, fallback proceeds. ref
  3. Move B to TMP with overriding (MOVEFILE_REPLACE_EXISTING).
  4. Try moving A to B without overriding (after 3., B should not exist).
  5. If B exists, 4. fails and eexists is returned.

It could be atomic by simply using the MOVEFILE_REPLACE_EXISTING flag, but the current implementation is different to mirror other Unix semantics (I am not sure which specifically, perhaps making sure that on failure the original file stays in place).

What can happen is that between 3. and 4. a concurrent process creates B (in our case lock_0), hence the failure.

I need to test some ideas to see if we can skip File.rename/2 altogether.

@lukaszsamson
Copy link
Contributor Author

Are you able to reproduce this on your machine?
does the exception you posted always happen, or is it non-deterministic?

I wasn't able to reproduce it locally. I've only seen it in ElixirLS telemetry and only for windows users. I'd say it's non-deterministic and rather unlikely. Maybe it would be easier to reproduce with multiple processes? See
https://www.erlang.org/doc/apps/kernel/file.html

File operations are only guaranteed to appear atomic when going through the same file server. A NIF or other OS process may observe intermediate steps on certain operations on some operating systems, eg. renaming an existing file on Windows, or write_file_info/2 on any OS at the time of writing.

eexist - Destination is not an empty directory. On some platforms, also given when Source and Destination are not of the same type.

The otp code https://github.com/erlang/otp/blob/940ec0f6f0370ecf5cd93cae31fd91f4651ddca6/erts/emulator/nifs/win32/win_prim_file.c#L1240 can return eexist in few cases with the following win32 errors

ERROR_DIR_NOT_EMPTY (The directory is not empty)
ERROR_CANNOT_MAKE (The directory or file cannot be created)
ERROR_ALREADY_EXISTS (Cannot create a file when that file already exists)
ERROR_FILE_EXISTS (The file exists)

The directory errors are not likely so the probable scenario is:

  1. first MoveFileExW call in (https://github.com/erlang/otp/blob/940ec0f6f0370ecf5cd93cae31fd91f4651ddca6/erts/emulator/nifs/win32/win_prim_file.c#L1249) fails, last error is ERROR_ALREADY_EXISTS or ERROR_FILE_EXISTS
  2. the code gets to fallback rename attempt via move to temp file in https://github.com/erlang/otp/blob/940ec0f6f0370ecf5cd93cae31fd91f4651ddca6/erts/emulator/nifs/win32/win_prim_file.c#L1300
  3. The NIF moves the destination side out of the way to a temp path
  4. If another process creates/recreates the destination path during this window, the second move (old → new) can again fail with ERROR_FILE_EXISTS or ERROR_ALREADY_EXISTS mapped to EEXIST in windows_to_posix_errno

@jonatanklosko
Copy link
Member

I opened #14800 with an alternative to File.rename/2. Using File.rename/2 introduces race conditions, even if we retry. It briefly removes the destination file (lock_0) and it is the same category of problems as removing lock_0 in unlock/2 (which we intentionally don't do).

@jonatanklosko
Copy link
Member

For reference, below is a specific race condition, even if we retry (this PR). In practice it is extremely unlikely, because it requires several processes interleaving operations in just the right order, but it is not impossible.

  • no process holds the lock, lock_0 is stale
  • p1 and p2 try to link to lock_0, both fail and then detect it's stale
  • p1 links to lock_1
  • p2 fails to link to lock_, and reads lock_1 port
  • p1 goes into takeover and calls File.rename/2, which briefly removes lock_0
  • p3 links to lock_0, it works because the spot is empty, it has the lock
  • p1 fails in the middle of File.rename/2 (the reported error), it retries
  • p1 crashes
  • p2 connects to lock_1 port, which fails (p1 socket no longer open)
  • p2 links to lock_2
  • p2 starts takeover and calls File.rename/2, this overrides the active lock held by p3

@josevalim josevalim closed this Sep 30, 2025
@josevalim
Copy link
Member

Thank you for the report and the proposed fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants