Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Windows, System.IO.Directory.Delete() intermittently fails quietly, is unexpectedly asynchronous #27958

Closed
mklement0 opened this issue Nov 19, 2018 · 9 comments

Comments

@mklement0
Copy link

Note: cmd.exe's rd /s and PowerShell's Remove-Item are equally affected: see here and here.

Note: The problem occurs only on Windows (the problem also affects the "full" .NET Framework there, albeit with slightly different symptoms).

The Windows API functions DeleteFile() and RemoveDirectory() functions are inherently asynchronous (emphasis added):

The DeleteFile function marks a file for deletion on close. Therefore, the file deletion does not occur until the last handle to the file is closed. Subsequent calls to CreateFile to open the file fail with ERROR_ACCESS_DENIED."

The RemoveDirectory function marks a directory for deletion on close. Therefore, the directory is not removed until the last handle to the directory is closed."

System.IO.Directory.Delete() fails to account for this asynchronous behavior, which has two implications:

  • Problem (a): Trying to delete a nonempty directory (which invariably requires recursive deletion of its content first) can fail - infrequently, but it does happen, and the failure is quiet.

  • Problem (b): Trying to recreate a successfully deleted directory or a file immediately afterwards may fail - intermittently (easier to provoke than the other problem, but a more exotic use case overall).

    • Note: You could argue that it's not .NET Core's responsibility to create a synchronous experience in this case, given that the underlying system API is asynchronous; however, it certainly makes for more robust, predictable programs if synchronicity is ensured - as is the case on Unix-like platforms, where the system APIs for file removal are synchronous.

Problem (a) is due to using depth-first recursion without accounting for the asynchronous deletion behavior; the problem, along with the solution, is described in this YouTube video (starts at 7:35).

Note that the source code does at least hint at the asynchronous behavior:

https://github.com/dotnet/corefx/blob/40364e539572e9dad9c8a2eb165fc9af28e5664a/src/System.IO.FileSystem/src/System/IO/FileSystem.Windows.cs#L532-L534

However, in practice the recursive removal of content does fail intermittently - and quietly - leaving both the target directory and remnants inside it behind.

Important:

  • For convenience, PowerShell Core is used to reproduce the problem below.
  • The repros below use the functions defined at the bottom, which must be defined first.
  • The tests create temp. dir. $HOME/tmpDir - remove it manually afterwards, if still present.

Steps to reproduce

Setup:

  • Open a PowerShell Core shell.
  • Paste and submit the code with the two function definitions below at the prompt to define the functions used in the tests.

Problem (a):

Assert-ReliableDirRemoval .net

Problem (b):

Assert-SyncDirRemoval .net

Expected behavior

The functions should loop indefinitely (terminate them with Ctrl+C), emitting a . in each iteration.

Actual behavior

Eventually - and that there is no predictable time frame is indicative of the problem - an error will occur, on the order of minutes or even longer.

Problem (a):

Reliable dir-tree-removal test: Repeatedly creating and removing a directory subtree, using .net for removal.
........Deletion failed quietly or with exit code 0, tmpDir still has content: sub

That is, recursive removal of the target dir's content failed quietly due to async timing issues, and both the target directory and remnants inside it linger.

Problem (b):

Synchronous dir-removal test: Repeatedly creating and removing a directory, using ps for removal.
.................New-Item : An item with the specified name C:\Users\jdoe\tmpDir already exists.

That is, recreating the target dir failed, because the prior removal hadn't yet completed.

Aux. function definitions:

Functions Assert-ReliableDirRemoval and Assert-SyncDirRemoval

You can paste the entire code at the prompt in order to define these functions.

function Assert-ReliableDirRemoval {
  param([parameter(Mandatory)] [ValidateSet('cmd', '.net', 'ps')] [string] $method)
  Write-Host "Reliable dir-tree-removal test: Repeatedly creating and removing a directory subtree, using $method for removal."
  # Treat all errors as fatal.
  $ErrorActionPreference = 'Stop'
  Set-Location $HOME  # !! Seemingly, a dir. in the user's home dir. tree reproduces the symptom fastest - $env:TEMP does not.
  # Remove a preexisting directory first, if any.
  Remove-Item -EA Ignore -Recurse tmpDir
  While ($true) {
    Write-Host -NoNewline .
    # Create a subdir. tree.
    $null = New-Item -Type Directory tmpDir, tmpDir/sub, tmpDir/sub/sub, tmpDir/sub/sub/sub
    # Fill each subdir. with 1000 empty test files.
    "tmpDir", "tmpDir/sub", "tmpDir/sub/sub", "tmpDir/sub/sub/sub"| ForEach-Object {
      $dir = $_
      1..1e3 | ForEach-Object { $null > "$dir/$_" }
    }
    # Now remove the entire dir., which invariably involves deleting its contents
    # recursively first.
    switch ($method) {
      'ps'  { Remove-Item -Recurse tmpDir }
      '.net' { [System.IO.Directory]::Delete((Convert-Path tmpDir), $true) }
      'cmd'  { cmd /c rd /s /q tmpDir; if ($LASTEXITCODE) { Throw } }
      # !! If rd /s fails during recursive removal due to async issues, it emits a stderr line, but 
      # !! does NOT report a nonzero exit code. We detect this case below.
    }
    # Does the dir. unexpectedly still exist?
    # This can happen for two reasons:
    #  - The removal of the top-level directory itself has been requested, but isn't complete yet.
    #  - Removing the content of the top-level directory failed quietly.
    while (Test-Path tmpDir) {
      if ($output = Get-ChildItem -EA Ignore -Force -Name tmpDir) { # Still has content?
        Throw "Deletion failed quietly or with exit code 0, tmpDir still has content: $output"
      } else {
        # Top-level directory removal isn't complete yet.
        # Wait for removal to complete, so we can safely recreate the directory in the next iteration.
        # This loop should exit fairly quickly.
        Write-Host -NoNewline !; Start-Sleep -Milliseconds 100
      }      
    }    
  }
}

function Assert-SyncDirRemoval { 
  param([parameter(Mandatory)] [ValidateSet('cmd', '.net', 'ps')] [string] $method)
  Write-Host "Synchronous dir-removal test: Repeatedly creating and removing a directory, using $method for removal."
  # Treat all errors as fatal.
  $ErrorActionPreference = 'Stop'
  Set-Location $HOME  # !! Seemingly, a dir. in the user's home dir. tree reproduces the symptom fastest - $env:TEMP does not.
  While ($true) {
    Write-Host -NoNewline .
    # Remove the test dir., which invariably involves deleting its contents recursively first.
    # Note: This could itself fail intermittently, but with just 10 files and no subdirs. is unlikely to.
    if (Test-Path tmpDir) {
      switch ($method) {
        'ps' { Remove-Item -Recurse tmpDir }
        '.net' { [System.IO.Directory]::Delete((Convert-Path tmpDir), $true) }
         'cmd'  { cmd /c rd /s /q tmpDir; if ($LASTEXITCODE) { Throw } }
         # !! If rd /s fails during recursive removal due to async issues, it emits a stderr line, but 
         # !! does NOT report a nonzero exit code. We detect this case below.
      }
    }
    # (Re-)create the dir. with 10 empty test files.
    # If the previous removal hasn't fully completed yet, this will fail.
    # Note: [System.IO.Directory]::Delete() could have quietly failed to remove the contents of the dir.
    #       due to async timing accidents, but, again, this is unlikely with 1- files and 
    try {
      $null = New-Item -Type Directory tmpDir
    } catch {
      # Handle the case where removal failed
      # quietly due to async vagaries while removing the dir's *content*, which
      # can happen with [System.IO.Directory]::Delete().
      # Note that if removal succeeded but is pending - the scenario we're trying to
      # get to occur - Get-ChildItem reports an access-denied error, which we ignore
      # in favor of reporting the original error from the dir. re-creation attempt.
      if ($output = Get-ChildItem -EA Ignore -Force -Name tmpDir) { # Still has content?
        Write-Warning "Ignoring failed content removal, retrying..."
        # Simply try again
        Start-Sleep -Milliseconds 500; Remove-item tmpDir -Recurse
        continue
      }
      # Re-creation failed due to async removal.
      Throw $_
    }
    1..10 | ForEach-Object { $null > tmpDir/$_ }
  }
}

Environment data

Microsoft.NETCore.App version 2.1.5
@MarcoRossignoli
Copy link
Member

/cc @JeremyKuhne

@JeremyKuhne
Copy link
Member

We're sort of stuck here due to the nature of file deletion on Windows as described. I know that Windows is looking at changing the deletion behavior to immediately remove files from the namespace on deletion. It is a tricky thing to do as there is almost certainly code that actually depends on the current Windows behavior.

I believe we'd be more likely to introduce performance or reliability regressions by attempting to rename than we would be to fix problems. Additionally, if the Windows change makes it in this will be moot. (Possibly in the Spring update?)

@JeremyKuhne
Copy link
Member

I've validated the new deletion behavior is in Windows 10 1903. Filenames are immediately freed when delete is called. Closing as that gets rid of the underlying problem here.

@CoskunSunali
Copy link

I realize this issue has been closed but I am having the same issue on 1903.

var source = GetSource();
var target = GetTarget();

if (Directory.Exists(target))
{
    Directory.Delete(target, true);
}

Directory.Move(source, target);

The code above fails when moving the directory if the process is run in release mode (or debug mode without a debugger attached) and throws the following exception:

$TARGET$ below is just for demonstration purposes.

System.IO.IOException: Cannot create '$TARGET$' because a file or directory with the same name already exists.
   at System.IO.Directory.Move(String sourceDirName, String destDirName)

If I make the thread sleep for a few seconds after calling Directory.Delete or put a debug breakpoint and continue a few seconds after calling Directory.Delete, the things work fine.

Is there a suggested workaround from the corefx team?

.NET Core SDK (reflecting any global.json):
Version:   3.0.100
Commit:    04339c3a26

Runtime Environment:
OS Name:     Windows
OS Version:  10.0.18362
OS Platform: Windows
RID:         win10-x64
Base Path:   C:\Program Files\dotnet\sdk\3.0.100\

Host (useful for support):
  Version: 3.0.0
  Commit:  7d57652f33

.NET Core SDKs installed:
  2.2.402 [C:\Program Files\dotnet\sdk]
  3.0.100 [C:\Program Files\dotnet\sdk]                             
                                                                                                                                                                            .NET Core runtimes installed:                                                                                             Microsoft.AspNetCore.All 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.WindowsDesktop.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

@mklement0
Copy link
Author

Agreed, @CoskunSunali; I can confirm that both (PowerShell-based) tests in the OP still eventually fail.

Pasting the functions in a PowerShell window and then running Assert-SyncDirRemoval .net is sufficient.

Assert-ReliableDirRemoval .net also eventually fails, but takes much longer.

@JeremyKuhne: Any thoughts?

Does the WinAPI now officially guarantee synchronous behavior and did the implementation fall short, or is there something else going on?

@JeremyKuhne
Copy link
Member

Any thoughts?

Fundamentally it's a bad idea to reuse paths right after you delete them if you can help it. If you can't, you really need to add some sort of retry logic. One other possible thing to try is moving the directory to a temporary name and deleting that.

While Windows has been making changes it is only in the most recent versions and they haven't
even documented the changes they've made afaik. There is no reference to updated behavior in DeleteFile(), but you easily observe the behavior as we have in our tests.

As I mentioned above there really isn't anything .NET can do to mitigate this as we don't have the context to know we should retry. Anything "smart" we would try and do would likely introduce more issues than we would resolve. :/

cc: @carlossanlop

@mklement0
Copy link
Author

mklement0 commented Oct 21, 2019

Understood re not wanting to fix this in .NET.

At the WinAPI level It would be nice to know when this will eventually be fixed reliably - your comment that accompanied closing this issue mistakenly gave the impression that this has already happened.

A link to an official promise / announcement / roadmap - if there is such a thing - would be great, so users can reliably tell when it will become safe to rely on code that assumes synchronous deletion.

Fundamentally it's a bad idea to reuse paths right after you delete them

It's only a bad idea with a filesystem that behaves asynchronously, and it is the latter I would consider a bad idea.

@JeremyKuhne
Copy link
Member

A link to an official promise / announcement / roadmap - if there is such a thing - would be great, so users can reliably tell when it will become safe to rely on code that assumes synchronous deletion.

I don't think there is one and I can't find one. :(

It's only a bad idea with a filesystem that behaves asynchronously, and it is the latter I would consider a bad idea.

Bad idea or not, that is the way NTFS has worked for 30 years. Some people actually rely on it working that way as a type of locking mechanism. :)

Even outside of NTFS, delete/recreate isn't the most reliable pattern and should be avoided when it isn't strictly needed as it can cause grief with multiple instances. Quite often I see this pattern with temporary files- which are much better suited to unique names.

@mklement0
Copy link
Author

Note:

  • The following update is based on running the Assert-ReliableDirRemoval .net test above successfully in Windows 10 version 20H2 and the information gleaned from this Stack Overflow post.

  • As of this writing, the documentation of the underlying DeleteFile Windows API makes no mention of the change in behavior.


Starting with Windows 10 version 1909, (at least) build 18363.657 (I don't know that Windows Server version and build that corresponds to; run winver.exe to check your version and build), the DeleteFile Windows API function now exhibits synchronous behavior at the Windows API level, which implicitly solves the problems with PowerShell's Remove-Item and .NET's System.IO.File.Delete / System.IO.Directory.Delete (but, curiously, not with cmd.exe's rd /s).

@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants