Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: File copy is no longer retried, causing builds to randomly fail #9169

Closed
Nezz opened this issue Aug 25, 2023 · 15 comments
Closed

[Bug]: File copy is no longer retried, causing builds to randomly fail #9169

Nezz opened this issue Aug 25, 2023 · 15 comments
Assignees
Labels
author-responded Author responded, needs response from dev team. bug needs-triage Have yet to determine what bucket this goes in. Priority:2 Work that is important, but not critical for the release

Comments

@Nezz
Copy link

Nezz commented Aug 25, 2023

Issue Description

We started receiving random build failures caused by failing file copy operations:

C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEngine.UnityCurlModule.dll" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityCurlModule.dll". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityCurlModule.dll' is denied. [C:\jenkins\workspace\<redacted>.csproj]

After investigating we found that this is caused by a regression between .NET SDK 7.0.306 (MSBuild version 17.6.8+c70978d4d for .NET) and 7.0.400 (MSBuild version 17.7.1+971bf70db for .NET). The older version worked around this by doing a retry:

C:\Program Files\dotnet\sdk\7.0.306\Microsoft.Common.CurrentVersion.targets(4862,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.SceneTemplateModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.SceneTemplateModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.SceneTemplateModule.xml' because it is being used by another process. [C:\jenkins\workspace\<redacted>.csproj]

The new version fails right away if there is already a retry queued:

[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' because it is being used by another process.  [C:\jenkins\workspace\redacted2.csproj]
[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' is denied. [C:\jenkins\workspace\redacted1.csproj]

Note that in the logs above the two lines are logged for different projects.

Steps to Reproduce

Build a large project that references many of the same files. Exact details or minimal repro project is not provided because root cause has been identified above.

Expected Behavior

File copy operations should be retried

Actual Behavior

File copy operations are not retried and fail

Analysis

Relevant logs:

[2023-08-24T08:22:07.879Z] MSBuild version 17.7.1+971bf70db for .NET
[2023-08-24T08:22:08.081Z] Build started 8/24/2023 8:22:08 AM.
[2023-08-24T08:22:08.384Z]      1>Project "C:\jenkins\workspace\redacted.sln" on node 1 (Clean target(s)).
[2023-08-24T08:22:08.384Z]      1>ValidateSolutionConfiguration:
[2023-08-24T08:22:08.384Z]          Building solution configuration "Debug|Any CPU".
[2023-08-24T08:22:08.585Z]      1>Project "C:\jenkins\workspace\redacted.sln" (1) is building "C:\jenkins\workspace\redacted.csproj" (2) on node 1 (Clean target(s)).
[2023-08-24T08:22:08.586Z]      2>CoreClean:
[2023-08-24T08:22:08.586Z]          Creating directory "obj\Debug\".
[2023-08-24T08:22:08.687Z]      2>Done Building Project "C:\jenkins\workspace\redacted.csproj" (Clean target(s)).
[2023-08-24T08:22:09.846Z]      1>Done Building Project "C:\jenkins\workspace\redacted.sln" (Clean target(s)).
[2023-08-24T08:22:09.846Z] 
[2023-08-24T08:22:09.846Z] Build succeeded.
[2023-08-24T08:22:09.846Z]     0 Warning(s)
[2023-08-24T08:22:09.846Z]     0 Error(s)
[2023-08-24T08:22:09.846Z] 
[2023-08-24T08:22:09.846Z] Time Elapsed 00:00:01.80
[2023-08-24T08:22:10.249Z] SonarScanner for MSBuild 5.13
[2023-08-24T08:22:10.249Z] Using the .NET Core version of the Scanner for MSBuild
...
[2023-08-24T08:22:18.185Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEngine.UnityWebRequestModule.dll" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityWebRequestModule.dll". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityWebRequestModule.dll' because it is being used by another process.  [C:\jenkins\workspace\redacted2.csproj]
[2023-08-24T08:22:18.185Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEngine.PhysicsModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.PhysicsModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.PhysicsModule.xml' because it is being used by another process.  [C:\jenkins\workspace\redacted1.csproj]
[2023-08-24T08:22:18.186Z] CSC : warning CS8021: No value for RuntimeMetadataVersion found. No assembly containing System.Object was found nor was a value for RuntimeMetadataVersion specified through options. [C:\jenkins\workspace\Yousician.Analytics.Yap.Tests.csproj]
[2023-08-24T08:22:18.186Z]   Sonar: (UniRx.csproj) Project processed successfully
[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' because it is being used by another process.  [C:\jenkins\workspace\redacted2.csproj]
[2023-08-24T08:22:18.186Z] CSC : warning CS8021: No value for RuntimeMetadataVersion found. No assembly containing System.Object was found nor was a value for RuntimeMetadataVersion specified through options. [C:\jenkins\workspace\Yousician.Networking.Tests.csproj]
[2023-08-24T08:22:18.186Z] CSC : warning CS2008: No source files specified. [C:\jenkins\workspace\redacted.csproj]
[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' is denied. [C:\jenkins\workspace\redacted1.csproj]
...
[2023-08-24T08:22:54.586Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' is denied. [C:\jenkins\workspace\redacted1.csproj]
[2023-08-24T08:22:54.586Z]     12614 Warning(s)
[2023-08-24T08:22:54.586Z]     1 Error(s)
[2023-08-24T08:22:54.586Z] 
[2023-08-24T08:22:54.586Z] Time Elapsed 00:00:42.18

Versions & Configurations

MSBuild version 17.7.1+971bf70db for .NET

@Nezz Nezz added bug needs-triage Have yet to determine what bucket this goes in. labels Aug 25, 2023
@AR-May
Copy link
Member

AR-May commented Aug 29, 2023

Team triage: @GangWang01 could you try to reproduce the bug?

@KalleOlaviNiemitalo
Copy link

There is a similar report in #9190.

@KalleOlaviNiemitalo
Copy link

The log looks like each file got ERROR_SHARING_VIOLATION initially but ERROR_ACCESS_DENIED on the retry. src/Tasks/Copy.cs does not normally retry after ERROR_ACCESS_DENIED but this can be changed via the MSBUILDALWAYSRETRY environment variable. I don't see any obvious change in this logic between 17.6.8+c70978d4d and 17.7.1+971bf70db, though.

@RobSwDev
Copy link

Our builds started failing for the same reason on 9th August - the day after the 7.0.4 sdk was released. Which seemed too neat to be a coincidence.

However, looks like our build agents didn't upgrade to 7.0.4 until a couple of days later - we'd already had loads of failures by then. So it's not that simple - at least for us.

Could this be due to some other Microsoft environmental change around the same time?

@Nezz
Copy link
Author

Nezz commented Aug 31, 2023

It could also be a windows update. The base virtual machine image we use with 7.0.400 was newer than the one we had with 7.0.306.

@KalleOlaviNiemitalo
Copy link

Between the virtual machine images, is there any difference in the file system minifilters listed by fltmc? (I expect not, but it's quick to check.)

@Nezz
Copy link
Author

Nezz commented Sep 1, 2023

Problematic VM:

09:45:00  Filter Name                     Num Instances    Altitude    Frame
09:45:00  ------------------------------  -------------  ------------  -----
09:45:00  storqosflt                              0       244000         0
09:45:00  wcifs                                   0       189900         0
09:45:00  CldFlt                                  0       180451         0
09:45:00  FileCrypt                               0       141100         0
09:45:00  luafv                                   1       135000         0
09:45:00  npsvctrig                               1        46000         0
09:45:00  Wof                                     1        40700         0

Good VM:

10:02:44  Filter Name                     Num Instances    Altitude    Frame
10:02:44  ------------------------------  -------------  ------------  -----
10:02:44  storqosflt                              0       244000         0
10:02:44  wcifs                                   0       189900         0
10:02:44  CldFlt                                  0       180451         0
10:02:44  FileCrypt                               0       141100         0
10:02:44  luafv                                   1       135000         0
10:02:44  npsvctrig                               1        46000         0
10:02:44  Wof                                     1        40700         0

Looks like no difference?

@Nezz
Copy link
Author

Nezz commented Sep 1, 2023

Not sure if it's relevant, but we call dotnet build with the -nodereuse:false parameter because on Windows we had issues with lingering dotnet processes that can leave files locked. We don't need that parameter on macOS or Linux.

@GangWang01
Copy link
Contributor

I couldn't repro the issue. But I did some tries to understand what happened. As KalleOlaviNiemitalo mentioned

The log looks like each file got ERROR_SHARING_VIOLATION initially but ERROR_ACCESS_DENIED on the retry. src/Tasks/Copy.cs does not normally retry after ERROR_ACCESS_DENIED but this can be changed via the MSBUILDALWAYSRETRY environment variable.

it was ERROR_ACCESS_DENIED on the retry broke the retry referring to

msbuild/src/Tasks/Copy.cs

Lines 827 to 843 in 3c910ba

if (code == NativeMethods.ERROR_ACCESS_DENIED)
{
// ERROR_ACCESS_DENIED can either mean there's an ACL preventing us, or the file has the readonly bit set.
// In either case, that's likely not a race, and retrying won't help.
// Retrying is mainly for ERROR_SHARING_VIOLATION, where someone else is using the file right now.
// However, there is a limited set of circumstances where a copy failure will show up as access denied due
// to a failure to reset the readonly bit properly, in which case retrying will succeed. This seems to be
// a pretty edge scenario, but since some of our internal builds appear to be hitting it, provide a secret
// environment variable to allow overriding the default behavior and forcing retries in this circumstance as well.
if (!s_alwaysRetryCopy)
{
throw;
}
else
{
LogDiagnostic("Retrying on ERROR_ACCESS_DENIED because MSBUILDALWAYSRETRY = 1");
}
.

With this simple App.zip, I tried

  • Lock the target file
    Copy with retries worked well. See binlog RetryCopying.zip

  • Set the target file readonly or change ACL to deny the access
    There was no copy with retries. It directly failed with access denied. See binlog AccessDenied.zip

I guess in this issue overwriting the same target file by multiple copy tasks got the file's ACL being initialized and next copy task happened to overwrite the file, then access denied occurred and broke the retry. Not sure about this.

@Nezz
Copy link
Author

Nezz commented Sep 7, 2023

I'm not sure what causes it, but export MSBUILDALWAYSRETRY=1 resolved the issue completely.

@AR-May AR-May added the Priority:2 Work that is important, but not critical for the release label Sep 12, 2023
@GangWang01
Copy link
Contributor

GangWang01 commented Sep 13, 2023

@Nezz Glad MSBUILDALWAYSRETRY=1resolved the issue.

We still need your help to understand what causes the file locked or access denied and what needs to improve from msbuild if any.
Can you use Process Monitor to filter out which process gets the target file of copy task locked or its ACL changed while reproducing this issue? If possible, it’s better to provide the process monitor log as well as build binary log (see how to Providing MSBuild Binary Logs for investigation). Thank you!

Tips about Process Monitor:
procmon

@YuliiaKovalova YuliiaKovalova added the needs-more-info Issues that need more info to continue investigation. label Nov 7, 2023
@ghost ghost removed the needs-triage Have yet to determine what bucket this goes in. label Nov 7, 2023
@YuliiaKovalova
Copy link
Member

Hi @Nezz,
Could you provide the requested information?
Thank you!

@Nezz
Copy link
Author

Nezz commented Nov 7, 2023

Sadly these issues happen on AWS EC2 instances that are terminate as soon as the job finishes running and there is no remote desktop access. Is there a way to obtain this information via the command line?

@ghost ghost added needs-triage Have yet to determine what bucket this goes in. author-responded Author responded, needs response from dev team. and removed needs-more-info Issues that need more info to continue investigation. labels Nov 7, 2023
@rokonec
Copy link
Contributor

rokonec commented Feb 1, 2024

I am not sure if there is a follow up with this issue. We have workaround, and without repro or process monitor log we don't know how to detect this transient file access issue.
We can consider MSBUILDALWAYSRETRY as default behavior but it will make build with wrong user entered path took longer before fail.
I recommend to lower priority of this to P3.
@rainersigwald @YuliiaKovalova do you agree?

@YuliiaKovalova
Copy link
Member

Close it as a low priority bug.
It can be reconsidered in the future.

@YuliiaKovalova YuliiaKovalova closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author-responded Author responded, needs response from dev team. bug needs-triage Have yet to determine what bucket this goes in. Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

No branches or pull requests

7 participants