New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize RAR output paths #6533
Conversation
@@ -993,7 +993,7 @@ public ITaskItem[] SuggestedRedirects | |||
public ITaskItem[] FilesWritten | |||
{ | |||
set { /*Do Nothing, Inputs not Allowed*/ } | |||
get { return _filesWritten.ToArray(); } | |||
get { return ReferenceTable.NormalizeAsArray(_filesWritten); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the normalization should run more than once per RAR execution except FilesWritten and UnresolvedAssemblyConflicts, which run once per access. My hunch is that that isn't an issue, but if it is, I'm happy to cache it so we only do that once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's cache it. Everything about RAR is a huge perf bottleneck.
So I made tests pass and looked at what the outputs look like. It looks correct as far as paths having been converted to full, normalized paths, but it does look a little odd. I'm wondering if I should have a somewhat simpler "remove extra slashes and Also, this looks like a breaking change to me. I routed all the normalization through a single function, so it's really easy to add a change wave. I think I should. Opposition? |
{ | ||
ti.ItemSpec = FileUtilities.NormalizePath(ti.ItemSpec); | ||
return ti; | ||
}).ToArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try to avoid unnecessary allocations here as it's a hot path. I'd pre-allocate the final array, and the for
over each item and fill it in (to even avoid the allocation for the foreach iterator).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, an IEqualityComparer that ignores trailing directory separators canbe used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would miss that C:\\\foo is the same as C:\foo
@@ -1021,7 +1021,7 @@ public String DependsOnNETStandard | |||
/// been outputted in MSB3277. Otherwise empty. | |||
/// </summary> | |||
[Output] | |||
public ITaskItem[] UnresolvedAssemblyConflicts => _unresolvedConflicts.ToArray(); | |||
public ITaskItem[] UnresolvedAssemblyConflicts => ReferenceTable.NormalizeAsArray(_unresolvedConflicts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only accessed once? because if it is read twice or more, we'll do the work every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be accessed many times, depending on the user's use case. I can cache it.
@@ -2687,6 +2687,17 @@ internal void GetReferenceItems | |||
copyLocalFiles = copyLocalItems.ToArray(); | |||
} | |||
|
|||
internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items) | |
internal static ITaskItem[] NormalizePathsAsArray(List<ITaskItem> items) |
OR even simpler...
internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items) | |
internal static ITaskItem[] NormalizePaths(List<ITaskItem> items) |
How about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you saw, but I took the middle suggestion. I do want to mention "Array," since otherwise it would be very non-obvious (without looking at its return type) that it's returning an array instead of just normalizing it.
@@ -64,7 +64,7 @@ public void ConflictBetweenNonCopyLocalDependencies() | |||
Assert.True(ContainsItem(t.ResolvedDependencyFiles, s_myLibraries_V2_GDllPath)); | |||
|
|||
Assert.Single(t.SuggestedRedirects); | |||
Assert.True(ContainsItem(t.SuggestedRedirects, @"D, Culture=neutral, PublicKeyToken=aaaaaaaaaaaaaaaa")); // "Expected to find suggested redirect, but didn't" | |||
Assert.True(ContainsItem(t.SuggestedRedirects, @$"{"D"}, Culture=neutral, PublicKeyToken=aaaaaaaaaaaaaaaa")); // "Expected to find suggested redirect, but didn't" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can undo this one. An earlier version required a legitimate test change but no more.
// Sort for stable outputs. (These came from a dictionary, which has undefined enumeration order.) | ||
Array.Sort(primaryFiles, TaskItemSpecFilenameComparer.GenericComparer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be concerned that we're changing the sort order here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comment, I don't think so. The comment and the sort have been there since the initial (GitHub) code commit, but it sounds like it's important that if you run RAR twice, the order is the same, but not so important what that order actually is. This maintains that there is a correct ordering, albeit reordering it slightly.
get { return _filesWritten.ToArray(); } | ||
get | ||
{ | ||
if (_filesWrittenArray?.Length != _filesWritten.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition surprises me. Is a length change the only way it can be invalidated?
What is the reason to do this caching rather than just do it? This field will be read once on task completion in the normal case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had it without the caching but was told I should change it. See #6533 (comment)
Having a length check was probably a bad plan. It would be quite expensive to do a proper check, though, so I think I should just remove the caching here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's remove it, unless there are in-code accesses to these getters. The engine itself should access them only once/task invocation.
primaryFiles = NormalizePathsAsArray(primaryItems); | ||
dependencyFiles = NormalizePathsAsArray(dependencyItems); | ||
relatedFiles = NormalizePathsAsArray(relatedItems); | ||
satelliteFiles = NormalizePathsAsArray(satelliteItems); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at how hard it would be to move the normalization to the input layer (so that it gets put in the cache).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at this bit, I don't think this is an easy change because in addition to finding canonical forms for all input paths, we would have to find canonical forms for dependencies, and that would get complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should normalize task inputs, but the inputs to the cache--the outputs of the task, but at the layer where they're created, not at output time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that clarifies things, thank you. Now I'm back to "seems like a good move, but I have no idea how hard that is."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think the newest changes should be equivalent, but I'm not at all confident.
It seems like the only way anything is added to References is via that one AddReference method. I'm also looking to see if I can move it any earlier. At worst, this is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important part of the change is:
8ed6c4b
I got a little carried away with various cleanup things in ReferenceTable. If it were just one or two changes, I'd push for them to stay in this PR. With this many, though, I'm fining moving them all to a separate PR to make this cleaner. Fair warning, though: that might lead to more random changes 🙂 Also, the last is maybe questionable. |
Yes, please reduce this PR to just the change needed. |
Some references can be resolved later. I'd initially missed it because References wasn't invoked directly, but rather a Reference was retrieved then later modified. I believe this is the only place that happens, however. The first check essentially checks whether it has already been resolved, and the second place where the ChangeWave was enabled ensures it is canonicalized.
if (reference.FullPath.Length > 0 && ChangeWaves.AreFeaturesEnabled(ChangeWaves.Wave17_0)) | ||
{ | ||
// Saves effort and makes deduplication possible downstream | ||
reference.FullPath = FileUtilities.NormalizePath(reference.FullPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the FullPath
prop has a side-effect of re-running the IsWinMDFile
check. Do you think it would be worth optimizing? Perhaps by introducing a "NormalizeFullPath" method on Reference
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me! I only did it in one case because the other case had previously set FullPath explicitly, so it was intentionally running the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'm curious about these statements:
msbuild/src/Tasks/AssemblyDependency/Reference.cs
Lines 491 to 493 in d150e93
_fullPathWithoutExtension = null; | |
_fileNameWithoutExtension = null; | |
_directoryName = null; |
They were executed in the previous version. Now you set only _fullPath
and leave these fields unchanged. Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think _fileNameWithoutExtension should care, and I added the other two. Looking through the rest of the method, if the _fullPath was not null or empty before, this shouldn't make it null or empty, so I think that should be unchanged. Whether something is a winmd file or not also shouldn't care if the path is normalized or not.
Fixes #6501
Context
RAR can output paths not in their canonical forms. This allows for there to be multiple identical paths, only distinguished by extra directory separator characters, for example, which can lead to duplicate work or failing to find paths in a cache.
Changes Made
This normalizes all paths output by RAR, ensuring any given path is in its canonical form.
Testing
To follow.
Notes