Normalize RAR output paths #6533

Forgind · 2021-06-07T21:16:32Z

Context

RAR can output paths not in their canonical forms. This allows for there to be multiple identical paths, only distinguished by extra directory separator characters, for example, which can lead to duplicate work or failing to find paths in a cache.

Changes Made

This normalizes all paths output by RAR, ensuring any given path is in its canonical form.

Testing

To follow.

Notes

Forgind · 2021-06-07T21:19:39Z

src/Tasks/AssemblyDependency/ResolveAssemblyReference.cs

@@ -993,7 +993,7 @@ public ITaskItem[] SuggestedRedirects
        public ITaskItem[] FilesWritten
        {
            set { /*Do Nothing, Inputs not Allowed*/ }
-            get { return _filesWritten.ToArray(); }
+            get { return ReferenceTable.NormalizeAsArray(_filesWritten); }


None of the normalization should run more than once per RAR execution except FilesWritten and UnresolvedAssemblyConflicts, which run once per access. My hunch is that that isn't an issue, but if it is, I'm happy to cache it so we only do that once.

Let's cache it. Everything about RAR is a huge perf bottleneck.

Forgind · 2021-06-07T22:03:20Z

So I made tests pass and looked at what the outputs look like. It looks correct as far as paths having been converted to full, normalized paths, but it does look a little odd. I'm wondering if I should have a somewhat simpler "remove extra slashes and folderName\.. bits" call instead.

Also, this looks like a breaking change to me. I routed all the normalization through a single function, so it's really easy to add a change wave. I think I should. Opposition?

KirillOsenkov · 2021-06-07T23:50:49Z

src/Tasks/AssemblyDependency/ReferenceTable.cs

+            {
+                ti.ItemSpec = FileUtilities.NormalizePath(ti.ItemSpec);
+                return ti;
+            }).ToArray();


I'd try to avoid unnecessary allocations here as it's a hot path. I'd pre-allocate the final array, and the for over each item and fill it in (to even avoid the allocation for the foreach iterator).

Alternatively, an IEqualityComparer that ignores trailing directory separators canbe used.

That would miss that C:\\\foo is the same as C:\foo

KirillOsenkov · 2021-06-07T23:51:34Z

src/Tasks/AssemblyDependency/ResolveAssemblyReference.cs

@@ -1021,7 +1021,7 @@ public String DependsOnNETStandard
        /// been outputted in MSB3277. Otherwise empty.
        /// </summary>
        [Output]
-        public ITaskItem[] UnresolvedAssemblyConflicts => _unresolvedConflicts.ToArray();
+        public ITaskItem[] UnresolvedAssemblyConflicts => ReferenceTable.NormalizeAsArray(_unresolvedConflicts);


Is this only accessed once? because if it is read twice or more, we'll do the work every time.

It could be accessed many times, depending on the user's use case. I can cache it.

Nirmal4G · 2021-06-15T02:26:21Z

src/Tasks/AssemblyDependency/ReferenceTable.cs

@@ -2687,6 +2687,17 @@ internal void GetReferenceItems
            copyLocalFiles = copyLocalItems.ToArray();
        }

+        internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items)


Suggested change

internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items)

internal static ITaskItem[] NormalizePathsAsArray(List<ITaskItem> items)

OR even simpler...

Suggested change

internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items)

internal static ITaskItem[] NormalizePaths(List<ITaskItem> items)

How about this?

I don't know if you saw, but I took the middle suggestion. I do want to mention "Array," since otherwise it would be very non-obvious (without looking at its return type) that it's returning an array instead of just normalizing it.

rainersigwald · 2021-06-15T21:48:43Z

src/Tasks.UnitTests/AssemblyDependency/SuggestedRedirects.cs

@@ -64,7 +64,7 @@ public void ConflictBetweenNonCopyLocalDependencies()
            Assert.True(ContainsItem(t.ResolvedDependencyFiles, s_myLibraries_V2_GDllPath));

            Assert.Single(t.SuggestedRedirects);
-            Assert.True(ContainsItem(t.SuggestedRedirects, @"D, Culture=neutral, PublicKeyToken=aaaaaaaaaaaaaaaa")); // "Expected to find suggested redirect, but didn't"
+            Assert.True(ContainsItem(t.SuggestedRedirects, @$"{"D"}, Culture=neutral, PublicKeyToken=aaaaaaaaaaaaaaaa")); // "Expected to find suggested redirect, but didn't"


I can undo this one. An earlier version required a legitimate test change but no more.

rainersigwald · 2021-06-15T21:50:06Z

src/Tasks/AssemblyDependency/ReferenceTable.cs

             // Sort for stable outputs. (These came from a dictionary, which has undefined enumeration order.)
            Array.Sort(primaryFiles, TaskItemSpecFilenameComparer.GenericComparer);


Should we be concerned that we're changing the sort order here?

From the comment, I don't think so. The comment and the sort have been there since the initial (GitHub) code commit, but it sounds like it's important that if you run RAR twice, the order is the same, but not so important what that order actually is. This maintains that there is a correct ordering, albeit reordering it slightly.

rainersigwald · 2021-06-15T21:51:40Z

src/Tasks/AssemblyDependency/ResolveAssemblyReference.cs

-            get { return _filesWritten.ToArray(); }
+            get
+            {
+                if (_filesWrittenArray?.Length != _filesWritten.Count)


This condition surprises me. Is a length change the only way it can be invalidated?

What is the reason to do this caching rather than just do it? This field will be read once on task completion in the normal case, right?

I originally had it without the caching but was told I should change it. See #6533 (comment)

Having a length check was probably a bad plan. It would be quite expensive to do a proper check, though, so I think I should just remove the caching here.

Yeah, let's remove it, unless there are in-code accesses to these getters. The engine itself should access them only once/task invocation.

src/Tasks/AssemblyDependency/ResolveAssemblyReference.cs

Forgind · 2021-06-21T15:34:47Z

src/Tasks/AssemblyDependency/ReferenceTable.cs

+            primaryFiles = NormalizePathsAsArray(primaryItems);
+            dependencyFiles = NormalizePathsAsArray(dependencyItems);
+            relatedFiles = NormalizePathsAsArray(relatedItems);
+            satelliteFiles = NormalizePathsAsArray(satelliteItems);


Look at how hard it would be to move the normalization to the input layer (so that it gets put in the cache).

After looking at this bit, I don't think this is an easy change because in addition to finding canonical forms for all input paths, we would have to find canonical forms for dependencies, and that would get complicated.

I don't think you should normalize task inputs, but the inputs to the cache--the outputs of the task, but at the layer where they're created, not at output time.

Ah, that clarifies things, thank you. Now I'm back to "seems like a good move, but I have no idea how hard that is."

Ok, I think the newest changes should be equivalent, but I'm not at all confident.

It seems like the only way anything is added to References is via that one AddReference method. I'm also looking to see if I can move it any earlier. At worst, this is cleaner.

Important part of the change is:
8ed6c4b

Forgind · 2021-07-08T00:14:00Z

I got a little carried away with various cleanup things in ReferenceTable. If it were just one or two changes, I'd push for them to stay in this PR. With this many, though, I'm fining moving them all to a separate PR to make this cleaner. Fair warning, though: that might lead to more random changes 🙂

Also, the last is maybe questionable.

rainersigwald · 2021-07-09T14:53:45Z

Yes, please reduce this PR to just the change needed.

src/Tasks/AssemblyDependency/ReferenceTable.cs

Some references can be resolved later. I'd initially missed it because References wasn't invoked directly, but rather a Reference was retrieved then later modified. I believe this is the only place that happens, however. The first check essentially checks whether it has already been resolved, and the second place where the ChangeWave was enabled ensures it is canonicalized.

ladipro · 2021-07-12T15:32:18Z

src/Tasks/AssemblyDependency/ReferenceTable.cs

+            if (reference.FullPath.Length > 0 && ChangeWaves.AreFeaturesEnabled(ChangeWaves.Wave17_0))
+            {
+                // Saves effort and makes deduplication possible downstream
+                reference.FullPath = FileUtilities.NormalizePath(reference.FullPath);


Setting the FullPath prop has a side-effect of re-running the IsWinMDFile check. Do you think it would be worth optimizing? Perhaps by introducing a "NormalizeFullPath" method on Reference.

Sounds reasonable to me! I only did it in one case because the other case had previously set FullPath explicitly, so it was intentionally running the check.

Thank you! I'm curious about these statements:

msbuild/src/Tasks/AssemblyDependency/Reference.cs

Lines 491 to 493 in d150e93

_fullPathWithoutExtension = null;

_fileNameWithoutExtension = null;

_directoryName = null;

They were executed in the previous version. Now you set only _fullPath and leave these fields unchanged. Intentional?

I don't think _fileNameWithoutExtension should care, and I added the other two. Looking through the rest of the method, if the _fullPath was not null or empty before, this shouldn't make it null or empty, so I think that should be unchanged. Whether something is a winmd file or not also shouldn't care if the path is normalized or not.

Forgind commented Jun 7, 2021

View reviewed changes

KirillOsenkov reviewed Jun 7, 2021

View reviewed changes

rainersigwald added this to the 17.0 milestone Jun 14, 2021

Nirmal4G reviewed Jun 15, 2021

View reviewed changes

rainersigwald changed the title ~~Dedup rar paths~~ Normalize RAR output paths Jun 15, 2021

rainersigwald requested changes Jun 15, 2021

View reviewed changes

Forgind force-pushed the dedup-rar-paths branch from 52389a2 to 75b16a9 Compare June 16, 2021 20:20

rainersigwald approved these changes Jun 18, 2021

View reviewed changes

Forgind commented Jun 21, 2021

View reviewed changes

Normalize a different way

d47bf26

Forgind force-pushed the dedup-rar-paths branch from fbff0ec to d47bf26 Compare July 9, 2021 17:43

rainersigwald reviewed Jul 9, 2021

View reviewed changes

src/Tasks/AssemblyDependency/ReferenceTable.cs Outdated Show resolved Hide resolved

Add comment + put under change wave

8daaeb4

rainersigwald approved these changes Jul 9, 2021

View reviewed changes

ladipro reviewed Jul 12, 2021

View reviewed changes

Forgind added 2 commits July 12, 2021 11:07

Expose simple path normalization method

ed55465

Add changed fields

eecf857

ladipro approved these changes Jul 13, 2021

View reviewed changes

Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Jul 19, 2021

ladipro merged commit 3e71818 into dotnet:main Jul 19, 2021

ladipro mentioned this pull request Feb 13, 2023

Quick post-mortem of recently implemented RAR optimizations #8432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize RAR output paths #6533

Normalize RAR output paths #6533

Forgind commented Jun 7, 2021

Forgind Jun 7, 2021

KirillOsenkov Jun 7, 2021

Forgind commented Jun 7, 2021

KirillOsenkov Jun 7, 2021

Therzok Jun 8, 2021

Forgind Jun 8, 2021

KirillOsenkov Jun 7, 2021

Forgind Jun 8, 2021

Nirmal4G Jun 15, 2021 •

edited

Forgind Jun 18, 2021

rainersigwald Jun 15, 2021

Forgind Jun 16, 2021

rainersigwald Jun 15, 2021

Forgind Jun 16, 2021

rainersigwald Jun 15, 2021

Forgind Jun 16, 2021

rainersigwald Jun 16, 2021

Forgind Jun 21, 2021

Forgind Jul 7, 2021

rainersigwald Jul 7, 2021

Forgind Jul 7, 2021

Forgind Jul 7, 2021

Forgind Jul 7, 2021

Forgind commented Jul 8, 2021 •

edited

rainersigwald commented Jul 9, 2021

ladipro Jul 12, 2021

Forgind Jul 12, 2021

ladipro Jul 13, 2021

Forgind Jul 13, 2021

	internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items)
	internal static ITaskItem[] NormalizePathsAsArray(List<ITaskItem> items)

		// Sort for stable outputs. (These came from a dictionary, which has undefined enumeration order.)
		Array.Sort(primaryFiles, TaskItemSpecFilenameComparer.GenericComparer);

	_fullPathWithoutExtension = null;
	_fileNameWithoutExtension = null;
	_directoryName = null;

Normalize RAR output paths #6533

Normalize RAR output paths #6533

Conversation

Forgind commented Jun 7, 2021

Context

Changes Made

Testing

Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Forgind commented Jun 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nirmal4G Jun 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Forgind commented Jul 8, 2021 • edited

rainersigwald commented Jul 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nirmal4G Jun 15, 2021 •

edited

Forgind commented Jul 8, 2021 •

edited