Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize RAR output paths #6533

Merged
merged 5 commits into from Jul 19, 2021
Merged

Normalize RAR output paths #6533

merged 5 commits into from Jul 19, 2021

Conversation

Forgind
Copy link
Member

@Forgind Forgind commented Jun 7, 2021

Fixes #6501

Context

RAR can output paths not in their canonical forms. This allows for there to be multiple identical paths, only distinguished by extra directory separator characters, for example, which can lead to duplicate work or failing to find paths in a cache.

Changes Made

This normalizes all paths output by RAR, ensuring any given path is in its canonical form.

Testing

To follow.

Notes

@@ -993,7 +993,7 @@ public ITaskItem[] SuggestedRedirects
public ITaskItem[] FilesWritten
{
set { /*Do Nothing, Inputs not Allowed*/ }
get { return _filesWritten.ToArray(); }
get { return ReferenceTable.NormalizeAsArray(_filesWritten); }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the normalization should run more than once per RAR execution except FilesWritten and UnresolvedAssemblyConflicts, which run once per access. My hunch is that that isn't an issue, but if it is, I'm happy to cache it so we only do that once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's cache it. Everything about RAR is a huge perf bottleneck.

@Forgind
Copy link
Member Author

Forgind commented Jun 7, 2021

So I made tests pass and looked at what the outputs look like. It looks correct as far as paths having been converted to full, normalized paths, but it does look a little odd. I'm wondering if I should have a somewhat simpler "remove extra slashes and folderName\.. bits" call instead.

Also, this looks like a breaking change to me. I routed all the normalization through a single function, so it's really easy to add a change wave. I think I should. Opposition?

{
ti.ItemSpec = FileUtilities.NormalizePath(ti.ItemSpec);
return ti;
}).ToArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to avoid unnecessary allocations here as it's a hot path. I'd pre-allocate the final array, and the for over each item and fill it in (to even avoid the allocation for the foreach iterator).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, an IEqualityComparer that ignores trailing directory separators canbe used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would miss that C:\\\foo is the same as C:\foo

@@ -1021,7 +1021,7 @@ public String DependsOnNETStandard
/// been outputted in MSB3277. Otherwise empty.
/// </summary>
[Output]
public ITaskItem[] UnresolvedAssemblyConflicts => _unresolvedConflicts.ToArray();
public ITaskItem[] UnresolvedAssemblyConflicts => ReferenceTable.NormalizeAsArray(_unresolvedConflicts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only accessed once? because if it is read twice or more, we'll do the work every time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be accessed many times, depending on the user's use case. I can cache it.

@rainersigwald rainersigwald added this to the 17.0 milestone Jun 14, 2021
@@ -2687,6 +2687,17 @@ internal void GetReferenceItems
copyLocalFiles = copyLocalItems.ToArray();
}

internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items)
Copy link
Contributor

@Nirmal4G Nirmal4G Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items)
internal static ITaskItem[] NormalizePathsAsArray(List<ITaskItem> items)

OR even simpler...

Suggested change
internal static ITaskItem[] NormalizeAsArray(List<ITaskItem> items)
internal static ITaskItem[] NormalizePaths(List<ITaskItem> items)

How about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you saw, but I took the middle suggestion. I do want to mention "Array," since otherwise it would be very non-obvious (without looking at its return type) that it's returning an array instead of just normalizing it.

@rainersigwald rainersigwald changed the title Dedup rar paths Normalize RAR output paths Jun 15, 2021
@@ -64,7 +64,7 @@ public void ConflictBetweenNonCopyLocalDependencies()
Assert.True(ContainsItem(t.ResolvedDependencyFiles, s_myLibraries_V2_GDllPath));

Assert.Single(t.SuggestedRedirects);
Assert.True(ContainsItem(t.SuggestedRedirects, @"D, Culture=neutral, PublicKeyToken=aaaaaaaaaaaaaaaa")); // "Expected to find suggested redirect, but didn't"
Assert.True(ContainsItem(t.SuggestedRedirects, @$"{"D"}, Culture=neutral, PublicKeyToken=aaaaaaaaaaaaaaaa")); // "Expected to find suggested redirect, but didn't"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can undo this one. An earlier version required a legitimate test change but no more.

Comment on lines 2677 to 2679
// Sort for stable outputs. (These came from a dictionary, which has undefined enumeration order.)
Array.Sort(primaryFiles, TaskItemSpecFilenameComparer.GenericComparer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be concerned that we're changing the sort order here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the comment, I don't think so. The comment and the sort have been there since the initial (GitHub) code commit, but it sounds like it's important that if you run RAR twice, the order is the same, but not so important what that order actually is. This maintains that there is a correct ordering, albeit reordering it slightly.

get { return _filesWritten.ToArray(); }
get
{
if (_filesWrittenArray?.Length != _filesWritten.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition surprises me. Is a length change the only way it can be invalidated?

What is the reason to do this caching rather than just do it? This field will be read once on task completion in the normal case, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had it without the caching but was told I should change it. See #6533 (comment)

Having a length check was probably a bad plan. It would be quite expensive to do a proper check, though, so I think I should just remove the caching here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's remove it, unless there are in-code accesses to these getters. The engine itself should access them only once/task invocation.

src/Tasks/AssemblyDependency/ResolveAssemblyReference.cs Outdated Show resolved Hide resolved
primaryFiles = NormalizePathsAsArray(primaryItems);
dependencyFiles = NormalizePathsAsArray(dependencyItems);
relatedFiles = NormalizePathsAsArray(relatedItems);
satelliteFiles = NormalizePathsAsArray(satelliteItems);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at how hard it would be to move the normalization to the input layer (so that it gets put in the cache).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this bit, I don't think this is an easy change because in addition to finding canonical forms for all input paths, we would have to find canonical forms for dependencies, and that would get complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should normalize task inputs, but the inputs to the cache--the outputs of the task, but at the layer where they're created, not at output time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that clarifies things, thank you. Now I'm back to "seems like a good move, but I have no idea how hard that is."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think the newest changes should be equivalent, but I'm not at all confident.

It seems like the only way anything is added to References is via that one AddReference method. I'm also looking to see if I can move it any earlier. At worst, this is cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important part of the change is:
8ed6c4b

@Forgind
Copy link
Member Author

Forgind commented Jul 8, 2021

I got a little carried away with various cleanup things in ReferenceTable. If it were just one or two changes, I'd push for them to stay in this PR. With this many, though, I'm fining moving them all to a separate PR to make this cleaner. Fair warning, though: that might lead to more random changes 🙂

Also, the last is maybe questionable.

@rainersigwald
Copy link
Member

Yes, please reduce this PR to just the change needed.

Some references can be resolved later. I'd initially missed it because References wasn't invoked directly, but rather a Reference was retrieved then later modified. I believe this is the only place that happens, however. The first check essentially checks whether it has already been resolved, and the second place where the ChangeWave was enabled ensures it is canonicalized.
if (reference.FullPath.Length > 0 && ChangeWaves.AreFeaturesEnabled(ChangeWaves.Wave17_0))
{
// Saves effort and makes deduplication possible downstream
reference.FullPath = FileUtilities.NormalizePath(reference.FullPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the FullPath prop has a side-effect of re-running the IsWinMDFile check. Do you think it would be worth optimizing? Perhaps by introducing a "NormalizeFullPath" method on Reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me! I only did it in one case because the other case had previously set FullPath explicitly, so it was intentionally running the check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'm curious about these statements:

_fullPathWithoutExtension = null;
_fileNameWithoutExtension = null;
_directoryName = null;

They were executed in the previous version. Now you set only _fullPath and leave these fields unchanged. Intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think _fileNameWithoutExtension should care, and I added the other two. Looking through the rest of the method, if the _fullPath was not null or empty before, this shouldn't make it null or empty, so I think that should be unchanged. Whether something is a winmd file or not also shouldn't care if the path is normalized or not.

@Forgind Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Jul 19, 2021
@ladipro ladipro merged commit 3e71818 into dotnet:main Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ResolveAssemblyReferences: Canonicalize output paths
6 participants