Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Match-on-metadata #6035

Merged
merged 17 commits into from Feb 24, 2021
Merged

Optimize Match-on-metadata #6035

merged 17 commits into from Feb 24, 2021

Conversation

@Forgind
Copy link
Member

@Forgind Forgind commented Jan 14, 2021

Fixes AB#1261123 (feedback ticket)

Context

Checking metadata for matches was taking (#fragments in item) * (#fragments in item to remove) * (#metadata). For many users, this was extremely slow.

Changes Made

Put fragments into a custom MetadataSet object before checking for a match. This takes it from O(n * r * m) to O(m * (n + r)). Since m is normally small, this is a substantial improvement.

Testing

Time on a single operation went from ~180 seconds to ~150 ms for the penultimate version. It also passes @cdmihai's extensive unit tests.*

Notes

Is there any reason to keep the error for referencing multiple items? As a user, I would expect not-match-on-metadata to ignore metadata and match-on-metadata to ignore the item.
Is it ok to rely on IEnumerable returning element in a consistent order?
Should we keep returning no-match if all the metadata are empty on both items? It makes more sense to me to have that be a match, and it would simplify the code slightly. If it makes more sense to others, now would be the time to do it, since the code isn't that old.

* Except one on paths. Not sure why that's failing, since it's passing in VS, just not from the command line. Will look into it more.

@Forgind Forgind requested review from ladipro and cdmihai Jan 14, 2021
@BenVillalobos
Copy link
Member

@BenVillalobos BenVillalobos commented Jan 14, 2021

I believe if you do AB/<feedback ticket number> then a tool that kevin pilch made will auto-link to the ticket

@@ -249,7 +249,18 @@
}
else
{
itemsToRemove = FindItemsUsingMatchOnMetadata(group, child, bucket, matchOnMetadata, matchingOptions);
ImmutableList<string> metadataList = matchOnMetadata.ToImmutableList();

This comment has been minimized.

@cdmihai

cdmihai Jan 15, 2021
Contributor

I'd keep this as a separate method to have ExecuteRemove read nicer by only contain top level intent and delegating more nitty gritty implementation details to helper methods. Like the other if branch.

StringComparer comparer = options == MatchOnMetadataOptions.CaseSensitive ? StringComparer.Ordinal : StringComparer.OrdinalIgnoreCase;
children = new Dictionary<string, MetadataSet<P, I>>(comparer);
normalize = options == MatchOnMetadataOptions.PathLike ? p => FileUtilities.NormalizePathForComparisonNoThrow(p, Environment.CurrentDirectory) : p => p;
foreach (ItemSpec<P, I>.ItemExpressionFragment frag in itemSpec.Fragments)

This comment has been minimized.

@cdmihai

cdmihai Jan 15, 2021
Contributor

How does this type check, isn't ItemSpec.Fragments a list of the base class ItemSpecFragment? (List<ItemSpecFragment>)

This comment has been minimized.

@Forgind

Forgind Jan 16, 2021
Author Member

Good question. It might be casting it automatically when it sees that, but I'm not sure. I don't think we should care about other kinds of fragments, since I don't think they can have metadata, so I think this is ok, but that was a lot of "I think"s.

This comment has been minimized.

@Forgind

Forgind Jan 25, 2021
Author Member

As far as whether this is actually a safe, I should have mentioned that the type is checked before the MetadataSet is constructed, so it should always be correct.

Copy link
Contributor

@ladipro ladipro left a comment

Neat! I have left comments - mostly nits, but the logic in Contains appears to be incorrect, please add a test case exercising all code paths there.

@@ -14,6 +14,7 @@ class RemoveOperation : LazyItemOperation
{
readonly ImmutableList<string> _matchOnMetadata;
readonly MatchOnMetadataOptions _matchOnMetadataOptions;
private MetadataSet<P, I> metadataSet;

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

nit: Naming convention, field name is missing the leading underscore.

{
if (metadataSet == null)
{
metadataSet = new MetadataSet<P, I>(_matchOnMetadataOptions, _matchOnMetadata, _itemSpec);

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

nit: This lazy initialization of metadataSet is not thread safe. Is it guaranteed to always run only on one thread?

This comment has been minimized.

@Forgind

Forgind Jan 16, 2021
Author Member

I can put it in the constructor—shouldn't affect perf, assuming operations are always executed.

@@ -504,4 +459,78 @@ public GlobFragment(string textFragment, string projectDirectory)
&& TextFragment[2] == '*'
&& FileUtilities.IsAnySlash(TextFragment[3]);
}

internal class MetadataSet<P, I> where P : class, IProperty where I : class, IItem, IMetadataTable

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

super-nit: The class can be sealed.

private Dictionary<string, MetadataSet<P, I>> children;
Func<string, string> normalize;
Comment on lines 465 to 466

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

nit: Names missing leading underscores, both fields can be readonly.

children = new Dictionary<string, MetadataSet<P, I>>(comparer);
}

// Relies on IEnumerable returning the metadata in a reasonable order. Reasonable?

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

The question is whether the matching IEnumerable<> is passed to the constructor and to Contains. It appears to be true although I wonder if there's a way to enforce it in the code.

if (current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child))
{
current = child;
}
else
{
current.children.Add(normalizedString, new MetadataSet<P, I>(comparer));
current = current.children[normalizedString];
}
Comment on lines 494 to 502

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

nit: Redundant dictionary lookup.

Suggested change
if (current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child))
{
current = child;
}
else
{
current.children.Add(normalizedString, new MetadataSet<P, I>(comparer));
current = current.children[normalizedString];
}
if (!current.children.TryGetValue(normalizedString, out MetadataSet<P, I> child))
{
child = new MetadataSet<P, I>(comparer);
current.children.Add(normalizedString, child);
}
current = child;
internal bool Contains(IEnumerable<string> metadata)
{
bool nonEmptyFound = false;
MetadataSet<P, I> curr = this;

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

super-nit: current to match the name used in Add?

@@ -249,7 +249,18 @@ private void ExecuteRemove(ProjectItemGroupTaskItemInstance child, ItemBucket bu
}
else
{
itemsToRemove = FindItemsUsingMatchOnMetadata(group, child, bucket, matchOnMetadata, matchingOptions);
ImmutableList<string> metadataList = matchOnMetadata.ToImmutableList();

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

Why the conversion to ImmutableList<>? You should be able to make the MetadataSet<,> constructor take IEnumerable<>.

{
if (!String.IsNullOrEmpty(m))
{
nonEmptyFound = true;

This comment has been minimized.

@ladipro

ladipro Jan 15, 2021
Contributor

Why isn't this return false? The previous code was checking if all metadata is defined and their value matches. I suggest adding test coverage for the case where m is empty.

This comment has been minimized.

@Forgind

Forgind Jan 16, 2021
Author Member

From #4997:
if a metadata name doesn't exist in both items, then it's considered a match. Special case: if none of the metadata exist in both items, then it should not match.

But you're right—the code I'm replacing checks for string.Empty and automatically returns false. Given our conversation in the PR review meeting, I'll change to that. Thanks!

@@ -1988,63 +1988,6 @@ public void KeepWithItemReferenceOnNonmatchingMetadata()
items.ElementAt(3).GetMetadataValue("d").ShouldBe("d");
}

[Fact]
public void FailWithMatchingMultipleMetadata()

This comment has been minimized.

@cdmihai

cdmihai Jan 21, 2021
Contributor

Don't delete the tests, instead replace the exception assertion with an assertion on I2's contents

This comment has been minimized.

@BenVillalobos

BenVillalobos Jan 22, 2021
Member

The new test implies you can match with multiple data now though. If we want to keep that, the FailWithMatchingMultipleMetadata test is actually invalid.

This comment has been minimized.

@Forgind

Forgind Jan 22, 2021
Author Member

Changed it to "RemoveWithMatchingMultipleMetdata." I did enable matching multiple.

.HelpKeyword.ShouldBe("MSBuild.OM_MatchOnMetadataIsRestrictedToOnlyOneReferencedItem");
Project project = ObjectModelHelpers.CreateInMemoryProject(content);
IEnumerable<ProjectItem> items = project.ItemsIgnoringCondition.Where(i => i.ItemType.Equals("I2"));
items.Count().ShouldBe(3);

This comment has been minimized.

@cdmihai

cdmihai Jan 21, 2021
Contributor

For both of the tests, assert that the right items are left in I2.

@cdmihai
Copy link
Contributor

@cdmihai cdmihai commented Jan 21, 2021

/azp run

@azure-pipelines
Copy link

@azure-pipelines azure-pipelines bot commented Jan 21, 2021

Azure Pipelines successfully started running 1 pipeline(s).

internal sealed class MetadataSet<P, I> where P : class, IProperty where I : class, IItem, IMetadataTable
{
private readonly Dictionary<string, MetadataSet<P, I>> _children;

This comment has been minimized.

@cdmihai

cdmihai Jan 21, 2021
Contributor

Why use this convoluted data structure instead of dumping all metadata values from the Itemspec into a flat set of normalized metadata values?

This comment has been minimized.

@BenVillalobos

BenVillalobos Jan 22, 2021
Member

Still not 100% sure I agree with the data structure, but I want to clarify my understanding from the PR review today.

A flat set of normalized metadata values would prevent us from being able to determine exactly which metadata values from which items were matched.

We'd need to do this to enable the logic in a test like RemoveWithMultipleItemReferenceOnMatchingMetadata.

<I3 Remove='@(I1);@(I2)' MatchOnMetadata='M1' />

This remove attribute effectively translates to:
Remove items in I3 where:

  • I1 OR I2 match on the following metadata:
    • M1

I believe this also enables something like:

<I3 Remove='@(I1);@(I2)' MatchOnMetadata='M1;M2' />

Where this remove attribute translates to:
Remove items in I3 where:

  • I1 OR I2 match on the following metadata:
    • M1 AND M2

@Forgind is this understanding correct? If so I think there should be a test for multiple items and multiple metadata.

This comment has been minimized.

@Forgind

Forgind Jan 22, 2021
Author Member

Wrote detailed comment for MetadataTrie class. Will push soon.

ProjectFileErrorUtilities.VerifyThrowInvalidProjectFile(
_matchOnMetadata.IsEmpty || _itemSpec.Fragments.All(f => f is ItemSpec<ProjectProperty, ProjectItem>.ItemExpressionFragment),
new BuildEventFileInfo(string.Empty),
"OM_MatchOnMetadataIsRestrictedToOnlyOneReferencedItem");

This comment has been minimized.

@ladipro

ladipro Jan 21, 2021
Contributor

Please don't forget to change the error message to reflect the relaxed validation.

Copy link
Member

@BenVillalobos BenVillalobos left a comment

Should we keep returning no-match if all the metadata are empty on both items?

I can theoretically see a use case for "Match on everything that has no metadata." Maybe we can use a special keyword/symbol here to represent that? Might be best to file that as an issue and see if that gets traction.

@@ -1988,63 +1988,6 @@ public void KeepWithItemReferenceOnNonmatchingMetadata()
items.ElementAt(3).GetMetadataValue("d").ShouldBe("d");
}

[Fact]
public void FailWithMatchingMultipleMetadata()

This comment has been minimized.

@BenVillalobos

BenVillalobos Jan 22, 2021
Member

The new test implies you can match with multiple data now though. If we want to keep that, the FailWithMatchingMultipleMetadata test is actually invalid.

ExecuteTask(task, lookup);
ICollection<ProjectItemInstance> items = lookup.GetItems("I2");
items.Count().ShouldBe(3);
items.ElementAt(0).EvaluatedInclude.ShouldBe("a2");

This comment has been minimized.

@BenVillalobos

BenVillalobos Jan 22, 2021
Member

So matching the names of metadata is case insensitive, but their values are not. The values being case sensitive makes sense. I remember us having a discussion on the topic in a linux issue.

Just thinking out loud.

This comment has been minimized.

@Forgind

Forgind Jan 22, 2021
Author Member

MatchOnMetadata allows three options: case sensitive, case insensitive, and pathlike. The third is most similar to what you were thinking, I think, with defaulting to case insensitive regardless of OS.

Copy link
Contributor

@ladipro ladipro left a comment

Apologies for getting back to this after so many days.

Playing devil's advocate, imagine that instead of the Trie, we use a simple HashSet<string> where the keys are a concatenation of the respective metadata value with a delimiter & proper escaping to make sure that matching doesn't have false positives. What would be the advantages and disadvantages of this approach?

Using your example from the code comment, the HashSet would contain something like:

"eating|talking|sleeping"
"sleeping|eating|talking"
"talking|sleeping|eating"

so it wouldn't match the Forgind item with the "signature" of "sleeping|talking|eating".

/// Item Child has metadata GoodAt="sleeping" BadAt="eating" OkAt="talking"
/// Item Adolescent has metadata GoodAt="talking" BadAt="sleeping" OkAt="eating"
/// Specifying these three metadata:
/// Item Forgind with metadata GoodAt="sleeping" BadAt="talking" OkAt="eating"

This comment has been minimized.

@ladipro

ladipro Jan 29, 2021
Contributor

😄

@Forgind
Copy link
Member Author

@Forgind Forgind commented Jan 29, 2021

You clearly get the advantage (disadvantage? 😋) of a simpler data structure, but the logic of how to construct the key is more complicated, which partially balances it out. Time to hash a particular string is officially constant but in practice scales according to the length of the string (nlg(n) and assumes the string is short, if I remember correctly), so I don't think the total hashing time would be less. We would win on lookups in the dictionary, but that isn't really relevant. We would also have to factor in time-to-construct the string, which would also always create a new string unless there happens to be exactly one metadatum, and we put in an optimization to not add a character to the end of that. (The argument against an optimization is that we would have to find out that we don't need the extra character, which would either require counting the length of an enumerable (fast, but still theoretically O(n)) or creating something like a StringBuilder, adding it to that StringBuilder, and having a check that the SB has only had one thing added to it before stringifying it. Admittedly, that part should be better with StringTools, so this argument is pretty weak.)

The biggest disadvantage in my mind is with reusing strings, since even if we concatenated them all in one step, we'd still make a new string for each item, and if there are several metadata or the metadata are large (noting that these are metadata values, not metadata keys), it could end up consuming a lot of memory. This is a case in which StringTools would struggle because each of the metadata values, once concatenated to others, could be unique, in which case we'd make a large number of unique large strings. Even matching on two metadata, if the values are large, say, "description" and "path" (with long paths enabled), these could be hundreds or even thousands of characters. With a metadata trie, we don't make any new strings. With a single hash set, we would.

@ladipro
Copy link
Contributor

@ladipro ladipro commented Feb 1, 2021

Thank you for the detailed analysis!

With a metadata trie, we don't make any new strings.

Actually we do if MatchOnMetadataOptions.PathLike is used, though I agree that usually the footprint of the Trie would be smaller than that of a flat set. As long as the Trie is not significantly less efficient in the worst case (all values distinct and path-normalized) I think that a slightly more complex datastructure is justified. Thanks again for taking care of this one!

@ladipro
ladipro approved these changes Feb 1, 2021
@ladipro ladipro merged commit 30bbc5a into dotnet:master Feb 24, 2021
8 checks passed
8 checks passed
license/cla All CLA requirements met.
Details
@azure-pipelines
msbuild-pr Build #20210122.8 succeeded
Details
@azure-pipelines
msbuild-pr (Linux Core) Linux Core succeeded
Details
@azure-pipelines
msbuild-pr (Windows Core) Windows Core succeeded
Details
@azure-pipelines
msbuild-pr (Windows Full Release (no bootstrap)) Windows Full Release (no bootstrap) succeeded
Details
@azure-pipelines
msbuild-pr (Windows Full) Windows Full succeeded
Details
@azure-pipelines
msbuild-pr (macOS Core) macOS Core succeeded
Details
@azure-pipelines
msbuild-pr (macOS Mono) macOS Mono succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants