Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExceptBy doesn't include duplicates by comparer in first set that should not be excluded #7656

Closed
bsonnino opened this issue Jan 22, 2022 · 16 comments · Fixed by dotnet/docs#39182
Assignees
Labels
area-System.Linq 📚 documentation doc-enhancement Improve the current content Pri1 Indicates issues/PRs that are high priority 📌 seQUESTered Identifies that an issue has been imported into Quest.
Milestone

Comments

@bsonnino
Copy link

bsonnino commented Jan 22, 2022

Description

If the first set has duplicates according to the comparer function, the second one is removed from the resulting set, thus being an error.
The resulting set should include all duplicates that are not in the first set

Reproduction Steps

If you have something like this:

public record Person(string Name, int Age);

var people = new List<Person>
{
    new Person("John", 30 ),
    new Person("Peter", 40),
    new Person("Mary", 20 ),
    new Person("Jane", 30 ),
    new Person("Larry", 50),
    new Person("Anne", 50 ),
    new Person("Paul", 20),
};

var excludedAges = new List<int> {30,40};
var people1 = people.ExceptBy(excludedAges, p => p.Age);
Console.WriteLine($"{string.Join("\n", people1)}");

Expected behavior

Person { Name = Mary, Age = 20 }
Person { Name = Larry, Age = 50 }
Person { Name = Anne, Age = 50 }
Person { Name = Paul, Age = 20 }

Actual behavior

Person { Name = Mary, Age = 20 }
Person { Name = Larry, Age = 50 }

Regression?

No response

Known Workarounds

var people2 = people.Where(p => !excludedAges.Contains(p.Age));

Configuration

.NET 6.0.101
Windows 11 21H2
x64

Other information

The code in Except.cs (line 94) uses a HashSet, thus eliminating the duplicates due to the comparer, which are not real duplicates


Associated WorkItem - 200997

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Jan 22, 2022

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

If the first set has duplicates according to the comparer function, the second one is removed from the resulting set, thus being an error.
The resulting set should include all duplicates that are not in the first set

Reproduction Steps

If you have something like this:

public record Person(string Name, int Age);

var people = new List<Person>
{
    new Person("John", 30 ),
    new Person("Peter", 40),
    new Person("Mary", 20 ),
    new Person("Jane", 30 ),
    new Person("Larry", 50),
    new Person("Anne", 50 ),
    new Person("Paul", 20),
};

var excludedAges = new List<int> {30,40};
var people1 = people.ExceptBy(excludedAges, p => p.Age);
Console.WriteLine($"{string.Join("\n", people1)}");

Expected behavior

Person { Name = Mary, Age = 20 }
Person { Name = Larry, Age = 50 }
Person { Name = Anne, Age = 50 }
Person { Name = Paul, Age = 20 }

Actual behavior

Person { Name = Mary, Age = 20 }
Person { Name = Larry, Age = 50 }

Regression?

No response

Known Workarounds

var people2 = people.Where(p => !excludedAges.Contains(p.Age));

Configuration

.NET 6.0.101
Windows 11 21H2
x64

Other information

The code in Except.cs (line 94) uses a HashSet, thus eliminating the duplicates due to the comparer, which are not real duplicates

Author: bsonnino
Assignees: -
Labels:

area-System.Linq, untriaged

Milestone: -

@Clockwork-Muse
Copy link

Except - and by extension, ExceptBy - is one of the "Set Operations" (along with Union and Intersect), and duplicates are always removed (this behavior is standard among pretty much all languages/libraries, because it's such a fundamental concept).

If your use case requires retaining duplicates, you should use Where instead.


@stephentoub
That aside, the documentation for ExceptBy completely lacks any documentation of this behavior, lacking a Remarks section that the other two have, or even the aside about "distinct elements" in the parameter/return section.

@stephentoub
Copy link
Member

@stephentoub That aside, the documentation for ExceptBy completely lacks any documentation of this behavior, lacking a Remarks section that the other two have, or even the aside about "distinct elements" in the parameter/return section.

@eiriktsarpalis

@eiriktsarpalis
Copy link
Member

Yes, a remarks section elaborating a bit more on what "set semantics" means in this case might be warranted (and in particular, how the projector delegate modulates the definition of the source set) might be warranted. Moving to dotnet-api-docs.

@PRMerger4 PRMerger4 added the Pri3 Indicates issues/PRs that are low priority label Jan 27, 2022
@eiriktsarpalis eiriktsarpalis transferred this issue from dotnet/runtime Jan 27, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Jan 27, 2022

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

If the first set has duplicates according to the comparer function, the second one is removed from the resulting set, thus being an error.
The resulting set should include all duplicates that are not in the first set

Reproduction Steps

If you have something like this:

public record Person(string Name, int Age);

var people = new List<Person>
{
    new Person("John", 30 ),
    new Person("Peter", 40),
    new Person("Mary", 20 ),
    new Person("Jane", 30 ),
    new Person("Larry", 50),
    new Person("Anne", 50 ),
    new Person("Paul", 20),
};

var excludedAges = new List<int> {30,40};
var people1 = people.ExceptBy(excludedAges, p => p.Age);
Console.WriteLine($"{string.Join("\n", people1)}");

Expected behavior

Person { Name = Mary, Age = 20 }
Person { Name = Larry, Age = 50 }
Person { Name = Anne, Age = 50 }
Person { Name = Paul, Age = 20 }

Actual behavior

Person { Name = Mary, Age = 20 }
Person { Name = Larry, Age = 50 }

Regression?

No response

Known Workarounds

var people2 = people.Where(p => !excludedAges.Contains(p.Age));

Configuration

.NET 6.0.101
Windows 11 21H2
x64

Other information

The code in Except.cs (line 94) uses a HashSet, thus eliminating the duplicates due to the comparer, which are not real duplicates

Author: bsonnino
Assignees: -
Labels:

Pri3, area-System.Linq, :watch: Not Triaged

Milestone: -

@jeffhandley jeffhandley added untriaged New issue has not been triaged by the area owner and removed ⌚ Not Triaged labels Apr 29, 2022
@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label May 5, 2022
@eiriktsarpalis eiriktsarpalis added this to the Backlog milestone May 5, 2022
@rcdailey
Copy link

I wasted a ton of time trying to hunt down why duplicate elements were missing after ExceptBy(). I don't understand why dropping duplicates is a feature and not a bug. Maybe some wikipedia page can explain the deep theory of this, but I wouldn't call it a "fundamental concept". Seems like a violation of single responsibility pattern to me. If I wanted distinct output wouldn't I write that into my LINQ query? I find this hidden behavior very confusing and error prone. But I don't know if my vote will count for much on this.

rcdailey added a commit to recyclarr/recyclarr that referenced this issue May 21, 2022
When using filters like `exclude`, it was possible for terms to not get
synced when they should have. This was due to a misunderstanding of how
`ExceptBy()` and `IntersectBy()` work. According to [an issue][1] on the
dotnet runtime repo, this is by design. The fix is to just avoid those
in favor of `Where()`.

Fixes #69.

[1]: dotnet/dotnet-api-docs#7656
@Clockwork-Muse
Copy link

@rcdailey - See the "Set Operations" link in my earlier comment.

That aside, probably the largest mainstream use of set mathematics in programming (and possibly the originator?) is SQL; the primary reason for Except (and a large percentage of the rest of the remaining LINQ methods) is for use in LINQ-to-SQL. And at that point different behavior based on the datasource (SQL database vs some other type of database) would be a source of major bugs.

@rcdailey
Copy link

rcdailey commented May 21, 2022

I saw the link, although admittedly I am uninterested in it. Mathematics stuff isn't my deal.

My misunderstanding was that ExceptBy() was a fancy shortcut for Where(x => !second.Contains(x)). But that appears to not be the case. The documentation is misleading as well; at least at its face:

The returned sequence contains only the elements from the first input sequence that are not in the second input sequence.

No where is it mentioned that duplicates are filtered out. Even the graphical example lacks this. Every example I see out there has no duplicate keys. So I think it's understandable that:

  1. Programmers exist that don't understand "set mathematics" and relevant theories
  2. There's no reasonable indication of the secret Distinct()-like behavior is hidden inside.

I understand that what you are describing is "technically correct", but that's not always the best kind of correct. I prefer straightforward documentation that explains things in a way that makes sense from a programming standpoint: That is, pre-conditions and post-conditions clearly explained. What we have instead is something more akin to what I'd find in a college math book, which I personally find less intuitive.

Maybe I'm alone on this. But that's my feedback on the matter.

@Clockwork-Muse
Copy link

The documentation is misleading as well; at least at its face:

Yes, that's why this issue is still open - as stated earlier, the documentation should be updated.

@AbstractionsAs
Copy link

This is clearly a bug.

Can anyone come up with any use case solved by the current implementation?

The implementation that makes sense, though, the one where the distinct operation is made on the final set, not on the key - that solves a very common use case: to filter a list based on a blacklisted property.

@Clockwork-Muse
Copy link

The implementation that makes sense, though, the one where the distinct operation is made on the final set

That presumes that elements have some equatable key, outside of the one used for the call to ExceptBy(), which may not be the case.

Besides the fact that changing this behavior would be an insanely massive breaking change.

If you want the behavior you're thinking of, the Where(e => !someSet.Contains(e)) will give you the behavior you want (and be more performant).

@marckruzik
Copy link

I just encountered this behavior, with all duplicates removed.
I agree that the documentation could provide more explanation.

Maybe @IEvangelist could help? He edited the documentation and provided code in 2021.

@IEvangelist IEvangelist self-assigned this Jan 17, 2024
@IEvangelist IEvangelist added Pri1 Indicates issues/PRs that are high priority doc-enhancement Improve the current content 📁 Repo - docs 📚 documentation and removed Pri3 Indicates issues/PRs that are low priority labels Jan 17, 2024
@IEvangelist
Copy link
Member

I just encountered this behavior, with all duplicates removed. I agree that the documentation could provide more explanation.

Maybe @IEvangelist could help? He edited the documentation and provided code in 2021.

Sure, see dotnet/docs#39182

@marckruzik
Copy link

Thanks @IEvangelist!

@github-actions github-actions bot added 📌 seQUESTered Identifies that an issue has been imported into Quest. and removed 🗺️ reQUEST Triggers an issue to be imported into Quest. labels Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Linq 📚 documentation doc-enhancement Improve the current content Pri1 Indicates issues/PRs that are high priority 📌 seQUESTered Identifies that an issue has been imported into Quest.
Projects
No open projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.