Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: The Except and ExceptBy methods #110121

Closed
Mr0N opened this issue Nov 24, 2024 · 10 comments
Closed

[API Proposal]: The Except and ExceptBy methods #110121

Mr0N opened this issue Nov 24, 2024 · 10 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq

Comments

@Mr0N
Copy link

Mr0N commented Nov 24, 2024

Background and motivation

I propose adding an overload to the Except and ExceptBy methods that allows for removing elements from the provided array without removing duplicates.

API Proposal

IEnumerable<TSource> ExceptIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second)
{
    var set = new HashSet<TSource>(second);

    foreach (TSource element in first)
    {
     
        if (!set.Contains(element))
        {
            yield return element;
        }
    }
}

API Usage

using System.ComponentModel;
using System.Linq;


var range = new int[] { 1 };
var rangeEx = new int[] { 1, 2, 3, 4, 5 ,2,5};

var result = rangeEx.ExceptBy(range, a => a).ToArray();
Write(result);
var write = rangeEx.Except(range);
Write(write.ToArray());

var iter = ExceptIterator(rangeEx, range);
Write(iter.ToArray());
Console.ReadKey();




void Write(params int[] values)
{
    Console.WriteLine($"Array:{string.Join(',',values)}");
}
IEnumerable<TSource> ExceptIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second)
{
    var set = new HashSet<TSource>(second);

    foreach (TSource element in first)
    {
     
        if (!set.Contains(element))
        {
            yield return element;
        }
    }
}

Array:2,3,4,5
Array:2,3,4,5
Array:2,3,4,5,2,5

Alternative Designs

No response

Risks

No response

@Mr0N Mr0N added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 24, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 24, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

@Clockwork-Muse
Copy link
Contributor

nit: Except (along with Union and Intersect) are terms that come from set math, so any method wouldn't/shouldn't be named this way.

For the rare cases where something like this is needed, most people can probably use the following extension method:

IEnumerable<TSource> WhereNotIn<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second)
{
    var set = new HashSet<TSource>(second);

    return first.Where(e => !set.Contains(e));
}

@eiriktsarpalis
Copy link
Member

Duplicate of #107338, #75066, and #102112.

@eiriktsarpalis eiriktsarpalis closed this as not planned Won't fix, can't repro, duplicate, stale Nov 24, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Nov 24, 2024
@Clockwork-Muse
Copy link
Contributor

@eiriktsarpalis - This isn't complaining about the existing behavior, so not a strict duplicate?

@Mr0N
Copy link
Author

Mr0N commented Nov 25, 2024

I’m not suggesting changing the logic of the existing method, but rather adding an overload where you can choose whether to remove duplicates. For example, it could be implemented like this:

public static IEnumerable<TSource> Except<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second, bool removeDuplicate = true)

`

@eiriktsarpalis
Copy link
Member

Adding boolean flags that fundamentally change the semantics of a function isn't considered good practice, and is definitely not something we've done before in LINQ. For better or for worse Intersect and Except reflect the semantics of SQL, and it would be more appropriate to consider separate method groups for that functionality. @Clockwork-Muse's use of WhereIn and WhereNotIn seems more appropriate in that regard, but I think these would need to be filed as a distinct proposal.

@Mr0N
Copy link
Author

Mr0N commented Nov 25, 2024

Adding boolean flags that fundamentally change the semantics of a function isn't considered good practice, and is definitely not something we've done before in LINQ. For better or for worse Intersect and Except reflect the semantics of SQL, and it would be more appropriate to consider separate method groups for that functionality. @Clockwork-Muse's use of WhereIn and WhereNotIn seems more appropriate in that regard, but I think these would need to be filed as a distinct proposal.

Well, yes, the SQL method is more understandable, but for C#, it's not immediately clear how to use this method. In SQL, you can use a LEFT JOIN to quickly find the difference between two tables, but in C#, it's quite difficult to achieve, especially in terms of performance.

It would be great if there were some method that could work between two collections and leverage all possible optimizations to speed up the process, similar to what SQL offers.

@Mr0N
Copy link
Author

Mr0N commented Nov 25, 2024

Without losing extra data

@Clockwork-Muse
Copy link
Contributor

but in C#, it's quite difficult to achieve, especially in terms of performance

It would be great if there were some method that could work between two collections and leverage all possible optimizations to speed up the process, similar to what SQL offers.

This is mostly because RDBMSs put a lot more work into their dynamic optimizers than would be reasonable for C#. If you're dealing with a large enough dataset to actually affect program performance, stick it into an actual database (especially because chances are you're going to want to do more things with it).


In SQL, you can use a LEFT JOIN to quickly find the difference between two tables

... I really wish (the iSeries version of) DB2's EXCEPTION JOIN would be added to the standard....

@Mr0N
Copy link
Author

Mr0N commented Nov 26, 2024

It would be good to implement such an algorithm in C#, and then use it to separate one collection from another. It seems like a complex algorithm at first glance.

https://en.wikipedia.org/wiki/Sort-merge_join

chat gpt:

Merge Join (Merging Join)

The Merge Join is an algorithm used for efficiently executing a JOIN operation, especially when both tables are already sorted by the column(s) involved in the join. Its key advantage lies in its linear traversal of rows, making it highly performant for large, sorted datasets.


How Does Merge Join Work?

Input Requirements:

  • Both tables must be sorted on the join column(s). If not, the database management system (DBMS) performs a sort operation before initiating the Merge Join.

Data Comparison:

  1. The algorithm starts with the first row of each table.
  2. It compares the join keys:
    • If the key in one table is smaller, the pointer for that table advances to the next row.
    • If the keys match, a result row is produced, and both pointers advance.
    • This continues until all rows are processed.

Example

Table A:

ID | Name -- | -- 1 | John 2 | Alice 3 | Bob

This behavior is crucial for accurate join results but may increase the size of the output significantly.


Advantages of Merge Join

  1. Efficient for Sorted Data: Works linearly, O(n+m)O(n+m), when data is pre-sorted.
  2. Scalable for Large Datasets: Ideal for large tables with indexes on join columns.
  3. Low Memory Overhead: Requires minimal additional memory compared to Hash Join.

Limitations

  1. Sorting Overhead: If the input data isn’t sorted, the pre-sorting step can be expensive, O(nlog⁡n)O(nlogn).
  2. Equality-Only Conditions: Only supports joins based on equality (=). For conditions like >, <, or !=, other join algorithms are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq
Projects
None yet
Development

No branches or pull requests

3 participants