Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding an AggregateBy method to LINQ #91533

Closed
eiriktsarpalis opened this issue Sep 4, 2023 · 18 comments · Fixed by #92089
Closed

Consider adding an AggregateBy method to LINQ #91533

eiriktsarpalis opened this issue Sep 4, 2023 · 18 comments · Fixed by #92089
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Sep 4, 2023

Background & Motivation

It is fairly common for LINQ users to want to calculate an aggregate for enumerables, grouped by a specific key. This is currently achievable using the GroupBy methods, however this approach is often inefficient as it forces intermediate IGrouping allocations. This is highlighted as the motivation of the CountBy API proposal.

AggregateBy is an extension of the existing Aggregate methods that lets users fold a source enumerable, grouped by key. It is a useful method that generalizes methods such as GroupBy and CountBy, but also works well for other applications. Similar APIs available in frameworks like Spark are called foldByKey and aggregateByKey.

API Proposal

namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
        this IEnumerable<TSource> source,
        TAccumulate seed,
        Func<TAccumulate, TSource, TAccumulate> func,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;

    public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
        this IEnumerable<TSource> source,
        Func<TKey, TAccumulate> seed,
        Func<TKey, TAccumulate, TSource, TAccumulate> func,
        Func<TSource, TKey> keySelector, 
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
        this IQueryable<TSource> source,
        TAccumulate seed,
        Expression<Func<TAccumulate, TSource, TAccumulate>> func,
        Expression<Func<TSource, TKey>> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;

    public static IQueryable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
        this IQueryable<TSource> source,
        Expression<Func<TKey, TAccumulate>> seed,
        Expression<Func<TKey, TAccumulate, TSource, TAccumulate>> func,
        Expression<Func<TSource, TKey>> keySelector, 
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

API Usage

Implementing LongCountBy

static IEnumerable<KeyValuePair<TKey, long>> LongCountBy<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
    => source.AggregateBy(seed: 0L, (count, _) => ++count, keySelector);

Implementing GroupBy

static IEnumerable<KeyValuePair<TKey, List<TSource>>> GroupBy<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
    => source.AggregateBy(seed: _ => new List<TSource>(), (_, group, element) => { group.Add(element); return group; }, keySelector);

Calculating total scores by key

var data = new (string id, int score)[]
{
    ("0", 42),
    ("1", 5),
    ("2", 4),
    ("1", 10),
    ("0", 25),
};

data.AggregateBy(seed: 0, (totalScore, curr) => totalScore += curr.score, keySelector: entry => entry.id);
//(0, 67)
//(1, 15)
//(2, 4)

Open Questions

We might want to consider the order of type parameters and delegates in the type signature. As proposed it follows the convention that TKey and the keySelector delegate go to the end of the method signature but perhaps that's not as natural here.

Reference Implementation

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func,
    Func<TSource, TKey> keySelector) where TKey : notnull
{
    using IEnumerator<TSource> enumerator = source.GetEnumerator();

    if (!enumerator.MoveNext())
    {
        yield break;
    }

    foreach (KeyValuePair<TKey, TAccumulate> entry in PopulateDictionary(enumerator))
    {
        yield return entry;
    }

    Dictionary<TKey, TAccumulate> PopulateDictionary(IEnumerator<TSource> enumerator)
    {
        var dict = new Dictionary<TKey, TAccumulate>();
        do
        {
            TSource value = enumerator.Current;
            TKey key = keySelector(value);
            ref TAccumulate? acc = ref CollectionsMarshal.GetValueRefOrAddDefault(dict, key, out bool exists);
            acc = func(exists ? acc! : seed, value);
        } while (enumerator.MoveNext());

        return dict;
    }
}

Originally posted by @eiriktsarpalis in #77716 (comment)

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 4, 2023
@ghost
Copy link

ghost commented Sep 4, 2023

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

A potential alternative is having an AggregateBy method:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
    this IEnumerable<TSource> source, 
    TAccumulate seed, 
    Func<TAccumulate, TKey, TSource, TAccumulate> func, 
    Func<TSource, TKey> keySelector);

which lets you express CountBy like so

source.AggregateBy(0, (count, _, _) => ++count, keySelector);

Originally posted by @eiriktsarpalis in #77716 (comment)

Author: eiriktsarpalis
Assignees: -
Labels:

area-System.Linq, untriaged

Milestone: -

@eiriktsarpalis eiriktsarpalis added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Sep 4, 2023
@eiriktsarpalis eiriktsarpalis added this to the Future milestone Sep 4, 2023
@viceroypenguin
Copy link

I would suggest that the seed may be determined by the key, so I propose:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TKey, TAccumulate>(
	this IEnumerable<TSource> source,
	Func<TSource, TKey> keySelector,
	Func<TKey, TAccumulate> seedSelector,
	Func<TAccumulate, TKey, TSource, TAccumulate> accumulator)

@chrisoverzero
Copy link

I would suggest that the seed may be determined by the key […]

What does this mean? Which TKey value would be passed to this function? Can you show an example of using this proposed signature?

@viceroypenguin
Copy link

@chrisoverzero Dummy example for how the method would be called:

var sum = Enumerable.Range(1, 100)
  .AggregateBy(
    keySelector: x => x % 10,
    seedSelector: k => k + 100,
    accumulator: (sum, key, src) => sum + src);

It's probably rare that the seed would need to be selected per key, but it would be simple to design for it and would help significantly in those cases.

@eiriktsarpalis
Copy link
Member Author

What's a real-world use case where the seed value is predicated on the key? In most applications I can think of the value of "zero" is fixed (typically either 0 or an empty list). FWIW I'm starting to think that accepting the key parameter in the accumulator delegate might already be overkill, and this probably signature is sufficient:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
    this IEnumerable<TSource> source, 
    TAccumulate seed, 
    Func<TAccumulate, TSource, TAccumulate> func, 
    Func<TSource, TKey> keySelector);

Assuming the keySelector delegate is cheap, with a bit of work it should be straightforward to express your method in terms of the above:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TKey, TAccumulate>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, TAccumulate> seedSelector,
    Func<TAccumulate, TKey, TSource, TAccumulate> accumulator)
{
    return source.
        AggregateBy<TSource, KeyValuePair<TKey, TAccumulate>?, TKey>(
            seed: null,
            func: (state, source) =>
            {
                if (state is not { Key: TKey key, Value: TAccumulate acc })
                {
                    key = keySelector(source);
                    acc = seedSelector(key);
                }

                acc = accumulator(acc, key, source);
                return new(key, acc);
            },
            keySelector: keySelector)
        .Select(kvp => kvp.Value!.Value);
}

@viceroypenguin
Copy link

@eiriktsarpalis Fair enough.

Two more questions:

  • Why put func parameter before keySelector parameter?
  • Should this be an IReadOnlyDictionary<,> or similar, to match prompt behavior of Aggregate, similar to the discussion in [API Proposal]: Linq CountBy method #77716?

@eiriktsarpalis
Copy link
Member Author

Why put func parameter before keySelector parameter?

Because we tend to put the key selector delegate at the very end of a method signature in *By variants, but didn't give it much thought beyond that.

Should this be an IReadOnlyDictionary<,> or similar, to match prompt behavior of Aggregate, similar to the discussion

Possibly, what ends up being used in CountBy will most likely influence the design here.

@eiriktsarpalis
Copy link
Member Author

I've just updated the OP with a full-blown proposal and examples. Your feedback would be appreciated.

@viceroypenguin
Copy link

@eiriktsarpalis Thanks for updating the proposal. I like the overload option of using a static seed vs a dynamic one.

I would argue that the keySelector should be the first parameter in both overloads. I know that keySelector is the last parameter in some *By variants, but I think this is the wrong way to look at it. Instead, I would say that it is, in every operator, the first non-sequence parameter.

  • GroupBy: first parameter
  • CountBy: first parameter
  • MaxBy/MinBy: first parameter
  • OrderBy/ThenBy: first parameter
  • UnionBy: sequence, then keySelector
  • IntersectBy: sequence, then keySelector
  • DistinctBy: sequence, then keySelector
  • ExceptBy: sequence, then keySelector

In the most relevant similarity (GroupBy), it is the first parameter, which leads naturally into following parameters which answer: "what are we doing with the key?", "what are we doing with the group?", etc.

In the set-based *By variants, the parameters follow an order: bind two sequences together, then define how to identify the key that binds the sequences together. This is also a natural read, but is less relevant, because there is no need for a follow-up parameter after keySelector.

So, from a reading standpoint, the most natural order of parameters to me would go: 1. select a key, 2. define a seed from the key (or use a static key), 3. accumulate based on the group selected by the key.

I welcome any comments to the contrary, as this is just my opinion on the matter.

@viceroypenguin
Copy link

viceroypenguin commented Sep 12, 2023

Also, if we are allowing more than one overload, does it make sense to offer a resultSelector overload as well, to match public static TResult Aggregate<TSource,TAccumulate,TResult> (this System.Collections.Generic.IEnumerable<TSource> source, TAccumulate seed, Func<TAccumulate,TSource,TAccumulate> func, Func<TAccumulate,TResult> resultSelector);?

@bartonjs
Copy link
Member

  • We changed <TSource, TAccumulate, TKey> to <TSource, TKey, TAccumulate>
  • We moved keySelector to be the second parameter.
  • We removed a TKey from the func on the second overload. If it's actually useful we can add it to both shapes in the future.
namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TKey, TAccumulate>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        TAccumulate seed,
        Func<TAccumulate, TSource, TAccumulate> func,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;

    public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TKey, TAccumulate>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector, 
        Func<TKey, TAccumulate> seed,
        Func<TAccumulate, TSource, TAccumulate> func,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TKey, TAccumulate>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector,
        TAccumulate seed,
        Expression<Func<TAccumulate, TSource, TAccumulate>> func,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;

    public static IQueryable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TKey, TAccumulate>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector, 
        Expression<Func<TKey, TAccumulate>> seed,
        Expression<Func<TAccumulate, TSource, TAccumulate>> func,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Sep 12, 2023
@eiriktsarpalis eiriktsarpalis added the help wanted [up-for-grabs] Good issue for external contributors label Sep 12, 2023
@eiriktsarpalis eiriktsarpalis modified the milestones: Future, 9.0.0 Sep 12, 2023
@eiriktsarpalis
Copy link
Member Author

Also, if we are allowing more than one overload, does it make sense to offer a resultSelector overload

The existing overloads have full control over the aggregate and how it is folder, why add another transform at the end? Even if it were necessary in certain scenaria, it seems to me that chaining it with a Select call is probably preferable.

@manandre
Copy link
Contributor

I am a bit late here as it is already approved, but why don't we have an overload without the seed as in the existing Aggregate Linq operator?

@viceroypenguin
Copy link

I am a bit late here as it is already approved, but why don't we have an overload without the seed as in the existing Aggregate Linq operator?

Not providing the seed would require that consumers specify all types for the generic method, in order to specify the TAccumulate type. This would reduce usability in many cases (anonymous types, long tuples, etc.).

@manandre
Copy link
Contributor

manandre commented Sep 12, 2023

I am a bit late here as it is already approved, but why don't we have an overload without the seed as in the existing Aggregate Linq operator?

Not providing the seed would require that consumers specify all types for the generic method, in order to specify the TAccumulate type. This would reduce usability in many cases (anonymous types, long tuples, etc.).

I was thinking about this kind of overload, without TAccumulate:

public static IEnumerable<KeyValuePair<TKey, TSource>> AggregateBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TSource, TSource> func,
    IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;

@eiriktsarpalis
Copy link
Member Author

I was thinking about this kind of overload

This is a reducer method, so should most probably be called ReduceBy rather than AggregateBy.

@manandre
Copy link
Contributor

The CountBy implementation has just been merged. It is then time to go to the next level!
@eiriktsarpalis I can handle the implementation for AggregateBy if you agree.

@eiriktsarpalis
Copy link
Member Author

@manandre go for it 👍

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 14, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 18, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Oct 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants