Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Introduce LeftJoin LINQ operator #110292

Open
roji opened this issue Dec 1, 2024 · 26 comments
Open

[API Proposal]: Introduce LeftJoin LINQ operator #110292

roji opened this issue Dec 1, 2024 · 26 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq
Milestone

Comments

@roji
Copy link
Member

roji commented Dec 1, 2024

Background and motivation

Background

LINQ has a Join operator, which, like its SQL INNER JOIN counterpart, correlates elements of two sequences based on matching keys; the LINQ join implementation internally creates a Lookup for the inner sequence, and then loops over the outer sequence, doing a lookup for the matching inner elements. In SQL database parlance, this is known as the hash join strategy (SQL Server docs, PostgreSQL docs as well as this useful post).

In addition to the above, SQL also has LEFT JOIN, which returns outer elements even if there's no corresponding inner ones; LINQ, in contrast, lacks this operator. The LINQ conceptual documentation shows how to combine existing operators to achieve a left join:

var query = students
    .GroupJoin(departments, student => student.DepartmentID, department => department.ID, (student, departmentList) => new { student, subgroup = departmentList })
    .SelectMany(
        joinedSet => joinedSet.subgroup.DefaultIfEmpty(),
        (student, department) => new
        {
            student.student.FirstName,
            student.student.LastName,
            Department = department.Name
        });

There are two issues with the above suggestion:

  • It's complicated, requiring combining 3 different LINQ operators in a specific way, and is easy to accidentally get wrong. Many EF users have complained about the complexity of this construct for expressing a simple SQL LEFT JOIN.
  • It is inefficient - the combination of operators adds significant overhead compared to a single operator using a hash join (as Join does).

Proposal

This proposes introducing a 1st-class LeftJoin operator, which operates very similar to Join, except that it returns outer elements for which no inner element could be correlated. Aside from being much simpler to use than GroupJoin/SelectMany, it would also simply use Lookup internally - just like Join - and would therefore be much faster.

An initial implementation shows significant performance improvement compared to GroupJoin/SelectMany; LeftJoin is always faster than the equivalent GroupJoin/SelectMany construct, since GroupJoin itself constructs and uses a Lookup internally to implement an inner join internally - just like LeftJoin does - but also adds additional work on top.

BenchmarkDotNet v0.14.0, macOS Sequoia 15.1.1 (24B91) [Darwin 24.1.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
Method InnersPerOuter OuterCount Mean Error StdDev Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
LeftJoin 1 1 97.73 ns 0.402 ns 0.376 ns 0.64 0.00 0.0573 - - 480 B 0.71
GroupJoin_SelectMany 1 1 153.02 ns 0.370 ns 0.309 ns 1.00 0.00 0.0811 - - 680 B 1.00
LeftJoin 1 10 488.66 ns 5.764 ns 5.391 ns 0.57 0.01 0.2031 - - 1704 B 0.57
GroupJoin_SelectMany 1 10 856.40 ns 4.305 ns 4.027 ns 1.00 0.01 0.3567 - - 2984 B 1.00
LeftJoin 1 100 4,814.61 ns 19.884 ns 17.627 ns 0.59 0.00 1.7090 0.0534 - 14344 B 0.54
GroupJoin_SelectMany 1 100 8,173.26 ns 37.560 ns 35.134 ns 1.00 0.01 3.1586 0.1068 - 26424 B 1.00
LeftJoin 1 1000 47,619.93 ns 227.566 ns 201.731 ns 0.59 0.00 16.2964 3.4790 - 136728 B 0.53
GroupJoin_SelectMany 1 1000 80,828.42 ns 323.010 ns 302.144 ns 1.00 0.01 30.6396 6.5918 - 256808 B 1.00
LeftJoin 10 1 300.03 ns 3.089 ns 2.580 ns 0.70 0.01 0.1316 - - 1104 B 0.85
GroupJoin_SelectMany 10 1 430.06 ns 3.919 ns 3.273 ns 1.00 0.01 0.1554 - - 1304 B 1.00
LeftJoin 10 10 2,954.84 ns 28.864 ns 25.587 ns 0.87 0.01 0.9460 0.0076 - 7944 B 0.86
GroupJoin_SelectMany 10 10 3,397.03 ns 9.497 ns 8.419 ns 1.00 0.00 1.1024 0.0114 - 9224 B 1.00
LeftJoin 10 100 31,873.63 ns 120.088 ns 106.455 ns 0.81 0.02 9.1553 0.7324 - 76744 B 0.86
GroupJoin_SelectMany 10 100 39,299.04 ns 743.548 ns 856.271 ns 1.00 0.03 10.5591 0.8545 - 88824 B 1.00
LeftJoin 10 1000 402,837.52 ns 5,140.127 ns 4,808.078 ns 0.88 0.01 90.8203 30.2734 - 760728 B 0.86
GroupJoin_SelectMany 10 1000 457,658.94 ns 5,117.136 ns 4,786.572 ns 1.00 0.01 104.9805 34.6680 - 880808 B 1.00
LeftJoin 100 1 1,906.55 ns 9.026 ns 8.443 ns 0.83 0.01 0.6981 0.0019 - 5848 B 0.97
GroupJoin_SelectMany 100 1 2,310.49 ns 29.862 ns 27.933 ns 1.00 0.02 0.7210 0.0038 - 6048 B 1.00
LeftJoin 100 10 18,861.98 ns 366.707 ns 560.001 ns 0.85 0.03 6.5918 0.2747 - 55384 B 0.98
GroupJoin_SelectMany 100 10 22,285.02 ns 189.605 ns 177.356 ns 1.00 0.01 6.7444 0.2747 - 56664 B 1.00
LeftJoin 100 100 197,041.60 ns 2,033.092 ns 1,802.283 ns 0.83 0.01 65.6738 15.8691 - 551144 B 0.98
GroupJoin_SelectMany 100 100 236,428.77 ns 1,920.429 ns 1,702.410 ns 1.00 0.01 67.1387 16.6016 - 563224 B 1.00
LeftJoin 100 1000 2,628,960.79 ns 22,819.664 ns 21,345.528 ns 0.85 0.03 656.2500 316.4063 - 5504731 B 0.98
GroupJoin_SelectMany 100 1000 3,113,006.15 ns 62,259.419 ns 95,076.717 ns 1.00 0.04 671.8750 328.1250 - 5624811 B 1.00
LeftJoin 1000 1 17,487.96 ns 203.775 ns 180.641 ns 0.84 0.01 5.8594 0.1221 - 49056 B 1.00
GroupJoin_SelectMany 1000 1 20,818.61 ns 251.607 ns 210.103 ns 1.00 0.01 5.8594 0.1831 - 49256 B 1.00
LeftJoin 1000 10 175,500.86 ns 2,377.458 ns 1,856.162 ns 0.77 0.02 58.1055 11.9629 - 487464 B 1.00
GroupJoin_SelectMany 1000 10 228,064.76 ns 4,414.845 ns 4,907.089 ns 1.00 0.03 58.3496 12.6953 - 488744 B 1.00
LeftJoin 1000 100 2,092,691.16 ns 39,429.196 ns 77,829.392 ns 0.86 0.04 582.0313 250.0000 - 4871947 B 1.00
GroupJoin_SelectMany 1000 100 2,422,638.76 ns 47,560.763 ns 84,539.215 ns 1.00 0.05 582.0313 246.0938 - 4884027 B 1.00
LeftJoin 1000 1000 42,875,766.40 ns 738,934.130 ns 691,199.444 ns 0.89 0.02 5900.0000 1700.0000 100.0000 48712842 B 1.00
GroupJoin_SelectMany 1000 1000 47,966,803.96 ns 925,858.818 ns 1,066,220.392 ns 1.00 0.03 5888.8889 1666.6667 111.1111 48832891 B 1.00
Benchmark code
[MemoryDiagnoser]
public class LeftJoinBenchmarks
{
    [Params(1, 10, 100, 1000, Priority = 1)]
    // [Params(1, Priority = 1)]
    public int InnersPerOuter { get; set; }

    [Params(1, 10, 100, 1000, Priority = 2)]
    // [Params(1, Priority = 2)]
    public int OuterCount { get; set; }

    private Outer[] _outers = null!;
    private Inner[] _inners = null!;

    class Outer
    {
        public int Id { get; set; }
        public string? OuterPayload { get; set; }
    }

    class Inner
    {
        public int Id { get; set; }
        public int OuterId { get; set; }
        public string? InnerPayload { get; set; }
    }

    private const int RandomSeed = 42;

    [GlobalSetup]
    public void Setup()
    {
        _outers = new Outer[OuterCount];
        _inners = new Inner[OuterCount * InnersPerOuter];

        var remainingInners = new List<Inner>(OuterCount * InnersPerOuter);
        for (var outerId = 0; outerId < OuterCount; outerId++)
        {
            _outers[outerId] = new Outer { Id = outerId, OuterPayload = $"Outer{outerId}" };

            for (var j = 0; j < InnersPerOuter; j++)
            {
                var innerId = outerId * j + j;
                remainingInners.Add(new Inner { Id = innerId, OuterId = outerId, InnerPayload = $"Inner{innerId}" });
            }
        }

        var random = new Random(RandomSeed);

        for (var i = 0; i < _inners.Length; i++)
        {
            var j = random.Next(0, remainingInners.Count);
            _inners[i] = remainingInners[j];
            remainingInners.RemoveAt(j);
        }

        Debug.Assert(remainingInners.Count == 0);
    }

    [Benchmark]
    public int LeftJoin()
        => _outers.LeftJoin(_inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload }).Count();

    [Benchmark(Baseline = true)]
    public int GroupJoin_SelectMany()
        => _outers
            .GroupJoin(_inners, o => o.Id, i => i.OuterId, (o, inners) => new { Outer = o, Inners = inners })
            .SelectMany(
                joinedSet => joinedSet.Inners.DefaultIfEmpty(),
                (o, i) => new
                {
                    o.Outer.OuterPayload,
                    i?.InnerPayload
                })
            .Count();
}
LeftJoin operator prototype implementation
private static IEnumerable<TResult> LeftJoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner?, TResult> resultSelector, IEqualityComparer<TKey>? comparer)
{
    using (IEnumerator<TOuter> e = outer.GetEnumerator())
    {
        if (e.MoveNext())
        {
            Lookup<TKey, TInner> innerLookup =
                Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer);
            do
            {
                TOuter item = e.Current;
                Grouping<TKey, TInner>? g = innerLookup.GetGrouping(outerKeySelector(item), create: false);
                if (g is null)
                {
                    yield return resultSelector(item, default);
                }
                else
                {
                    int count = g._count;
                    TInner[] elements = g._elements;
                    for (int i = 0; i != count; ++i)
                    {
                        yield return resultSelector(item, elements[i]);
                    }
                }
            } while (e.MoveNext());
        }
    }
}

Note: The current LINQ documentation for GroupJoin/SelectMany shows using AsQueryable, for no apparent reason. The addition of AsQueryable here adds very significant perf overhead - see dotnet/docs#43807 for benchmarks and a proposal to remove AsQueryable from that code sample.

Additional Notes

  • This proposal also helps LINQ provider such as EF - the ability to directly express a SQL LEFT JOIN comes up regularly.
  • For inner value types, the operator returns default when an outer has no inners; this makes it impossible to distinguish between an inner not being found, and the inner being found but being null. This is similar to e.g. FirstOrDefault; although it's quite contrived for LeftJoin, see below for some notes and alternative API designs.
  • Mostly for symmetry's sake, we could also introduce RightJoin(), which is the reverse of LeftJoin() (i.e. elements from the inner sequence are returned if no correlated outer is found). Right joins are seldom used in SQL, and it's always possible to flip the sequences around to express the join as a left join instead.
  • In theory, we could also introduce an operator to perform a FULL OUTER JOIN, which is a combination of LEFT and RIGHT JOIN (elements from both outer and inner are returned even if there's no correlated element on the other side). This is, however, a rare operation and not generally very useful, and is more complicated to implement efficiently. We can do this later if the need arises, but should keep naming in mind.
  • Naming-wise:
    • Following the full operation name, we could call the operator LeftOuterJoin instead of LeftJoin; though the OUTER keyword is optional in all major SQL databases, and routinely dropped. LeftJoin is a lighter, more accessible name.
    • It may seem a bit odd to introduce the very SQL-oriented LeftJoin, RightJoin. However, LINQ operators already follow SQL naming (e.g. Where/Select rather than Filter/Map).
    • We could have JoinLeft, JoinRight to have all join operators grouped together in Intellisense (though the naming is less intuitive).
  • Corresponding syntax could be added to LINQ expression syntax, on the C# side (just like join) - but this is optional.
  • We should consider an analyzer+code fix to transform the current recommended GroupJoin/SelectMany into the new LeftJoin - including with the unneeded AsQueryable shown in the docs - as many people have been copy-pasting that and have very slow code (see this).

/cc @jeffhandley @dotnet/area-system-linq @dotnet/efteam

API Proposal

namespace System.Linq;

public static class Enumerable
{
    // Alternative naming: LeftOuterJoin
    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector);

    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

// ... plus optionally RightJoin version of the above
}

public static class Queryable
{
    public static System.Linq.IQueryable<TResult> LeftJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector);
	
    public static System.Linq.IQueryable<TResult> Join<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector,
        System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}

// If we decide to also introduce RightJoin (or RightOuterJoin):

public static class Enumerable
{
    // Alternative naming: RightOuterJoin
    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector);

    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

// ... plus optionally RightJoin version of the above
}

public static class Queryable
{
    public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector);
	
    public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector,
        System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}

API Usage

var query = outers.LeftJoin(inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload });

Value types and alternative LeftJoin API shapes

As pointed out above, the fact that the result selector accepts a defaultable inner means that it's impossible to distinguish between no inner being found for an outer, and the situation where the inner itself happens to be the default (thanks for discussion on this, @eiriktsarpalis). This problem isn't specific to this proposal, other operators (e.g. FirstOrDefault) have the same problem.

An alternative API to address this would pass a boolean to the result selector, representing whether an inner was found or not:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
	Func<TOuter, bool, TInner, TResult> resultSelector);

Alternatively, with the upcoming introduction of discriminated unions to .NET, an Optional type could allow the same thing:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
	Func<TOuter, Optional<TInner>, TResult> resultSelector);

However, it seems that cases where the distinction between "not found" and "found but default" matters are especially rare for joining; the inner key selector would have to accept a null/default inner and "extract" a key out of that (matching the outer key); not impossible, but definitely feels contrived. It's also possible to work around the ambiguity (in some cases) by switching to a nullable value type.

In other words, we should IMHO avoid making the API more complex/heavy for everyone because of a 1% case.

Previous related issues

@roji roji added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 1, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 1, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 1, 2024
@teo-tsirpanis teo-tsirpanis added area-System.Linq and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 1, 2024
@roji roji removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Dec 1, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

@roji roji added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 1, 2024
@jeffhandley
Copy link
Member

Your Additional Notes covered all the questions and naming thoughts I had, @roji. Thanks for including all of that.

@Clockwork-Muse
Copy link
Contributor

@stephentoub - Is this something we should also consider for PLINQ?

(For others - because of the internals of PLINQ, it's not possible for third parties to implement their own operators, as it is with normal LINQ)

@stephentoub
Copy link
Member

stephentoub commented Dec 2, 2024

@stephentoub - Is this something we should also consider for PLINQ?

From an API perspective, #98689 covers adding to PLINQ any APIs added to LINQ that PLINQ doesn't already have. I don't think this particular API is any more special than the others already omitted, so I'd just want to lump this one in with those.

From an implementation perspective, I suspect the PLINQ implementation would just be the code @roji wrote in his opening comments, with that operator implemented by delegating to GroupJoin/SelectMany. Doing a full-fledged open-coded implementation of PLINQ's would be a lot of difficult to validate code.

because of the internals of PLINQ, it's not possible for third parties to implement their own operators

When doing an open-coded implementation, sure. But anyone can layer an implementation on top of the existing operators, just as with LINQ, e.g.

public static ParallelQuery<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
    this ParallelQuery<TOuter> outer, ParallelQuery<TInner> inner,
    Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner?, TResult> resultSelector) =>
    outer.GroupJoin(inner, outerKeySelector, innerKeySelector, (outerItem, innerItem) => (outerItem, innerItem))
         .SelectMany(joinedSet => joinedSet.innerItem.DefaultIfEmpty(), (joinedSet, innerItem) => resultSelector(joinedSet.outerItem, innerItem));

@roji
Copy link
Member Author

roji commented Dec 2, 2024

From an implementation perspective, I suspect the PLINQ implementation would just be the code @roji wrote in his opening comments, with that operator implemented by delegating to GroupJoin/SelectMany. Doing a full-fledged open-coded implementation of PLINQ's would be a lot of difficult to validate code.

@stephentoub I know very little about PLINQ, but assuming an implementation of Join already exists, wouldn't an implementation of LeftJoin be very similar (just as the non-PLINQ implementation of LeftJoin is very similar to the Join implementation)?

@stephentoub
Copy link
Member

stephentoub commented Dec 2, 2024

wouldn't an implementation of LeftJoin be very similar (just as the non-PLINQ implementation of LeftJoin is very similar to the Join implementation)?

Likely. But to put it in context, this is the LINQ implementation of Join:

private static IEnumerable<TResult> JoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector, IEqualityComparer<TKey>? comparer)
{
using (IEnumerator<TOuter> e = outer.GetEnumerator())
{
if (e.MoveNext())
{
Lookup<TKey, TInner> lookup = Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer);
if (lookup.Count != 0)
{
do
{
TOuter item = e.Current;
Grouping<TKey, TInner>? g = lookup.GetGrouping(outerKeySelector(item), create: false);
if (g is not null)
{
int count = g._count;
TInner[] elements = g._elements;
for (int i = 0; i != count; ++i)
{
yield return resultSelector(item, elements[i]);
}
}
}
while (e.MoveNext());
}
}
}

and this monster is the PLINQ implementation:
/// <summary>
/// A join operator takes a left query tree and a right query tree, and then yields the
/// matching pairs between the two. LINQ supports equi-key-based joins. Hence, a key-
/// selection function for the left and right data types will yield keys of the same
/// type for both. We then merely have to match elements from the left with elements from
/// the right that have the same exact key. Note that this is an inner join. In other
/// words, outer elements with no matching inner elements do not appear in the output.
///
/// Hash-joins work in two phases:
///
/// (1) Building - we build a hash-table from one of the data sources. In the case
/// of this specific operator, the table is built from the hash-codes of
/// keys selected via the key selector function. Because elements may share
/// the same key, the table must support one-key-to-many-values.
/// (2) Probing - for each element in the data source not used for building, we
/// use its key to look into the hash-table. If we find elements under this
/// key, we just enumerate all of them, yielding them as join matches.
///
/// Because hash-tables exhibit on average O(1) lookup, we turn what would have been
/// an O(n*m) algorithm -- in the case of nested loops joins -- into an O(n) algorithm.
/// We of course require some additional storage to do so, but in general this pays.
/// </summary>
/// <typeparam name="TLeftInput"></typeparam>
/// <typeparam name="TRightInput"></typeparam>
/// <typeparam name="TKey"></typeparam>
/// <typeparam name="TOutput"></typeparam>
internal sealed class JoinQueryOperator<TLeftInput, TRightInput, TKey, TOutput> : BinaryQueryOperator<TLeftInput, TRightInput, TOutput>
{
private readonly Func<TLeftInput, TKey> _leftKeySelector; // The key selection routine for the outer (left) data source.
private readonly Func<TRightInput, TKey> _rightKeySelector; // The key selection routine for the inner (right) data source.
private readonly Func<TLeftInput, TRightInput, TOutput> _resultSelector; // The result selection routine.
private readonly IEqualityComparer<TKey>? _keyComparer; // An optional key comparison object.
//---------------------------------------------------------------------------------------
// Constructs a new join operator.
//
internal JoinQueryOperator(ParallelQuery<TLeftInput> left, ParallelQuery<TRightInput> right,
Func<TLeftInput, TKey> leftKeySelector,
Func<TRightInput, TKey> rightKeySelector,
Func<TLeftInput, TRightInput, TOutput> resultSelector,
IEqualityComparer<TKey>? keyComparer)
: base(left, right)
{
Debug.Assert(left != null && right != null, "child data sources cannot be null");
Debug.Assert(leftKeySelector != null, "left key selector must not be null");
Debug.Assert(rightKeySelector != null, "right key selector must not be null");
Debug.Assert(resultSelector != null, "need a result selector function");
_leftKeySelector = leftKeySelector;
_rightKeySelector = rightKeySelector;
_resultSelector = resultSelector;
_keyComparer = keyComparer;
_outputOrdered = LeftChild.OutputOrdered;
SetOrdinalIndex(OrdinalIndexState.Shuffled);
}
public override void WrapPartitionedStream<TLeftKey, TRightKey>(
PartitionedStream<TLeftInput, TLeftKey> leftStream, PartitionedStream<TRightInput, TRightKey> rightStream,
IPartitionedStreamRecipient<TOutput> outputRecipient, bool preferStriping, QuerySettings settings)
{
Debug.Assert(rightStream.PartitionCount == leftStream.PartitionCount);
if (LeftChild.OutputOrdered)
{
if (ExchangeUtilities.IsWorseThan(LeftChild.OrdinalIndexState, OrdinalIndexState.Increasing))
{
PartitionedStream<TLeftInput, int> leftStreamInt =
QueryOperator<TLeftInput>.ExecuteAndCollectResults(leftStream, leftStream.PartitionCount, OutputOrdered, preferStriping, settings)
.GetPartitionedStream();
WrapPartitionedStreamHelper<int, TRightKey>(
ExchangeUtilities.HashRepartitionOrdered(leftStreamInt, _leftKeySelector, _keyComparer, null, settings.CancellationState.MergedCancellationToken),
rightStream, outputRecipient, settings.CancellationState.MergedCancellationToken);
}
else
{
WrapPartitionedStreamHelper<TLeftKey, TRightKey>(
ExchangeUtilities.HashRepartitionOrdered(leftStream, _leftKeySelector, _keyComparer, null, settings.CancellationState.MergedCancellationToken),
rightStream, outputRecipient, settings.CancellationState.MergedCancellationToken);
}
}
else
{
WrapPartitionedStreamHelper<int, TRightKey>(
ExchangeUtilities.HashRepartition(leftStream, _leftKeySelector, _keyComparer, null, settings.CancellationState.MergedCancellationToken),
rightStream, outputRecipient, settings.CancellationState.MergedCancellationToken);
}
}
//---------------------------------------------------------------------------------------
// This is a helper method. WrapPartitionedStream decides what type TLeftKey is going
// to be, and then call this method with that key as a generic parameter.
//
private void WrapPartitionedStreamHelper<TLeftKey, TRightKey>(
PartitionedStream<Pair<TLeftInput, TKey>, TLeftKey> leftHashStream, PartitionedStream<TRightInput, TRightKey> rightPartitionedStream,
IPartitionedStreamRecipient<TOutput> outputRecipient, CancellationToken cancellationToken)
{
if (RightChild.OutputOrdered && LeftChild.OutputOrdered)
{
PairOutputKeyBuilder<TLeftKey, TRightKey> outputKeyBuilder = new PairOutputKeyBuilder<TLeftKey, TRightKey>();
IComparer<Pair<TLeftKey, TRightKey>> outputKeyComparer =
new PairComparer<TLeftKey, TRightKey>(leftHashStream.KeyComparer, rightPartitionedStream.KeyComparer);
WrapPartitionedStreamHelper<TLeftKey, TRightKey, Pair<TLeftKey, TRightKey>>(leftHashStream,
ExchangeUtilities.HashRepartitionOrdered(rightPartitionedStream, _rightKeySelector, _keyComparer, null, cancellationToken),
outputKeyBuilder, outputKeyComparer, outputRecipient, cancellationToken);
}
else
{
LeftKeyOutputKeyBuilder<TLeftKey, int> outputKeyBuilder = new LeftKeyOutputKeyBuilder<TLeftKey, int>();
WrapPartitionedStreamHelper<TLeftKey, int, TLeftKey>(leftHashStream,
ExchangeUtilities.HashRepartition(rightPartitionedStream, _rightKeySelector, _keyComparer, null, cancellationToken),
outputKeyBuilder, leftHashStream.KeyComparer, outputRecipient, cancellationToken);
}
}
private void WrapPartitionedStreamHelper<TLeftKey, TRightKey, TOutputKey>(
PartitionedStream<Pair<TLeftInput, TKey>, TLeftKey> leftHashStream, PartitionedStream<Pair<TRightInput, TKey>, TRightKey> rightHashStream,
HashJoinOutputKeyBuilder<TLeftKey, TRightKey, TOutputKey> outputKeyBuilder, IComparer<TOutputKey> outputKeyComparer,
IPartitionedStreamRecipient<TOutput> outputRecipient, CancellationToken cancellationToken)
{
int partitionCount = leftHashStream.PartitionCount;
PartitionedStream<TOutput, TOutputKey> outputStream =
new PartitionedStream<TOutput, TOutputKey>(partitionCount, outputKeyComparer, OrdinalIndexState);
for (int i = 0; i < partitionCount; i++)
{
JoinHashLookupBuilder<TRightInput, TRightKey, TKey> rightLookupBuilder =
new JoinHashLookupBuilder<TRightInput, TRightKey, TKey>(rightHashStream[i], _keyComparer);
outputStream[i] = new HashJoinQueryOperatorEnumerator<TLeftInput, TLeftKey, TRightInput, TRightKey, TKey, TOutput, TOutputKey>(
leftHashStream[i], rightLookupBuilder, _resultSelector, outputKeyBuilder, cancellationToken);
}
outputRecipient.Receive(outputStream);
}
internal override QueryResults<TOutput> Open(QuerySettings settings, bool preferStriping)
{
QueryResults<TLeftInput> leftResults = LeftChild.Open(settings, false);
QueryResults<TRightInput> rightResults = RightChild.Open(settings, false);
return new BinaryQueryOperatorResults(leftResults, rightResults, this, settings, false);
}
//---------------------------------------------------------------------------------------
// Returns an enumerable that represents the query executing sequentially.
//
internal override IEnumerable<TOutput> AsSequentialQuery(CancellationToken token)
{
IEnumerable<TLeftInput> wrappedLeftChild = CancellableEnumerable.Wrap(LeftChild.AsSequentialQuery(token), token);
IEnumerable<TRightInput> wrappedRightChild = CancellableEnumerable.Wrap(RightChild.AsSequentialQuery(token), token);
return wrappedLeftChild.Join(
wrappedRightChild, _leftKeySelector, _rightKeySelector, _resultSelector, _keyComparer);
}
//---------------------------------------------------------------------------------------
// Whether this operator performs a premature merge that would not be performed in
// a similar sequential operation (i.e., in LINQ to Objects).
//
internal override bool LimitsParallelism
{
get { return false; }
}
}
/// <summary>
/// Class to build a HashJoinHashLookup of right elements for use in Join operations.
/// </summary>
/// <typeparam name="TElement"></typeparam>
/// <typeparam name="TOrderKey"></typeparam>
/// <typeparam name="THashKey"></typeparam>
internal sealed class JoinHashLookupBuilder<TElement, TOrderKey, THashKey> : HashLookupBuilder<TElement, TOrderKey, THashKey>
{
private readonly QueryOperatorEnumerator<Pair<TElement, THashKey>, TOrderKey> _dataSource; // data source. For building.
private readonly IEqualityComparer<THashKey>? _keyComparer; // An optional key comparison object.
internal JoinHashLookupBuilder(QueryOperatorEnumerator<Pair<TElement, THashKey>, TOrderKey> dataSource, IEqualityComparer<THashKey>? keyComparer)
{
Debug.Assert(dataSource != null);
_dataSource = dataSource;
_keyComparer = keyComparer;
}
public override HashJoinHashLookup<THashKey, TElement, TOrderKey> BuildHashLookup(CancellationToken cancellationToken)
{
HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> lookup =
new HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>>(_keyComparer);
JoinBaseHashBuilder baseHashBuilder = new JoinBaseHashBuilder(lookup);
BuildBaseHashLookup(_dataSource, baseHashBuilder, cancellationToken);
return new JoinHashLookup(lookup);
}
protected override void Dispose(bool disposing)
{
Debug.Assert(_dataSource != null);
_dataSource.Dispose();
}
/// <summary>
/// Adds TElement,TOrderKey values to a HashLookup of HashLookupValueLists.
/// </summary>
private readonly struct JoinBaseHashBuilder : IBaseHashBuilder<TElement, TOrderKey>
{
private readonly HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> _base;
public JoinBaseHashBuilder(HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> baseLookup)
{
Debug.Assert(baseLookup != null);
_base = baseLookup;
}
public bool Add(THashKey hashKey, TElement element, TOrderKey orderKey)
{
HashLookupValueList<TElement, TOrderKey> currentValue = default(HashLookupValueList<TElement, TOrderKey>);
if (!_base.TryGetValue(hashKey, ref currentValue))
{
currentValue = new HashLookupValueList<TElement, TOrderKey>(element, orderKey);
_base.Add(hashKey, currentValue);
return false;
}
else
{
if (currentValue.Add(element, orderKey))
{
// We need to re-store this element because the pair is a value type.
_base[hashKey] = currentValue;
}
return true;
}
}
}
/// <summary>
/// A wrapper for the HashLookup returned by JoinHashLookupBuilder.
///
/// Since Join operations do not require a default, this just passes the call on to the base lookup.
/// </summary>
private sealed class JoinHashLookup : HashJoinHashLookup<THashKey, TElement, TOrderKey>
{
private readonly HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> _base;
internal JoinHashLookup(HashLookup<THashKey, HashLookupValueList<TElement, TOrderKey>> baseLookup)
{
Debug.Assert(baseLookup != null);
_base = baseLookup;
}
public override bool TryGetValue(THashKey key, ref HashLookupValueList<TElement, TOrderKey> value)
{
return _base.TryGetValue(key, ref value);
}
}
}
}

@roji
Copy link
Member Author

roji commented Dec 2, 2024

Point taken :) Maybe we can have a common implementation with a flag/enum that determines left vs. inner... If that works..

@JohnGalt1717
Copy link

Please add leftjoin for query syntax. (And crossjoin and rightjoin)

This is desperately needed

@bachratyg
Copy link

Corresponding syntax could be added to LINQ expression syntax, on the C# side (just like join) - but this is optional.

Please please please don't skip query syntax! Operators that introduce new variables (SelectMany, Join, GroupBy) are a pain to both read and write using method syntax.

@OJacot-Descombes
Copy link

A version having no result selector returning an IEnumerable<(TOuter, TInner?)> would be handy. Note that the Enumerable.Index(IEnumerable) Method returns a value tuple enumeration as well.

@CyrusNajmabadi
Copy link
Member

Language change requests (like query syntax) would need a discussion opened for that idea at dotnet/csharplang

@roji
Copy link
Member Author

roji commented Dec 3, 2024

@OJacot-Descombes this proposal intentionally follows the existing Enumerable.Join API shape very closely (that's the right model here, rather than Enumerable.Index). I'd prefer keeping additional new APIs as a separate, later discussion for both Join and LeftJoin, assuming the latter makes it in.

@CyrusNajmabadi yep. I'd like to get sign-off on the operator here first, once that happens I'll start the language conversation.

@eiriktsarpalis
Copy link
Member

For inner value types, the operator returns default when an outer has no inners.

I know this just reflects convention in existing methods such as FirstOrDefault, but I do think this approach is problematic in the general case. As any user of LINQ has probably already internalized, value sequences must be lifted to Nullable<T> and in reference types null doesn't necessarily represent the absence of a value. I'm wondering if we could do better going forward, e.g. by passing a boolean to the delegate or perhaps looking at using true option types assuming the language does get DU support soon.

@roji
Copy link
Member Author

roji commented Dec 3, 2024

Updated the proposal above, with the signatures for the Queryable variants, for RightJoin (in case we decide to add it), and with notes and alternative API proposals based on a conversation with @eiriktsarpalis on the ambiguity caused by the operator passing default to the result selector's inner parameter.

I agree that there's an ambiguity here in the general caes, though I think the cases where it matters are generally rare/contrived to warrant a more complex API... See above for more notes.

@eiriktsarpalis eiriktsarpalis added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Dec 3, 2024
@eiriktsarpalis eiriktsarpalis added this to the 10.0.0 milestone Dec 3, 2024
@aloraman
Copy link

aloraman commented Dec 3, 2024

Have you looked at alternative implementations, such as https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/LeftJoin.cs ?

P.S. People ask for cross join, right join in query syntax. If you would consider it, please, consider also adding support for theta joins (that is, joining not on equality (equijoins), but arbitrary conditions)

@terrajobst
Copy link
Member

terrajobst commented Dec 3, 2024

Video

  • We should do both LeftJoin and RightJoin as reordering the Linq query expression is much more involved than it would be in SQL
  • We could do an overload that doesn't take a result selector and returns a tuple. But we decided to keep this separate as we should do the same for the existing Join method
  • We decided against including Outer because the existing Join method omitted Inner.
namespace System.Linq;

public static class Enumerable
{
    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector);

    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector);

    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);
}

public static class Queryable
{
    public static IQueryable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter, TInner?, TResult>> resultSelector);
	
    public static IQueryable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter, TInner?, TResult>> resultSelector,
        IEqualityComparer<TKey>? comparer);
    
    public static IQueryable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter?, TInner, TResult>> resultSelector);
	
    public static IQueryable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IQueryable<TOuter> outer,
        IEnumerable<TInner> inner,
        Expression<Func<TOuter, TKey>> outerKeySelector,
        Expression<Func<TInner, TKey>> innerKeySelector,
        Expression<Func<TOuter?, TInner, TResult>> resultSelector,
        IEqualityComparer<TKey>? comparer);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Dec 3, 2024
@jogibear9988
Copy link

Linq2db already also added extension methods like this for query syntax:

https://linq2db.github.io/articles/sql/Join-Operators.html

@jogibear9988
Copy link

linq2db has:

LeftJoin, RightJoin, InnerJoin, FullJoin and CrossJoin

they are defined here: https://github.com/linq2db/linq2db/blob/507843d091b4d28c1808d568a729a456de4dde53/Source/LinqToDB/LinqExtensions.cs#L3438

@roji
Copy link
Member Author

roji commented Dec 3, 2024

Have you looked at alternative implementations, such as https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/LeftJoin.cs ?

Yep. One comment on that API is that it seems overly complex - requiring both a firstSelector and a bothSelector; it does account for the case of disambiguating between "not found" and "found but default" (see "Value types and alternative LeftJoin API shapes" in the OP), but as with the other proposals, it seems to make the basic API more complicated for a 1% edge case.

P.S. People ask for cross join, right join in query syntax. If you would consider it, please, consider also adding support for theta joins (that is, joining not on equality (equijoins), but arbitrary conditions)

Left join has received overwhelmingly more requests than the others, which is why this proposal concentrates on it (note that RightJoin is included as well). Cross join specifically is already quite easy to express (from ... from ..., SelectMany), so it doesn't necessarily require a dedicated operator.

But of course, the fact that other operators aren't included in this specific proposal doesn't mean that they won't be added in the future - I'm just not sure we've seen lots of demand for them.

@obiwanjacobi
Copy link

Decl: So you have a TOuter and a TInner. Which one is the left (or right) one?
Usage: Is the this extension argument the left one: left.LeftJoin(right, ...)?

@roji
Copy link
Member Author

roji commented Dec 11, 2024

@obiwanjacobi this works exactly like the existing (non-left) Join(), and the signature expresses this:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
	Func<TInner, TKey> innerKeySelector,
	Func<TOuter, TInner?, TResult> resultSelector);

In other words, in x.LeftJoin(y, ...), x is the TOuter and y is the TInner.

@obiwanjacobi
Copy link

@roji but with a normal Join left or right is not important as it is here.
Is TOuter left? Or TInner? It is not clear.
Tell which side is seen as left (and right).
Or are you going for the Compare() vibe :-)

@JohnGalt1717
Copy link

Indeed. The inner should be the non-nullable entries, and the outer should be the optional join.

RightJoin join would be the opposite and CrossJoin would be just "left and right" with both being optional.

@terrajobst
Copy link
Member

terrajobst commented Dec 11, 2024

Maybe I'm missing something but in regular SQL it looks like this:

SELECT *
FROM   Outer o
        LEFT JOIN Inner i ON i.Id = O.Id

In C# it looks like this:

var results = dbContext.Outer.LeftJoin(dbContext.Inner, o => o.Id, i => i.Id, (o, i) => (Outer: o, Inner: i));

The same argument can be made for RightJoin. Outer is always on the left, inner is always on the right. LeftJoin and RightJoin signatures don't need to change to reflect that.

The confusing bit I guess is that in SQL, the join types use INNER and OUTER terminology, that is also used to describe the sides of the join. The way I think of this is that a join like X JOIN Y ON P can be implemented as nested loops like FOREACH X { FOREACH Y { IF (P) RETURN X, Y } }. There, outer and inner refer to loops (the inner loop vs the outer loop). Obviously, a SQL DB tries hard to avoid nested loops because it's O(X*Y), but for the purposes here that's an implementation detail.

@Clockwork-Muse
Copy link
Contributor

CrossJoin would be just "left and right" with both being optional.

No, that's FULL OUTER JOIN.
A CROSSJOIN is a full cartesian product of all rows in the table multiplied by the rows in the result set. It's equivalent to:

SELECT ...
FROM <table reference>
INNER JOIN <other reference>
ON 1 = 1

Both sides are always required.

(This capability is rarely useful with tables with real data, instead more often being used with subquery result sets and to build other such computed sets. This is particularly useful when constructing ad-hoc buckets for date/time window analysis).

@terrajobst
Copy link
Member

terrajobst commented Dec 11, 2024

We already have a cross join (kinda) with SelectMany().

Also, in standard SQL you can actually express a cross join:

SELECT *
FROM   Outer o
        CROSS JOIN Inner i

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq
Projects
None yet
Development

No branches or pull requests