Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undesirable composite ORDER BY clause when loading included data on a one-to-many relation #19571

Closed
ghost opened this issue Jan 11, 2020 · 25 comments

Comments

@ghost
Copy link

ghost commented Jan 11, 2020

Entity Framework is generating a composite ORDER BY clause when a query on one entity .Includes() an entity under a one-to-many relation and that related entity is loaded in the SELECT clause. The order by clause includes the primary keys of every single entity in the query as a whole, not just those under that relation. None of this happens if the relation is on a *-to-one relation. I haven't tested many-to-many.

This in combination with the change in EF 3.0 to utilize joins instead of multiple selects can have severe negative performance implications for queries with .Include() calls that worked well enough prior to EF 3.0. Not only does that mean there's extra load on the database server to do these orderings - the sql generated by EF 3.1 for one of our large queries executes in two minutes with the ORDER BY clause and 30 seconds without it - but also the database server isn't able to even start streaming results until it more or less has all the results assembled, leading to client-side timeouts.

I'm not sure if this is a bug or just undesirable behavior. I've found nothing documenting this on pages like https://docs.microsoft.com/en-us/ef/core/querying/related-data, and I've found no clear way to disable it or get around it, absent rewriting these queries or doing something ugly with a command interceptor, but I could have just missed a source of information. I totally understand the pivot to joins implies some things need to be rewritten, but this ORDER BY especially on every entity is excessive and mandates us to manually fine tune data loading where we were previously able to rely on EF for it, even when joins alone are fine.

Solutions in order of preference:

  1. Don't introduce these order by clauses unless asked for in the query
  2. Document this and provide some way to opt out of this behavior on a global or per-query basis.
  3. Document this and don't order by other entitles' keys outside the one-to-many-relation
  4. Document this behavior loudly and provide best practices to work around it

Steps to reproduce

See my gist with code and comments at https://gist.github.com/cwhitelaw/df7d12778966e16e0119a4ffce0a06dc?ts=4

Model:

public class MainEntity
{
	public int Id { get; set; }
	public int SingleRelatedEntityId { get; set; }
	public SingleRelatedEntity SingleRelatedEntity { get; set; }
	public List<ManyRelatedEntity> ManyRelatedEntities { get; set; }
}
public class SingleRelatedEntity
{
	public int Id { get; set; }
}
public class ManyRelatedEntity
{
	public int Id { get; set; }
	public int MainEntityId { get; set; }
	public MainEntity MainEntity { get; set; }
}

Output (the problem manifests in the queries labeled with main entities including to-many relation):

Query: main entities
SELECT [m].[Id], [m].[SingleRelatedEntityId]
FROM [MainEntities] AS [m]

Query: single-related entities
SELECT [s].[Id]
FROM [SingleRelatedEntities] AS [s]

Query: many-related entities
SELECT [m].[Id], [m].[MainEntityId]
FROM [ManyRelatedEntities] AS [m]

Query: main entities including to-one relation
SELECT [m].[Id], [m].[SingleRelatedEntityId], [s].[Id]
FROM [MainEntities] AS [m]
INNER JOIN [SingleRelatedEntities] AS [s] ON [m].[SingleRelatedEntityId] = [s].[Id]

Query: main entities including to-many relation
SELECT [m].[Id], [m].[SingleRelatedEntityId], [m0].[Id], [m0].[MainEntityId]
FROM [MainEntities] AS [m]
LEFT JOIN [ManyRelatedEntities] AS [m0] ON [m].[Id] = [m0].[MainEntityId]
ORDER BY [m].[Id], [m0].[Id]

Query: main entities including to-many relation, order by MainEntity.Id descending
SELECT [m].[Id], [m].[SingleRelatedEntityId], [m0].[Id], [m0].[MainEntityId]
FROM [MainEntities] AS [m]
LEFT JOIN [ManyRelatedEntities] AS [m0] ON [m].[Id] = [m0].[MainEntityId]
ORDER BY [m].[Id] DESC, [m0].[Id]

Query: main entities including both to-one and to-many relation
SELECT [m].[Id], [m].[SingleRelatedEntityId], [s].[Id], [m0].[Id], [m0].[MainEntityId]
FROM [MainEntities] AS [m]
INNER JOIN [SingleRelatedEntities] AS [s] ON [m].[SingleRelatedEntityId] = [s].[Id]
LEFT JOIN [ManyRelatedEntities] AS [m0] ON [m].[Id] = [m0].[MainEntityId]
ORDER BY [m].[Id], [s].[Id], [m0].[Id]

Query: many-related entities through MainEntity relation
SELECT [m0].[Id], [m0].[MainEntityId]
FROM [MainEntities] AS [m]
INNER JOIN [ManyRelatedEntities] AS [m0] ON [m].[Id] = [m0].[MainEntityId]

Query: many-related entities including their MainEntity
SELECT [m].[Id], [m].[MainEntityId], [m0].[Id], [m0].[SingleRelatedEntityId]
FROM [ManyRelatedEntities] AS [m]
INNER JOIN [MainEntities] AS [m0] ON [m].[MainEntityId] = [m0].[Id]

Further technical details

EF Core version: 3.1.0
Database provider: Microsoft.EntityFrameworkCore.SqlServer 3.1.0
Target framework: .NET Core 3.1
Operating system: Windows 10 x64
IDE: Visual Studio 2019

@ghost ghost added the type-bug label Jan 11, 2020
@ajcvickers
Copy link
Member

@roji @maumar Any thoughts on this?

@roji roji added the area-perf label Jan 13, 2020
@roji
Copy link
Member

roji commented Jan 13, 2020

I remember doing some performance tests on ordering by primary keys some time ago (on PostgreSQL), and definitely did not see any difference close to 2 minutes vs. 30 seconds - but we should definitely look into this. I can do a quick perf investigation here.

@CWhitelaw this isn't exactly my domain, but to give some context... The ORDER By clause is added when materializing 1-to-many relations, so that the principal is grouped together while loading the dependent. If we don't have this, rows come at totally random order, and we must track all principals and look them up again on for each row. Having said that, if the ORDER BY necessarily triggers such a perf degradation I definitely agree we should re-examine this.

@ghost
Copy link
Author

ghost commented Jan 13, 2020

@roji Thanks, it had just occurred to me this morning that this might be the use case: you can read rows until you get a different PK column and know you're done reading that entity and yield it to whatever is consuming the IEnumerable. This is a compelling trade off to make especially for memory usage and I see the value. What I don't understand following that is, using my schema as an example, why does it also need to order by the ManyRelatedEntity id? The ordering on the just the MainEntityId seems like it would get the job done unless the ManyRelatedEntityId in turn were to include other data on a one to many relation, in which case we would in turn just additionally want to order on that Id and not also the id of whatever ID it is referencing.

For more context on the performance, the 2m -> 30s example came up on a query with a lot of recursive includes (7 deep), some being to-many relations and some being to-one, so it's one we are rewriting regardless for EF Core 3. It also has some internal inefficiencies of its own we will simplify but I'm sure there are legitimate cases for stuff like this especially for larger projects.

_dbContext.E1
    .Include(x => x.E2) // Single, never null
        .ThenInclude(x => x.E3s) // Many
            .ThenInclude(x => x.E4s) // Many (more or less a many-many join table, and ~10000s records)
                .ThenInclude(x => x.E5) // Single, never null (consequently, also ~10000s of records), 
                    .ThenInclude(x => x.E6) // Single, never null
                        .ThenInclude(x => x.E7) // Many, up to ~10 rows
    .Include(x => x.E2) // Single (same as previous include)
        .ThenInclude(x => x.E8) // Many, 0-100s of rows
    .FirstOrDefaultAsync(e => e.Id == oneId);

@roji
Copy link
Member

roji commented Jan 13, 2020

What I don't understand following that is, using my schema as an example, why does it also need to order by the ManyRelatedEntity id?

That's a good observation and we've raised this internally - it seems like the last dependent entity does not need necessarily need an ORDER BY on its primary key.

However, from the rest of your comment, it seems like the performance degradation you're experiencing could be completely unrelated to the ORDER BY clauses, but rather a case of eager loading being done with a single query. In a nutshell, versions before 3.0 sent a separate query for each related entity, where since 3.0 EF Core always generates a single query - which can cause a so-called "cartesian explosion" if too many joins are present. There are ways to rewrite your query to return to the pre-3.0 multiple query mechanism: please see this comment.

For now I'm going to assume that any perf issue here is related to the single-query issue, rather than to the ORDER BY clause - could you please take a look and confirm?

@ghost
Copy link
Author

ghost commented Jan 13, 2020

Yes I'm aware of the single query change and what that implies, and we are rewriting some queries to that end. The comment you linked seemed to be doing something effectively the same as my initial steps toward that - that comment uses a FirstOrDefault to force a load while I was just using Load() the Db.Context.Entry.Collection, Db.Context.Entry.Reference, or .Query() queryables starting from either of those two.

However, just to clarify, that 30s -> 2min difference is not an EF 2.x to 3.1 change - it's predicated solely on the the ORDER BY clause on the same query generated by EF Core 3.1, so the joins themselves were only 30 seconds even for such a large query. So I think the ordering is still significant. But now that I see why ordering on the primary key of just one entity is desirable, I could test it against the minimal orderings rather than against none.

Also, there's still separately the issue of this case

				LoadQuery("main entities including both to-one and to-many relation",
					dbContext.MainEntities
						.Include(x => x.SingleRelatedEntity)
						.Include(x => x.ManyRelatedEntities));

which outputs an order by all three primary keys rather than just the two under the relation. So it seems like it could be a really ugly surprise that if you had something like this

dbContext.MainEntities
.Include(x => x.SingleRelatedEntity1)
.Include(x => x.SingleRelatedEntity2)
.Include(x => x.SingleRelatedEntity3)
    .ThenInclude(x => x.SingleRelatedEntity4)
.Include(x => x.SingleRelatedEntity5)

And then you added an additional one-to-many relation in that query somewhere, suddenly your query is ordering by 7 different IDs instead of 0.

One particularly rough aspect of this is when all the ordering forces the DB to assemble all the results before it can spit any back out (at least for SQL server and the case of the 2m query i reported), which really reduces the value that the ordering would provide in allowing EF to yield one entity at a time, and is the main problem when it comes to timeouts.

@roji
Copy link
Member

roji commented Jan 13, 2020

However, just to clarify, that 30s -> 2min difference is not an EF 2.x to 3.1 change - it's predicated solely on the the ORDER BY clause on the same query generated by EF Core 3.1, so the joins themselves were only 30 seconds even for such a large query. So I think the ordering is still significant. But now that I see why ordering on the primary key of just one entity is desirable, I could test it against the minimal orderings rather than against none.

So to be sure I understand - can you please confirm you're seeing a 30s -> 2min difference on a single-join query (e.g. the query tagged "main entities including to-many relation" above), predicated solely on the existence or absence of the ORDER BY clause? If so, I'll definitely look into this.

@ghost
Copy link
Author

ghost commented Jan 13, 2020

That is correct. I grabbed the single statement SQL that EF Core 3.1 spit out for the query in my initial comment with all the .ThenIncludes, and just removed the order-by clause which was ordering by a bunch of primary key IDs.

For the example cases in the gist like "main entities including to-many relation", I only created those to generate SQL output to illustrate the order-by clauses, and I haven't done any load testing with data against that example schema or those queries in particular. I imagine results can vary greatly by the queries themselves and also the size of the tables, number of relations that that satisfy the query in those 1-many Include() calls, the DB provider, etc.

@roji
Copy link
Member

roji commented Jan 13, 2020

Thanks, I'll look into this and will get back to you soon if I have any questions.

@ghost
Copy link
Author

ghost commented Jan 13, 2020

@roji I realized i might have misunderstood what you meant by "single-join query" and wanted to clarify:

This is the query in question where we saw drastic the performance difference.

_dbContext.E1
    .Include(x => x.E2) // Single, never null
        .ThenInclude(x => x.E3s) // Many
            .ThenInclude(x => x.E4s) // Many (more or less a many-many join table, and ~10000s records)
                .ThenInclude(x => x.E5) // Single, never null (consequently, also ~10000s of records), 
                    .ThenInclude(x => x.E6) // Single, never null
                        .ThenInclude(x => x.E7) // Many, up to ~10 rows
    .Include(x => x.E2) // Single (same as previous include)
        .ThenInclude(x => x.E8) // Many, 0-100s of rows
    .FirstOrDefaultAsync(e => e.Id == oneId);

It results in a single query, but with many joins and some subqueries, and an order-by on 8 fields. So it certainly has more than one join if that was the question at hand - sorry if I misunderstood.

I ran it again just to check - without the order-bys, results started coming back coming back within seconds and completing in 33s like before, and with the order-bys it took 2:58s this time and results don't start coming back until nearly the end of that time. The result set is 45862 rows.

I don't yet have any performance numbers for smaller items like the sample queries in the gist.

@ghost
Copy link
Author

ghost commented Jan 13, 2020

To my great surprise, there's exactly one ID that's causing the bottleneck here. It's E8's ID.

With all 8 order-by fields (ran this once just prior to make sure it's a warm run)
Results start being returned at 1:50
Query completes in 2:04

By simply removing E8's ID from the order by clause (leaving 7 other elements in the order-by critieria)
Results start being returned immediately
Query completes in 0:18

Without any order-by, the query also completes in 18 seconds.

I tried removing other fields from the order by without any significant change, yet this one was drastic.

Using azure data studio if that's relevant. It's an azure SQL database and I don't currently have a dedicated instance/environment for this so the results at different times are pretty variable but repeated runs at any given point in time seem pretty consistent.

This might be a good motivation to only be ordering on the left side of the one-to-many relation rather than both but I don't know how representative this case is. Hope this helps.

@roji
Copy link
Member

roji commented Jan 13, 2020

Thanks for the added info, it's important to know that the degradation is (for now) seen on a query with 8 includes/joins. This also means that you may be able to work around the issue here by breaking the query into separate parts as described in #18022.

Out of curiosity, if E8 seems to be the cause of the issue, do you have the same problem when removing the entire first include chain (i.e. E2-E7)?

Also, is there any chance you can provide your (minimal!) model, with an idea of the data required to reproduce this? Also, any chance you could check if this occurs on a regular, on-premise SQL Server instance rather than Azure?

This might be a good motivation to only be ordering on the left side of the one-to-many relation rather than both but I don't know how representative this case is. Hope this helps.

I think it would be good to get to the bottom of this even though not ordering on the dependent side here may mitigate the issue you're specifically seeing.

@roji
Copy link
Member

roji commented Jan 13, 2020

Also, if you could post the specific SQL being generated that could be helpful.

@ghost
Copy link
Author

ghost commented Jan 14, 2020

I haven't been able to get back to this yet, but is what I found when I left off: It doesn't happen with E8.Id alone, and I was able to isolate the behavior down to the order by on E8 being combined with one other particular key in that ThenInclude chain but I don't recall as of now which one.

The SQL had 3 subqueries and looked like this. Seemed like a pretty good query - it knew to grab just one row from the first select.

SELECT <a bunch of fields>
FROM (SELECT TOP 1 <fields> FROM <E1 join E2>)
LEFT JOIN (<subquery on E3 through E7>)
LEFT JOIN <E8>

I ran the explain plan on both and saw there was a Sort operation right at the end that accounted for 100% of the cost, and I can't see it anywhere in the plan if I just remove the E8.Id from that clause (presumably, other changes to the clause would induce the same change but i haven't yet verified). This was on a dev server, while on a production server the explain showed 50% due to the sort and 50% for "Parallelism (Gather Streams)", which i assume is just because the larger instance has at least another core on it. Without the complete order-by clause, most of the cost goes to a clustered index scan on E8, but the query as a whole is quite fast. I'm not extremely familiar with specifics of SQL Server explain plans, but I was surprised in that if that "sort" isn't doing the ordering, then what would be doing it?

What's more perplexing about this is that E8 has 4 entries in this case, and 0 of them are actually associated with the test case in question.

I'll come back when I'm able with a minified model and the actual SQL plus results against a local sql server dev edition installation, unless these details signal any potential answers.

@roji
Copy link
Member

roji commented Jan 14, 2020

@CWhitelaw thanks for all the time and effort you're putting into this.

I will try to play around with a mock model tomorrow, to try and reproduce your findings - the first step here is to definitely get minimal repro, and then we can go into analyzing the plan and understanding exactly what effect the ORDER BY has and why it's degrading performance. Please let me know if you manage to do this on your side.

@ghost
Copy link
Author

ghost commented Jan 21, 2020

I've been pretty slammed but I think I have a minimal enough repro now, minus the data.

Gist: https://gist.github.com/cwhitelaw/7dc6d93a0a1025cd4d3a2569300a46c5?ts=4

You can see through the execution plan that despite having order by clauses everywhere, that sort comes in only when the query gets to including the E4s, while including everything except the E8s is no problem. The last query is effectively the original one I had posted about.

In the minimal case where we just go up to E4s, altering the order by clause to just order by E8 alone is enough to retain the sort operation, but I could have sworn that wasn't the case when i was testing on the real database either. Also notably the sort operation produces a lower estimated cost (~40%), but i'm betting that's the case because there's only enough columns here to support the relations and not actual data itself. It's also possible that the sort operation is a red herring but the correlation is pretty strong so far.

--Query: Baseline - E1 FirstOrDefault(ID=1) and its E8s
SELECT [t].[Id], [t].[E2Id], [e0].[Id], [e0].[E1Id]
FROM (
    SELECT TOP(1) [e].[Id], [e].[E2Id]
    FROM [E1] AS [e]
    WHERE [e].[Id] = 1
) AS [t]
LEFT JOIN [E8] AS [e0] ON [t].[Id] = [e0].[E1Id]
ORDER BY [t].[Id], [e0].[Id]

--Query: Include up to E2
SELECT [t].[Id], [t].[E2Id], [t].[Id0], [e1].[Id], [e1].[E1Id]
FROM (
    SELECT TOP(1) [e].[Id], [e].[E2Id], [e0].[Id] AS [Id0]
    FROM [E1] AS [e]
    INNER JOIN [E2] AS [e0] ON [e].[E2Id] = [e0].[Id]
    WHERE [e].[Id] = 1
) AS [t]
LEFT JOIN [E8] AS [e1] ON [t].[Id] = [e1].[E1Id]
ORDER BY [t].[Id], [t].[Id0], [e1].[Id]

--Query: Include up to E3s
SELECT [t].[Id], [t].[E2Id], [t].[Id0], [e1].[Id], [e1].[E2Id], [e2].[Id], [e2].[E1Id]
FROM (
    SELECT TOP(1) [e].[Id], [e].[E2Id], [e0].[Id] AS [Id0]
    FROM [E1] AS [e]
    INNER JOIN [E2] AS [e0] ON [e].[E2Id] = [e0].[Id]
    WHERE [e].[Id] = 1
) AS [t]
LEFT JOIN [E3] AS [e1] ON [t].[Id0] = [e1].[E2Id]
LEFT JOIN [E8] AS [e2] ON [t].[Id] = [e2].[E1Id]
ORDER BY [t].[Id], [t].[Id0], [e1].[Id], [e2].[Id]

--Query: Include up to E4s
SELECT [t].[Id], [t].[E2Id], [t].[Id0], [t0].[Id], [t0].[E2Id], [t0].[Id0], [t0].[E3Id], [t0].[E5Id], [e3].[Id], [e3].[E1Id]
FROM (
    SELECT TOP(1) [e].[Id], [e].[E2Id], [e0].[Id] AS [Id0]
    FROM [E1] AS [e]
    INNER JOIN [E2] AS [e0] ON [e].[E2Id] = [e0].[Id]
    WHERE [e].[Id] = 1
) AS [t]
LEFT JOIN (
    SELECT [e1].[Id], [e1].[E2Id], [e2].[Id] AS [Id0], [e2].[E3Id], [e2].[E5Id]
    FROM [E3] AS [e1]
    LEFT JOIN [E4] AS [e2] ON [e1].[Id] = [e2].[E3Id]
) AS [t0] ON [t].[Id0] = [t0].[E2Id]
LEFT JOIN [E8] AS [e3] ON [t].[Id] = [e3].[E1Id]
ORDER BY [t].[Id], [t].[Id0], [t0].[Id], [t0].[Id0], [e3].[Id]

--Query: Include everything
SELECT [t].[Id], [t].[E2Id], [t].[Id0], [t1].[Id], [t1].[E2Id], [t1].[Id0], [t1].[E3Id], [t1].[E5Id], [t1].[Id00], [t1].[E6Id], [t1].[Id1], [t1].[Id2], [t1].[E6Id0], [e6].[Id], [e6].[E1Id]
FROM (
    SELECT TOP(1) [e].[Id], [e].[E2Id], [e0].[Id] AS [Id0]
    FROM [E1] AS [e]
    INNER JOIN [E2] AS [e0] ON [e].[E2Id] = [e0].[Id]
    WHERE [e].[Id] = 1
) AS [t]
LEFT JOIN (
    SELECT [e1].[Id], [e1].[E2Id], [t0].[Id] AS [Id0], [t0].[E3Id], [t0].[E5Id], [t0].[Id0] AS [Id00], [t0].[E6Id], [t0].[Id1], [t0].[Id2], [t0].[E6Id0]
    FROM [E3] AS [e1]
    LEFT JOIN (
        SELECT [e2].[Id], [e2].[E3Id], [e2].[E5Id], [e3].[Id] AS [Id0], [e3].[E6Id], [e4].[Id] AS [Id1], [e5].[Id] AS [Id2], [e5].[E6Id] AS [E6Id0]
        FROM [E4] AS [e2]
        INNER JOIN [E5] AS [e3] ON [e2].[E5Id] = [e3].[Id]
        INNER JOIN [E6] AS [e4] ON [e3].[E6Id] = [e4].[Id]
        LEFT JOIN [E7] AS [e5] ON [e4].[Id] = [e5].[E6Id]
    ) AS [t0] ON [e1].[Id] = [t0].[E3Id]
) AS [t1] ON [t].[Id0] = [t1].[E2Id]
LEFT JOIN [E8] AS [e6] ON [t].[Id] = [e6].[E1Id]
ORDER BY [t].[Id], [t].[Id0], [t1].[Id], [t1].[Id0], [t1].[Id00], [t1].[Id1], [t1].[Id2], [e6].[Id]

--Query: Include everything except the E8s
SELECT [t].[Id], [t].[E2Id], [t].[Id0], [t1].[Id], [t1].[E2Id], [t1].[Id0], [t1].[E3Id], [t1].[E5Id], [t1].[Id00], [t1].[E6Id], [t1].[Id1], [t1].[Id2], [t1].[E6Id0]
FROM (
    SELECT TOP(1) [e].[Id], [e].[E2Id], [e0].[Id] AS [Id0]
    FROM [E1] AS [e]
    INNER JOIN [E2] AS [e0] ON [e].[E2Id] = [e0].[Id]
    WHERE [e].[Id] = 1
) AS [t]
LEFT JOIN (
    SELECT [e1].[Id], [e1].[E2Id], [t0].[Id] AS [Id0], [t0].[E3Id], [t0].[E5Id], [t0].[Id0] AS [Id00], [t0].[E6Id], [t0].[Id1], [t0].[Id2], [t0].[E6Id0]
    FROM [E3] AS [e1]
    LEFT JOIN (
        SELECT [e2].[Id], [e2].[E3Id], [e2].[E5Id], [e3].[Id] AS [Id0], [e3].[E6Id], [e4].[Id] AS [Id1], [e5].[Id] AS [Id2], [e5].[E6Id] AS [E6Id0]
        FROM [E4] AS [e2]
        INNER JOIN [E5] AS [e3] ON [e2].[E5Id] = [e3].[Id]
        INNER JOIN [E6] AS [e4] ON [e3].[E6Id] = [e4].[Id]
        LEFT JOIN [E7] AS [e5] ON [e4].[Id] = [e5].[E6Id]
    ) AS [t0] ON [e1].[Id] = [t0].[E3Id]
) AS [t1] ON [t].[Id0] = [t1].[E2Id]
ORDER BY [t].[Id], [t].[Id0], [t1].[Id], [t1].[Id0], [t1].[Id00], [t1].[Id1], [t1].[Id2]

@roji
Copy link
Member

roji commented Jan 21, 2020

@CWhitelaw thanks for putting this together, I'll look into it in the next few days and investigate.

@roji
Copy link
Member

roji commented Jan 23, 2020

@CWhitelaw I've hacked up some SQL scripts to load data into your tables, to repro the speed difference. I haven't yet had time to play around with actual values, but as you know you're data maybe you can help out and suggest the numbers that would reproduce the issue?

To make sure we're talking about the same thing, the idea is to run the following query:

SELECT [t].[Id], [t].[E2Id], [t].[Id0], [t1].[Id], [t1].[E2Id], [t1].[Id0], [t1].[E3Id], [t1].[E5Id], [t1].[Id00], [t1].[E6Id], [t1].[Id1], [t1].[Id2], [t1].[E6Id0], [e6].[Id], [e6].[E1Id]
FROM (
    SELECT TOP(1) [e].[Id], [e].[E2Id], [e0].[Id] AS [Id0]
    FROM [E1s] AS [e]
    INNER JOIN [E2s] AS [e0] ON [e].[E2Id] = [e0].[Id]
    WHERE [e].[Id] = 1
) AS [t]
LEFT JOIN (
    SELECT [e1].[Id], [e1].[E2Id], [t0].[Id] AS [Id0], [t0].[E3Id], [t0].[E5Id], [t0].[Id0] AS [Id00], [t0].[E6Id], [t0].[Id1], [t0].[Id2], [t0].[E6Id0]
    FROM [E3s] AS [e1]
    LEFT JOIN (
        SELECT [e2].[Id], [e2].[E3Id], [e2].[E5Id], [e3].[Id] AS [Id0], [e3].[E6Id], [e4].[Id] AS [Id1], [e5].[Id] AS [Id2], [e5].[E6Id] AS [E6Id0]
        FROM [E4s] AS [e2]
        INNER JOIN [E5s] AS [e3] ON [e2].[E5Id] = [e3].[Id]
        INNER JOIN [E6s] AS [e4] ON [e3].[E6Id] = [e4].[Id]
        LEFT JOIN [E7s] AS [e5] ON [e4].[Id] = [e5].[E6Id]
    ) AS [t0] ON [e1].[Id] = [t0].[E3Id]
) AS [t1] ON [t].[Id0] = [t1].[E2Id]
LEFT JOIN [E8s] AS [e6] ON [t].[Id] = [e6].[E1Id]
ORDER BY [t].[Id], [t].[Id0], [t1].[Id], [t1].[Id0], [t1].[Id00], [t1].[Id1], [t1].[Id2], [e6].[Id];

... once with the final ORDER BY, and once without it, and see significant perf differences.

SQL to generate data:

DECLARE @e2s INT = 1000; -- e1s and e2s
DECLARE @e3s INT = 20;   -- per e2
DECLARE @e5s INT = 20;   -- per e6
DECLARE @e6s INT = 1000; -- e6s
DECLARE @e7s INT = 10;   -- per e6
DECLARE @e8s INT = 10;  -- per e1
DECLARE @e4s INT = @e2s * @e3s; -- between e3 and e5

-- E1+E2
DECLARE @e2 INT = 1;
WHILE @e2 <= @e2s
BEGIN
    INSERT INTO [E2s] DEFAULT VALUES;
    INSERT INTO [E1s] ([E2Id]) VALUES (@e2);

    -- E3
    DECLARE @e3 INT = 1;
    WHILE @e3 <= @e3s
    BEGIN
        INSERT INTO [E3s] ([E2Id]) VALUES (@e2);
        SET @e3 = @e3 + 1;
    END;
        
    -- E8
    DECLARE @e8 INT = 1;
    WHILE @e8 <= @e8s
    BEGIN
        INSERT INTO [E8s] ([E1Id]) VALUES (@e2);
        SET @e8 = @e8 + 1;
    END;

    SET @e2 = @e2 + 1;
END;

-- E6
DECLARE @e6 INT = 1;
WHILE @e6 <= @e6s
BEGIN
    INSERT INTO [E6s] DEFAULT VALUES;

    -- E7
    DECLARE @e7 INT = 1;
    WHILE @e7 <= @e7s
    BEGIN
        INSERT INTO [E7s] ([E6Id]) VALUES (@e6);
        SET @e7 = @e7 + 1;
    END;

    -- E5
    DECLARE @e5 INT = 1;
    WHILE @e5 <= @e5s
    BEGIN
        INSERT INTO [E5s] ([E6Id]) VALUES (@e6);
        SET @e5 = @e5 + 1;
    END;

    SET @e6 = @e6 + 1;
END;
    
-- E4 (many-to-many between E3 and E5)
DECLARE @e4 INT = 1;
WHILE @e4 <= @e4s
BEGIN
    INSERT INTO [E4s] ([E3Id], [E5Id]) VALUES (@e4, @e4);
    SET @e4 = @e4 + 1;
END;

-- DELETE DATA
DELETE FROM [E8s] WHERE 1=1;
DELETE FROM [E1s] WHERE 1=1;
DELETE FROM [E3s] WHERE 1=1;
DELETE FROM [E2s] WHERE 1=1;
DELETE FROM [E5s] WHERE 1=1;
DELETE FROM [E7s] WHERE 1=1;
DELETE FROM [E6s] WHERE 1=1;
DELETE FROM [E4s] WHERE 1=1;
DBCC CHECKIDENT ('E8s', RESEED, 0);
DBCC CHECKIDENT ('E1s', RESEED, 0);
DBCC CHECKIDENT ('E2s', RESEED, 0);
DBCC CHECKIDENT ('E3s', RESEED, 0);
DBCC CHECKIDENT ('E5s', RESEED, 0);
DBCC CHECKIDENT ('E6s', RESEED, 0);
DBCC CHECKIDENT ('E7s', RESEED, 0);
DBCC CHECKIDENT ('E4s', RESEED, 0);

Thanks again for your help.

@ghost
Copy link
Author

ghost commented Jan 27, 2020

Here's some rough numbers on table size

e1 300
e2 300
e3 75000
e4 2500000
e5 2500000
e6 1500
e7 5000
e8 5

I think there might be a mismatch in how some of the entities are generated so here's a rough outline how I would think to do it from scratch.

create 300 e1s

create 5 e8s, each associated with any distinct e1 (there are barely any entries in this table, despite its impact on the query)

create 300 e2s, one per e1
create 250 e3s per e2, associated with that e2

create 1500 e6s
create 3 E7s per e6 associated with that e6

for each e3:
  choose an arbitrary e6
  do 30 times:
    create an e5 associated with that e6
    create an e4 associated with that e3 and that e5

That should arrive at approximately those numbers.

@ajcvickers ajcvickers added this to the 5.0.0 milestone Feb 3, 2020
@ghost
Copy link
Author

ghost commented Feb 4, 2020

Checking in - any findings etc. on this? I see this issue got added to a milestone yesterday.

@roji
Copy link
Member

roji commented Feb 5, 2020

@CWhitelaw no specific findings, but only because I haven't had time to look into it. It's in the milestone which means we definitely intend to investigate this for 5.0.

@smitpatel will also be interested in this.

@smitpatel
Copy link
Member

Last ordering term is not necessary for result computation. Though it is hard to keep track & add/remove in code complexity. We can do that if there is compelling evidence that it causes huge perf penalty. Apart from that, nothing new from #18022

@Stasinevich
Copy link

Stasinevich commented Feb 7, 2020

Hi

I've encountered this issue in the current week. Even on 4 Includes (joins) I got a composite 'order by' clause. The performance penalty sometimes is huge - for instance query with single/double fields in order by is executed within 2 secs, whether with 8+ it takes more than 2 minutes. Also, SQL server grants a lot more memory for such SQL query plans.

I have found an only single workaround for now - split queries

@roji
Copy link
Member

roji commented Feb 7, 2020

Apart from that, nothing new from #18022

The new thing for me is that the slowdown in this case comes specifically from the ORDER BY clause - the same query without the clause performs significantly better (2 minutes vs. 30 seconds). This is more specific than general perf degradation caused by cartesian explosion (I do wonder how much of the complaints in #18022 come from the ORDER BY and how much from other factors).

Opened #19828 to specifically track removing the last ORDER BY.

Some general thoughts on this (@smitpatel and @ajcvickers can correct any inaccuracies, am new to this domain):

  • The ORDER BYs are necessary in order to have streaming behavior (otherwise related entities are located in random places), which is important for large resultsets.
  • However, it could be claimed that streaming is meaningless for tracking queries, since materialized entities are referenced from the change tracker anyway. Would it make sense to consider dropping the ORDER BYs from tracked queries, and just identity resolution?
  • For non-tracking queries, streaming makes more sense (as materialized entities aren't referenced by EF), and so ORDER BY seems necessary. In theory we could allow the user to express whether streaming is required or not, and take this into account. This would also require implementing identity resolution for non-streaming, non-tracked queries, making them a bit more similar to tracked queries. Not sure this is worth the effort.
  • I'm not sure what the exact current behavior is for queries which don't return entities at all - I think I'm seeing ORDER BY although it shouldn't be necessary?

@smitpatel smitpatel removed this from the 5.0.0 milestone Feb 7, 2020
@ajcvickers ajcvickers added this to the 5.0.0 milestone Feb 10, 2020
@roji
Copy link
Member

roji commented Feb 26, 2020

Opened #20076 to track the removal of ORDER BY for buffered tracked queries. Between that, #19828 (remove last ORDER BY) and #18022 ("easy API" to split queries), I don't think there's anything more to do here - am closing this issue.

@CWhitelaw at this point, the right thing to do is to manually split the query as described in #18022 (and to switch to the "easy API" once it's introduced). If, after splitting the query, you still encounter a scenario where an ORDER BY generates a significant perf degradation, please post a repro for that to #20076 - that would help us evaluate the necessity of doing that.

@roji roji closed this as completed Feb 26, 2020
@roji roji removed this from the 5.0.0 milestone Feb 26, 2020
@ghost
Copy link
Author

ghost commented Feb 26, 2020

Glad to hear these are under consideration and yes that's the road we're going down at this point. I hope these proposals make it through for EF 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants