Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single SQL query instead of split queries #12098

Open
benmccallum opened this Issue May 22, 2018 · 49 comments

Comments

Projects
None yet
@benmccallum
Copy link

benmccallum commented May 22, 2018

I've been investigation an issue when using AutoMapper's Queryable projection extensions on top of EF Core 2.1-rc1-final. Full details including troubleshooting logs with the SQL statements are at the issue (AutoMapper/AutoMapper#2639) and EfAutomapperTroubleshooting.zip is a repro.

Basically I've got a Booking which has multiple BookingItems. I have a DTO for each type and have configured them to be used by AutoMapper. When I execute the query on EF6 the JOIN optimization occurs (it's even smart enough that I don't need to specific the .Include(b => b.BookingItems). On EF Core however, two DB queries fire (one for Booking and the other for BookingItems). Explicitly specifying the Include potentially would work, but it's ignored by the "IgnoredIncludes" cleverness, so I can't force that behaviour even if I wanted to.

So there's two issues running on EF Core I'm trying to get past:

  1. The JOIN optimization isn't "figured out" automatically.
  2. Trying to force the JOIN with .Include(b => b.BookingItems) doesn't work either as EF Core is being clever but not clever enough. I can't see a way to turn this off or opt out per execution/context.

Obviously understanding that EF Core is not EF 6, but just wondering if this would ever be supported as it's a common scenario these days. The AutoMapper projection extensions are awesome. I'll try do this without AutoMapper for now and see if that can be my workaround for now.

Steps to reproduce

See linked issue logged at AutoMapper and the downloadable .sln zip in my later comment.

Further technical details

EF Core version: 2.1.0-rc1-final
Database Provider: Microsoft.EntityFrameworkCore.SqlServer
Operating system: Windows 10 Pro, Version: 1803, OS build: 17134.48
IDE: Visual Studio 2017 15.7.1

@benmccallum

This comment has been minimized.

Copy link
Author

benmccallum commented May 22, 2018

I also tried it with a straight up LINQ query and hand-written projections and there is the same 2 queries generated. This isn't really an option for me though even if it did work, as I'm building a GraphQL service that passing required field names down from client and then I use the .ProjectTo<T>(null, requiredFieldNames) overload to basically have AutoMapper generate the perfect SQL query for me. It'll be magical if it can work end-to-end and produce the JOIN I'd like.

bookingDtos = db.Bookings
    .AsNoTracking()
    .Select(b => new BookingDto
    {
        Id = b.Id,
        Name = b.Name,
        BookingItems = b.BookingItems.Select(bi => new BookingItemDto
        {
            Id = bi.Id,
            Name = bi.Name
        }).ToList()
    })
    .ToArray();
info: Microsoft.EntityFrameworkCore.Database.Command[20101]
      Executed DbCommand (4ms) [Parameters=[], CommandType='Text', CommandTimeout='30']
      SELECT [b].[Id], [b].[Name]
      FROM [Booking] AS [b]
      ORDER BY [b].[Id]
info: Microsoft.EntityFrameworkCore.Database.Command[20101]
      Executed DbCommand (7ms) [Parameters=[], CommandType='Text', CommandTimeout='30']
      SELECT [t].[Id], [b.BookingItems].[Id] AS [Id0], [b.BookingItems].[Name], [b.BookingItems].[BookingId]
      FROM [BookingItem] AS [b.BookingItems]
      INNER JOIN (
          SELECT [b0].[Id]
          FROM [Booking] AS [b0]
      ) AS [t] ON [b.BookingItems].[BookingId] = [t].[Id]
      ORDER BY [t].[Id]
@smitpatel

This comment has been minimized.

Copy link
Contributor

smitpatel commented May 22, 2018

Its not N+1, its just 2 queries. 2 queries instead of 1 is by design. What is your specific issue with that?

@benmccallum benmccallum changed the title N+1 with projections, even with Include 2 queries with one:many projections, even with Include May 22, 2018

@benmccallum

This comment has been minimized.

Copy link
Author

benmccallum commented May 22, 2018

Sorry, you're right. So just to clarify, you are saying it's by design to do 2 queries when it can be achieved with one using a JOIN? If that's the case then no worries, it's just surprising.

@smitpatel

This comment has been minimized.

Copy link
Contributor

smitpatel commented May 22, 2018

@benmccallum - Yes it is by design. While it can be achieved with 1 query, it is not less likely to be performant. Single query for collection (include or join) causes dreaded cartesian product.
To give an example, let's assume you have 50 bookings, and each booking has 5 bookingitems. With 1 query, it will do a join between both and cause 250 rows returns from server. And 50 bookings' data will be repeated for 5 times exactly. If you add another collection include (assume 10 bookingdetails per bookings), now the total size of data is 2500 rows. With a lot of duplicated data present.
With split queries, we send 1 query for each collection include. So you get 3 queries with 50/250/500 results each with no duplication of data. Reference includes are still done using join in single query. Of course there is trade off based on cardinality which would make multiple queries efficient or single query efficient. Due to join explosion in case of multiple collection joins, we decided to adopt 2 query approach in EF Core.

@benmccallum

This comment has been minimized.

Copy link
Author

benmccallum commented May 22, 2018

Thanks heaps for taking the time to explain this @smitpatel. That makes total sense. Closing this out.

@elyasnew

This comment has been minimized.

Copy link

elyasnew commented Jun 28, 2018

there are 2 independent queries which means the second query does nothing with the first query, so still I don't understand how running 2 queries prevents data duplication!

@smitpatel

This comment has been minimized.

Copy link
Contributor

smitpatel commented Jun 28, 2018

@elyasnew - They are not independent queries. 2nd query fetches the related data (for collection include) for the records in the first query. If you want all data in one query then you need to a join which will duplicate data in first query for every related record in 2nd query.

@brainoverflow98

This comment has been minimized.

Copy link

brainoverflow98 commented Nov 10, 2018

@benmccallum - Yes it is by design. While it can be achieved with 1 query, it is not less likely to be performant. Single query for collection (include or join) causes dreaded cartesian product.
To give an example, let's assume you have 50 bookings, and each booking has 5 bookingitems. With 1 query, it will do a join between both and cause 250 rows returns from server. And 50 bookings' data will be repeated for 5 times exactly. If you add another collection include (assume 10 bookingdetails per bookings), now the total size of data is 2500 rows. With a lot of duplicated data present.
With split queries, we send 1 query for each collection include. So you get 3 queries with 50/250/500 results each with no duplication of data. Reference includes are still done using join in single query. Of course there is trade off based on cardinality which would make multiple queries efficient or single query efficient. Due to join explosion in case of multiple collection joins, we decided to adopt 2 query approach in EF Core.

Thanks for the clarification. I think it is really important to splitting the queries in some cases where you have too many duplicate data but can you make it optional in case some people doesn't want to use it that way ? Something like: _db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).SingleQuery().ToList(); That would be usefull especially if trying to return a single blog. We do not need to split queries for a single blog. I know some scenarios where duplicate data is more than the size of actual data but for the most part people might want to use single query especially when the fact that round trip time and processing the statements makes up the most of the query time taken into account.

@dpsenner

This comment has been minimized.

Copy link

dpsenner commented Nov 20, 2018

@smitpatel I would like to reopen the discussion about this issue.

I just wrote a single linq query and expected the database to spit out a single sql query. See the referenced issue for more information. But I was astonished that entity framework core did something totally different and even sorted out to join tuples in memory. This was unexpected.

This said I do believe that when a developer writes one linq query he expresses the intention to run that linq query as a single sql query and writes it such that it can be run as a single sql query against the database. If the developer did not want to offload that operation to the database, he would write two linq queries and join the results in the application server.

I do believe that the developer should have a choice. It is the developers responsibility to write linq quieries that implement the functional requirements in a sane way. The decision to generate two queries makes it impossible to implement streaming techniques of several hundredthousands of records with their related records while it silently increases the memory requirements of the machine that runs the application server.

Can I somehow persuade the entity framework core team to rethink the original decision? I do not think that entity framework core team has to protect developers from writing bad linq queries. For instance, I would not mind if entity framework core threw an exception if it is unable generate a sensible sql query from a bad linq query.

@roji

This comment has been minimized.

Copy link
Member

roji commented Nov 20, 2018

@divega divega added this to the Discussions milestone Nov 20, 2018

@divega

This comment has been minimized.

Copy link
Member

divega commented Nov 21, 2018

@dpsenner @roji Thanks, this is indeed an interesting discussion. I would like to point out that although EF Core creates a separate query whenever you project or include collections, for each of these cases there is generally an equivalent query (that is, a query that returns the exact same rows) that is translated by EF Core to a single query with JOINs. E.g.:

// this will translate as three SQL queries
db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).ToList();
// this will translate as a single SQL query
db.Comments.Include(b=>b.Post).ThenInclude(b=>b.Blog).Where(b=>b.Post.Blog.BlogId == 1).ToList();

The difference between the two queries is that you are traversing the navigation properties in the opposite direction.

  • For the first query, we interpret that your main intent is to get a Blog that has a BlogId equal to 1. Your secondary intent is to get related information about the posts and the comments. Based on this, we translate to one query to get the blog, and two additional queries to get the related posts and comments. In this case, your query is traversing the navigation properties from the one end toward the many end. Had we tried to translate to a single query, we would have caused the number of times you get the data in the rows for blog to explode.

  • For the second query, we interpret that your primary intent is to get the comments for a specific blog, and we your secondary intent is to get the data for the related posts and blogs. Our hypothesis is that you already expect that the cardinality of the results will be the same as all the number of comments for that blog in the database. You are traversing navigations from the many ends towards the one ends, each time narrowing the cardinality of the additional data you are asking for. So when we translate this to a single query, the number of rows in the results is going to be the same as the number of rows if you were querying for the comments alone. However, we are duplicating the data in all the columns of post and blog.

You are welcome to think that the distinction between the two queries (the direction in which you are traversing relationships) is arbitrary, and we could have used the same translation. However I would argue that we are simply using the order in which navigations are being traversed as a hint on how we translate the query. A useful way to think about how EF Core decides whether to translate to one or multiple SQL queries is that it is a heuristic that we apply to avoid the explosion of the cardinality of the root of the query that you are consuming. In a way, it is an attempt to achieve least astonishment, but focused on that specific aspect of the query.

Also regarding the assertion that this doesn't allow streaming: for the first query, we actually still stream the results of Blog on the same query. We buffer the results from Posts and Comments, but if your database (and connection string) supports multiple active result sets (MARS), we only buffer them incrementally. E.g. we will buffer together all the posts for the first blog, then when you skip to the next blog, we will buffer all the posts for the second blog, etc.

In any case, we are reopening this issue to discuss if there are other ways we can do this in the future. Obviously, it becomes much harder or impossible for you to write a query that gets resolved as single query if the reverse navigation properties are absent in the model, or if one of the leaf entities is an owned entity.

@divega divega reopened this Nov 21, 2018

@divega divega removed this from the Discussions milestone Nov 21, 2018

@divega divega changed the title 2 queries with one:many projections, even with Include Two queries with one:many projections, even with Include Nov 21, 2018

@roji

This comment has been minimized.

Copy link
Member

roji commented Nov 21, 2018

@divega that's very useful information I didn't know - thanks! I'd suggest maybe adding this to the docs as I'm sure not people are aware of the distinction and the two SQL generation options. Maybe an "advanced query techniques" page would be appropriate for this kind of thing...

At least IMO that's quite a satisfactory state of affairs... Even if the "default" mode (traversal from one to many) only buffers the one end, it's understandable that in some cases users may wish to avoid it, even at the expense of a cartesian product explosion. You seem to already provide a way to express it, and requiring that reverse navigation be present doesn't seem like a big problem to me. Owned entities may present a more serious problem though.

One last somewhat-related note... Npgsql indeed does not support MARS - you can only have one reader open on a connection at a given point in time. However, PostgreSQL does support SQL server-side cursors, where you start a query with DECLARE and progressively fetch results with FETCH, allowing multiple query resultsets to be traversed simultaneously. At some point I considered implementing MARS support in Npgsql over cursors (via lower-level support in the protocol) but ended up deciding against it for several reasons - and in any case the option is there for users to do so themselves with DECLARE/OPEN. Treating this as a higher-level SQL-based MARS, in theory EF Core could use that to prevent buffering even when MARS isn't supported - I think it's even part of the SQL standard and implemented across databases. That may be a bit optimistic though, IIRC MySQL only allows cursors in stored procedures, and PostgreSQL itself requires cursors to be in a transaction, which also complicated things.

Just dumping a bit more context/options for us to think about...

@divega

This comment has been minimized.

Copy link
Member

divega commented Nov 21, 2018

@roji agreed. Other two “crazy ideas” that come to mind re streaming and MARS:

  1. Someday we could just change how DbContext works to open additional connections when MARS is required but not supported and we are only doing reads.
  2. Assuming you implement connection multiplexing, you could to something homologous thing at the DbConnection vs. physical connection. I.e. they don’t need to be 1:1 anymore.
@dpsenner

This comment has been minimized.

Copy link

dpsenner commented Nov 21, 2018

I'm glad to see that ef core team is open to discuss this issue.

@divega If a dbcontext begins to spawn connections I raise serious concerns regarding scalability.

As for the two queries mentioned above, I have totally different expectations to how this should behave.

// this will translate as three SQL queries
db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).ToList();
// this will translate as a single SQL query
db.Comments.Include(b=>b.Post).ThenInclude(b=>b.Blog).Where(b=>b.Post.Blog.BlogId == 1).ToList();

I expect both linq queries to map to one sql query each. If I did want to run this as separate queries I could implement it in the application in this way:

// first query
var blogs = await dbContext.Blogs.Include(b => b.Posts).ToListAsync();
foreach (var blogPost in blogs.SelectMany(b => b.Posts)) {
  // n queries
  var blogPostComments = await dbContext.Comments.Where(c => c.PostId == blogPost.Id).toListAsync();
  // do whatever needs to be done
}

or even do fancy stuff with batching:

// first query
var blogs = await .ToListAsync();
foreach (var blogs in dbContext.Blogs.InBatchesOf(100)) {
  // fetch all blog posts for the set of blogs in one query and aggregate them into a lookup
  var blogPostsLookup = dbContext.Posts.Where(p => blogs.Select(b => b.Id).ToArray().Contains(p.BlogId)).ToLookup(t => t.BlogId, t => t);
  // then process all blog posts for each blog
  foreach (var blog in blogs) {
    foreach (var blogPost in blogPostsLookup[blog.Id]) {
      // fancy stuff
    }
  }
}

This said, linq to sql should map linq to sql. Not do fancy logic.

@benmccallum

This comment has been minimized.

Copy link
Author

benmccallum commented Nov 21, 2018

Thanks for the tips about reversing the query traversal to make a single query.

In my case, I'm translating graphql queries down to db queries via AutoMapper's .ProjectTo<TDto>(fieldsToMap), so the query has to be driven top-down meaning I'll always get the query splitting, even if I'm only requesting one top-level record (blog in examples above).

For my situation to be optimized, I'd need to be able to tell EF that I'm only retrieving one blog somehow, or is it smart enough to recognise the query passes a primary key and/or TOP 1? If not, feature request!

@roji

This comment has been minimized.

Copy link
Member

roji commented Nov 21, 2018

@divega

Someday we could just change how DbContext works to open additional connections when MARS is required but not supported and we are only doing reads.

Definitely an option! Although we'd have to watch out for transactions, which also affect reads...

Assuming you implement connection multiplexing, you could to something homologous thing at the DbConnection vs. physical connection. I.e. they don’t need to be 1:1 anymore.

Yeah - that resembles your 1st suggestion in a way: since every query can be dispatched to an arbitrary physical connection, there's no more constraint on having two queries on the same connection. The same caveat with transactions applies though.

But multiplexing introduces a problem the other way - if you have multiple queries queued up for different producers on the same connection, then if the 2nd producer wants to process results but the 1st one still hasn't, then we have a problem... We can either delay the 2nd one (seems unacceptable) or buffer the 1st resultset in memory - in which we've killed streaming... I still have to think about multiplexing a bit more, but it definitely creates some issues in this area.

@roji

This comment has been minimized.

Copy link
Member

roji commented Nov 21, 2018

@dpsenner, here's my take on your argument:

I expect both linq queries to map to one sql query each. If I did want to run this as separate queries I could implement it in the application in this way:

The problem, again, is that executing this kind of query in a single SQL query can be extremely bad for performance, because of the reasons that have already been explained above: duplication of principal entity fields, leading to tons of unnecessary data being transferred and thrown away. This simply doesn't seem like the right option for the default usecase in an average application.

To summarize, when deciding what EF Core should do by default, the decision seems to be between:

  1. A single SQL query with joins
    a. Pro: totally streaming in all scenarios and providers
    b. Possibly a pro: maintains the one-SQL-per-one-LINQ query idea. Up to now I'm not really convinced why this should be a principle of EF Core or what actually value it brings to users
    b. Con: potentially huge amounts of needless data transferred because of duplication of principal entity fields
  2. A query per include
    a. Pro: no data duplication
    b. Con: only partially streamable (i.e. non-MARS provider).

To me, the decision between the two options isn't trivial. However, since the streaming issue isn't universal (MARS is supported in some providers and could be supported in others), and I don't really subscribe to the idea that one LINQ should provide one SQL, the 2nd really seems superior.

One last point is whether the multiple queries produced in the 2nd option are batched. If they're not, we're adding more database network roundtrips, which isn't good - but this could be fixed in EF Core (#10878 tracks this).

Finally, the idea suggested by @divega above could indeed mitigate the streaming concern when provider MARS isn't available, but I agree with your point that it hogs more connections and so could be problematic.

@dpsenner

This comment has been minimized.

Copy link

dpsenner commented Nov 21, 2018

I'm glad that you too share the opinion that entity framework core should not be allowed to spawn connections.

As said, for me entity framework is an object relational mapping framework. As such it is responsible to map linq to a sql command and map the relational results to an object model. If a linq expression is badly formulated, it results in a NotImplementedException or NotSupportedException. This behavior is comparable to a badly formulated sql query. Nobody would expect the database engine to autocomplete missing aggregate functions in combination with group by. When faced with an exception, the developer will figure out a way to formulate a linq expression that gets the data from the database and mapped into the application server. I know very well that this is a very strict viewpoint that not many people share. But in this viewpoint entity framework core does no longer have to implement guesswork on linq query trees like the mentioned ordering of property traversals in the comment of @divega. Further, developers get the benefit of being able to implement sane and stable application servers.

Of course some developers will write bad queries. So they will also be able to write code that crashes. Nobody expects from the compiler to protect them from doing so.

@roji

This comment has been minimized.

Copy link
Member

roji commented Nov 21, 2018

I personally don't really agree with much of what you wrote (my views are my own):

As said, for me entity framework is an object relational mapping framework. As such it is responsible to map linq to a sql command and map the relational results to an object model.

I don't think there's anything in the idea of an ORM that requires a single LINQ query to be translated to a single SQL query - ORMs map objects to relational databases, and can do so in various ways. LINQ specifically is a pretty .NET-specific concept, so it seems odd to say that the concept of ORMs somehow necessarily implies something about how LINQ should be translated.

If a linq expression is badly formulated, it results in a NotImplementedException or NotSupportedException. This behavior is comparable to a badly formulated sql query.

I don't really think the analogy holds well - there's a very big difference between an incorrect expression or query, and the question of how to translate a correct one.

I think it's more appropriate to compare to compilers (or more accurately transpilers), which translate from one language to another. These tools routinely perform optimizations and other manipulations to make sure your code runs in the best way possible - that's one of their main job descriptions. Following your logic we may say that method inlining is a bad idea, because a C# method must always be translated to an IL and assembly method...

Of course some developers will write bad queries.

Agreed, but we're not talking about a bad query, but rather about what happens with the default, standard EF Core join construct. At the end of the day, if the proposal is that EF Core translate the standard join syntax (Include()) via an SQL construct that results in extremely bad performance for the common scenario, then I think everyone will end up losing.

@dpsenner

This comment has been minimized.

Copy link

dpsenner commented Nov 21, 2018

everyone will end up losing

Of course this should not happen.

SQL construct that results in extremely bad performance

Bad performance can also happen when a queried database column is missing an index or if the index has bad selectivity.

what happens with the default, standard EF Core join construct

Entity framework core should behave consistently and not do unexpected things. To me it was and is still unexpected that one linq query runs as two sql queries because I, being a human, would write it as one sql query. Including hundreds of columns from different tables and joining all together will always end up with huge result sets. I would not expect entity framework core to do any magic on this. But is it really entity framework core's responsibility to deal with these broken badly designed queries by loading everything into memory and doing the operation in the application server?

@ajcvickers

This comment has been minimized.

Copy link
Member

ajcvickers commented Dec 17, 2018

See also #9014, which is essentially a duplicate of this.

@dpsenner

This comment has been minimized.

Copy link

dpsenner commented Jan 21, 2019

We have new findings to share. Recently we implemented new features and wanted to fetch an entity, from here on referenced to as e1, along with two of its relations, from here on referenced to as e2 and e3, in a single view. In this scenario we observe entity framework core to run one query to fetch e1 (e1.Id, e1.Name, e1.Class.Id, e1.Class.Name, e1.Class.Meaning, IsArchived), one query to fetch e2 (TagNames) and n+1 queries to fetch e3 (component.Id, component.Amount, component.AmountUnit, component.Name, component.Sequence) for every e1 with WHERE @_outer_Id = component.id.

return await Storage
	.Material
	.Where(e1 => e1.Id == id)
	.Select(e1 => new
	{
		e1.Id,
		e1.Name,
		Class= new
		{
			e1.Class.Id,
			e1.Class.Name,
			e1.Class.Meaning,
		},
		IsArchived = e1.Archive.Any(t1 => t1.During.Contains(now)),
		TagNames = e1.Tags.Select(t1 => t1.DomainObjectTag.Name).ToArray(),
		Components = e1.ItemsWhereComposite.Select(component => new
		{
			component.Id,
			component.Amount,
			AmountUnitId = component.AmountUnit.Id,
			AmountUnitName = component.AmountUnit.Name,
			AmountUnitMeaning = component.AmountUnit.Meaning,
			component.Component.Name,
			component.Sequence,
		}).ToArray(),
	})

As a human, I would probably write the query (against postgres) as follows because it yields the best behavior for our usecase:

SELECT e1.id, e1.name, e2.id, e2.name AS class_name, e2.meaning as class_meaning, (
    SELECT CASE
        WHEN EXISTS (
            SELECT 1
            FROM material.material_archive AS t1
            WHERE ((t1.during @> tooling.now_utc()) = TRUE) AND (e1.id = t1.material_id))
        THEN TRUE::bool ELSE FALSE::bool
    END
) AS is_archived,
ARRAY(
	select
	tag.name
	from material.material_tag as material_tag
	inner join fas.domain_object_tag as tag
		on material_tag.domain_object_tag_id = tag.id
	where
		material_tag.material_id = e1.id
) as tag_names,
(
	select array_agg(bill_of_material)
	from
	(
		select
				component.id,
				component.sequence,
				component.amount ,
				unit.name,
				unit.meaning,
				component_material.id,
				component_material.name
		from
			material.bill_of_material_item as component
		inner join material.material as component_material
			on component_material.id = component.ingredient_material_id
		inner join measuring.unit as unit
			on unit.id = component.unit_id
		where
			component.material_id = e1.id
	) bill_of_material
) as components
FROM (
    select * from material.f_filter_material_unarchived(NULL)
) AS e1
INNER JOIN material.material_class AS e2 ON e1.material_class_id = e2.id
ORDER BY e1.id
@smitpatel

This comment has been minimized.

Copy link
Contributor

smitpatel commented Jan 25, 2019

With few linked issue where multiple query actually caused issue in results, we decided to move to single SQL for all cases. Yes, it could cause cartesian product explosion which user would be able to work-around with manual loading patterns.

@smitpatel smitpatel removed the needs-design label Jan 25, 2019

@smitpatel smitpatel changed the title Two queries with one:many projections, even with Include Use single SQL query instead of split queries Jan 25, 2019

@dpsenner

This comment has been minimized.

Copy link

dpsenner commented Jan 25, 2019

That is sad to hear, thanks for sharing the plan. With this outline we can adapt and fix our implementation accordingly. Unfortunately that means for us that we are going to write a lot of sql queries and manual type conversions.

@smitpatel

This comment has been minimized.

Copy link
Contributor

smitpatel commented Jan 25, 2019

@dpsenner - Use of ARRAY in query was always provider specific and cannot be implemented inside base EF Core provider. Apart from that, as I mentioned earlier, doing split query manually is not too much of code. Not ideal for all cases, but given all the pending features, providing multiple strategy to do the same include (like one query vs many query) introduces complexity in the code while providing little value. Of course the value proposition could change based on feedback, in which case, multiple strategies can be revisited in future.

@roji

This comment has been minimized.

Copy link
Member

roji commented Jan 28, 2019

@dpsenner, regarding your query above, it's not out of the realm of possibility for the Npgsql to translate ToArray() in a projection to array_agg() on PostgreSQL (and produce ARRAY() subqueries), but this definitely won't happen anytime soon (and definitely post-EF Core 3.0). As @smitpatel wrote, constructs that are specific to a particular database implementation are always going to be less supported (if at all).

Regardless, while the use of ARRAY() and array_agg() in this way may be elegant, it again creates subqueries which can be extremely expensive - I suspect this will perform much slower on the PostgreSQL than a simple, traditional join (even if the latter may end up sending more data over the network). This is very similar to our discussion in npgsql/Npgsql.EntityFrameworkCore.PostgreSQL#712.

@rosslovas

This comment has been minimized.

Copy link

rosslovas commented Mar 13, 2019

With few linked issue where multiple query actually caused issue in results, we decided to move to single SQL for all cases. Yes, it could cause cartesian product explosion which user would be able to work-around with manual loading patterns.

Given this, is there (or will there be) a way to load nested collections without looping? For example, given Trees with Branches with Leaves:

context.Entry(tree).Collection(p => p.Branches).Query().Include(b => b.Leaves).Load();

This would work but could result in a "cartesian explosion" after this change (and even now Includes in some situations aren't great because of #12022, which is why I want to move to manual loads now if I can), and this isn't possible:

context.Entry(tree).Collection(p => p.Branches).Load();
context.Entry(tree).Collection(p => p.Branches).Collection(c => c.Leaves).Load();

This would work but obviously does one query per Branch:

context.Entry(tree).Collection(p => p.Branches).Load();
foreach (var branch in tree.Branches)
{
    _context.Entry(branch).Collection(b => b.Leaves).Load()
}

I would expect to be able to generate something that will select Leaves where Branch.Id IN (2, 5, 6, 9, ...) and then have EF associate each Leaf that comes back with the right Branch but I'm not aware of any way to do that right now.

@smitpatel

This comment has been minimized.

Copy link
Contributor

smitpatel commented Mar 13, 2019

I would expect to be able to generate something that will select Leaves where Branch.Id IN (2, 5, 6, 9, ...) and then have EF associate each Leaf that comes back with the right Branch but I'm not aware of any way to do that right now.

If you are doing tracking query then that is very easy to do.

context.Entry(tree).Collection(p => p.Branches).Load();
context.Set<Leaf>().Where(l => tree.Branches.Select(e => e.Id).Contains(l.BranchId)).ToList();
@rosslovas

This comment has been minimized.

Copy link

rosslovas commented Mar 13, 2019

Thanks, that'll work for my purposes, although I have to capture tree.Branches.Select(e => e.Id) in a local otherwise it evaluates the Contains client-side. I think some way to do this directly involving .Load() would be really useful, as I can easily see another developer (or me if I forget) re-inlining the tree.Branches.Select(e => e.Id) and ruining performance or just not understanding what the point of evaluating an IQueryable like this is, so I'll have to comment heavily each time I do it.

@smitpatel

This comment has been minimized.

Copy link
Contributor

smitpatel commented Mar 13, 2019

I have to capture tree.Branches.Select(e => e.Id) in a local otherwise it evaluates the Contains client-side.

Which version of package are you using? We have already improved that in 3.0 and it should not cause client eval anymore. There can be likely bug but plan in 3.0 is make sure that inline-ing also does server eval.

@rosslovas

This comment has been minimized.

Copy link

rosslovas commented Mar 14, 2019

Which version of package are you using? We have already improved that in 3.0 and it should not cause client eval anymore.

I'm using 2.2.3. Glad to hear that's been improved.

@marekvse

This comment has been minimized.

Copy link

marekvse commented Apr 20, 2019

I just got a surprise too, seeing my simple query behaves differently than EF6. When I'm writing a query like this, I'm aware of the cartesian product and know it is much smaller deal then running the sub-query separately for each row.

return ( from r in DC.OrganizationUserRoles where bytForRoleId == 0 || (bytForRoleId > 0 && r.OrgUserRoleId == bytForRoleId) orderby r.OrgUserRoleId select new OrgRolePermissionInfo() { RoleId = r.OrgUserRoleId, RoleName = r.OrgUserRoleName, Permissions = ( from dp in r.RolePermissions from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == dp.OrgUserRoleId && x.PermissionId == dp.PermissionId).DefaultIfEmpty() where dp.OrgId == 0 select new OrgRolePermission() { PermissionId = dp.Permission.PermissionId, Value = p.PermissionValue, Types = dp.Permission.PermissionApplicables.ToList() } ).ToList() } ).ToList();
I would think that retrieving the four fields with every Type from PermissionApplicables would be far less costly then issuing a sub-query and running it to get the Types for each RolePermission separately.
This makes me pretty scared actually, regarding the performance.
As mentioned by some people before, if I thought it would be more efficient to issue separate queries, I'd write it that way. In fact, I would probably write it with Contains to issue only one sub-query and then populated the Types for each row from the one result set I'd receive, which I would suppose (not sure, but I'd think that anyway) would be faster than issuing N separate queries across to the database.

Am I wrong about any of this?

(I'm not sure why my code is not displaying properly.. :( Can anyone help with that? I'm not usually commenting much so I'm not sure if there is a bug on this site or I'm doing something wrong by using the <> tool, which inserts backticks, between which I inserted the code.. )

@ErikEJ

This comment has been minimized.

Copy link
Contributor

ErikEJ commented Apr 20, 2019

Did you measure??

@marekvse

This comment has been minimized.

Copy link

marekvse commented Apr 20, 2019

No, I did not in this case as I just found out that it issues so many queries, but going across processes to the database more often than necessary, compiling of the query by the database... I'd definitely expect that taking more time then compiling one query and returning let's say 40 bytes of additional data per row.
(Anyone on why my code is not being presented colorized and multiline here?)

@marekvse

This comment has been minimized.

Copy link

marekvse commented Apr 20, 2019

OK, so I've spent a couple of hours measuring the difference....
And sure enough, getting one resultset and then - in my case - manually populating the class hierarchy (something EF6 used to do by itself without a problem) is muuuuuuch faster than calling the database 9 times instead. The average time increase on 9 vs 1 call to get the same data is 337%.!! And if I ask my function to populate all my permission roles for me (I have 7 right now in there), the difference goes up to 499%, which is of course no surprise since doing it the same thing with 50!!! queries in EF Core versus how EF6 did it with just one query to get the same result is bound to have an impact. Not a good one either.
And I'm not even mentioning that doing the query backwards to be able to measure this is really not easy as it is a very unnatural way to design a query. Doing this and debugging it regularly for other queries will take lots of additional time and lots more code... Or could it be done easier? Here's my testing code from my real app:
(Sorry if it comes out not formatted again. I'm happy to fix it if anyone can point me to what I'm doing wrong in this comment editor.)

            var bytForRoleId = (byte)forRole;
            bytForRoleId = 0;
            var w = new System.Diagnostics.Stopwatch();
            w.Start();

            var q1 = (
                from r in DC.OrganizationUserRoles
                where bytForRoleId == 0 || (bytForRoleId > 0 && r.OrgUserRoleId == bytForRoleId)
                orderby r.OrgUserRoleId
                select new OrgRolePermissionInfo() {
                    RoleId = r.OrgUserRoleId,
                    RoleName = r.OrgUserRoleName,
                    RoleDescr = r.OrgUserRoleDescr,
                    Permissions = (
                        from dp in r.RolePermissions
                        from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == dp.OrgUserRoleId && x.PermissionId == dp.PermissionId).DefaultIfEmpty()
                        where dp.OrgId == 0
                        select new OrgRolePermission() {
                            Permission = dp.Permission,
                            DefaultValue = dp.PermissionValue,
                            Value = p.PermissionId > 0 ? p.PermissionValue : dp.PermissionValue,
                            Inherited = p.PermissionId > 0 ? false : true,
                            Types = dp.Permission.PermissionApplicables.ToList()
                        }
                    ).ToList()
                }
            ).ToList();

            w.Stop();
            var ms1 = w.ElapsedTicks;

            w.Restart();

            var q2 = (
                from a in DC.PermissionApplicables
                from rp in DC.RolePermissions.Where(x => x.PermissionId == a.PermissionId).DefaultIfEmpty()
                from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == rp.OrgUserRoleId && x.PermissionId == rp.PermissionId).DefaultIfEmpty()
                from pp in DC.Permissions.Where(x => x.PermissionId == rp.PermissionId)
                from r in DC.OrganizationUserRoles.Where(x => bytForRoleId == 0 || (bytForRoleId > 0 && x.OrgUserRoleId == bytForRoleId)).DefaultIfEmpty()
                where rp.OrgId == 0 && (pp.PermissionId != 0 && pp.PermissionId != 3)
                orderby r.OrgUserRoleId, a.PermissionId, a.CanWhat
                select new Test {
                    RoleId = r.OrgUserRoleId,
                    RoleName = r.OrgUserRoleName,
                    RoleDescr = r.OrgUserRoleDescr,
                    Perm = pp,
                    DefaultValue = rp.PermissionValue,
                    Value = (p.PermissionId > 0 ? p.PermissionValue : rp.PermissionValue),
                    Inherited = p.PermissionId > 0 ? false : true,
                    PermId = a.PermissionId,
                    CanWhat = a.CanWhat,
                    UiText = a.UiText,
                    UiExplanation = a.UiExplanation,
                    Includes = a.Includes,
                    ApplicableToAppUser = a.ApplicableToAppUser
                }
            ).ToList();

            w.Stop();
            var ms2 = w.ElapsedTicks;

            w.Restart();

            Test row;
            byte iLastRoleId = 0;
            byte iLastPermId = 0;
            short iLastCanWhat = 0;
            OrgRolePermissionInfo o = null;
            var lst = new List<OrgRolePermissionInfo>();
            OrgRolePermission rolePerm = null;

            for (var i = 0; i < q2.Count; i++) {
                row = q2[i];
                if(row.RoleId != iLastRoleId) {
                    iLastRoleId = row.RoleId;
                    iLastPermId = 0;
                    iLastCanWhat = 0;

                    o = new OrgRolePermissionInfo() {
                        RoleId = row.RoleId,
                        RoleName = row.RoleName,
                        RoleDescr = row.RoleDescr,
                        Permissions = new List<OrgRolePermission>()
                    };
                    lst.Add(o);
                }

                if(row.PermId != iLastPermId) {
                    iLastPermId = row.PermId;
                    iLastCanWhat = 0;
                    rolePerm = new OrgRolePermission() {
                        Permission = row.Perm,
                        DefaultValue = row.DefaultValue,
                        Value = row.Value,
                        Inherited = row.Inherited,
                        Types = new List<PermissionApplicable>()
                    };
                    o.Permissions.Add(rolePerm);
                }

                if(row.CanWhat != iLastCanWhat) {
                    iLastCanWhat = row.CanWhat;
                    rolePerm.Types.Add(new PermissionApplicable() {
                        PermissionId = row.PermId,
                        ApplicableToAppUser = row.ApplicableToAppUser,
                        CanWhat = row.CanWhat,
                        Includes = row.Includes,
                        UiExplanation = row.UiExplanation,
                        UiText = row.UiText
                    });
                }
            }

            w.Stop();
            var ms3 = w.ElapsedTicks;
@marekvse

This comment has been minimized.

Copy link

marekvse commented Apr 20, 2019

Comments to the above code:
First of all, it is really strange how the code block starts after several lines after its actual beginning. But at least it is readable now. Anyway...
The q1 query is the "natural" one - the way it feels naturally to be written and the way EF6 would issue one query to the database for and fill it correctly.
The q2 is the one I attempted to write backwards. As I mentioned, that was pretty hard to debug. It gives the same results now, which is enough for the measurement, but I'd not be completely confident with it for the production code in this form.
The second query is ordered by the entity keys of the involved entities and then iterated over to populate the list of object hierarchies to return.

I need to write queries like this all the time. Yet, according to @roji this is not a common case? Can I do it better? I'm not happy at all with a 5-fold decrease in performance or more with more return objects expected. With so many queries, the DB server will not be very scaleable. :-(

@roji

This comment has been minimized.

Copy link
Member

roji commented Apr 20, 2019

@marekvse you may have missed it, but the decision has already been made to move to a single SQL query (with classical join) for all cases - see this comment above.

FWIW whether a single query will perform better than one-query-per-joined-table depends on your specific query, various environmental concerns (e.g. network latency), etc. It's very easy to show scenarios in which the single query approach will be very bad for perf. However, it's true that single query has several non-perf advantages (always allows streaming regardless of MARS support, potential subtle transaction isolation issues..), and users should be able to get the previous behavior by coding for it explicitly.

@marekvse

This comment has been minimized.

Copy link

marekvse commented Apr 21, 2019

Yes, I indeed did miss that. Thank you @roji . I just updated from EF Core 2.2.0 to 2.2.4, but I get the same result - the same 50 queries instead of one for q1 query from the code above. Is there a setting I have to change or alter somehow the way the query is written?
Overnight I decided I was going to write a native SQL sproc and get data that way, which would at least let me write the query naturally and understandably. I'm glad I don't have to do that probably, but I'm not sure what's wrong with my LINQ in q1 to achieve the desired result.

@divega

This comment has been minimized.

Copy link
Member

divega commented Apr 21, 2019

@marekvse, the change @roji is referring to is coming in EF Core 3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.