Optimizer causes duplicate query when filtering or paginating a related field #159
Comments
Hey @vitosamson , Oh, you just hit a corner case the optimizer is not considering. Yeah, we could add an Would you like to try to open a PR for those issues? |
I started digging into it. I'm able to get the optimizer to include filter fields in the prefetch (mostly, still some small issues), but I realized that even with that, it wouldn't prevent unnecessary subsequent queries from being duplicated since So I was thinking a different way to approach this might be to use a {
foos {
greenBars: bars(filters: {color: "green"}) {
id
}
blueBars: bars(filters: {color: "blue"}) {
id
}
}
} the loader would need to support looking up Thoughts? |
@vitosamson when writing this lib I considered using dataloaders instead of prefetch, but it really got really complex. Not to mention that it doesn't fill the django cache (which makes some custom resolvers to not be able to optimize that) and it also doesn't have proper support for sync resolvers. The main issue here is that the optimizer is not able to handle nested filtering/ordering/pagination right now. To do that it would probably need to introspect the arguments and produce a For small results there's always an option to do that in memory. For example, I have a lot of cases in a project where I sort or filter results directly in python. For example: @gql.django
class SomeModelType:
@gql.django.field(prefetch_related=["items"])
def items(self, root: SomeModel, color: str | None = None) -> list[SomeModelItem]:
items = root.items.all()
if color is not None:
items = [i for i in items if i.color == color]
return items This works fine when the expected number of items per root object is small enough, and also has the advantage of being able to reuse the prefetched items instead of 2 prefetches (which, in this case, is an optimization). In your example, the optimizer would do 2 queries, one for But obviously, if the number of items is too big then this solution is not good. For those cases you can do something like this: @gql.django
class SomeModelType:
@gql.django.field(prefetch_related=[
lambda info: Prefetch(items, SomeModelItem.objects.filter(...), to_attr="_filtered_items"),
])
def items(self, root: SomeModel, color: str | None = None) -> list[SomeModelItem]:
return root._filtered_items In there you would need to retrieve the filter arguments from |
Got it, thanks.
Is there a good/supported way to do that? I'm also curious if there are any utilities for determining if a specific field was requested. Currently I'm doing something like this: @gql.field
def some_related_field(self, info) -> RelatedModelType:
qs = RelatedModel.objects.filter(...)
field = next(field for field in info.selected_fields if field.name == "someRelatedField")
if any(selection.name == "user" for selection in field.selections):
qs = qs.select_related("user")
return qs This is a weird related field that actually isn't related via FKs so I can't rely on the optimizer to do it for me. I'm wondering if anything currently exists such that I don't have to spell out all the |
The optimizer itself does this. It will introspect the many to many relations for prefetch_related and also prefetch_related inside it if there are multiple nested relations. But when you write your own prefetch you have to do it yourself currently. We can try to improve that in the optimizer, which is related to the support for filtering/pagination in it.
You can maybe use the optimize function directly in your qs? The only issue here is the n+1 problem, since you are going to generate one qs (and thus, one db query) per result that has the field. In that case, you can try to use the |
Hey @bellini666, I wanted to follow up on this with another related question, though slightly different from the original. We've got a T = TypeVar("T")
@gql.type
class Paginated(Generic[T]):
items: list[T]
count: int
@staticmethod
def from_queryset(queryset: QuerySet, pagination: PaginationInput):
paginator = Paginator(queryset, pagination.limit)
try:
items = paginator.page(pagination.page)
return Paginated(items=items.object_list, count=items.paginator.count)
except EmptyPage:
return Paginated(items=[], count=0)
@gql.django.type(models.Queue)
class Queue:
name: gql.auto
@gql.django.field(field_name="reportdata_set")
def report_data_set(self, pagination: PaginationInput, info: Info) -> Paginated["QueueReportData"]:
qs = optimize(
self.reportdata_set.all(),
info,
config=OptimizerConfig(
enable_only=True,
enable_select_related=True,
enable_prefetch_related=True,
),
)
return Paginated.from_queryset(qs, pagination)
@gql.django.type(models.QueueReportData)
class QueueReportData:
id: gql.auto # noqa
report: Report = gql.django.field()
@gql.type
class Query:
queues: list[Queue] = gql.django.field()
schema = gql.Schema(query=Query, extensions=[DjangoOptimizerExtension]) {
queues {
name
reportDataSet {
count
items {
id
report { id }
}
}
}
} The optimizer doesn't seem to be able to support this sort of configuration. For example, looking at the DB queries made, I see duplicate queries for the related field id SELECT
DISTINCT "app_queue"."id",
"app_queue"."name"
FROM
"app_queue";
SELECT
COUNT(*) AS "__count"
FROM
"app_queuereportdata"
WHERE
"app_queuereportdata"."queue_id" = 4;
SELECT
"app_queuereportdata"."id"
FROM
"app_queuereportdata"
WHERE
"app_queuereportdata"."queue_id" = 4
LIMIT
1;
-- I believe this is the extra query that's caused by not including queue_id in the initial query
SELECT
"app_queuereportdata"."id",
"app_queuereportdata"."queue_id"
FROM
"app_queuereportdata"
WHERE
"app_queuereportdata"."id" = 3
LIMIT
21; -- not quite sure where this 21 is coming from 🤷 whereas if I change the resolver to return a simple list like so: @gql.django.field(field_name="reportdata_set")
def report_data_set(self, info: Info) -> list["QueueReportData"]:
return self.reportdata_set.all() then the optimizer correctly includes SELECT
DISTINCT "app_queue"."id",
"app_queue"."name"
FROM
"app_queue";
SELECT
"app_queuereportdata"."id",
"app_queuereportdata"."queue_id"
FROM
"app_queuereportdata"
WHERE
"app_queuereportdata"."queue_id" IN (4) I assume this is because the optimizer sees that Wondering if you've got any ideas? Is there some way to coerce Sorry for the wall of text, I wanted to include enough code examples to illustrate what's going on. |
@vitosamson yes, it is because of the The optimizer has the same issues with relay connections, which is handled by this code: https://github.com/blb-ventures/strawberry-django-plus/blob/main/strawberry_django_plus/optimizer.py#L88 For curiosity: Is there a reason to why you defined your own Also soon the relay implementation from this repo will be merged into strawberry (strawberry-graphql/strawberry#2511). It's usage will be a lot more easier/cleaner after that :) |
Honestly, mostly because it wasn't really clear how to use it 😅 it would be helpful if the docs contained some examples, though the preview docs for that PR into strawberry has some which look good Also, our models don't have globally unique IDs. I'm not sure if that's an absolute requirement for relay. Does the relay implementation require that all of our node types implement the
query {
node(id: "<some id>") {
id
... on Fruit {
name
weight
}
}
} or would we still be able to do something like query {
fruit(id: "<some id>") {
id
name
weight
}
} |
I have to be honest with you, I'm not really good at writing docs =P. I really have to spend some time improving them... Do you want to open a PR for that? :)
When you implement the
Yes and no =P. If you define a single But you can also define your own I personally implement e.g. you have a mutation that can receive both a It also makes it easier for clients (e.g. apollo client) to cache objects because their ids will always be unique. So refetching them is also easy. |
Oh yeah, that is pretty neat. I think probably the only blocker for us is that we're integrating graphql into an existing app which already uses page-based pagination, so I'm not sure how straightforward it would be to retrofit it to use cursor based pagination... but I'll keep playing around with it. Are there plans to have relay connections automatically support things like filtering, ordering, and optimization? It looks like we'd need to handle all that manually in the connection resolver, is that correct or am I doing something wrong? |
That's all already supported =P When you do You can even define your own resolver and return a queryset on it to customize what comes in the connection, and the relay will do the rest for you. Like: @gql.type
class Query:
some_conn: gql.relay.Connection[SomeType] = gql.django.connection()
@gql.relay.connection()
def some_conn_only_active(self) -> Iterable[SomeType]:
return cast(Iterable[SomeType], SomeModel.objects.filter(is_active=True)) And that's it. If you add extra arguments to that resolver it will be included in the final query, together with |
Given a schema like:
and a query like:
the optimizer will cause two DB queries against the
bar
table. The first is because the optimizer determines that we need to prefetchbars
, and it does so without the filters. A second query is then done with theWHERE
clause for the filters.Aside from being inefficient, this prefetching also causes an error when paginating the related field. Given the following query:
The optimizer will perform the prefetch without any pagination. A second query is then done by slicing the queryset here: https://github.com/strawberry-graphql/strawberry-graphql-django/blob/d57d767dc9574030888b5c36db4869e54bd24aff/strawberry_django/pagination.py#L25
The problem is that because the queryset (
.all()
on the related manager) has already been evaluated by the optimizer's prefetch, slicing it for pagination returns a list rather than another queryset. The list then propagates down toget_queryset_as_list
here:strawberry-django-plus/strawberry_django_plus/field.py
Lines 261 to 266 in a635956
and because
_fetch_all
doesn't exist on a list, it raises an exception.I think these might be two separate issues but I discovered the first while investigating the second. Let me know if it would be better to split these into separate github issues.
I'd imagine the pagination issue could be solved with an
isinstance(qs, QuerySet)
check before calling_fetch_all
, but preferably the optimizer would determine if we need to do any filtering/pagination/ordering and do that in the prefetch the first time so that a second query isn't performed at all.The text was updated successfully, but these errors were encountered: