Optimizer causes duplicate query when filtering or paginating a related field #159

vitosamson · 2022-12-27T21:25:33Z

Given a schema like:

class Foo(Model):
    pass

class Bar(Model):
    status = CharField()
    foo = ForeignKey(Foo, related_name="bars")

@gql.django.type(Foo)
class FooType:
    id: auto
    bars: List["BarType"]

@gql.django.filters.filter(Bar)
class BarFilter:
    status: str

@gql.django.type(Bar, filters=BarFilter, pagination=True)
class BarType:
    id: auto

and a query like:

{
  foo(id: 1) {
    bars(filters: {status: "whatever"}) {
      id
    }
  }
}

the optimizer will cause two DB queries against the bar table. The first is because the optimizer determines that we need to prefetch bars, and it does so without the filters. A second query is then done with the WHERE clause for the filters.

Aside from being inefficient, this prefetching also causes an error when paginating the related field. Given the following query:

{
  foo(id: 1) {
    bars(pagination: {limit: 2}) {
      id
    }
  }
}

The optimizer will perform the prefetch without any pagination. A second query is then done by slicing the queryset here: https://github.com/strawberry-graphql/strawberry-graphql-django/blob/d57d767dc9574030888b5c36db4869e54bd24aff/strawberry_django/pagination.py#L25

The problem is that because the queryset (.all() on the related manager) has already been evaluated by the optimizer's prefetch, slicing it for pagination returns a list rather than another queryset. The list then propagates down to get_queryset_as_list here:

strawberry-django-plus/strawberry_django_plus/field.py

Lines 261 to 266 in a635956

    
           qs = self.get_queryset(filter_with_perms(qs, info), info, **kwargs) 
        
           if not skip_fetch and not isinstance(self, relay.ConnectionField): 
        
               # This is what QuerySet does internally to fetch results. 
        
               # After this, iterating over the queryset should be async safe 
        
               qs._fetch_all()  # type:ignore 
        
           return qs

and because _fetch_all doesn't exist on a list, it raises an exception.

I think these might be two separate issues but I discovered the first while investigating the second. Let me know if it would be better to split these into separate github issues.

I'd imagine the pagination issue could be solved with an isinstance(qs, QuerySet) check before calling _fetch_all, but preferably the optimizer would determine if we need to do any filtering/pagination/ordering and do that in the prefetch the first time so that a second query isn't performed at all.

The text was updated successfully, but these errors were encountered:

bellini666 · 2022-12-28T16:24:57Z

Hey @vitosamson ,

Oh, you just hit a corner case the optimizer is not considering.

Yeah, we could add an and isinstance(qs, QuerySet) for the fetch part which would avoid the propagated error, but for sure the pagination needs to be handled better by the optimizer.

Would you like to try to open a PR for those issues?

vitosamson · 2022-12-28T20:28:11Z

I started digging into it. I'm able to get the optimizer to include filter fields in the prefetch (mostly, still some small issues), but I realized that even with that, it wouldn't prevent unnecessary subsequent queries from being duplicated since strawberry_django doesn't know that we've performed a prefetch and has no access to the prefetched queryset. (please correct me if I'm misunderstanding here)

So I was thinking a different way to approach this might be to use a DataLoader, such that the optimizer could prime the loader with the rows returned from the prefetch, and the subsequent query could fetch the rows from there... but it seems like that could get real complex real fast. Imagine a query like:

{
  foos {
    greenBars: bars(filters: {color: "green"}) {
      id
    }

    blueBars: bars(filters: {color: "blue"}) {
      id
    }
  }
}

the loader would need to support looking up bars by both foo_id and color (and whatever other filters one might add to that field).

Thoughts?

bellini666 · 2022-12-30T17:04:39Z

@vitosamson when writing this lib I considered using dataloaders instead of prefetch, but it really got really complex. Not to mention that it doesn't fill the django cache (which makes some custom resolvers to not be able to optimize that) and it also doesn't have proper support for sync resolvers.

The main issue here is that the optimizer is not able to handle nested filtering/ordering/pagination right now. To do that it would probably need to introspect the arguments and produce a Prefetch object with the exact query that the nested resolver would do so that django can reuse the cached result.

For small results there's always an option to do that in memory. For example, I have a lot of cases in a project where I sort or filter results directly in python. For example:

@gql.django
class SomeModelType:
    @gql.django.field(prefetch_related=["items"])
    def items(self, root: SomeModel, color: str | None = None) -> list[SomeModelItem]:
        items = root.items.all()
        if color is not None:
            items = [i for i in items if i.color == color]
        return items

This works fine when the expected number of items per root object is small enough, and also has the advantage of being able to reuse the prefetched items instead of 2 prefetches (which, in this case, is an optimization). In your example, the optimizer would do 2 queries, one for greenBars and other for blueBars, but in this case only one query is done and you then filter those in memory.

But obviously, if the number of items is too big then this solution is not good. For those cases you can do something like this:

@gql.django
class SomeModelType:
    @gql.django.field(prefetch_related=[
        lambda info: Prefetch(items, SomeModelItem.objects.filter(...), to_attr="_filtered_items"),
    ])
    def items(self, root: SomeModel, color: str | None = None) -> list[SomeModelItem]:
        return root._filtered_items

In there you would need to retrieve the filter arguments from info and filter the items accordingly.

vitosamson · 2023-01-03T21:29:06Z

Got it, thanks.

retrieve the filter arguments from info

Is there a good/supported way to do that? I'm also curious if there are any utilities for determining if a specific field was requested. Currently I'm doing something like this:

@gql.field
def some_related_field(self, info) -> RelatedModelType:
    qs = RelatedModel.objects.filter(...)

    field = next(field for field in info.selected_fields if field.name == "someRelatedField")
    if any(selection.name == "user" for selection in field.selections):
        qs = qs.select_related("user")

    return qs

This is a weird related field that actually isn't related via FKs so I can't rely on the optimizer to do it for me. I'm wondering if anything currently exists such that I don't have to spell out all the field for field in info.selected_fields if field.name and selection.name for selection in field.selections bits myself, and the same would go for checking for specific arguments.

bellini666 · 2023-01-04T17:24:03Z

Is there a good/supported way to do that? I'm also curious if there are any utilities for determining if a specific field was requested. Currently I'm doing something like this:

The optimizer itself does this. It will introspect the many to many relations for prefetch_related and also prefetch_related inside it if there are multiple nested relations.

But when you write your own prefetch you have to do it yourself currently. We can try to improve that in the optimizer, which is related to the support for filtering/pagination in it.

This is a weird related field that actually isn't related via FKs so I can't rely on the optimizer to do it for me. I'm wondering if anything currently exists such that I don't have to spell out all the field for field in info.selected_fields if field.name and selection.name for selection in field.selections bits myself, and the same would go for checking for specific arguments.

You can maybe use the optimize function directly in your qs? The only issue here is the n+1 problem, since you are going to generate one qs (and thus, one db query) per result that has the field. In that case, you can try to use the @gql.django.field(prefetch_related=[lambda info: Prefetch(items, optimize(SomeModelItem.objects.filter(...), info), to_attr="_filtered_items")]) and your resolver would just do a return [i for i in self._filtered_items if <condition>] for the current field?

vitosamson · 2023-02-28T22:08:53Z

Hey @bellini666, I wanted to follow up on this with another related question, though slightly different from the original.

We've got a Paginated[T] type that we're using like this in a custom resolver:

T = TypeVar("T")

@gql.type
class Paginated(Generic[T]):
    items: list[T]
    count: int

    @staticmethod
    def from_queryset(queryset: QuerySet, pagination: PaginationInput):
        paginator = Paginator(queryset, pagination.limit)

        try:
            items = paginator.page(pagination.page)
            return Paginated(items=items.object_list, count=items.paginator.count)
        except EmptyPage:
            return Paginated(items=[], count=0)

@gql.django.type(models.Queue)
class Queue:
    name: gql.auto

    @gql.django.field(field_name="reportdata_set")
    def report_data_set(self, pagination: PaginationInput, info: Info) -> Paginated["QueueReportData"]:
        qs = optimize(
            self.reportdata_set.all(),
            info,
            config=OptimizerConfig(
                enable_only=True,
                enable_select_related=True,
                enable_prefetch_related=True,
            ),
        )
        return Paginated.from_queryset(qs, pagination)

@gql.django.type(models.QueueReportData)
class QueueReportData:
    id: gql.auto  # noqa
    report: Report = gql.django.field()

@gql.type
class Query:
    queues: list[Queue] = gql.django.field()

schema = gql.Schema(query=Query, extensions=[DjangoOptimizerExtension])

{
  queues {
    name
    reportDataSet {
      count
      items {
        id
        report { id }
      }
    }
  }
}

The optimizer doesn't seem to be able to support this sort of configuration. For example, looking at the DB queries made, I see duplicate queries for the related field id queue_id for each QueueReportData returned:

SELECT
  DISTINCT "app_queue"."id",
  "app_queue"."name"
FROM
  "app_queue";

SELECT
  COUNT(*) AS "__count"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."queue_id" = 4;

SELECT
  "app_queuereportdata"."id"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."queue_id" = 4
LIMIT
  1;

-- I believe this is the extra query that's caused by not including queue_id in the initial query
SELECT
  "app_queuereportdata"."id",
  "app_queuereportdata"."queue_id"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."id" = 3
LIMIT
  21;   -- not quite sure where this 21 is coming from 🤷

whereas if I change the resolver to return a simple list like so:

    @gql.django.field(field_name="reportdata_set")
    def report_data_set(self, info: Info) -> list["QueueReportData"]:
        return self.reportdata_set.all()

then the optimizer correctly includes queue_id in the initial query and doesn't require an additional query for each QueueReportData:

SELECT
  DISTINCT "app_queue"."id",
  "app_queue"."name"
FROM
  "app_queue";


SELECT
  "app_queuereportdata"."id",
  "app_queuereportdata"."queue_id"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."queue_id" IN (4)

I assume this is because the optimizer sees that info.return_type is that Paginated type, which isn't a strawberry/django type, and therefore it doesn't know how to optimize the queryset I've provided to it.

Wondering if you've got any ideas? Is there some way to coerce info to indicate to the optimizer that it's actually optimizing on the type of the Paginated.items field? Or am I totally off the mark here?

Sorry for the wall of text, I wanted to include enough code examples to illustrate what's going on.

bellini666 · 2023-03-02T14:34:17Z

@vitosamson yes, it is because of the Paginated usage

The optimizer has the same issues with relay connections, which is handled by this code: https://github.com/blb-ventures/strawberry-django-plus/blob/main/strawberry_django_plus/optimizer.py#L88

For curiosity: Is there a reason to why you defined your own Paginated implementation instead of using relay's Connection? The way you are using it is actually similar.

Also soon the relay implementation from this repo will be merged into strawberry (strawberry-graphql/strawberry#2511). It's usage will be a lot more easier/cleaner after that :)

vitosamson · 2023-03-03T16:05:22Z

Is there a reason to why you defined your own Paginated implementation instead of using relay's Connection?

Honestly, mostly because it wasn't really clear how to use it 😅 it would be helpful if the docs contained some examples, though the preview docs for that PR into strawberry has some which look good

Also, our models don't have globally unique IDs. I'm not sure if that's an absolute requirement for relay. Does the relay implementation require that all of our node types implement the Node interface, and therefore if we want to have a field that fetches one single instance of that type, we'd have to use fragment spreading as in this example?

For example, to retrieve a single fruit given its unique ID:

query {
  node(id: "<some id>") {
    id
    ... on Fruit {
      name
      weight
    }
  }
}

or would we still be able to do something like

query {
  fruit(id: "<some id>") {
    id
    name
    weight
  }
}

bellini666 · 2023-03-05T14:55:34Z

Honestly, mostly because it wasn't really clear how to use it sweat_smile it would be helpful if the docs contained some examples, though the preview docs for that PR into strawberry has some which look good

I have to be honest with you, I'm not really good at writing docs =P. I really have to spend some time improving them...

Do you want to open a PR for that? :)

Also, our models don't have globally unique IDs. I'm not sure if that's an absolute requirement for relay.

When you implement the Node interface in an object, it will automatically handle global ids for you. If it is a django type, you don't need to worry about it (just don't define your own id because the interface already does that). If you have your own object implementing Node, you would need to define its abstract methods (it should become much simpler in the future once (this PR)[https://github.com/strawberry-graphql/strawberry/pulls?q=is%3Apr+is%3Aopen+relay] gets merged into strawberry)

Does the relay implementation require that all of our node types implement the Node interface, and therefore if we want to have a field that fetches one single instance of that type, we'd have to use fragment spreading as in this example?

Yes and no =P. If you define a single node id you would only be able to retrieve objects that implement the Node interface.

But you can also define your own fruit: Fruit = gql.django.node() and it will act just like a fruit: Fruit that creates a field with a pk: ID! argument, but instead it will create a field with id: GlobalID! argument.

I personally implement Node in all my django (and some non django) types because there are a lot of advantages of using it. Not only I can paginate any type using a connection, the global id makes it easy to receive them as arguments in a generic way.

e.g. you have a mutation that can receive both a FooType's id or a BarTypes id. If you define it as some_resolver(id: ID) you would have to add an extra argument to tell which type it is. If you do some_resolver(id: GlobalID) you have all the info there. id.type_name has the type name and id.node_id has the actual id, not to mention you can id.resolve_node(info) to ask relay to retrieve the object for you.

It also makes it easier for clients (e.g. apollo client) to cache objects because their ids will always be unique. So refetching them is also easy.

vitosamson · 2023-03-06T23:41:17Z

Oh yeah, that is pretty neat. I think probably the only blocker for us is that we're integrating graphql into an existing app which already uses page-based pagination, so I'm not sure how straightforward it would be to retrofit it to use cursor based pagination... but I'll keep playing around with it.

Are there plans to have relay connections automatically support things like filtering, ordering, and optimization? It looks like we'd need to handle all that manually in the connection resolver, is that correct or am I doing something wrong?

bellini666 · 2023-03-07T00:44:14Z

Are there plans to have relay connections automatically support things like filtering, ordering, and optimization? It looks like we'd need to handle all that manually in the connection resolver, is that correct or am I doing something wrong?

That's all already supported =P

When you do some_conn: gql.relay.Connection[SomeType] = gql.django.connection(), its default resolver will run the optimizer like if it was a List[SomeType] and also add filters/ordering that the type defines (or you can pass them as arguments to connection(). You only need to define your own methods for the Node subclass if it is not a gql.django.type()

You can even define your own resolver and return a queryset on it to customize what comes in the connection, and the relay will do the rest for you. Like:

@gql.type
class Query:
    some_conn: gql.relay.Connection[SomeType] = gql.django.connection()

    @gql.relay.connection()
    def some_conn_only_active(self) -> Iterable[SomeType]:
        return cast(Iterable[SomeType], SomeModel.objects.filter(is_active=True))

And that's it.

If you add extra arguments to that resolver it will be included in the final query, together with first/last/before/after and filters/order if those are used, without requiring you to handle them.

bellini666 added bug Something isn't working help wanted Extra attention is needed labels Dec 28, 2022

Mapiarz mentioned this issue Apr 5, 2023

Filter on a field leads to optimization error #192

Closed

bellini666 mentioned this issue May 24, 2023

Custom get_queryset of nested nodes not used #209

Closed

bellini666 mentioned this issue Jun 11, 2023

refactor!: migrate relay to strawberry's implementation #235

Merged

bellini666 closed this as completed in #235 Jun 15, 2023

adammsteele mentioned this issue Jun 16, 2023

Optimise manually prefetched field #244

Open

bellini666 mentioned this issue Jul 17, 2023

Strawberry and Relay potential improvements strawberry-graphql/strawberry-django#304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer causes duplicate query when filtering or paginating a related field #159

Optimizer causes duplicate query when filtering or paginating a related field #159

vitosamson commented Dec 27, 2022 •

edited

bellini666 commented Dec 28, 2022

vitosamson commented Dec 28, 2022

bellini666 commented Dec 30, 2022

vitosamson commented Jan 3, 2023

bellini666 commented Jan 4, 2023

vitosamson commented Feb 28, 2023

bellini666 commented Mar 2, 2023

vitosamson commented Mar 3, 2023

bellini666 commented Mar 5, 2023

vitosamson commented Mar 6, 2023

bellini666 commented Mar 7, 2023 •

edited

Optimizer causes duplicate query when filtering or paginating a related field #159

Optimizer causes duplicate query when filtering or paginating a related field #159

Comments

vitosamson commented Dec 27, 2022 • edited

bellini666 commented Dec 28, 2022

vitosamson commented Dec 28, 2022

bellini666 commented Dec 30, 2022

vitosamson commented Jan 3, 2023

bellini666 commented Jan 4, 2023

vitosamson commented Feb 28, 2023

bellini666 commented Mar 2, 2023

vitosamson commented Mar 3, 2023

bellini666 commented Mar 5, 2023

vitosamson commented Mar 6, 2023

bellini666 commented Mar 7, 2023 • edited

vitosamson commented Dec 27, 2022 •

edited

bellini666 commented Mar 7, 2023 •

edited