Skip to content
This repository has been archived by the owner on Jul 8, 2023. It is now read-only.

Optimizer causes duplicate query when filtering or paginating a related field #159

Closed
vitosamson opened this issue Dec 27, 2022 · 11 comments · Fixed by #235
Closed

Optimizer causes duplicate query when filtering or paginating a related field #159

vitosamson opened this issue Dec 27, 2022 · 11 comments · Fixed by #235
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@vitosamson
Copy link

vitosamson commented Dec 27, 2022

Given a schema like:

class Foo(Model):
    pass

class Bar(Model):
    status = CharField()
    foo = ForeignKey(Foo, related_name="bars")

@gql.django.type(Foo)
class FooType:
    id: auto
    bars: List["BarType"]

@gql.django.filters.filter(Bar)
class BarFilter:
    status: str

@gql.django.type(Bar, filters=BarFilter, pagination=True)
class BarType:
    id: auto

and a query like:

{
  foo(id: 1) {
    bars(filters: {status: "whatever"}) {
      id
    }
  }
}

the optimizer will cause two DB queries against the bar table. The first is because the optimizer determines that we need to prefetch bars, and it does so without the filters. A second query is then done with the WHERE clause for the filters.

Aside from being inefficient, this prefetching also causes an error when paginating the related field. Given the following query:

{
  foo(id: 1) {
    bars(pagination: {limit: 2}) {
      id
    }
  }
}

The optimizer will perform the prefetch without any pagination. A second query is then done by slicing the queryset here: https://github.com/strawberry-graphql/strawberry-graphql-django/blob/d57d767dc9574030888b5c36db4869e54bd24aff/strawberry_django/pagination.py#L25

The problem is that because the queryset (.all() on the related manager) has already been evaluated by the optimizer's prefetch, slicing it for pagination returns a list rather than another queryset. The list then propagates down to get_queryset_as_list here:

qs = self.get_queryset(filter_with_perms(qs, info), info, **kwargs)
if not skip_fetch and not isinstance(self, relay.ConnectionField):
# This is what QuerySet does internally to fetch results.
# After this, iterating over the queryset should be async safe
qs._fetch_all() # type:ignore
return qs

and because _fetch_all doesn't exist on a list, it raises an exception.


I think these might be two separate issues but I discovered the first while investigating the second. Let me know if it would be better to split these into separate github issues.

I'd imagine the pagination issue could be solved with an isinstance(qs, QuerySet) check before calling _fetch_all, but preferably the optimizer would determine if we need to do any filtering/pagination/ordering and do that in the prefetch the first time so that a second query isn't performed at all.

@bellini666
Copy link
Member

Hey @vitosamson ,

Oh, you just hit a corner case the optimizer is not considering.

Yeah, we could add an and isinstance(qs, QuerySet) for the fetch part which would avoid the propagated error, but for sure the pagination needs to be handled better by the optimizer.

Would you like to try to open a PR for those issues?

@bellini666 bellini666 added bug Something isn't working help wanted Extra attention is needed labels Dec 28, 2022
@vitosamson
Copy link
Author

I started digging into it. I'm able to get the optimizer to include filter fields in the prefetch (mostly, still some small issues), but I realized that even with that, it wouldn't prevent unnecessary subsequent queries from being duplicated since strawberry_django doesn't know that we've performed a prefetch and has no access to the prefetched queryset. (please correct me if I'm misunderstanding here)

So I was thinking a different way to approach this might be to use a DataLoader, such that the optimizer could prime the loader with the rows returned from the prefetch, and the subsequent query could fetch the rows from there... but it seems like that could get real complex real fast. Imagine a query like:

{
  foos {
    greenBars: bars(filters: {color: "green"}) {
      id
    }

    blueBars: bars(filters: {color: "blue"}) {
      id
    }
  }
}

the loader would need to support looking up bars by both foo_id and color (and whatever other filters one might add to that field).

Thoughts?

@bellini666
Copy link
Member

@vitosamson when writing this lib I considered using dataloaders instead of prefetch, but it really got really complex. Not to mention that it doesn't fill the django cache (which makes some custom resolvers to not be able to optimize that) and it also doesn't have proper support for sync resolvers.

The main issue here is that the optimizer is not able to handle nested filtering/ordering/pagination right now. To do that it would probably need to introspect the arguments and produce a Prefetch object with the exact query that the nested resolver would do so that django can reuse the cached result.

For small results there's always an option to do that in memory. For example, I have a lot of cases in a project where I sort or filter results directly in python. For example:

@gql.django
class SomeModelType:
    @gql.django.field(prefetch_related=["items"])
    def items(self, root: SomeModel, color: str | None = None) -> list[SomeModelItem]:
        items = root.items.all()
        if color is not None:
            items = [i for i in items if i.color == color]
        return items

This works fine when the expected number of items per root object is small enough, and also has the advantage of being able to reuse the prefetched items instead of 2 prefetches (which, in this case, is an optimization). In your example, the optimizer would do 2 queries, one for greenBars and other for blueBars, but in this case only one query is done and you then filter those in memory.

But obviously, if the number of items is too big then this solution is not good. For those cases you can do something like this:

@gql.django
class SomeModelType:
    @gql.django.field(prefetch_related=[
        lambda info: Prefetch(items, SomeModelItem.objects.filter(...), to_attr="_filtered_items"),
    ])
    def items(self, root: SomeModel, color: str | None = None) -> list[SomeModelItem]:
        return root._filtered_items

In there you would need to retrieve the filter arguments from info and filter the items accordingly.

@vitosamson
Copy link
Author

Got it, thanks.

retrieve the filter arguments from info

Is there a good/supported way to do that? I'm also curious if there are any utilities for determining if a specific field was requested. Currently I'm doing something like this:

@gql.field
def some_related_field(self, info) -> RelatedModelType:
    qs = RelatedModel.objects.filter(...)

    field = next(field for field in info.selected_fields if field.name == "someRelatedField")
    if any(selection.name == "user" for selection in field.selections):
        qs = qs.select_related("user")

    return qs

This is a weird related field that actually isn't related via FKs so I can't rely on the optimizer to do it for me. I'm wondering if anything currently exists such that I don't have to spell out all the field for field in info.selected_fields if field.name and selection.name for selection in field.selections bits myself, and the same would go for checking for specific arguments.

@bellini666
Copy link
Member

Is there a good/supported way to do that? I'm also curious if there are any utilities for determining if a specific field was requested. Currently I'm doing something like this:

The optimizer itself does this. It will introspect the many to many relations for prefetch_related and also prefetch_related inside it if there are multiple nested relations.

But when you write your own prefetch you have to do it yourself currently. We can try to improve that in the optimizer, which is related to the support for filtering/pagination in it.

This is a weird related field that actually isn't related via FKs so I can't rely on the optimizer to do it for me. I'm wondering if anything currently exists such that I don't have to spell out all the field for field in info.selected_fields if field.name and selection.name for selection in field.selections bits myself, and the same would go for checking for specific arguments.

You can maybe use the optimize function directly in your qs? The only issue here is the n+1 problem, since you are going to generate one qs (and thus, one db query) per result that has the field. In that case, you can try to use the @gql.django.field(prefetch_related=[lambda info: Prefetch(items, optimize(SomeModelItem.objects.filter(...), info), to_attr="_filtered_items")]) and your resolver would just do a return [i for i in self._filtered_items if <condition>] for the current field?

@vitosamson
Copy link
Author

Hey @bellini666, I wanted to follow up on this with another related question, though slightly different from the original.

We've got a Paginated[T] type that we're using like this in a custom resolver:

T = TypeVar("T")

@gql.type
class Paginated(Generic[T]):
    items: list[T]
    count: int

    @staticmethod
    def from_queryset(queryset: QuerySet, pagination: PaginationInput):
        paginator = Paginator(queryset, pagination.limit)

        try:
            items = paginator.page(pagination.page)
            return Paginated(items=items.object_list, count=items.paginator.count)
        except EmptyPage:
            return Paginated(items=[], count=0)

@gql.django.type(models.Queue)
class Queue:
    name: gql.auto

    @gql.django.field(field_name="reportdata_set")
    def report_data_set(self, pagination: PaginationInput, info: Info) -> Paginated["QueueReportData"]:
        qs = optimize(
            self.reportdata_set.all(),
            info,
            config=OptimizerConfig(
                enable_only=True,
                enable_select_related=True,
                enable_prefetch_related=True,
            ),
        )
        return Paginated.from_queryset(qs, pagination)

@gql.django.type(models.QueueReportData)
class QueueReportData:
    id: gql.auto  # noqa
    report: Report = gql.django.field()

@gql.type
class Query:
    queues: list[Queue] = gql.django.field()

schema = gql.Schema(query=Query, extensions=[DjangoOptimizerExtension])
{
  queues {
    name
    reportDataSet {
      count
      items {
        id
        report { id }
      }
    }
  }
}

The optimizer doesn't seem to be able to support this sort of configuration. For example, looking at the DB queries made, I see duplicate queries for the related field id queue_id for each QueueReportData returned:

SELECT
  DISTINCT "app_queue"."id",
  "app_queue"."name"
FROM
  "app_queue";

SELECT
  COUNT(*) AS "__count"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."queue_id" = 4;

SELECT
  "app_queuereportdata"."id"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."queue_id" = 4
LIMIT
  1;

-- I believe this is the extra query that's caused by not including queue_id in the initial query
SELECT
  "app_queuereportdata"."id",
  "app_queuereportdata"."queue_id"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."id" = 3
LIMIT
  21;   -- not quite sure where this 21 is coming from 🤷 

whereas if I change the resolver to return a simple list like so:

    @gql.django.field(field_name="reportdata_set")
    def report_data_set(self, info: Info) -> list["QueueReportData"]:
        return self.reportdata_set.all()

then the optimizer correctly includes queue_id in the initial query and doesn't require an additional query for each QueueReportData:

SELECT
  DISTINCT "app_queue"."id",
  "app_queue"."name"
FROM
  "app_queue";


SELECT
  "app_queuereportdata"."id",
  "app_queuereportdata"."queue_id"
FROM
  "app_queuereportdata"
WHERE
  "app_queuereportdata"."queue_id" IN (4)

I assume this is because the optimizer sees that info.return_type is that Paginated type, which isn't a strawberry/django type, and therefore it doesn't know how to optimize the queryset I've provided to it.

Wondering if you've got any ideas? Is there some way to coerce info to indicate to the optimizer that it's actually optimizing on the type of the Paginated.items field? Or am I totally off the mark here?

Sorry for the wall of text, I wanted to include enough code examples to illustrate what's going on.

@bellini666
Copy link
Member

@vitosamson yes, it is because of the Paginated usage

The optimizer has the same issues with relay connections, which is handled by this code: https://github.com/blb-ventures/strawberry-django-plus/blob/main/strawberry_django_plus/optimizer.py#L88

For curiosity: Is there a reason to why you defined your own Paginated implementation instead of using relay's Connection? The way you are using it is actually similar.

Also soon the relay implementation from this repo will be merged into strawberry (strawberry-graphql/strawberry#2511). It's usage will be a lot more easier/cleaner after that :)

@vitosamson
Copy link
Author

Is there a reason to why you defined your own Paginated implementation instead of using relay's Connection?

Honestly, mostly because it wasn't really clear how to use it 😅 it would be helpful if the docs contained some examples, though the preview docs for that PR into strawberry has some which look good

Also, our models don't have globally unique IDs. I'm not sure if that's an absolute requirement for relay. Does the relay implementation require that all of our node types implement the Node interface, and therefore if we want to have a field that fetches one single instance of that type, we'd have to use fragment spreading as in this example?

For example, to retrieve a single fruit given its unique ID:

query {
  node(id: "<some id>") {
    id
    ... on Fruit {
      name
      weight
    }
  }
}

or would we still be able to do something like

query {
  fruit(id: "<some id>") {
    id
    name
    weight
  }
}

@bellini666
Copy link
Member

Honestly, mostly because it wasn't really clear how to use it sweat_smile it would be helpful if the docs contained some examples, though the preview docs for that PR into strawberry has some which look good

I have to be honest with you, I'm not really good at writing docs =P. I really have to spend some time improving them...

Do you want to open a PR for that? :)

Also, our models don't have globally unique IDs. I'm not sure if that's an absolute requirement for relay.

When you implement the Node interface in an object, it will automatically handle global ids for you. If it is a django type, you don't need to worry about it (just don't define your own id because the interface already does that). If you have your own object implementing Node, you would need to define its abstract methods (it should become much simpler in the future once (this PR)[https://github.com/strawberry-graphql/strawberry/pulls?q=is%3Apr+is%3Aopen+relay] gets merged into strawberry)

Does the relay implementation require that all of our node types implement the Node interface, and therefore if we want to have a field that fetches one single instance of that type, we'd have to use fragment spreading as in this example?

Yes and no =P. If you define a single node id you would only be able to retrieve objects that implement the Node interface.

But you can also define your own fruit: Fruit = gql.django.node() and it will act just like a fruit: Fruit that creates a field with a pk: ID! argument, but instead it will create a field with id: GlobalID! argument.

I personally implement Node in all my django (and some non django) types because there are a lot of advantages of using it. Not only I can paginate any type using a connection, the global id makes it easy to receive them as arguments in a generic way.

e.g. you have a mutation that can receive both a FooType's id or a BarTypes id. If you define it as some_resolver(id: ID) you would have to add an extra argument to tell which type it is. If you do some_resolver(id: GlobalID) you have all the info there. id.type_name has the type name and id.node_id has the actual id, not to mention you can id.resolve_node(info) to ask relay to retrieve the object for you.

It also makes it easier for clients (e.g. apollo client) to cache objects because their ids will always be unique. So refetching them is also easy.

@vitosamson
Copy link
Author

Oh yeah, that is pretty neat. I think probably the only blocker for us is that we're integrating graphql into an existing app which already uses page-based pagination, so I'm not sure how straightforward it would be to retrofit it to use cursor based pagination... but I'll keep playing around with it.

Are there plans to have relay connections automatically support things like filtering, ordering, and optimization? It looks like we'd need to handle all that manually in the connection resolver, is that correct or am I doing something wrong?

@bellini666
Copy link
Member

bellini666 commented Mar 7, 2023

Are there plans to have relay connections automatically support things like filtering, ordering, and optimization? It looks like we'd need to handle all that manually in the connection resolver, is that correct or am I doing something wrong?

That's all already supported =P

When you do some_conn: gql.relay.Connection[SomeType] = gql.django.connection(), its default resolver will run the optimizer like if it was a List[SomeType] and also add filters/ordering that the type defines (or you can pass them as arguments to connection(). You only need to define your own methods for the Node subclass if it is not a gql.django.type()

You can even define your own resolver and return a queryset on it to customize what comes in the connection, and the relay will do the rest for you. Like:

@gql.type
class Query:
    some_conn: gql.relay.Connection[SomeType] = gql.django.connection()

    @gql.relay.connection()
    def some_conn_only_active(self) -> Iterable[SomeType]:
        return cast(Iterable[SomeType], SomeModel.objects.filter(is_active=True))

And that's it.

If you add extra arguments to that resolver it will be included in the final query, together with first/last/before/after and filters/order if those are used, without requiring you to handle them.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants