Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuerySet from subquery #16814

Closed
wants to merge 5 commits into from
Closed

Conversation

AivanF
Copy link

@AivanF AivanF commented Apr 29, 2023

1. Problem

Django 4.2 enabled us to filter by calculated values and this is awesome! But for now it's impossible to perform a filter exactly after a filter by annotated value. The following Django ORM code:

Model.objects.all() \
    .annotate(
        ord=Window(
            expression=RowNumber(),
            partition_by=F('related_id'),
            order_by=[F("date_created").desc()]
        )
    ) \
    .filter(ord=1) \
    .filter(date_created__lte=some_datetime)

Leads to this SQL query:

SELECT * 
FROM (
    SELECT 
      id, related_id, values, date_created
      ROW_NUMBER() OVER (
        PARTITION BY related_id
        ORDER BY date_created DESC
      ) AS ord
    FROM model_table
    WHERE date_created <= '2022-02-24 00:00:00+00:00'
)
WHERE ord = 1

However, in some cases the order of filtering statements leads to different result data and hence is highly important.

2. Existing Solutions

To overcome this problem, it's possible to use subquery in WHERE clause:

sub = Model.objects.all() \
    .annotate(
        ord=Window(
            expression=RowNumber(),
            partition_by=F('related_id'),
            order_by=[F('date_created').desc()]
        )
    ) \
    .filter(ord=1)

result = Model.objects.all() \
    .filter(id__in=sub.values_list('id')) \
    .filter(date_created__lte=some_datetime)

Which has bad performance due to hash join.

Or RawSQL could be used:

sub = Model.objects.all() \
    .annotate(
        ord=Window(
            expression=RowNumber(),
            partition_by=F('related_id'),
            order_by=[F('date_created').desc()]
        )
    ) \
    .filter(ord=1)
sql, params = sub.query.sql_with_params()

result = Model.objects.raw(
    'SELECT * FROM ({}) "model_table" WHERE date_created <= %s'.format(sql),
    params + (some_datetime,)
)

But this breaks ORM philosophy and prevents it's features.

There are external libraries solving the problem, but they are limited and unmaintained, e.g PyPI: django-sub-query. Meanwhile, the described logic is pretty simple and could be better implemented inside Django ORM itself.
Honestly, I tried to do it as a separate library, but this was too monstrous. I also aimed to create subqueries without much new code, using existing features that generate FROM (...), but it also was overcomplicated.

3. My Solution with From-Subquery Feature

Now QuerySet provides as_subquery method:

Model.objects.all() \
    .annotate(
        ord=Window(
            expression=RowNumber(),
            partition_by=F('related_id'),
            order_by=[F("date_created").desc()]
        )
    ) \
    .filter(ord=1) \
    .as_subquery() \
    .filter(date_created__lte=some_datetime)

Which leads to a SQL query like:

SELECT id, related_id, values, date_created
FROM (
    SELECT * 
    FROM (
        SELECT 
          id, related_id, values, date_created
          ROW_NUMBER() OVER (
            PARTITION BY related_id
            ORDER BY date_created DESC
          ) AS ord
        FROM model_table
    )
    WHERE ord = 1
) model_table
WHERE date_created <= '2022-02-24 00:00:00+00:00'

To achieve this, I:

  • Added as_subquery method to QuerySet class which created a separate django.db.models.sql.Query object referencing old one as a subquery.
  • Added inner_query property to django.db.models.sql.Query class
  • Removed forced unique aliasing in SQLCompiler.get_qualify_sql. I'm not into this masking stuff, but, as far as I understood, this could be better done with unmasking and masking again... But I don't understand why to add aliases at all?
  • SQLCompiler.get_from_clause returns subquery when it is presented.
  • Checked unit tests got no new fails.
  • Added new test case for the added feature.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Thank you for your contribution 💪

As it's your first contribution be sure to check out the patch review checklist.

If you're fixing a ticket from Trac make sure to set the "Has patch" flag and include a link to this PR in the ticket!

If you have any design or process questions then you can ask in the Django forum.

Welcome aboard ⛵️!

@shangxiao
Copy link
Contributor

Hi @AivanF & welcome!

All PRs require a ticket and all feature request tickets require some discussion in the Django Forum first.

Is there a ticket that this PR is fixing?

@AivanF
Copy link
Author

AivanF commented Apr 29, 2023

@shangxiao Nope, there is no ticket yet. Google Groups: Django Developers – is it the forum?

@charettes
Copy link
Member

charettes commented Apr 29, 2023

This is already an accepted ticket for this feature so there's no need for a wider discussion, see ticket-24462 and #11715, #12602 for proposed implementations.

This approach won't fly though as this is an abuse of Query.qualify and only supports very basic operations and will fail on most non-trivial queries (alias collisions, aggregation, already nested queries, further annotations that require joining).

What we need here is a set of primitives that resemble what currently lives get_qualify_sql (subquery pushdown, masking, window filtering support) but can be generalized in a way that allows further chaining. get_qualify_sql can get away with a lot of things today because it's terminal in the sense that it evaluated at compilation time. It's another can of worms to put this in place and allow further filtering.

@shangxiao
Copy link
Contributor

@AivanF Yup there is that but the new & preferred method is the forum: Info here:
https://code.djangoproject.com/wiki/DevelopersMailingList

@felixxm
Copy link
Member

felixxm commented May 11, 2023

Closing due to inactivity and lack of follow up.

@felixxm felixxm closed this May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants