Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Smarter paginator default #8278

Open
stof opened this issue Sep 21, 2020 · 8 comments
Open

[RFC] Smarter paginator default #8278

stof opened this issue Sep 21, 2020 · 8 comments

Comments

@stof
Copy link
Member

stof commented Sep 21, 2020

The Paginator class exposes settings to configure how queries are paginated:

  • fetchJoinCollections, which defaults to true
  • useOutputWalker, which defaults to using it if the query does not already have a custom output walker (we can only have one output walker. Not sure things actually work at all if a custom output walker is applied as it would apply on the modified query)
  • the CountWalker (using when not using the output walker) allows to control its DISTINCT flag through a query hint storing a boolean, defaulting to true

The existing defaults are set to maximize compatibility with queries, as they are the one producing the correct result (and supporting the query at all) for most queries. But they are also the one producing the worst queries from a performance point of view.
Note that useOutputWalker controls the implementation used for getting the results of the slice when fetchJoinCollections is true and also the implementation used for getting the count.

Limitations of various configs:

  • fetchJoinCollections = false will produce the wrong results (potentially very badly broken) if the query is actually fetch-joining a collection, due to the fact that an hydrated result spans multiple SQL result in such case.
  • useOutputWalker = false does not support queries using HAVING
  • useOutputWalker = false does not support entities using a composite identifier
  • useOutputWalker = false does not support entities using a foreign key as identifier
  • useOutputWalker = false does not support queries using a field from a toMany relation in the ORDER BY clause
  • useOutputWalker = false does not support queries using multiple entities in FROM

We could have smarter defaults, which would try automatically choose the best behavior (keeping the ability to configure them explicitly for cases where the default cannot be smart enough, as we must still choose correctness and compatibility by default).

For fetchJoinCollections:

  • if the query does not make joins, we are 100% sure we don't fetch join a collection so we can set it to false
  • if the query only does joins on toOne relations, it is also safe to set this flag to false. But this case might be harder to identify based on the query AST.

For useOutputWalker, the cases of composite identifiers or association identifiers in entities can be detected (the walkers already validate those) and checking for HAVING and multiple FROM is easy. Regarding the last case, it is also possible to validate it (the code does it in the tree walker currently). So we could be smarter about using a tree walker. So it would be better to try to use the tree walker as much as possible.

For the CountWalker, setting DISTINCT to false might produce a broken count if there are multiple SQL results for the same hydrated result (which might happen in case of joins)

In practice, the Paginator class is often used wrapped inside a higher-level API (as it is low-level on purpose) like Pagerfanta, Zend Paginator or KnpPaginator. But it can also be used directly (in a custom higher-level API) as done in https://github.com/symfony/demo. Such higher-level APIs would have 3 choices:

  • expose the Doctrine settings while keeping the same defaults
  • expose the Doctrine settings but change the defaults to different hardcoded ones (trading performance against query support by default)
  • expose the Doctrine settings but with smarter defaults (none of the libraries I know about do it)
  • take control of the settings (smart or not smart)

Applying smarter defaults in these higher-level libraries has several drawbacks:

  • it requires duplicating the implementation in all of these
  • it requires them to duplicate internal logic f the Paginator regarding how the validations are done, which is quite tied to the DQL AST (even though it is more stable now that a few years back)
  • it does not help projects using the Paginator tool with their own higher-level API

In practice, libraries are either exposing the same defaults or changing them to a way which does not support all queries. And in most cases, this pushes the burden to projects to discover what they should do to get a good performance.

@greg0ire
Copy link
Member

Apparently, adding a detection for smarter defaults would be very welcome:

described are executed. We hope to automate the detection for this in
the future.

@stof
Copy link
Member Author

stof commented Sep 22, 2020

What makes things worse is that only part of these settings are documented, making things even harder to configure properly.

@mpdude
Copy link
Contributor

mpdude commented Jan 23, 2023

Would it be possible to figure out what settings to use when we'd have the AST from the query parser at hand?

@stof
Copy link
Member Author

stof commented Jan 24, 2023

As described in the issue, most of those things can be figured out as the current code validates those. But nobody has taken the time to implement them until now.

@d-ph
Copy link

d-ph commented Jul 16, 2024

I was about to create a similar RFC github issue about the default performance of the Paginator, and I'm glad I spent 10 minutes to use the "search" function to find this ticket. I reached the exact same conclusions as stof after debugging why simple cruds in my webapp take seconds to load. What's not ideal, however, is seeing that this issue remains opened with limited response from the project managers since 2020.

The following part especially is such a cheap change that it hurts seeing it not being in place already:

if the query does not make joins, we are 100% sure we don't fetch join a collection so we can set it [i.e. the fetchJoinCollections] to false

This alone improved some of my queries from 3.5s to 0.1s.


It took me a while to trace how the Paginator builds its queries, and to understand my options, so in case someone reads this comment and wishes to "just get stuff done" (or wishes to compare approaches on how to improve the situation while this ticket remains open), here's what I did (in my case: with Pagerfanta):

use Pagerfanta\Doctrine\ORM\QueryAdapter;
use Pagerfanta\Pagerfanta;

public function createPagerfantaPaginator(
    QueryBuilder $queryBuilder,
    bool $hasXToManyJoins = null,
    // ... other args
): Pagerfanta {
    $query = $queryOrQueryBuilder->getQuery();
    
    /*
     * Case when the query has any joins, any of which potentially being [X]ToMany: run the more expensive
     * paginator queries.
     */
    $hasXToManyJoins ??= count($queryOrQueryBuilder->getDQLParts()['join']) > 0;
    
    if (!$hasXToManyJoins) {
        $query->setHint(CountWalker::HINT_DISTINCT, false);
    }
    
    $adapter = new QueryAdapter(
        $query,
        fetchJoinCollection: $hasXToManyJoins,
        useOutputWalkers: $hasXToManyJoins,
    );

    // ... continue creating the Pagerfanta paginator using the $adapter created above
}

I'm looking forward to something along the above lines being implemented in the Doctrine Paginator itself, especially (and as stof already mentioned) because:

  1. This affect everyone, including the higher-level Paginator tools. It takes everyone's time to discover this problem and to action it.
  2. The Paginator already accepts an instance of QueryBuilder in its constructor, so it already is in a position to choose better defaults based on the wealth of information that the QueryBuilder conveys about the DQL query being built.

Thank you.

@greg0ire
Copy link
Member

greg0ire commented Jul 16, 2024

What's not ideal, however, is seeing that this issue remains opened with limited response from the project managers since 2020.

Ok, I'll be more precise then: this seems like a good idea, and I would be open to reviewing PRs implementing the changes described above, or improving the documentation to include a description of the current settings, as well as merging them of course. Just make sure to send reasonably-sized PRs, with the lowest hanging fruits / less risky stuff dealt with first please.

@d-ph
Copy link

d-ph commented Jul 19, 2024

Thanks for clarifying, and apologies if I sounded too belligerent.

I could spend some time to write up a PR for further discussion. I would be grateful however if you could look at my proposition below and give an "initial ok" for it, so that a rough direction is initially deemed acceptable.

This entire ticket is essentially (and I suspect stof would agree) about doing the following:

If a query doesn't have any joins, or all the joins are *ToOne, then the following defaults should be used:

  • "fetchJoinCollections" should be set to false
  • "useOutputWalker" should be set to false (unless the query is also subject to the caveats mentioned by stof, like using a HAVING clause)
  • "CountWalker::HINT_DISTINCT" hint should be set to false (which would be effective only if "useOutputWalker" is also false)

Proposition:

  1. The $fetchJoinCollections argument in Paginator::__construct() becomes nullable and is defaulted to null. When the Paginator is constructed with that argument set to null, an "auto-detection" is going to be carried out by the Paginator on whether the argument should be true or false. Likewise, when $useOutputWalker is null (and the query does not use a "custom output walker"), then an auto-detection of that argument is also going to be done. In the same vain, when the query does not have a CountWalker::HINT_DISTINCT set on it, and the $useOutputWalker is concluded to be false, then an auto-detection of whether the hint should be set to true or false is also going to be carried out.

  2. In the first PR, the auto-detection is going to be attempted only when a QueryBuilder (instead of a Query) is supplied to the Paginator.

2.1. When an instance of a Query is supplied, and the $fetchJoinCollections is set to null, then $fetchJoinCollections is going to be set back to true (which is the current behaviour).

  1. If the QueryBuilder has no joins, then all the defaults mentioned at the beginning of this comment are going to be set.

  2. I will spend 30 minutes to see whether it's JustTM possible to get the information from the QueryBuilder whether the joins are *ToMany or not. If that's easy to do, this would also be done in the first PR.


Bottom line: very simple code change. No AST interpretation and modification. Pretty much what I did in the code snippet mentioned earlier.

Thoughts?

@greg0ire
Copy link
Member

On paper it sounds OK, I think we might want to make this opt-in at least at first, so that people can try it and report back, then default to null in a next minor or even major if we want to be careful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants