[Feature] Avoid wrapping in subqueries where possible and respect `order by` clauses with `dbt show --limit` #9600

joellabes · 2024-02-19T23:35:41Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

At the moment, the limit flag on dbt show is achieved by wrapping the main query in a subquery, so:

select *
    from (
        with summarised as (select * from analytics_dev.dbt_jlabes.messages)
        
        select 
            len(text), 
            *  
        from summarised 
        order by len(text) asc
    ) as model_limit_subq
    limit 500

instead of the more idiomatic

with summarised as (select * from analytics_dev.dbt_jlabes.messages)

select 
    len(text), 
    *  
from summarised 
order by len(text) asc
limit 500

Because of this, the output isn't always correctly sorted (screenshot from the Cloud IDE, but behaviour is consistent across Cloud IDE, Cloud CLI and Core CLI)

Expected Behavior

Sort order to be respected. The easiest way to do this would be to not wrap in a subquery and instead append the limit directly, as shown above.

I imagine that we're doing it this way to abstract over differences in syntax across adapters (lookin at you, SQL Server)

(Potentially we're also doing it to avoid dbt show --inline "select * from my_table limit 10" --limit 100 from rendering two limit clauses and causing an error. But IMO if you do that, you should get the error from the DWH and then remove one of your limits.)

Steps To Reproduce

Unfortunately it reproduces sporadically - I'd hoped to provide you a basic query like dbt_utils.generate_series and sort by that, but I think there's something to do with how each DWH distributes the results that makes this tricky to force into existence.

Relevant log output

No response

Environment

- dbt: 1.6.latest and 1.7.0

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2024-02-20T19:48:55Z

Thanks for raising this @joellabes !

Agreed, it makes sense that we'd want to respect order by within the underlying query.

We'd also want to respect limit within the underlying query too though. If we make the proposed change as-is, I think we'd be at risk of breaking preview queries that are working currently. So I'm going to re-categorize this as a feature request.

Workaround

The standard definition is:

{% macro default__get_limit_subquery_sql(sql, limit) %}
    select *
    from (
        {{ sql }}
    ) as model_limit_subq
    limit {{ limit }}
{% endmacro %}

You can get the behavior you're after by creating a file within your project named something like macros/overrides/get_limit_subquery_sql.sql:

{% macro default__get_limit_subquery_sql(sql, limit) %}
    {{ sql }}
    limit {{ limit }}
{% endmacro %}

joellabes · 2024-02-20T19:58:15Z

ooooh i like that workaround. on my way to propose a PR in internal-analytics 👀

I think we'd be at risk of breaking preview queries that are working currently.

@dbeatty10 can you clarify what you mean here? Aren't preview queries by definition ephemeral (not in the materialization sense) and unbreakable?

dbeatty10 · 2024-02-20T20:27:56Z

Like you mentioned originally, any query that uses a limit won't work with the macro override above.

Here's an example that doesn't work:

dbt show --inline "select 1 as id limit 3"

joellabes · 2024-02-20T20:30:40Z

OK got it! I guess my pitch is that previews don't have to be protected from that happening - if you accidentally wind up issuing select 1 as id limit 3 limit 100, you solve it by going "oops!" and removing the --limit flag (or the limit statement from your actual query, as appropriate)

dsmdavid · 2024-03-15T09:43:49Z

Hi @dbeatty10 splitting hairs here :-) I don't know if I would say this is a feature request or a bug fixing. Yes, I hear you that some queries currently working would stop working (the double limit queries); however, I could also frame it as there are queries that used to work as expected before that are now not throwing errors but returning incorrect results compared to previous versions (e.g. if you downgrade the environment to 1.6 they would work as expected) so I would consider this more of a regression bug that a new feature 😅
Appreciate the workarounds. Also there's another (at least for Snowflake reported by Jack Nguyen https://getdbt.slack.com/archives/CBSQTAPLG/p1698024014827109?thread_ts=1698004749.652379&cid=CBSQTAPLG )
using a explicit limit inside the query instead of relying on dbt's wrap.

dbeatty10 · 2024-03-16T00:44:34Z

Very good points @dsmdavid -- we could reasonably categorize it as any of those.

We've taken a couple swings at this:

initially using the PEP249 / DB API 2.0 python driver methods in fix #7390: push down limit filtering to adapter #7545
then switching to adding a SQL LIMIT clause in Inline limit in SQL sent from dbt show #8641

The latter has interactions with ORDER BY clauses that is the subject of this issue.

Ideal

In the most ideal scenario, we'd just do the following:

Add a limit if none exists in the SQL. If a limit already exists in the SQL, and it is less than the provided CLI/Cloud limit, then there's nothing to do. Otherwise, replace the preexisting SQL limit with the CLI/Cloud limit.

The devil would be in the implementation details due to the query re-writing, so this could be a large lift. We'd probably opt for a more practical option that is more easy to implement, like below.

Less ideal, but more practical

See below for a not-quite-perfect workaround that seems reasonably practical.

There's two variants we could choose between when applying LIMIT to the end of the query, each with their own trade-offs:

Don't use a subquery -- accepting the trade-off that a pre-existing LIMIT clause will lead to a syntax error.
Use a subquery -- accepting the trade-off that any pre-existing ORDER BY clause may be clobbered silently.

Is there any way that we have the best of both words without re-writing the query? I don't think so.

But we can do better than just choosing one to apply to all scenarios. Rather, we can dynamically choose one or other depending on the situation.

Proposal

I'd propose that we choose between these two (imperfect!) strategies:

IFF there is a pre-existing ORDER BY clause, Don't use a subquery.
IFF there is a pre-existing LIMIT clause, Use a subquery.

Both variants works great in all scenarios except the following edge case:

select 1 as id
order by id
limit 3

Either one will break down in this scenario, but each in a different way: Using variant 1 will lead to a syntax error, whereas for variant 2, the underlying database may clobber the order by clause.

Code

Code to override get_show_sql, etc:

macros/overrides/show.sql

{% macro get_show_sql(compiled_code, sql_header, limit) -%}
  {%- if sql_header -%}
  {{ sql_header }}
  {%- endif -%}
  {%- if limit is not none -%}
    {%- if use_limit_subquery(compiled_code) is sameas true -%}
      {{ get_limit_subquery_sql(compiled_code, limit) }}
    {%- else -%}
      {{ get_limit_sql(compiled_code, limit) }}
    {%- endif -%}
  {%- else -%}
  {{ compiled_code }}
  {%- endif -%}
{% endmacro %}

{% macro get_limit_subquery_sql(sql, limit) %}
  {{ adapter.dispatch('get_limit_subquery_sql', 'dbt')(sql, limit) }}
{% endmacro %}

{% macro default__get_limit_subquery_sql(sql, limit) %}
    select *
    from (
        {{ sql }}
    ) as model_limit_subq
    limit {{ limit }}
{% endmacro %}

{% macro get_limit_sql(sql, limit) %}
  {{ adapter.dispatch('get_limit_sql', 'dbt')(sql, limit) }}
{% endmacro %}

{% macro default__get_limit_sql(sql, limit) %}
    {{ sql }}
    limit {{ limit }}
{% endmacro %}

{% macro use_limit_subquery(compiled_code) %}
  {{ adapter.dispatch('use_limit_subquery', 'dbt')(compiled_code) }}
{% endmacro %}

{% macro default__use_limit_subquery(compiled_code) %}
  {%- do return('order by' not in compiled_code.lower()) -%}
{% endmacro %}

Examples for variant 1

Please bear with the "LaTeX" casing I adopted in the example below. This mixed-casing is included to ensure the proposed solution is case-insensitive.

The example below demonstrates variant 1. To switch these examples to variant 2, just replace the use_limit_subquery macro with this definition instead:

{% macro use_limit_subquery(compiled_code) %}
  {%- do return('limit' in compiled_code.lower()) -%}
{% endmacro %}

Happy path 👍

The following models work as intended when using dbt show along with variant 1 as described above.

models/without_limit_or_order_by.sql

select 1 as id

dbt show runs this query:

    select *
    from (
        select 1 as id
    ) as model_limit_subq
    limit 5

models/without_order_by.sql

select 1 as id
LiMiT 3

dbt show runs this query:

    select *
    from (
        select 1 as id
LiMiT 3
    ) as model_limit_subq
    limit 5

models/without_limit.sql

select 1 as id
OrDeR bY id

dbt show runs this query:

    select 1 as id
OrDeR bY id
    limit 5

models/with_commented_limit.sql

select 1 as id
OrDeR bY id
-- LiMiT 3

dbt show runs this query:

    select 1 as id
OrDeR bY id
-- LiMiT 3
    limit 5

Unhappy path / edge case 👎

models/with_limit_and_order_by.sql

select 1 as id
OrDeR bY id
LiMiT 3

dbt show runs this query and generates a database syntax error:

    select 1 as id
OrDeR bY id
LiMiT 3
    limit 5

models/with_commented_order_by.sql

select 1 as id
-- OrDeR bY id
LiMiT 3

dbt show runs this query and generates a database syntax error:

    select 1 as id
-- OrDeR bY id
LiMiT 3
    limit 5

Making each variant configurable

We could consider making a flag so that users can toggle between which trade-off they want to accept (limit that clobbers order by vs. double limit syntax error). If we surfaced a flag, we could name it something like --limit-subquery / --no-limit-subquery.

joellabes · 2024-03-18T19:10:54Z

Enthusiastic about a flag where you can choose your tradeoff!

I assume the above sample use_limit_subquery macros are intended as a proof of concept, but for the sake of completeness... code like row_number() over (partition by some_id order by date desc) and is_limited_liability_corp precludes us from using them as-is without Scunthorping ourselves.

dbeatty10 · 2024-03-19T16:24:08Z

Good point @joellabes ! Yep, this is a perfect example of two competing edge cases.

Variant 1 would be best for the is_limited_liability_corp edge case, whereas variant 2 would be better for the row_number() over (partition by some_id order by date desc) edge case.

As mentioned earlier, the code provided earlier comes with trade-offs that we can't fully get around without fully re-writing the query itself to either overwrite an existing LIMIT clause or to append one (which is ideal, but not trivial).

To avoid bewilderment/frustration while wrestling with this, I have to remind myself of the true root cause:

for some reason I don't understand, some data platforms don't preserve order by clauses when wrapped in a subquery .

So all these gymnastics are just trying to do our best to workaround that undesirable behavior.

Here's a few follow-up ideas that are not mutually exclusive and could be used in combination.

Idea 1

One idea would be making the "show strategy" configurable somehow on a model-by-model basis.

e.g., add a model config like the following:

limit_stategy

The available strategies could be named like this (or alternatives):

append_limit
subquery_limit / potentially_clobber_order_by
none / empty

The last one is to not apply any additional limit at all and just trust that the user is handling limits within the model itself.

Idea 2

Another idea is to allow users to make assertions about their model via a model-level config:

e.g., add a model config like the following:

has_limit and/or has_order_by

If we added a model-level config that allows users to be explicit, then we could raise appropriate warnings during inference, like:

Warning: It looks like the `abc` model might include a `LIMIT` clause at the end of the query.
If not, set the `has_limit` model config to `false`.
Warning: It looks like the `abc` model might include a `ORDER BY` clause at the end of the query.
If not, set the `has_order_by` model config to `false`.

Idea 3

The real edge case is when both LIMIT and ORDER BY are detected. The crux of the issue is that we know we want to limit the number of results in dbt show so that we don't use an excessive amount memory or screen real estate when the model has a ton of rows.

For users that have a limit clause at the end of their model code, we could encourage a pattern like the following that uses either the hard-coded limit in the model or the CLI limit (depending on which is least).

Suppose the user had this before:

limit 17

Then we'd encourage them to use the invocation_args_dict and the min filter to dynamically choose the lesser:

limit {{ [17, invocation_args_dict.get("limit", 999)] | min }}

Gnarly! 😬

But if the model has this code, then we can bypass adding another limit altogether because the model will already have it. But like in the other scenarios discussed above, we'd probably need the user to explicitly say that they don't need/want anything added to the show query (for this model specifically). i.e., the none / empty strategy described above.

joellabes added bug Something isn't working triage labels Feb 19, 2024

dbeatty10 added enhancement New feature or request and removed bug Something isn't working labels Feb 20, 2024

dbeatty10 changed the title ~~[Bug] dbt show --limit doesn't respect order by clauses and should avoid wrapping in subqueries where possible~~ [Feature] Avoid wrapping in subqueries where possible and respect order by clauses with dbt show --limit Feb 20, 2024

dbeatty10 added Refinement Maintainer input needed and removed triage labels Feb 27, 2024

graciegoheen added the Impact: Exp label Mar 15, 2024

This was referenced Mar 19, 2024

[Core] The schemas context variable has its own page but database_schemas does not dbt-labs/docs.getdbt.com#5117

Open

[Core] The schemas context variable has its own page but invocation_args_dict does not dbt-labs/docs.getdbt.com#5118

Open

jtcohen6 mentioned this issue May 14, 2024

Change the default behavior of get_show_sql to avoid wrapping in subquery dbt-labs/dbt-adapters#207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Avoid wrapping in subqueries where possible and respect `order by` clauses with `dbt show --limit` #9600

[Feature] Avoid wrapping in subqueries where possible and respect `order by` clauses with `dbt show --limit` #9600

joellabes commented Feb 19, 2024

dbeatty10 commented Feb 20, 2024

joellabes commented Feb 20, 2024 •

edited

Loading

dbeatty10 commented Feb 20, 2024

joellabes commented Feb 20, 2024

dsmdavid commented Mar 15, 2024 •

edited

Loading

dbeatty10 commented Mar 16, 2024 •

edited

Loading

joellabes commented Mar 18, 2024

dbeatty10 commented Mar 19, 2024

[Feature] Avoid wrapping in subqueries where possible and respect order by clauses with dbt show --limit #9600

[Feature] Avoid wrapping in subqueries where possible and respect order by clauses with dbt show --limit #9600

Comments

joellabes commented Feb 19, 2024

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

dbeatty10 commented Feb 20, 2024

Workaround

joellabes commented Feb 20, 2024 • edited Loading

dbeatty10 commented Feb 20, 2024

joellabes commented Feb 20, 2024

dsmdavid commented Mar 15, 2024 • edited Loading

dbeatty10 commented Mar 16, 2024 • edited Loading

Ideal

Less ideal, but more practical

Proposal

Code

Examples for variant 1

Happy path 👍

Unhappy path / edge case 👎

Making each variant configurable

joellabes commented Mar 18, 2024

dbeatty10 commented Mar 19, 2024

Idea 1

Idea 2

Idea 3

[Feature] Avoid wrapping in subqueries where possible and respect `order by` clauses with `dbt show --limit` #9600

[Feature] Avoid wrapping in subqueries where possible and respect `order by` clauses with `dbt show --limit` #9600

joellabes commented Feb 20, 2024 •

edited

Loading

dsmdavid commented Mar 15, 2024 •

edited

Loading

dbeatty10 commented Mar 16, 2024 •

edited

Loading