Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Avoid wrapping in subqueries where possible and respect order by clauses with dbt show --limit #9600

Open
2 tasks done
joellabes opened this issue Feb 19, 2024 · 8 comments
Labels
enhancement New feature or request Impact: Exp Refinement Maintainer input needed

Comments

@joellabes
Copy link
Contributor

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

At the moment, the limit flag on dbt show is achieved by wrapping the main query in a subquery, so:

select *
    from (
        with summarised as (select * from analytics_dev.dbt_jlabes.messages)
        
        select 
            len(text), 
            *  
        from summarised 
        order by len(text) asc
    ) as model_limit_subq
    limit 500

instead of the more idiomatic

with summarised as (select * from analytics_dev.dbt_jlabes.messages)

select 
    len(text), 
    *  
from summarised 
order by len(text) asc
limit 500

Because of this, the output isn't always correctly sorted (screenshot from the Cloud IDE, but behaviour is consistent across Cloud IDE, Cloud CLI and Core CLI)
image

Expected Behavior

Sort order to be respected. The easiest way to do this would be to not wrap in a subquery and instead append the limit directly, as shown above.

I imagine that we're doing it this way to abstract over differences in syntax across adapters (lookin at you, SQL Server)

(Potentially we're also doing it to avoid dbt show --inline "select * from my_table limit 10" --limit 100 from rendering two limit clauses and causing an error. But IMO if you do that, you should get the error from the DWH and then remove one of your limits.)

Steps To Reproduce

Unfortunately it reproduces sporadically - I'd hoped to provide you a basic query like dbt_utils.generate_series and sort by that, but I think there's something to do with how each DWH distributes the results that makes this tricky to force into existence.

Relevant log output

No response

Environment

- dbt: 1.6.latest and 1.7.0

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@joellabes joellabes added bug Something isn't working triage labels Feb 19, 2024
@dbeatty10
Copy link
Contributor

Thanks for raising this @joellabes !

Agreed, it makes sense that we'd want to respect order by within the underlying query.

We'd also want to respect limit within the underlying query too though. If we make the proposed change as-is, I think we'd be at risk of breaking preview queries that are working currently. So I'm going to re-categorize this as a feature request.

Workaround

The standard definition is:

{% macro default__get_limit_subquery_sql(sql, limit) %}
    select *
    from (
        {{ sql }}
    ) as model_limit_subq
    limit {{ limit }}
{% endmacro %}

You can get the behavior you're after by creating a file within your project named something like macros/overrides/get_limit_subquery_sql.sql:

{% macro default__get_limit_subquery_sql(sql, limit) %}
    {{ sql }}
    limit {{ limit }}
{% endmacro %}

@dbeatty10 dbeatty10 added enhancement New feature or request and removed bug Something isn't working labels Feb 20, 2024
@dbeatty10 dbeatty10 changed the title [Bug] dbt show --limit doesn't respect order by clauses and should avoid wrapping in subqueries where possible [Feature] Avoid wrapping in subqueries where possible and respect order by clauses with dbt show --limit Feb 20, 2024
@joellabes
Copy link
Contributor Author

joellabes commented Feb 20, 2024

ooooh i like that workaround. on my way to propose a PR in internal-analytics 👀

I think we'd be at risk of breaking preview queries that are working currently.

@dbeatty10 can you clarify what you mean here? Aren't preview queries by definition ephemeral (not in the materialization sense) and unbreakable?

@dbeatty10
Copy link
Contributor

Like you mentioned originally, any query that uses a limit won't work with the macro override above.

Here's an example that doesn't work:

dbt show --inline "select 1 as id limit 3"

@joellabes
Copy link
Contributor Author

OK got it! I guess my pitch is that previews don't have to be protected from that happening - if you accidentally wind up issuing select 1 as id limit 3 limit 100, you solve it by going "oops!" and removing the --limit flag (or the limit statement from your actual query, as appropriate)

@dbeatty10 dbeatty10 added Refinement Maintainer input needed and removed triage labels Feb 27, 2024
@dsmdavid
Copy link

dsmdavid commented Mar 15, 2024

Hi @dbeatty10 splitting hairs here :-) I don't know if I would say this is a feature request or a bug fixing. Yes, I hear you that some queries currently working would stop working (the double limit queries); however, I could also frame it as there are queries that used to work as expected before that are now not throwing errors but returning incorrect results compared to previous versions (e.g. if you downgrade the environment to 1.6 they would work as expected) so I would consider this more of a regression bug that a new feature 😅
Appreciate the workarounds. Also there's another (at least for Snowflake reported by Jack Nguyen https://getdbt.slack.com/archives/CBSQTAPLG/p1698024014827109?thread_ts=1698004749.652379&cid=CBSQTAPLG )
using a explicit limit inside the query instead of relying on dbt's wrap.

@dbeatty10
Copy link
Contributor

dbeatty10 commented Mar 16, 2024

Very good points @dsmdavid -- we could reasonably categorize it as any of those.

We've taken a couple swings at this:

The latter has interactions with ORDER BY clauses that is the subject of this issue.

Ideal

In the most ideal scenario, we'd just do the following:

Add a limit if none exists in the SQL. If a limit already exists in the SQL, and it is less than the provided CLI/Cloud limit, then there's nothing to do. Otherwise, replace the preexisting SQL limit with the CLI/Cloud limit.

The devil would be in the implementation details due to the query re-writing, so this could be a large lift. We'd probably opt for a more practical option that is more easy to implement, like below.

Less ideal, but more practical

See below for a not-quite-perfect workaround that seems reasonably practical.

There's two variants we could choose between when applying LIMIT to the end of the query, each with their own trade-offs:

  1. Don't use a subquery -- accepting the trade-off that a pre-existing LIMIT clause will lead to a syntax error.
  2. Use a subquery -- accepting the trade-off that any pre-existing ORDER BY clause may be clobbered silently.

Is there any way that we have the best of both words without re-writing the query? I don't think so.

But we can do better than just choosing one to apply to all scenarios. Rather, we can dynamically choose one or other depending on the situation.

Proposal

I'd propose that we choose between these two (imperfect!) strategies:

  1. IFF there is a pre-existing ORDER BY clause, Don't use a subquery.
  2. IFF there is a pre-existing LIMIT clause, Use a subquery.

Both variants works great in all scenarios except the following edge case:

select 1 as id
order by id
limit 3

Either one will break down in this scenario, but each in a different way: Using variant 1 will lead to a syntax error, whereas for variant 2, the underlying database may clobber the order by clause.

Code

Code to override get_show_sql, etc:

macros/overrides/show.sql

{% macro get_show_sql(compiled_code, sql_header, limit) -%}
  {%- if sql_header -%}
  {{ sql_header }}
  {%- endif -%}
  {%- if limit is not none -%}
    {%- if use_limit_subquery(compiled_code) is sameas true -%}
      {{ get_limit_subquery_sql(compiled_code, limit) }}
    {%- else -%}
      {{ get_limit_sql(compiled_code, limit) }}
    {%- endif -%}
  {%- else -%}
  {{ compiled_code }}
  {%- endif -%}
{% endmacro %}

{% macro get_limit_subquery_sql(sql, limit) %}
  {{ adapter.dispatch('get_limit_subquery_sql', 'dbt')(sql, limit) }}
{% endmacro %}

{% macro default__get_limit_subquery_sql(sql, limit) %}
    select *
    from (
        {{ sql }}
    ) as model_limit_subq
    limit {{ limit }}
{% endmacro %}

{% macro get_limit_sql(sql, limit) %}
  {{ adapter.dispatch('get_limit_sql', 'dbt')(sql, limit) }}
{% endmacro %}

{% macro default__get_limit_sql(sql, limit) %}
    {{ sql }}
    limit {{ limit }}
{% endmacro %}

{% macro use_limit_subquery(compiled_code) %}
  {{ adapter.dispatch('use_limit_subquery', 'dbt')(compiled_code) }}
{% endmacro %}

{% macro default__use_limit_subquery(compiled_code) %}
  {%- do return('order by' not in compiled_code.lower()) -%}
{% endmacro %}

Examples for variant 1

Please bear with the "LaTeX" casing I adopted in the example below. This mixed-casing is included to ensure the proposed solution is case-insensitive.

The example below demonstrates variant 1. To switch these examples to variant 2, just replace the use_limit_subquery macro with this definition instead:

{% macro use_limit_subquery(compiled_code) %}
  {%- do return('limit' in compiled_code.lower()) -%}
{% endmacro %}

Happy path 👍

The following models work as intended when using dbt show along with variant 1 as described above.

models/without_limit_or_order_by.sql

select 1 as id

dbt show runs this query:

    select *
    from (
        select 1 as id
    ) as model_limit_subq
    limit 5

models/without_order_by.sql

select 1 as id
LiMiT 3

dbt show runs this query:

    select *
    from (
        select 1 as id
LiMiT 3
    ) as model_limit_subq
    limit 5

models/without_limit.sql

select 1 as id
OrDeR bY id

dbt show runs this query:

    select 1 as id
OrDeR bY id
    limit 5

models/with_commented_limit.sql

select 1 as id
OrDeR bY id
-- LiMiT 3

dbt show runs this query:

    select 1 as id
OrDeR bY id
-- LiMiT 3
    limit 5

Unhappy path / edge case 👎

models/with_limit_and_order_by.sql

select 1 as id
OrDeR bY id
LiMiT 3

dbt show runs this query and generates a database syntax error:

    select 1 as id
OrDeR bY id
LiMiT 3
    limit 5

models/with_commented_order_by.sql

select 1 as id
-- OrDeR bY id
LiMiT 3

dbt show runs this query and generates a database syntax error:

    select 1 as id
-- OrDeR bY id
LiMiT 3
    limit 5

Making each variant configurable

We could consider making a flag so that users can toggle between which trade-off they want to accept (limit that clobbers order by vs. double limit syntax error). If we surfaced a flag, we could name it something like --limit-subquery / --no-limit-subquery.

@joellabes
Copy link
Contributor Author

Enthusiastic about a flag where you can choose your tradeoff!

I assume the above sample use_limit_subquery macros are intended as a proof of concept, but for the sake of completeness... code like row_number() over (partition by some_id order by date desc) and is_limited_liability_corp precludes us from using them as-is without Scunthorping ourselves.

@dbeatty10
Copy link
Contributor

Good point @joellabes ! Yep, this is a perfect example of two competing edge cases.

Variant 1 would be best for the is_limited_liability_corp edge case, whereas variant 2 would be better for the row_number() over (partition by some_id order by date desc) edge case.

As mentioned earlier, the code provided earlier comes with trade-offs that we can't fully get around without fully re-writing the query itself to either overwrite an existing LIMIT clause or to append one (which is ideal, but not trivial).

To avoid bewilderment/frustration while wrestling with this, I have to remind myself of the true root cause:

  • for some reason I don't understand, some data platforms don't preserve order by clauses when wrapped in a subquery .

So all these gymnastics are just trying to do our best to workaround that undesirable behavior.

Here's a few follow-up ideas that are not mutually exclusive and could be used in combination.

Idea 1

One idea would be making the "show strategy" configurable somehow on a model-by-model basis.

e.g., add a model config like the following:

  • limit_stategy

The available strategies could be named like this (or alternatives):

  • append_limit
  • subquery_limit / potentially_clobber_order_by
  • none / empty

The last one is to not apply any additional limit at all and just trust that the user is handling limits within the model itself.

Idea 2

Another idea is to allow users to make assertions about their model via a model-level config:

e.g., add a model config like the following:

  • has_limit and/or has_order_by

If we added a model-level config that allows users to be explicit, then we could raise appropriate warnings during inference, like:

Warning: It looks like the `abc` model might include a `LIMIT` clause at the end of the query.
If not, set the `has_limit` model config to `false`.
Warning: It looks like the `abc` model might include a `ORDER BY` clause at the end of the query.
If not, set the `has_order_by` model config to `false`.

Idea 3

The real edge case is when both LIMIT and ORDER BY are detected. The crux of the issue is that we know we want to limit the number of results in dbt show so that we don't use an excessive amount memory or screen real estate when the model has a ton of rows.

For users that have a limit clause at the end of their model code, we could encourage a pattern like the following that uses either the hard-coded limit in the model or the CLI limit (depending on which is least).

Suppose the user had this before:

limit 17

Then we'd encourage them to use the invocation_args_dict and the min filter to dynamically choose the lesser:

limit {{ [17, invocation_args_dict.get("limit", 999)] | min }}

Gnarly! 😬

But if the model has this code, then we can bypass adding another limit altogether because the model will already have it. But like in the other scenarios discussed above, we'd probably need the user to explicitly say that they don't need/want anything added to the show query (for this model specifically). i.e., the none / empty strategy described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Impact: Exp Refinement Maintainer input needed
Projects
None yet
Development

No branches or pull requests

4 participants