Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] clone creating "view pointers" instead of "cloned tables" on 1.8 / "Keep on latest version" #10296

Open
2 tasks done
jeremyyeo opened this issue Jun 12, 2024 · 2 comments
Labels
bug Something isn't working clone related to the dbt clone command

Comments

@jeremyyeo
Copy link
Contributor

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

On the latest dbt version ("Keep on latest version") in dbt Cloud - dbt clone is sometimes creating view pointers of the target table instead of actually creating a cloned table.

Internal thread: https://dbt-labs.slack.com/archives/C05FWBP9X1U/p1717992405960779

Expected Behavior

If the target table exist as a table - it should be cloned via create or replace table db.ci.foo clone db.prod.foo; instead of a view pointer create or replace view db.ci.foo as (select * from db.prod.foo);

Steps To Reproduce

Not able to repro yet.

Relevant log output

Internal thread above has a lot of logging.

Environment

- OS:
- Python:
- dbt: Keep on latest version

Which database adapter are you using with dbt?

snowflake

Additional Context

I believe this is a core thing as opposed to an adapter thing so filing this here.

@jeremyyeo jeremyyeo added bug Something isn't working triage labels Jun 12, 2024
@dbeatty10 dbeatty10 added clone related to the dbt clone command and removed triage labels Jun 12, 2024
@sigveks
Copy link

sigveks commented Jun 25, 2024

This has also been observed in 1.7

@jeremyyeo
Copy link
Contributor Author

jeremyyeo commented Jun 26, 2024

  1. Given this is not happening to exclusively 1.8 (keep on latest version in dbt Cloud) - there must be something else afoot.
  2. The macro that controls whether the defer relation (prod object) is cloned into the CI schema as a "cloned table" or "pointer view" is the clone materialization here:
  3. This line specifically has to evaluate to true if a relation will be "cloned table":
    {%- if other_existing_relation and other_existing_relation.type == 'table' and can_clone_table -%}
  4. Currently, we do log out the objects that exist in a schema - but we don't log out what type the object is - for example:
10:33:18  While listing relations in database=prod_jyeo, schema=dbt_jyeo, found: ABC, BAR, BAZ, FOO

I propose overriding the clone macro to do 2 additional things in the mean time:

  1. Log out more information about the other_existing_relation / defer relation that is retrieved from the cache.
  2. Do a double take and check the information schema AGAIN for the type of the defer relation.

In order to do that - add the following macro to your dbt project:

-- macros/clone_override.sql
{%- materialization clone, default -%}

  {%- set relations = {'relations': []} -%}

  {%- if not defer_relation -%}
      -- nothing to do
      {{ log("No relation found in state manifest for " ~ model.unique_id, info=True) }}
      {{ return(relations) }}
  {%- endif -%}

  {%- set existing_relation = load_cached_relation(this) -%}

  {%- if existing_relation and not flags.FULL_REFRESH -%}
      -- noop!
      {{ log("Relation " ~ existing_relation ~ " already exists", info=True) }}
      {{ return(relations) }}
  {%- endif -%}

  {%- set other_existing_relation = load_cached_relation(defer_relation) -%}

  {#/* See https://github.com/dbt-labs/dbt-core/issues/10296 */-#}
  {{ log('CLONE DEBUG: ' ~ other_existing_relation ~ ' type is ' ~ other_existing_relation.type) }}

  {% set check_type_query -%}
    select case when table_type = 'BASE TABLE' then 'table' else table_type end as table_type
      from {{ defer_relation.database }}.information_schema.tables
     where lower(table_schema) =  lower('{{ defer_relation.schema }}')
       and lower(table_name) = lower('{{ defer_relation.identifier}}')
  {%- endset %}
  {% set recheck = run_query(check_type_query) %}
  {% set defer_rel_type_recheck = recheck.columns['TABLE_TYPE'].values()[0] %}
  {{ log('CLONE DEBUG (recheck type): defer relation type is ' ~ defer_rel_type_recheck) }}

  -- If this is a database that can do zero-copy cloning of tables, and the other relation is a table, then this will be a table
  -- Otherwise, this will be a view

  {% set can_clone_table = can_clone_table() %}

  {%- if other_existing_relation and (other_existing_relation.type == 'table' or defer_rel_type_recheck == 'table') and can_clone_table -%}

      {%- set target_relation = this.incorporate(type='table') -%}
      {% if existing_relation is not none and not existing_relation.is_table %}
        {{ log("Dropping relation " ~ existing_relation ~ " because it is of type " ~ existing_relation.type) }}
        {{ drop_relation_if_exists(existing_relation) }}
      {% endif %}

      -- as a general rule, data platforms that can clone tables can also do atomic 'create or replace'
      {% call statement('main') %}
          {% if target_relation and defer_relation and target_relation == defer_relation %}
              {{ log("Target relation and defer relation are the same, skipping clone for relation: " ~ target_relation) }}
          {% else %}
              {{ create_or_replace_clone(target_relation, defer_relation) }}
          {% endif %}

      {% endcall %}

      {% set should_revoke = should_revoke(existing_relation, full_refresh_mode=True) %}
      {% do apply_grants(target_relation, grant_config, should_revoke=should_revoke) %}
      {% do persist_docs(target_relation, model) %}

      {{ return({'relations': [target_relation]}) }}

  {%- else -%}

      {%- set target_relation = this.incorporate(type='view') -%}

      -- reuse the view materialization
      -- TODO: support actual dispatch for materialization macros
      -- Tracking ticket: https://github.com/dbt-labs/dbt-core/issues/7799
      {% set search_name = "materialization_view_" ~ adapter.type() %}
      {% if not search_name in context %}
          {% set search_name = "materialization_view_default" %}
      {% endif %}
      {% set materialization_macro = context[search_name] %}
      {% set relations = materialization_macro() %}
      {{ return(relations) }}

  {%- endif -%}

{%- endmaterialization -%}

Caveats: Snowflake only. You must not have mixed cased objects in your Snowflake db - i.e. both table "FOO" and "foo" exist - there can every be only one prod foo object - check the way we're querying the information_schema in the query above with lower() on both sides of the equality.

With a clone override like above and a table and view model like so:

-- models/foo.sql
{{ config(materialized='table') }}
select 2 id

-- models/bar.sql
{{ config(materialized='view') }}
select * from {{ ref('foo') }}

We get additional logging information in the debug logs:

$ dbt --debug clone -s state:modified+ --defer --state target_old --target dev
10:33:16  Began running node model.my_dbt_project.foo
10:33:16  Re-using an available connection from the pool (formerly list_development_jyeo_dbt_jyeo, now model.my_dbt_project.foo)
10:33:16  Began compiling node model.my_dbt_project.foo
10:33:16  Began executing node model.my_dbt_project.foo
10:33:16  On "model.my_dbt_project.foo": cache miss for schema "prod_jyeo.dbt_jyeo", this is inefficient
10:33:17  Using snowflake connection "model.my_dbt_project.foo"
10:33:17  On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.8.2", "profile_name": "sf", "target_name": "dev", "node_id": "model.my_dbt_project.foo"} */
show objects in prod_jyeo.dbt_jyeo limit 10000
10:33:17  Opening a new connection, currently in state closed
10:33:18  SQL status: SUCCESS 4 in 2.0 seconds
10:33:18  While listing relations in database=prod_jyeo, schema=dbt_jyeo, found: ABC, BAR, BAZ, FOO
10:33:18  CLONE DEBUG: "PROD_JYEO"."DBT_JYEO"."FOO" type is table
10:33:18  Using snowflake connection "model.my_dbt_project.foo"
10:33:18  On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.8.2", "profile_name": "sf", "target_name": "dev", "node_id": "model.my_dbt_project.foo"} */
select case when table_type = 'BASE TABLE' then 'table' else table_type end as table_type
      from prod_jyeo.information_schema.tables
     where lower(table_schema) =  lower('dbt_jyeo')
       and lower(table_name) = lower('foo')
10:33:20  SQL status: SUCCESS 1 in 1.0 seconds
10:33:20  CLONE DEBUG (recheck type): defer relation type is table
10:33:20  Writing runtime sql for node "model.my_dbt_project.foo"
10:33:20  Using snowflake connection "model.my_dbt_project.foo"
10:33:20  On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.8.2", "profile_name": "sf", "target_name": "dev", "node_id": "model.my_dbt_project.foo"} */
create or replace
      transient
      table development_jyeo.dbt_jyeo.foo
      clone prod_jyeo.dbt_jyeo.foo
10:33:21  SQL status: SUCCESS 1 in 1.0 seconds
10:33:21  On model.my_dbt_project.foo: Close
10:33:22  Finished running node model.my_dbt_project.foo
10:33:22  Began running node model.my_dbt_project.bar
10:33:22  Re-using an available connection from the pool (formerly model.my_dbt_project.foo, now model.my_dbt_project.bar)
10:33:22  Began compiling node model.my_dbt_project.bar
10:33:22  Began executing node model.my_dbt_project.bar
10:33:22  CLONE DEBUG: "PROD_JYEO"."DBT_JYEO"."BAR" type is view
10:33:22  Using snowflake connection "model.my_dbt_project.bar"
10:33:22  On model.my_dbt_project.bar: /* {"app": "dbt", "dbt_version": "1.8.2", "profile_name": "sf", "target_name": "dev", "node_id": "model.my_dbt_project.bar"} */
select case when table_type = 'BASE TABLE' then 'table' else table_type end as table_type
      from prod_jyeo.information_schema.tables
     where lower(table_schema) =  lower('dbt_jyeo')
       and lower(table_name) = lower('bar')
10:33:22  Opening a new connection, currently in state closed
10:33:24  SQL status: SUCCESS 1 in 2.0 seconds
10:33:24  CLONE DEBUG (recheck type): defer relation type is VIEW
10:33:24  Writing runtime sql for node "model.my_dbt_project.bar"
10:33:24  Using snowflake connection "model.my_dbt_project.bar"
10:33:24  On model.my_dbt_project.bar: /* {"app": "dbt", "dbt_version": "1.8.2", "profile_name": "sf", "target_name": "dev", "node_id": "model.my_dbt_project.bar"} */
create or replace   view development_jyeo.dbt_jyeo.bar
  
   as (
    select * from prod_jyeo.dbt_jyeo.bar
  );
10:33:25  SQL status: SUCCESS 1 in 1.0 seconds
10:33:25  On model.my_dbt_project.bar: Close
10:33:25  Finished running node model.my_dbt_project.bar

Prior to cloning the prod object - the debug logs will tell us:

  1. What was the type of the prod object (table or view) as stored in the cache generated at the start of the run.
  2. What was the type of the prod object (table or view) again, as retrieved directly from the information_schema the second time just prior to doing the clone operation.

It would be useful to know if (1) and (2) both line up - this means 100% certainty that Snowflake is returning the fact that the prod object is a confirmed view - rightly or wrongly - that's what Snowflake returns so perhaps there's some external processes that is converting the prod object to a view when users think that it is actually a table.

If (1) and (2) don't line up - it is also useful I think?

Additionally, by logging out other_existing_relation - we would know that we looked into the cache and found the existence of the object in the cache and potentially if that is logging out as None then there is a bug here with peering into the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working clone related to the dbt clone command
Projects
None yet
Development

No branches or pull requests

3 participants