Custom schemas: table already exists #38

eamontaaffe · 2019-11-25T05:05:19Z

Issues with re-running workflows when using custom schemas.

When I create a model with a custom schema configured:

-- models/clean/clean_accounts.sql
{{ config(alias='accounts', schema='clean', materialization='table') }}
select * from {{ source('incoming', 'accounts') }}

I am able to run the workflow successfully once:

> dbt run
...
Completed successfully

However, if I run the same workflow again I get an error:

> dbt run
...
Runtime Error in model clean_orders (models/clean/clean_accounts.sql)
  Database Error
    org.apache.spark.sql.AnalysisException: `dev_clean`.`accounts` already exists.;

Instead, the table should be dropped and recreated. If we repeat the same exercise without the schema='clean' configuration, everything works as expected.

The text was updated successfully, but these errors were encountered:

eamontaaffe · 2019-12-12T03:07:56Z

So what I think is happening here is that DBT isn't picking up that the table already exists when it attempts a subsequent run. Therefore it isn't running the drop_relation on the old table.

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/materializations/table.sql#L14-L16

The old relation is identified using an adapter macro get_relation.

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/materializations/table.sql#L5

However the get_relation method isn't defined in the adapter implementation here:

https://github.com/fishtown-analytics/dbt-spark/blob/master/dbt/adapters/spark/impl.py

We are really keen to get this issue solved and would like to put in a pull request. Is it possible to get someone to check my logic so I know I'm heading in the right direction?

cc: @drewbanin

eamontaaffe · 2019-12-13T01:56:26Z

Question, why do we need to check it the table exists at all? Couldn't we just use the keyword IF EXISTS to ensure that a table is dropped?

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/materializations/table.sql#L13-L16

So we could instead do something like:

  -- setup: if the target relation already exists, drop it
  -- Notice there is no need to check if relation exists before dropping it.
  {{ adapter.drop_relation(old_relation) }}

The macro implementation of spark__drop_relation already uses if exists so we don't even need to update it:

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/adapters.sql#L108-L113

drewbanin · 2019-12-17T15:14:04Z

hey @eamontaaffe - thanks for your thoughtful writeup here! I appreciate your patience - it was hard to get back in the swing of the dbt-spark plugin, but I'm excited to get this (and the other open PRs in this repo) merged!

I think the change you've proposed here is uncontroversial - let me pick this up with you in the open PR.

eamontaaffe · 2019-12-18T06:27:54Z

Awesome, thanks @drewbanin. I'm excited about some of the other issues & PRs that are currently open too! Version 15 is support will be amazing.

We have been using the changes proposed in #42 for the last 4-5 days and it seems to be behaving as expected.

jtcohen6 · 2020-02-03T21:10:41Z

In the spirit of figuring out what was actually going wrong with adapter.get_relation, I discovered the cause: in Spark, unlike in other dbt adapters, database and schema are one and the same. Only the schema property of the materialization is updated, however, when a custom schema is declared in a model config. When dbt checks the cache here for a table matching both the database and schema of the model, it supplies the custom schema for schema but the default (target.database) for database.

I think we should fix get_relation, rather than the workaround in #42. We could redefine all get_relation calls to look like

{%- set old_relation = adapter.get_relation(database=schema, schema=schema, identifier=identifier) -%}

Or we could re-implement cache.get_relations for the Spark adapter to only check for a matching schema. I'm leaning toward the latter, what do you think @drewbanin?

drewbanin · 2020-02-03T21:14:41Z

@jtcohen6 I think we'd want to override adapter.get_relation here instead, no?

https://github.com/fishtown-analytics/dbt/blob/0d44dbf078bd5eb0813f351099f58de30a1ba934/core/dbt/adapters/base/impl.py#L725-L747

jtcohen6 · 2020-02-04T02:50:27Z

@drewbanin Good call. Is #52 what you had in mind?

eamontaaffe mentioned this issue Dec 13, 2019

Custom schema: table already exists #42

Closed

jtcohen6 mentioned this issue Feb 4, 2020

Reimplement get_relation to handle custom schemas #52

Merged

jtcohen6 closed this as completed in #52 Feb 4, 2020

WangSleep mentioned this issue Nov 30, 2020

Runtime Error PrestoUserError(type=USER_ERROR, name=SYNTAX_ERROR, message="line 1:1: Target table 'hive.table_name' already exists", query_id=20201130_043832_05538_63uev) dbt-labs/dbt-presto#35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom schemas: table already exists #38

Custom schemas: table already exists #38

eamontaaffe commented Nov 25, 2019

eamontaaffe commented Dec 12, 2019 •

edited

eamontaaffe commented Dec 13, 2019 •

edited

drewbanin commented Dec 17, 2019

eamontaaffe commented Dec 18, 2019

jtcohen6 commented Feb 3, 2020 •

edited

drewbanin commented Feb 3, 2020

jtcohen6 commented Feb 4, 2020

Custom schemas: table already exists #38

Custom schemas: table already exists #38

Comments

eamontaaffe commented Nov 25, 2019

eamontaaffe commented Dec 12, 2019 • edited

eamontaaffe commented Dec 13, 2019 • edited

drewbanin commented Dec 17, 2019

eamontaaffe commented Dec 18, 2019

jtcohen6 commented Feb 3, 2020 • edited

drewbanin commented Feb 3, 2020

jtcohen6 commented Feb 4, 2020

eamontaaffe commented Dec 12, 2019 •

edited

eamontaaffe commented Dec 13, 2019 •

edited

jtcohen6 commented Feb 3, 2020 •

edited