Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom schemas: table already exists #38

Closed
eamontaaffe opened this issue Nov 25, 2019 · 7 comments · Fixed by #52
Closed

Custom schemas: table already exists #38

eamontaaffe opened this issue Nov 25, 2019 · 7 comments · Fixed by #52

Comments

@eamontaaffe
Copy link

Issues with re-running workflows when using custom schemas.

When I create a model with a custom schema configured:

-- models/clean/clean_accounts.sql
{{ config(alias='accounts', schema='clean', materialization='table') }}
select * from {{ source('incoming', 'accounts') }}

I am able to run the workflow successfully once:

> dbt run
...
Completed successfully

However, if I run the same workflow again I get an error:

> dbt run
...
Runtime Error in model clean_orders (models/clean/clean_accounts.sql)
  Database Error
    org.apache.spark.sql.AnalysisException: `dev_clean`.`accounts` already exists.;

Instead, the table should be dropped and recreated. If we repeat the same exercise without the schema='clean' configuration, everything works as expected.

@eamontaaffe
Copy link
Author

eamontaaffe commented Dec 12, 2019

So what I think is happening here is that DBT isn't picking up that the table already exists when it attempts a subsequent run. Therefore it isn't running the drop_relation on the old table.

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/materializations/table.sql#L14-L16

The old relation is identified using an adapter macro get_relation.

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/materializations/table.sql#L5

However the get_relation method isn't defined in the adapter implementation here:

https://github.com/fishtown-analytics/dbt-spark/blob/master/dbt/adapters/spark/impl.py

We are really keen to get this issue solved and would like to put in a pull request. Is it possible to get someone to check my logic so I know I'm heading in the right direction?

cc: @drewbanin

@eamontaaffe
Copy link
Author

eamontaaffe commented Dec 13, 2019

Question, why do we need to check it the table exists at all? Couldn't we just use the keyword IF EXISTS to ensure that a table is dropped?

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/materializations/table.sql#L13-L16

So we could instead do something like:

  -- setup: if the target relation already exists, drop it
  -- Notice there is no need to check if relation exists before dropping it.
  {{ adapter.drop_relation(old_relation) }}

The macro implementation of spark__drop_relation already uses if exists so we don't even need to update it:

https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/include/spark/macros/adapters.sql#L108-L113

@drewbanin
Copy link
Contributor

hey @eamontaaffe - thanks for your thoughtful writeup here! I appreciate your patience - it was hard to get back in the swing of the dbt-spark plugin, but I'm excited to get this (and the other open PRs in this repo) merged!

I think the change you've proposed here is uncontroversial - let me pick this up with you in the open PR.

@eamontaaffe
Copy link
Author

Awesome, thanks @drewbanin. I'm excited about some of the other issues & PRs that are currently open too! Version 15 is support will be amazing.

We have been using the changes proposed in #42 for the last 4-5 days and it seems to be behaving as expected.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 3, 2020

In the spirit of figuring out what was actually going wrong with adapter.get_relation, I discovered the cause: in Spark, unlike in other dbt adapters, database and schema are one and the same. Only the schema property of the materialization is updated, however, when a custom schema is declared in a model config. When dbt checks the cache here for a table matching both the database and schema of the model, it supplies the custom schema for schema but the default (target.database) for database.

I think we should fix get_relation, rather than the workaround in #42. We could redefine all get_relation calls to look like

{%- set old_relation = adapter.get_relation(database=schema, schema=schema, identifier=identifier) -%}

Or we could re-implement cache.get_relations for the Spark adapter to only check for a matching schema. I'm leaning toward the latter, what do you think @drewbanin?

@drewbanin
Copy link
Contributor

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 4, 2020

@drewbanin Good call. Is #52 what you had in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants