Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a table gets dropped elsewhere, analysis fails causing triggers of multiple repetitions of create table #215

Closed
1 of 5 tasks
danfran opened this issue Sep 15, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@danfran
Copy link

danfran commented Sep 15, 2021

Describe the bug

Hi everyone, currently I am going through an issue with DBT (0.20.1) on production that seems related to some "concurrent updates" when I try to run different queries at the same time, not directly related, from different models. As simplified description, I have my folder models divided in subfolders like this:

models:
  model_1:
  	table_1.sql
  	table_2.sql
  	...
  model_2:
  	table_14.sql
  	...
  ...
  model_25:
  	...
  	table_117.sql

for a total of 117 models.
Then at some point, let's say a Process-A starts the model_7 (dbt run --models model_7 on a Server-A) (that contains the query table_65.sql). What I can see from the beginning of the logs for this process is something like:

2021-09-14 08:43:56.293473 (MainThread): Acquiring new spark connection "model.my_proj.table_1".
2021-09-14 08:43:56.385776 (MainThread): Acquiring new spark connection "model.my_proj.table_2".
....
2021-09-14 08:43:56.390074 (MainThread): Acquiring new spark connection "model.my_proj.table_117".

Meantime a separate Process-B starts the model_9 (dbt run --models model_9 on a Server-B) that requires to drop the table table_82 (table_65 and table_82 are not related).
Back to Process-A, from the log I can read now (caused by the Process-B likely):

2021-09-14 08:44:35.863983 (ThreadPoolExecutor-1_0): Error while running:
/* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "connection_name": "list_None_my_models"} */
show table extended in my_models like '*'
  
2021-09-14 08:44:35.864026 (ThreadPoolExecutor-1_0): Database Error
  org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'table_82' not found in database 'my_models'

This last log message gets repeated 3 times in row. Now for some reason this also triggers the process related to the build of the table_65 3 times (as I can see from the log) for:

2021-09-14 08:44:36.189855 (Thread-1): /* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "node_id": "model.my_proj.table_65"} */

      create table model_7.table_65


	using parquet
    
    partitioned by (country,month_selected)
    
    
    location 's3://my_data/....'
    
    as ....


2021-09-14 08:44:36.281855 (Thread-1): /* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "node_id": "model.my_proj.table_65"} */

      create table model_7.table_65


	using parquet
    
    partitioned by (country,month_selected)
    
    
    location 's3://my_data/....'
    
    as ....

 2021-09-14 08:44:36.399835 (Thread-1): /* {"app": "dbt", "dbt_version": "0.20.1", "profile_name": "my_proj_spark", "target_name": "prod", "node_id": "model.my_proj.table_65"} */

      create table model_7.table_65


	using parquet
    
    partitioned by (country,month_selected)
    
    
    location 's3://my_data/....'
    
    as ....

that understandably will make the process fail with this error:

2021-09-14 08:44:36.944890 (Thread-1): Database Error
  org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Table my_proj.table_65 already exists. You need to drop it first.

If I am correctly reading the logs, is there any work around to avoid multiple triggers of the same process if a table disappear not directly related fails to pass the analysis? Or is it a DBT bug? Thank you

Steps To Reproduce

Please look at the bug description. I think the process can be repeated locally following that steps.

Expected behavior

The expected behaviour is avoiding a failure if not related tables get dropped elsewhere.

Screenshots and log output

Please look at the bug description.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: AWS/EMR 6.3)

The output of dbt --version:

installed version: 0.20.1
   latest version: 0.20.2

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 0.19.2
  - snowflake: 0.19.2
  - redshift: 0.19.2
  - postgres: 0.19.2
  - spark: 0.20.1

The operating system you're using:
Ubuntu 20.04 (on AWS/EC2)

The output of python --version:
python --version
Python 3.8.14

Additional context

Running this process on AWS/EMR 6.3 with Glue as "Hive" replacement for Spark.

@danfran danfran added bug Something isn't working triage labels Sep 15, 2021
@danfran
Copy link
Author

danfran commented Sep 20, 2021

I found the root of the previous issue (at least I hope). The reason why this is happening is due to the fact that DBT runs via Spark the following command: show table extended in my_glue_schema like '*' . In my case the same schema is used by different other processes that can generate tables with "bad" schemas or without "Location" (S3 path) defined (a view for example). This will make fail the highlighted command and so an eventual "drop table" or "insert overwrite" won't happen, and only the create table process will. Why DBT needs to run show table extended in my_glue_schema like '*' ? Can it be skipped some how?

@danfran
Copy link
Author

danfran commented Sep 20, 2021

I have been pointed to this previous issue: #93. That seems to create a major issue for my case, as we cannot run separate parts of a potential unique "tree" as they may overlap creation/drop of tables that are expected in one or another tree.

In environments like Glue/Catalog via EMR may be not required tho, as the "layer" is the Catalog service itself. Not sure if I am missing the point here. In that case that command should be "optional"? For example something like:

show table extended in my_glue_schema in ('table_1', 'table_2', ...etc...)

@jtcohen6
Copy link
Contributor

Sorry for the delay @danfran! I'm going to close this in favor of the discussion and next steps in #228, since I believe a solid resolution there would also manage to avoid this bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants