Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of granularity in tables. #103

Closed
alanmcruickshank opened this issue Mar 10, 2022 · 0 comments · Fixed by #104
Closed

Refactor of granularity in tables. #103

alanmcruickshank opened this issue Mar 10, 2022 · 0 comments · Fixed by #104

Comments

@alanmcruickshank
Copy link
Contributor

#81 migrated several OR statements into a consistent artifact_run_id. This was a good improvement but results in a few tricky issues about granularity.

Situation:

  • dbt cloud job, which runs seed, run and test (I don't believe the problem I'm highlighting would occur in the case of running build).
  • After each step we run upload_artifacts macros.
  • The created artifacts would have run_results for each step, with each of the models/tests/seeds executed at each step (which is good) and also a duplicate manifest for each, with a different command_invocation_id but the same artifact_run_id.
  • The current proposed MERGE statements in Switch to merges in the upload scripts #99 don't solve this situation currently because we're merging on artifact_run_id AND atrifact_generated_at.
  • This means that when we start joining on artifact_run_id in models we get a cartesian join and more rows than we need. Most specifically in stg_dbt__node_executions, but also in fct_dbt__snapshot_executions, fct_dbt__seed_executions & fct_dbt__model_executions.

I think the solution to this is to change the join in these cases to join only on command_invocation_id which is unique to a given artifact - and avoids the cartesian join.

alanmcruickshank added a commit to alanmcruickshank/dbt_artifacts that referenced this issue Mar 10, 2022
NiallRees added a commit that referenced this issue Mar 12, 2022
* Dedupe V2 artifacts script

* qualify with schema and database

* qualify drop

* qualify insert

* Add a V1 dedupe script

* Add tests for V1 and V2 dedupe script

* Remove dedupe in artifacts staging model

* Dedupe using a qualify rather than a distinct

* Add order by clauses

* Switch to dedpue on command_invocation_id #103

* Rename run_manifest -> manifest

* Update README

* Dedupe on the correct granularity

* Clone rather than truncate > insert

* fix the copy pasta

* cannot clone temporary and non-temporary tables.

* Fix clone statement

* Update macros/dedupe_artifacts_v1.sql

* Update macros/dedupe_artifacts_v2.sql

Co-authored-by: Niall Woodward <niall@niallrees.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant