You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#81 migrated several OR statements into a consistent artifact_run_id. This was a good improvement but results in a few tricky issues about granularity.
Situation:
dbt cloud job, which runs seed, run and test (I don't believe the problem I'm highlighting would occur in the case of running build).
After each step we run upload_artifacts macros.
The created artifacts would have run_results for each step, with each of the models/tests/seeds executed at each step (which is good) and also a duplicate manifest for each, with a different command_invocation_id but the sameartifact_run_id.
The current proposed MERGE statements in Switch to merges in the upload scripts #99 don't solve this situation currently because we're merging on artifact_run_id AND atrifact_generated_at.
This means that when we start joining on artifact_run_id in models we get a cartesian join and more rows than we need. Most specifically in stg_dbt__node_executions, but also in fct_dbt__snapshot_executions, fct_dbt__seed_executions & fct_dbt__model_executions.
I think the solution to this is to change the join in these cases to join only on command_invocation_id which is unique to a given artifact - and avoids the cartesian join.
The text was updated successfully, but these errors were encountered:
* Dedupe V2 artifacts script
* qualify with schema and database
* qualify drop
* qualify insert
* Add a V1 dedupe script
* Add tests for V1 and V2 dedupe script
* Remove dedupe in artifacts staging model
* Dedupe using a qualify rather than a distinct
* Add order by clauses
* Switch to dedpue on command_invocation_id #103
* Rename run_manifest -> manifest
* Update README
* Dedupe on the correct granularity
* Clone rather than truncate > insert
* fix the copy pasta
* cannot clone temporary and non-temporary tables.
* Fix clone statement
* Update macros/dedupe_artifacts_v1.sql
* Update macros/dedupe_artifacts_v2.sql
Co-authored-by: Niall Woodward <niall@niallrees.com>
#81 migrated several OR statements into a consistent
artifact_run_id
. This was a good improvement but results in a few tricky issues about granularity.Situation:
seed
,run
andtest
(I don't believe the problem I'm highlighting would occur in the case of runningbuild
).upload_artifacts
macros.run_results
for each step, with each of the models/tests/seeds executed at each step (which is good) and also a duplicate manifest for each, with a differentcommand_invocation_id
but the sameartifact_run_id
.MERGE
statements in Switch to merges in the upload scripts #99 don't solve this situation currently because we're merging onartifact_run_id
ANDatrifact_generated_at
.artifact_run_id
in models we get a cartesian join and more rows than we need. Most specifically instg_dbt__node_executions
, but also infct_dbt__snapshot_executions
,fct_dbt__seed_executions
&fct_dbt__model_executions
.I think the solution to this is to change the join in these cases to join only on
command_invocation_id
which is unique to a given artifact - and avoids the cartesian join.The text was updated successfully, but these errors were encountered: