Refactor of granularity in tables. #103

alanmcruickshank · 2022-03-10T14:23:21Z

#81 migrated several OR statements into a consistent artifact_run_id. This was a good improvement but results in a few tricky issues about granularity.

Situation:

dbt cloud job, which runs seed, run and test (I don't believe the problem I'm highlighting would occur in the case of running build).
After each step we run upload_artifacts macros.
The created artifacts would have run_results for each step, with each of the models/tests/seeds executed at each step (which is good) and also a duplicate manifest for each, with a different command_invocation_id but the same artifact_run_id.
The current proposed MERGE statements in Switch to merges in the upload scripts #99 don't solve this situation currently because we're merging on artifact_run_id AND atrifact_generated_at.
This means that when we start joining on artifact_run_id in models we get a cartesian join and more rows than we need. Most specifically in stg_dbt__node_executions, but also in fct_dbt__snapshot_executions, fct_dbt__seed_executions & fct_dbt__model_executions.

I think the solution to this is to change the join in these cases to join only on command_invocation_id which is unique to a given artifact - and avoids the cartesian join.

The text was updated successfully, but these errors were encountered:

* Dedupe V2 artifacts script * qualify with schema and database * qualify drop * qualify insert * Add a V1 dedupe script * Add tests for V1 and V2 dedupe script * Remove dedupe in artifacts staging model * Dedupe using a qualify rather than a distinct * Add order by clauses * Switch to dedpue on command_invocation_id #103 * Rename run_manifest -> manifest * Update README * Dedupe on the correct granularity * Clone rather than truncate > insert * fix the copy pasta * cannot clone temporary and non-temporary tables. * Fix clone statement * Update macros/dedupe_artifacts_v1.sql * Update macros/dedupe_artifacts_v2.sql Co-authored-by: Niall Woodward <niall@niallrees.com>

alanmcruickshank added a commit to alanmcruickshank/dbt_artifacts that referenced this issue Mar 10, 2022

Switch to dedpue on command_invocation_id brooklyn-data#103

7caea3e

This was referenced Mar 10, 2022

Fix cartesian join bug #104

Merged

Dedupe script #94

Merged

Switch to merges in the upload scripts #99

Merged

NiallRees closed this as completed in #104 Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of granularity in tables. #103

Refactor of granularity in tables. #103

alanmcruickshank commented Mar 10, 2022

Refactor of granularity in tables. #103

Refactor of granularity in tables. #103

Comments

alanmcruickshank commented Mar 10, 2022