Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate entries in the report. #1420

Open
annav00 opened this issue Feb 19, 2024 · 5 comments
Open

Duplicate entries in the report. #1420

annav00 opened this issue Feb 19, 2024 · 5 comments
Labels
Bug Something isn't working dbt package High priority Created by Linear-GitHub Sync

Comments

@annav00
Copy link

annav00 commented Feb 19, 2024

Describe the bug

The on-run-end hook saves data to artifact table. When running tests/models in parallel, duplicate entries in artifact tables sometimes occur. Because of this, when generating a report, duplicate records with information about tests appear.

To Reproduce
Steps to reproduce the behavior:

  1. Run tests/models in parallel as separate queries.
  2. Generate the report.

Expected behavior
The report contains one entry for each inspection.

Screenshots
Example: elementary_test_results 1 record * dbt_sources 2 records * dbt_tests 6 records -> there are 12 records in the report.

image

Environment (please complete the following information):

  • edr Version: 0.13.2
  • dbt package Version: 0.13.0

Additional context
Perhaps it is possible to solve the problem of duplication in artifact tables when working in parallel. Or maybe can distinct records when query the data for the report.

@annav00 annav00 added Bug Something isn't working Triage 👀 labels Feb 19, 2024
@MICHM137
Copy link

Hello,

I confirm this bug which is a bit annoying. When we run models in parallel then we end up with duplicates in the tables.
As workaround I need to run those queries periodically.

create or replace table elementary.dbt_tests as (
    select * from elementary.dbt_tests qualify row_number() over (partition by unique_id order by generated_at) = 1 
);
create or replace table elementary.dbt_models as (
    select * from elementary.dbt_models qualify row_number() over (partition by unique_id order by generated_at) = 1 
);
create or replace table elementary.dbt_sources as (
    select * from elementary.dbt_sources qualify row_number() over (partition by unique_id order by generated_at) = 1 
);
create or replace table elementary.dbt_exposures as (
    select * from elementary.dbt_exposures qualify row_number() over (partition by unique_id order by generated_at) = 1
);
create or replace table elementary.dbt_columns as (
    select * from elementary.dbt_columns qualify row_number() over (partition by unique_id order by generated_at) = 1
);

@haritamar
Copy link
Collaborator

Hi @annav00 and @MICHM137 ,
Sorry for the delay in responding here, can you please confirm if this issue is still relevant to you?
Also - which databases are you using?

I'll mark this as high priority on our end.

In the meantime, a workaround you can consider setting the var cache_artifacts = False - this will force a full replace of the artifacts on every run, which I think should actually prevent duplicates (though it can increase on_run_end duration).

(When caching is enabled we only insert a diff - and I think there's probably a race there)

@haritamar haritamar added High priority Created by Linear-GitHub Sync dbt package and removed Triage 👀 labels May 28, 2024
@mattxxi
Copy link

mattxxi commented May 29, 2024

Hey (it is MICHM137),
I would avoid adding cache_artifacts False because the on_run_end duration takes already a lot of time and make our pipelines way longer than without elementary.
Do you plan optimizing the on_run_end hook?
Thanks for your answer

@haritamar
Copy link
Collaborator

Hi @mattxxi ,
Yeah makes sense. We implemented the cache due to performance reasons, just pointed out the alternative.

I think the duplicate entries when the cache is enabled results from a race in the delete_and_insert macro, which we need to fix.
I don't have an immediate time frame for it but I'm guessing we'll prioritize it in the near future.

@sindhuthirugnanam
Copy link

Do you have timelines for the fix? Is the issue happening in latest version of elementary as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working dbt package High priority Created by Linear-GitHub Sync
Projects
None yet
Development

No branches or pull requests

5 participants