Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail due to apparent duplicate rows in apple_store__subscription_report #15

Open
2 of 4 tasks
casparwylie opened this issue Sep 27, 2023 · 13 comments
Open
2 of 4 tasks
Labels
status:scoping Currently being scoped

Comments

@casparwylie
Copy link

casparwylie commented Sep 27, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

The test case unique_combination_of_columns in the apple_store__subscription_report table fails because the data does seem to have duplicates (where multiple rows have the same date_day, account_id, app_id, subscription_name, territory_long, state).

Looking into it further, the duplicates seem to exist in the transformed data from app_store.sales_subscription_event_summary (though the Fivetran assigned Primary keys are still all distinct).

I raised a ticket with Fivetran support who have said the duplicates are not a mistake, meaning I believe the test in the DBT package should change.

Relevant error log or model output

No response

Expected behavior

Tests should pass regardless of duplicates for the listed fields.

dbt Project configurations

N/A

Package versions

0.3.1

What database are you using dbt with?

bigquery

dbt Version

1.5.0

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.
@fivetran-joemarkiewicz
Copy link
Collaborator

Hi @casparwylie thank you for raising this issue. Did Fivetran support say why these duplicates are not a mistake?

I also was looking at a past issue #12 and saw another customer experienced a similar error due to the territory_name being slightly different. Would you be able to share a few of these duplicate records so we may see if they have some differences that we may be able to correct within the package.

If they do not differ, I would encourage reaching out to Apple to understand why there are duplicates in the source data. I do not believe there should be duplicate subscription report entries in the raw data. That seems like a data integrity issue that this failed test is appropriately flagging.

@casparwylie
Copy link
Author

To clarify, the rows are still unique by the fivetran/meta fields, just often not by date_day, account_id, app_id, subscription_name, territory_long, state.

We are seeing plenty of duplicate rows where every column except the meta columns (e.g _index) are the same in sales_subscription_event_summary.

@fivetran-markgaughran am I right in saying from your end, the duplicates are expected?

@fivetran-markgaughran
Copy link

Hi @fivetran-joemarkiewicz @casparwylie ,

duplicates do not exist in the SALES_SUBSCRIPTION_EVENT_SUMMARY table for the Fivetran assigned Primary keys but they do appear to exist for the transformation output unique keys (date_day, account_id, app_id, subscription_name, territory_long, state), thus causing the transformation to fail.

@fivetran-joemarkiewicz
Copy link
Collaborator

Thanks for adding context @casparwylie and @fivetran-markgaughran.

@casparwylie would you be able to share an example of a duplicate in the apple_store__subscription_report? Mainly I would be curious to take a look at the territory_long field as this has caused some issues in the past with Apple not being consistent with territory naming. An example of the duplicate record will help us understand what next steps may be needed to resolve the error.

@fivetran-joemarkiewicz fivetran-joemarkiewicz added the status:scoping Currently being scoped label Sep 29, 2023
@casparwylie
Copy link
Author

I'm not sure why, but the tests are now passing, likely due to a new historic sync. I now can't find examples other than what I described above! I'm closing the issue. Thank you both.

@casparwylie
Copy link
Author

casparwylie commented Oct 5, 2023

Apologies but the issue as resurfaced now. Here is are 2 fresh examples in JSON result format given the query

SELECT date_day, account_id, app_id, subscription_name, territory_long, state, count(*) as qty 
 FROM `project.apple_store.apple_store__subscription_report` 
 GROUP BY date_day, account_id, app_id, subscription_name, territory_long, state
 HAVING count(*)> 1

(in total there are 211183 results)

[{
    "date_day": "2022-07-21",
    "account_id": "<our account id>",
    "app_id": null,
    "subscription_name": "Offer name",
    "territory_long": "Armenia",
    "state": null,
    "qty": "2"
}, {
    "date_day": "2022-11-10",
    "account_id": "<our account id>",
    "app_id": null,
    "subscription_name": "Offer name",
    "territory_long": "Armenia",
    "state": null,
    "qty": "2"
}]

Let me know your thoughts. Thank you.

@casparwylie casparwylie reopened this Oct 5, 2023
@casparwylie
Copy link
Author

@fivetran-joemarkiewicz Hey - just wondering if any updates on this! Thanks.

@fivetran-joemarkiewicz
Copy link
Collaborator

Hi @casparwylie I am sorry to see that the issue has resurfaced. Would you be able to share the select * of one of those duplicates you came across? I am wondering if this is in fact a duplicate issue that needs to be traced back to the source or code logic in the package, or if this is a scenario where we simply need to update our tests to factor in more than the specified fields for uniqueness.

@casparwylie
Copy link
Author

I've included the query that fetched all the rows (and hidden some more sensitive properties) in the previous comment. Is there any column(s) in particular you'd be keen to see?

@fivetran-joemarkiewicz
Copy link
Collaborator

Yeah I am wondering if there are any columns where you saw the rows were not unique? If they are sensitive no need to share, but I am curious if rows were duplicates across every single field?

Additionally, it would be worthwhile to check the source again and make sure these duplicates don't exist there.

@casparwylie
Copy link
Author

Yea so we are seeing plenty of duplicate rows where every column except the meta columns (e.g _index) are the same in sales_subscription_event_summary. However, unrelated, it looks like app_name is the only difference in the duplicates in apple_store__subscription_report. I suppose if the app_name changes in the app store, the report here causes duplicates.

@fivetran-joemarkiewicz
Copy link
Collaborator

@casparwylie thank you for sharing! The insight into the app_name duplicates does make sense and probably is something we should update in our test to account for the name of the app as that may change.

However, I am still struggling with the duplicates in the source that are only not duplicates due to the Fivetran metadata columns. Would you be interested in meeting sometime this week for my team and I to review these live with you and determine the best approach forward?

@casparwylie
Copy link
Author

My team aren't going to be looking at this anymore so probably not neccassary - the first issue mentioned is probably the main one though in case you're keen to look into it further! Thanks anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:scoping Currently being scoped
Projects
None yet
Development

No branches or pull requests

3 participants