Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 Destination Snowflake: duplicate rows on retries when using incremental staging #8832

Closed
joshuataylor opened this issue Dec 16, 2021 · 4 comments 路 Fixed by #9141
Closed

Comments

@joshuataylor
Copy link

joshuataylor commented Dec 16, 2021

Environment

  • Airbyte version: v0.33.12-alpha
  • OS Version / Instance: Ubuntu
  • Deployment: Docker
  • Source Connector and version: Postgres 0.3.17
  • Destination Connector and version: Snowflake 0.1.2
  • Severity: High
  • Step where error happened: Sync job

Current Behavior

When creating a new sync, if the sync fails and it has to retry, all rows which have already been put on the stage will then have rows appended again to the stage. So there are duplicate rows.

Expected Behavior

It should not have duplicate rows.

Uploading data from stage:  stream xxx. schema public, tmp table _airbyte_tmp_boa_xxx, stage PUBLIC_AIRBYTE_RAW_XXX

The stage should have a folder in it, as mentioned here https://docs.snowflake.com/en/user-guide/data-load-local-file-system-stage.html:

put file:///data/data.csv @~/staged/SOME UUID;

This way the uuid is used just for that sync, and other retries should then use a new UUID. On failure it should delete files from that uuid.

Attempt 1:

XXX GB | XXXX records | 1h 36m 14s
/tmp/workspace/3/0/logs.log.

Attempt 2:

XX.XX GB | XXXX records | 1h 41m 54s
/tmp/workspace/3/1/logs.log.

Steps to Reproduce

  1. Create a new destination with SF
  2. At the end when it's inserting into SF (or during the process), cancel the query in SF
  3. It then retries (good!), but has duplicate rows.

Are you willing to submit a PR?

Maybe?

@joshuataylor joshuataylor added needs-triage type/bug Something isn't working labels Dec 16, 2021
@alafanechere alafanechere changed the title When syncing to Snowflake with Incremental Staging, retries will duplicate rows 馃悰 Destination Snowflake: Incremental Staging, retries will duplicate rows Dec 16, 2021
@alafanechere alafanechere added area/connectors Connector related issues and removed needs-triage labels Dec 16, 2021
@alafanechere alafanechere changed the title 馃悰 Destination Snowflake: Incremental Staging, retries will duplicate rows 馃悰 Destination Snowflake: duplicate rows on retries when using incremental staging Dec 16, 2021
@alafanechere alafanechere added the priority/critical Critical priority! label Dec 16, 2021
@sherifnada
Copy link
Contributor

@joshuataylor if i understand correctly the problem is that all files are loaded from the stage, rather than loading files from this particular sync correct?

@joshuataylor
Copy link
Author

Correct, so the files are in the stage, so when a sync fails and retries it adds new files to the stage, which will be duplicated from retry 1 and retry 2.

@VitaliiMaltsev VitaliiMaltsev self-assigned this Dec 17, 2021
@VitaliiMaltsev
Copy link
Contributor

@joshuataylor please advise how to cancel query in Snowflake? I believe we should know query id for that

@VitaliiMaltsev
Copy link
Contributor

@joshuataylor please ignore my previous comment. Already found needed approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants