-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination MSSQL duplicate array records when using incremental append (nested tables) #9465
Comments
Hi, thanks por creating the issue @marcosmarxm. I would like to help with this, any idea where I can start checking? It's normal that re-run normalization when no new data it's fetch? Thanks |
Nested streams are not de-duplicated, see Line 142 in ecfc9e1
You could use a custom transformation where you can specify how to de-duplicate sub-streams |
Thanks for the response @ChristopheDuong, I'm working with incremental append and It make sense what you comment because there’s not a cursor/primary key that I'm agree, but the problem it's more related that it's duplicating data on every run with no data. This generate thousands of rows on each sync every 5 minutes and re-run the normalization every sync that is not that efficient. This it's an expected behavior or what do you think? It should run the normalization only on new data? For example this 3 syncs with no data (0bytes) is generating new rows on nested table on each run: |
It's there a way where I can check here on this line if the process it's sync and without any new records to avoid re-run the normalization?? that make sense? @ChristopheDuong airbyte/airbyte-workers/src/main/java/io/airbyte/workers/DefaultNormalizationWorker.java Line 47 in 2115f7a
|
No, that's not right. If there is one record for a stream not related to your substream in your connection, normalization would be triggered and rows will be appended in your un-nested table too. You should really look at custom transformation where you can specify how to de-duplicate sub-streams |
thanks for the response @ChristopheDuong, sorry I didn't understand well, probably I'm missing some context. it's there an example case where it's needed to re-run normalization with no new records found? Not considering when you reset your data. |
Zendesk ticket #1758 has been linked to this issue. |
Comment made from Zendesk by Marcos Marx on 2022-08-01 at 12:32:
|
Is this your first time deploying Airbyte: No
OS Version / Instance: Ubuntu 20.04 (2 vCPU ARM)
Memory / Disk: 8GB / 40GB
Deployment: Docker compose
Airbyte Version: 0.35.4-alpha
Source name/version: Custom connector
Destination name/version: MSSQL 0.1.13
Step: Sending nesting array data for normalization from custom connector to MSSQL using incremental append duplicate data on nesting on each sync run, even though no records are found it's adding new rows, the logs says "Read 0 records from forms stream"
Description: When no records found anyway it's executing normalization and duplicating the rows on nested tables from arrays.
from slack convo:
Hello @Marcos Marx (Airbyte), yes the data it's very simple something like this: [{id: "1", name: "test", sub_objects: [{name: "abcd"}]}] it works perfect with main table, but with sub_objects table on every sync with no any extra data is re-running normalization and on that step duplicating data on sub_object table.
The text was updated successfully, but these errors were encountered: