Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination MSSQL duplicate array records when using incremental append (nested tables) #9465

Open
marcosmarxm opened this issue Jan 13, 2022 · 8 comments
Labels

Comments

@marcosmarxm
Copy link
Member

Is this your first time deploying Airbyte: No
OS Version / Instance: Ubuntu 20.04 (2 vCPU ARM)
Memory / Disk: 8GB / 40GB
Deployment: Docker compose
Airbyte Version: 0.35.4-alpha
Source name/version: Custom connector
Destination name/version: MSSQL 0.1.13
Step: Sending nesting array data for normalization from custom connector to MSSQL using incremental append duplicate data on nesting on each sync run, even though no records are found it's adding new rows, the logs says "Read 0 records from forms stream"
Description: When no records found anyway it's executing normalization and duplicating the rows on nested tables from arrays.

from slack convo:
Hello @Marcos Marx (Airbyte), yes the data it's very simple something like this: [{id: "1", name: "test", sub_objects: [{name: "abcd"}]}] it works perfect with main table, but with sub_objects table on every sync with no any extra data is re-running normalization and on that step duplicating data on sub_object table.

@marcosmarxm marcosmarxm added type/bug Something isn't working area/connectors Connector related issues normalization labels Jan 13, 2022
@agrass
Copy link
Contributor

agrass commented Jan 13, 2022

Hi, thanks por creating the issue @marcosmarxm. I would like to help with this, any idea where I can start checking? It's normal that re-run normalization when no new data it's fetch? Thanks

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jan 13, 2022

Nested streams are not de-duplicated, see

# nested streams can't be deduped like their parents (as they may not share the same cursor/primary keys)

You could use a custom transformation where you can specify how to de-duplicate sub-streams

@agrass
Copy link
Contributor

agrass commented Jan 13, 2022

Thanks for the response @ChristopheDuong, I'm working with incremental append and It make sense what you comment because there’s not a cursor/primary key that I'm agree, but the problem it's more related that it's duplicating data on every run with no data. This generate thousands of rows on each sync every 5 minutes and re-run the normalization every sync that is not that efficient. This it's an expected behavior or what do you think? It should run the normalization only on new data?

For example this 3 syncs with no data (0bytes) is generating new rows on nested table on each run:
Captura de Pantalla 2022-01-11 a la(s) 18 53 58

@agrass
Copy link
Contributor

agrass commented Jan 14, 2022

It's there a way where I can check here on this line if the process it's sync and without any new records to avoid re-run the normalization?? that make sense? @ChristopheDuong

@ChristopheDuong
Copy link
Contributor

No, that's not right.

If there is one record for a stream not related to your substream in your connection, normalization would be triggered and rows will be appended in your un-nested table too.

You should really look at custom transformation where you can specify how to de-duplicate sub-streams
https://docs.airbyte.com/operator-guides/transformation-and-normalization/transformations-with-airbyte

@agrass
Copy link
Contributor

agrass commented Jan 14, 2022

thanks for the response @ChristopheDuong, sorry I didn't understand well, probably I'm missing some context. it's there an example case where it's needed to re-run normalization with no new records found? Not considering when you reset your data.

@marcosmarxm
Copy link
Member Author

Zendesk ticket #1758 has been linked to this issue.

@marcosmarxm
Copy link
Member Author

Comment made from Zendesk by Marcos Marx on 2022-08-01 at 12:32:

Hello Jaafar, look your issue is similar to #9465
Currently you need to dedup nested records by yourself probably exporting the normalization project and executing the dedup.
The main reason of this is:
# nested streams can't be deduped like their parents (as they may not share the same cursor/primary keys)
 
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants