Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 CDC Sources to normalized destinations: deleted_at field is not getting updated for deleted records. #8830

Closed
PiyushBajaj-7 opened this issue Dec 16, 2021 · 20 comments · Fixed by #12846

Comments

@PiyushBajaj-7
Copy link

Is this your first time deploying Airbyte: Yes
OS Version / Instance: Compute engine n2-highmem-8
Deployment: Docker
Airbyte Version: 0.33.12-alpha
Source name/version: MySql 0.5.1
Destination name/version: BigQuery 0.5.0

@alafanechere
Copy link
Contributor

Hi @PiyushBajaj-7 could you please provide more details about your sync method, I guess you are using CDC in MySQL. A screenshot of your MySQL config + of your connection setup would be great. Moreover please detail the current behavior, the expected behavior and bug reproduction step. It will help us investigate your problem accurately!

@alafanechere alafanechere changed the title MySQL (Deduped+history) to Bigquery: Deleted records are not getting marked in final table. _ab_cdc_deleted_at field is not getting updated for deleted records. Please could you look into the issue. 🐛 CDC MySQL (Deduped+history) to Bigquery: _ab_cdc_deleted_at field is not getting updated for deleted records. Dec 16, 2021
@alafanechere alafanechere added area/connectors Connector related issues cdc and removed needs-triage labels Dec 16, 2021
@alafanechere alafanechere changed the title 🐛 CDC MySQL (Deduped+history) to Bigquery: _ab_cdc_deleted_at field is not getting updated for deleted records. 🐛 CDC MySQL (Deduped+history) to Bigquery: deleted_at field is not getting updated for deleted records. Dec 16, 2021
@burhanuddinhusain
Copy link

Hi Team - Yes, we are using CDC method. We have created a pipeline to sync data between mysql to bigquery every 15 minutes. Inserts and Updates are syncing properly on bigquery side whenever there are any changes on mysql side. However, deletes are not reflecting in final bigquery table. Delete changes reflects in the temp table created by airbyte but doesn't reflect in the final table. Please could you look into the issue. I will attach some pictures which will help

@burhanuddinhusain
Copy link

Hi Team - Yes, we are using CDC method. We have created a pipeline to sync data between mysql to bigquery every 15 minutes. Inserts and Updates are syncing properly on bigquery side whenever there are any changes on mysql side. However, deletes are not reflecting in final bigquery table. Delete changes reflects in the temp table created by airbyte but doesn't reflect in the final table. Please could you look into the issue. I will attach some pictures which will help

image

@alafanechere
Copy link
Contributor

@subodh1810 are you aware of this bug or a similar one? Is it MySQL specific or might be a concern for other CDC sources?

@neethuthomas04
Copy link

Hey, Thanks for the raising this issue - we're are also facing the same issue with MySQL -> BigQuery delete
image

@burhanuddinhusain
Copy link

Hi Team - Do we have any updates on the issue?

@alafanechere
Copy link
Contributor

Hi @burhanuddinhusain, our team prioritized this issue for scoping, as you can see in the above notification.

@satyadeepk
Copy link

We are seeing this issue with Postgres CDC Source as well. So it might not be specific to MySQL as such.

@ZeroStack
Copy link

We are also experiencing this issue with Postgres CDC source to BQ for deduped history.

@marcosmarxm
Copy link
Member

@satyadeepk and @ZeroStack can you check the _raw_tables Airbyte had created and check if the _ab_cdc_deleted_at has data?

@satyadeepk
Copy link

@marcosmarxm Yes I can see records in _raw_tables with _ab_cdc_deleted_at not null. But the same is not reflected in the final table.

Example: "_ab_cdc_deleted_at":"2022-01-19T14:44:09.049Z"

@nekprj
Copy link

nekprj commented Apr 2, 2022

I also got similar issue #11256

@ordonezf
Copy link

ordonezf commented Apr 5, 2022

I have the same issue as @nekprj, I'm on version 0.35.58-alpha with postgres source 0.4.9 and postgres destination 0.3.15.

  • In _aibyte_raw_ I can see the deleted row.
  • In the _scd table the record seems deleted and there is no _airbyte_active_row set to 1 for that entry history.
  • But in the actual table I can still see the record I deleted and the _ab_cdc_deleted_at.

Since the data arrives correctly to the destination maybe there is a bug in the dbt script that generates the final table?

@ameyabhope
Copy link

Observing the same issue for 0.35.56-alpha with postgres source 0.4.9 and bigquery destination.

Would be happy to contribute if pointed in the right direction.

@azhard
Copy link
Contributor

azhard commented Apr 21, 2022

Any updates here? This is a huge blocker for my team right now as the main reason we're moving to CDC replication is to capture deletions.

@omri-shaiko
Copy link

We are also experiencing this issue with MySQL CDC source to Snowflake for deduped history.
Airbyte version is: 0.36.2-alpha

@alafanechere
Copy link
Contributor

@ameyabhope you mentioned on Discourse that you'd want to work on this fix. Could you please sync with @edgao to make sure your work does not overlap? Thanks!

@edgao
Copy link
Contributor

edgao commented May 6, 2022

summary from talking with @ameyabhope (thanks for working with me on this, it was super helpful!):

so the approach I'm planning:

  • modify stream_processor to also fetch deleted records from the current normalization run
  • add a post-hook (here) to delete records from the final table with active_row = 0
  • figure out how to exclude the deleted_at column from generate_final_model

Exit criteria: (@grishick fyi)

  • records which were deleted in the source (i.e. have a non-null _ab_cdc_deleted_at column in the SCD table) are deleted from the final table
  • the final table no longer has an _ab_cdc_deleted_at column, since it will always be null

@grishick
Copy link
Contributor

grishick commented May 16, 2022

@edgao
Copy link
Contributor

edgao commented May 16, 2022

@grishick yep, same thing.

@edgao edgao changed the title 🐛 CDC MySQL (Deduped+history) to Bigquery: deleted_at field is not getting updated for deleted records. 🐛 CDC Sources to normalized destinations: deleted_at field is not getting updated for deleted records. May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment