Investigate the effect of huge WAL logs on Postgres CDC source sync #5837

subodh1810 · 2021-09-03T12:06:46Z

A user reported that their Postgres source connector running in CDC mode was not able to fetch the changes during incremental updates. Take a look at the following logs

2021-09-02 08:23:46 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-02 08:23:46 [32mINFO[m i.d.c.p.PostgresStreamingChangeEventSource(searchWalPosition):268 - {} - Searching for WAL resume position
2021-09-02 08:28:44 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-02 08:28:44 [32mINFO[m i.a.i.d.i.DebeziumRecordIterator(computeNext):98 - {} - Closing cause next is returned as null

The two logs have a difference of 5 minutes. We wait for 5 minutes and there were no records returned by debezium and as a result we shutdown the sync. The last log from debezium is 2021-09-02 08:23:46 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-02 08:23:46 [32mINFO[m i.d.c.p.PostgresStreamingChangeEventSource(searchWalPosition):268 - {} - Searching for WAL resume position

Even after 5 minutes of waiting debezium didn't return any record which caused the sync to shut down.

The user reported that they dropped and recreated the logical replication slot, and now have successful syncs. WAL was at ~250GB before they dropped it.

Our guess is that 5 minutes is not enough to find the resume position if WAL grows too big?

Slack ref : https://airbytehq.slack.com/archives/C01MFR03D5W/p1630662717338800?thread_ts=1630572024.268600&cid=C01MFR03D5W
logs-1338-0.txt

The text was updated successfully, but these errors were encountered:

sherifnada · 2022-01-11T23:28:20Z

Additional instance of this happening in https://github.com/airbytehq/oncall/issues/90:
logs-31085-0 (2).txt

grishick · 2022-04-14T16:50:20Z

Exit criteria:

test (could be manual)
test results: how long it takes to read 250GB WAL.
Test scenarios:
1. WAL log is occupied with events from the table we are syncing
2. WAL log is occupied with events from a table we are not syncing

Estimate: L (one sprint). To set this up will take some time.

tuliren · 2022-05-18T14:06:02Z

Observation

For WAL records from a syncing table, new incremental records are categorized as processed and ignored. It seems that incremental syncs are not working.
For WAL records from a non-syncing table, all incremental records are iterated through, and can lead to timeout for the syncing table.
Minor. Tons of info level logs, either about irrelevant messages, or process messages.

Postgres CDC Process

16-18 seconds to get the first LSN.
20 seconds to start processing messages.
Process 20K-30K irrelevant messages per second.
Process 4K relevant messages per second.

Full notes

Doc

tuliren · 2022-06-02T16:23:03Z

The investigation is done. Here are the follow-up items:

Check Debezium state file (unless there is an API for that) to check if Debezium is still processing records
- Postgres source: check if Debezium is active #13208
Reduce 5 minute timeout to 1 minute
- Postgres source: check if Debezium is active #13208
Update docs to mention that the publications in source DB should match the configured streams
- Postgres source: document that under CDC mode, all and only relevant tables should be in the publication #13209
Add an additional check in Discover command where the connector will make sure that configured streams match publications
- Postgres source: add a check for publications in discover method for CDC #13207

subodh1810 added the type/bug Something isn't working label Sep 3, 2021

sherifnada added the area/connectors Connector related issues label Sep 10, 2021

igrankova added connectors/sources-database connectors/source/postgres labels Jan 13, 2022

bleonard added the team/databases label Apr 14, 2022

grishick assigned tuliren Apr 27, 2022

tuliren mentioned this issue May 18, 2022

Make CDC debezium first record wait time configurable #12776

Closed

This was referenced May 25, 2022

Postgres source: add a check for publications in discover method for CDC #13207

Closed

Postgres source: document that under CDC mode, all and only relevant tables should be in the publication #13209

Closed

tuliren closed this as completed Jun 2, 2022

tuliren mentioned this issue Jun 2, 2022

Exception when reading data from RDS Postgres replica #12644

Closed

rx007 mentioned this issue Oct 28, 2023

[Snyk] Security upgrade axios from 0.21.1 to 1.6.0 rx007/airbyte#2130

Open

mresposito mentioned this issue Oct 28, 2023

[Snyk] Security upgrade axios from 0.21.1 to 1.6.0 candulabs/airbyte#104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the effect of huge WAL logs on Postgres CDC source sync #5837

Investigate the effect of huge WAL logs on Postgres CDC source sync #5837

subodh1810 commented Sep 3, 2021

sherifnada commented Jan 11, 2022

grishick commented Apr 14, 2022

tuliren commented May 18, 2022 •

edited

Loading

tuliren commented Jun 2, 2022 •

edited

Loading

Investigate the effect of huge WAL logs on Postgres CDC source sync #5837

Investigate the effect of huge WAL logs on Postgres CDC source sync #5837

Comments

subodh1810 commented Sep 3, 2021

sherifnada commented Jan 11, 2022

grishick commented Apr 14, 2022

tuliren commented May 18, 2022 • edited Loading

Observation

Postgres CDC Process

Full notes

tuliren commented Jun 2, 2022 • edited Loading

tuliren commented May 18, 2022 •

edited

Loading

tuliren commented Jun 2, 2022 •

edited

Loading