Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the effect of huge WAL logs on Postgres CDC source sync #5837

Closed
subodh1810 opened this issue Sep 3, 2021 · 4 comments
Closed
Assignees

Comments

@subodh1810
Copy link
Contributor

A user reported that their Postgres source connector running in CDC mode was not able to fetch the changes during incremental updates. Take a look at the following logs

2021-09-02 08:23:46 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-02 08:23:46 [32mINFO[m i.d.c.p.PostgresStreamingChangeEventSource(searchWalPosition):268 - {} - Searching for WAL resume position
2021-09-02 08:28:44 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-02 08:28:44 [32mINFO[m i.a.i.d.i.DebeziumRecordIterator(computeNext):98 - {} - Closing cause next is returned as null

The two logs have a difference of 5 minutes. We wait for 5 minutes and there were no records returned by debezium and as a result we shutdown the sync. The last log from debezium is 2021-09-02 08:23:46 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-02 08:23:46 [32mINFO[m i.d.c.p.PostgresStreamingChangeEventSource(searchWalPosition):268 - {} - Searching for WAL resume position

Even after 5 minutes of waiting debezium didn't return any record which caused the sync to shut down.

The user reported that they dropped and recreated the logical replication slot, and now have successful syncs. WAL was at ~250GB before they dropped it.

Our guess is that 5 minutes is not enough to find the resume position if WAL grows too big?

Slack ref : https://airbytehq.slack.com/archives/C01MFR03D5W/p1630662717338800?thread_ts=1630572024.268600&cid=C01MFR03D5W
logs-1338-0.txt

@subodh1810 subodh1810 added the type/bug Something isn't working label Sep 3, 2021
@sherifnada sherifnada added the area/connectors Connector related issues label Sep 10, 2021
@sherifnada
Copy link
Contributor

Additional instance of this happening in https://github.com/airbytehq/oncall/issues/90:
logs-31085-0 (2).txt

@grishick
Copy link
Contributor

Exit criteria:

  • test (could be manual)
  • test results: how long it takes to read 250GB WAL.
  • Test scenarios:
    1. WAL log is occupied with events from the table we are syncing
    2. WAL log is occupied with events from a table we are not syncing

Estimate: L (one sprint). To set this up will take some time.

@tuliren
Copy link
Contributor

tuliren commented May 18, 2022

Observation

  • For WAL records from a syncing table, new incremental records are categorized as processed and ignored. It seems that incremental syncs are not working.
  • For WAL records from a non-syncing table, all incremental records are iterated through, and can lead to timeout for the syncing table.
  • Minor. Tons of info level logs, either about irrelevant messages, or process messages.

Postgres CDC Process

  • 16-18 seconds to get the first LSN.
  • 20 seconds to start processing messages.
  • Process 20K-30K irrelevant messages per second.
  • Process 4K relevant messages per second.

Full notes

Doc

@tuliren
Copy link
Contributor

tuliren commented Jun 2, 2022

The investigation is done. Here are the follow-up items:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants