Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-transactional backfill path #56

Closed
bobvawter opened this issue Oct 6, 2021 · 2 comments · Fixed by #73
Closed

Non-transactional backfill path #56

bobvawter opened this issue Oct 6, 2021 · 2 comments · Fixed by #73

Comments

@bobvawter
Copy link
Member

There are related problems to consider:

  • initial_scan over large data sets
  • High update rates on the source cluster which overwhelm the 3x write amplification in the target db (i.e.: staging and flushing on a resolved timestamp)
  • High update rates or very large amounts of incoming data which exceed the maximum transaction size in the target database during resolved-timestamp flush operations.

These would seem to call for some kind of direct-feed approach, where cdc-sink bypasses the staging table and populates the target tables directly. For the initial_scan case, this should be relatively safe, as there is no expectation that the target database would be usable until the backfill is complete. For the other cases, we may want to allow the operator to place cdc-sink into a non-transactional, "catch up" mode to accommodate operational issues that may be encountered.

@bobvawter
Copy link
Member Author

Thought for later: receiving a resolved timestamp need not actually flush all pending writes to the destination tables, but just record a target timestamp for an ambient, batching, flush process.

For applications that only need to arrive at an eventually-consistent state (i.e.: not running fully-consistent queries against the target db, but just as a standby cluster), you just need to make sure that the flush process has completed up to the last resolved timestamp.

The current behavior of transactional flushes could be maintained by optionally eliminating any limits in the flush query.

@bobvawter
Copy link
Member Author

Work in #73 adds an immediate mode wherein data rows are immediately written to the target tables.

It might be interesting to enable this automatically for targets that have no previous resolved timestamp.

@bobvawter bobvawter linked a pull request Dec 15, 2021 that will close this issue
@bobvawter bobvawter removed a link to a pull request Dec 15, 2021
@bobvawter bobvawter linked a pull request Dec 17, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant