Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pglogical replication #64

Closed
7 tasks done
bobvawter opened this issue Nov 29, 2021 · 0 comments · Fixed by #91
Closed
7 tasks done

Support pglogical replication #64

bobvawter opened this issue Nov 29, 2021 · 0 comments · Fixed by #91
Assignees

Comments

@bobvawter
Copy link
Member

bobvawter commented Nov 29, 2021

This is a tracking issue to be able to consume a logical replication feed from PostgreSQL to act as a source of row data.

Much of the necessary wire-protocol is already present in jackc/pglogrepl.

In the desired end state, cdc-sink would support multiple frontends for row data, with a common backend that supports transactionally-consistent staging and reification of source data.

It's likely that a mapping/transform layer will exist in cdc-sink at some point in the future to perform on-the-fly schema or data-type adjustments. This should be accommodated as a desired end-state, but will not be added in the initial implementation.

Plan needs (revise as necessary):

  • Add automated integration testing with CRDB
  • Source-code reorganization to clearly identify front- and back-half concerns
  • Clean up existing configuration ergonomics pglogical is now a separate subcommand.
  • Extend test-rig to also provide a matrix of postgres versions
  • Register and de-register a replication slot in a source database
  • Backfill and ingest
  • Checkpointing and error-recovery
  • Lease the feed across multiple cdc-sink instances pglogical is now a separate subcommand.
  • Fail usefully in the presence of incompatible upstream schema changes
@bobvawter bobvawter self-assigned this Nov 29, 2021
bobvawter added a commit that referenced this issue Dec 1, 2021
If there are multiple changes to the same row that have been staged, we only
want to apply the latest one, to avoid multiple updates to the same target row
when flushing a resolved timestamp. This is currently accomplished by having
cdc-sink read all staged changes in reverse-MVCC order and tracking which keys
have been touched.

This change moves the deduplication into CRDB to reduce the number of rows
returned to cdc-sink.  It is also a precursor to single-statement promotion in

The Line type is broken out with a Mutation base type that only holds the PK
and data to be upserted.  The timestamp information is unused by the UPSERT, so
we don't want to retrieve it.

X-Ref: #64
X-Ref: #70
bobvawter added a commit that referenced this issue Dec 1, 2021
If there are multiple changes to the same row that have been staged, we only
want to apply the latest one, to avoid multiple updates to the same target row
when flushing a resolved timestamp. This is currently accomplished by having
cdc-sink read all staged changes in reverse-MVCC order and tracking which keys
have been touched.

This change moves the deduplication into CRDB to reduce the number of rows
returned to cdc-sink.  It is also a precursor to single-statement promotion in

The Line type is broken out with a Mutation base type that only holds the PK
and data to be upserted.  The timestamp information is unused by the UPSERT, so
we don't want to retrieve it.

X-Ref: #64
X-Ref: #70
bobvawter added a commit that referenced this issue Dec 3, 2021
If there are multiple changes to the same row that have been staged, we only
want to apply the latest one, to avoid multiple updates to the same target row
when flushing a resolved timestamp. This is currently accomplished by having
cdc-sink read all staged changes in reverse-MVCC order and tracking which keys
have been touched.

This change moves the deduplication into CRDB to reduce the number of rows
returned to cdc-sink.  It is also a precursor to single-statement promotion in

The Line type is broken out with a Mutation base type that only holds the PK
and data to be upserted.  The timestamp information is unused by the UPSERT, so
we don't want to retrieve it.

X-Ref: #64
X-Ref: #70
bobvawter added a commit that referenced this issue Dec 6, 2021
This change defers reifying the contents of a Line until the actual upsert into
a table. The goal is to improve efficiency and to make it simpler to generate
batches of lines from sources other than a CockroachDB CDC feed.

Related: #64
@bobvawter bobvawter linked a pull request Feb 24, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant