Three-way merge #487

bobvawter · 2023-09-18T21:52:51Z

This change is part of #487 to support three-way merges. It adds a before field to types.Mutation and allows this data to be persisted.

This change is part of #487 to support three-way merges. It adds a before field to types.Mutation and allows this data to be persisted. Code related to gzipping the mutation data has been extracted to helpers to compress both the before and data fields.

This change is part of #487 to support three-way merges. This change allows the CDC handler to extract the `before` attribute from incoming requests to populate the `Mutation.Before` field. The `server/integration_test.go` is updated to configure a RetireOffset and to inspect the staged mutation to ensure that the before value was recorded. Once merging is actually implemented, we can return to this test to verify end-to-end behavior.

This change is part of #487 to support three-way merges. This change allows the CDC handler to extract the `before` attribute from incoming requests to populate the `Mutation.Before` field. We use the new `RetireOffset` setting to allow the staged data to be inspected by `handler_test.go`. The `server/integration_test.go` is similarly updated to inspect the staged mutation to ensure that the before value was recorded. Once merging is actually implemented, we can return to this test to verify end-to-end behavior.

This change is part of #487 to support three-way merges. It adds a before field to types.Mutation and allows this data to be persisted. Code related to gzipping the mutation data has been extracted to helpers to compress both the before and data fields.

This change is part of #487 to support three-way merges. This change allows the CDC handler to extract the `before` attribute from incoming requests to populate the `Mutation.Before` field. We use the new `RetireOffset` setting to allow the staged data to be inspected by `handler_test.go`. The `server/integration_test.go` is similarly updated to inspect the staged mutation to ensure that the before value was recorded. Once merging is actually implemented, we can return to this test to verify end-to-end behavior.

This change is part of #487 to support three-way merges. This change allows the CDC handler to extract the `before` or `cdc_prev` attributes from incoming requests to populate the `Mutation.Before` field. That is, if a tabular changefeed is created with the `diff` option, or a query changefeed includes the `cdc_prev` column, the before data will be persisted. The CDC queries endpoints had not been previously tested in `server/integration_test.go`, so there's a little bit of churn to cover some unhandled edge cases around the presence or absense of the `diff` option. This change uses the new-ish `RetireOffset` configuration to allow the staged records to be inspected even though there may be concurrent resolved timestamps being processed. Uninteresting uses of json.NewDecoder() have been replaced with json.Unmarshal(). Some error messages have been edited for clarity.

This change is part of #487 to support three-way merges. This change removes support for storing apply configurations in the staging database. The userscript is now integrated into all cdc-sink modes and provides superior ergonomics. The upcoming merge function would have to be configured through the userscript and does not make sense to persist. Futhermore, having two distinct ways of accomplishing a task is confusing. Breaking Change: Any deployments using table-based configuration of data application behaviors must instead switch to the userscript for configuration. The wiki pages have been scrubbed for any references to the table-based approach: X-Ref: https://github.com/cockroachdb/cdc-sink/wiki/Data-Behaviors X-Ref: https://github.com/cockroachdb/cdc-sink/wiki/User-Scripts

This change is part of #487 to support three-way merges. This change allows the CDC handler to extract the `before` or `cdc_prev` attributes from incoming requests to populate the `Mutation.Before` field. That is, if a tabular changefeed is created with the `diff` option, or a query changefeed includes the `cdc_prev` column, the before data will be persisted. The CDC queries endpoints had not been previously tested in `server/integration_test.go`, so there's a little bit of churn to cover some unhandled edge cases around the presence or absense of the `diff` option. This change uses the new-ish `RetireOffset` configuration to allow the staged records to be inspected even though there may be concurrent resolved timestamps being processed. Uninteresting uses of json.NewDecoder() have been replaced with json.Unmarshal(). Some error messages have been edited for clarity.

This change is part of #487 to support three-way merges. This change removes support for storing apply configurations in the staging database. The userscript is now integrated into all cdc-sink modes and provides superior ergonomics. The upcoming merge function would have to be configured through the userscript and does not make sense to persist. Futhermore, having two distinct ways of accomplishing a task is confusing. Breaking Change: Any deployments using table-based configuration of data application behaviors must instead switch to the userscript for configuration. The wiki pages have been scrubbed for any references to the table-based approach: X-Ref: https://github.com/cockroachdb/cdc-sink/wiki/Data-Behaviors X-Ref: https://github.com/cockroachdb/cdc-sink/wiki/User-Scripts

This change is part of #487 to support three-way merges. This change add supports for declaring a user-defind merge function within the userscript. The goja version is updated so that we can create a lightweight wrapper around the ident.Map that will store the reified mutation values.

This change is part of #487 to support three-way merges. This change updates the apply package to call into the merge function when targeting CockroachDB or PostgreSQL. The conditional-upsert SQL is extended to return the index of the conflicting data and the contents of the blocking row. The blocking and conflicting row data are then used to drive the merge function. The merge API added in PR #534 is refined. The relevent types are extracted into their own package, which contains a new "Bag" type. A Bag holds reified properties and can represent the data in a mutation or in a database row. It additonally classifies properties as being "mapped" or "unmapped" as to whether or not the property maps onto a known column. Some of the bookkeeping previously in the apply code to track missing or extra properties is simplified. The upsert code also becomes recursive. Mutations are reified into Bags and are applied. If a Bag generates a conflict, the merge function will be called to produce a Bag that will be unconditionally applied. Once all conflicts have been resolved, the accumulated Bags will be upserted by a recursive call to the upsert method. There is only ever one level of recursion.

This change is part of #487 to support three-way merges. This change add supports for declaring a user-defind merge function within the userscript. The goja version is updated so that we can create a lightweight wrapper around the ident.Map that will store the reified mutation values.

This change is part of #487 to support three-way merges. This change updates the apply package to call into the merge function when targeting CockroachDB or PostgreSQL. The conditional-upsert SQL is extended to return the index of the conflicting data and the contents of the blocking row. The blocking and conflicting row data are then used to drive the merge function. The merge API added in PR #534 is refined. The relevent types are extracted into their own package, which contains a new "Bag" type. A Bag holds reified properties and can represent the data in a mutation or in a database row. It additonally classifies properties as being "mapped" or "unmapped" as to whether or not the property maps onto a known column. Some of the bookkeeping previously in the apply code to track missing or extra properties is simplified. The upsert code also becomes recursive. Mutations are reified into Bags and are applied. If a Bag generates a conflict, the merge function will be called to produce a Bag that will be unconditionally applied. Once all conflicts have been resolved, the accumulated Bags will be upserted by a recursive call to the upsert method. There is only ever one level of recursion.

This change is part of #487 to support three-way merges. This change updates the apply package to call into the merge function when targeting CockroachDB or PostgreSQL. The conditional-upsert SQL is extended to return the index of the conflicting data and the contents of the blocking row. The blocking and conflicting row data are then used to drive the merge function. The merge API added in PR #534 is refined. The relevent types are extracted into their own package, which contains a new "Bag" type. A Bag holds reified properties and can represent the data in a mutation or in a database row. It additonally classifies properties as being "mapped" or "unmapped" as to whether or not the property maps onto a known column. Some of the bookkeeping previously in the apply code to track missing or extra properties is simplified by letting the Bag keep track of unexpected input properties. The upsert code also becomes recursive. Mutations are reified into Bags and are applied. If a Bag generates a conflict, the merge function will be called to produce a Bag that will be unconditionally applied. Once all conflicts have been resolved, the accumulated Bags will be upserted by a recursive call to the upsert method. There is only ever one level of recursion.

This change is part of #487 to support three-way merges. This change updates the apply package to call into the merge function when targeting CockroachDB or PostgreSQL. The conditional-upsert SQL is extended to return the index of the conflicting data and the contents of the blocking row. The blocking and conflicting row data are then used to drive the merge function. The merge API proposed in PR #534 is refined. The relevent types are extracted into their own package, which contains a new "Bag" type. A Bag holds reified properties and can represent the data in a mutation or in a database row. It additonally classifies properties as being "mapped" or "unmapped" as to whether or not the property maps onto a known column. Some of the bookkeeping previously in the apply code to track missing or extra properties is simplified by letting the Bag keep track of unexpected input properties. The upsert code also becomes recursive. Mutations are reified into Bags and are applied. If a Bag generates a conflict, the merge function will be called to produce a Bag that will be unconditionally applied. Once all conflicts have been resolved, the accumulated Bags will be upserted by a recursive call to the upsert method. There is only ever one level of recursion.

This change is part of #487 to support three-way merges. This change adds, but does not integrate, support for a per-target-schema DLQ. cdc-sink, as a design rule, does not perform any schema changes in the target database. If the user wants to use the DLQ, they will need to create the destination table. The only requirements on the DLQ table are that it contain certain well-known column names with appropriate types. A basic schema is suggested by cdc-sink, and this suggested schema is used by the DLQ tests. The justification for this hand-waving is that the DLQ becomes, in essence, a part of the user's application and will likely need to be part of a schema-management system. We cannot predict how the DLQ entries will be used, indexed, etc. so integration with a minimum number of well-known columns seems like it should give the user maximum flexibility.

This change is part of #487 to support three-way merges. This change add supports for declaring a user-defind merge function within the userscript. The goja version is updated so that we can create a lightweight wrapper around the ident.Map that will store the reified mutation values.

This change is part of #487 to support three-way merges. This change updates the apply package to call into the merge function when targeting CockroachDB or PostgreSQL. The conditional-upsert SQL is extended to return the index of the conflicting data and the contents of the blocking row. The blocking and conflicting row data are then used to drive the merge function. The merge API proposed in PR #534 is refined. The relevent types are extracted into their own package, which contains a new "Bag" type. A Bag holds reified properties and can represent the data in a mutation or in a database row. It additonally classifies properties as being "mapped" or "unmapped" as to whether or not the property maps onto a known column. Some of the bookkeeping previously in the apply code to track missing or extra properties is simplified by letting the Bag keep track of unexpected input properties. The upsert code also becomes recursive. Mutations are reified into Bags and are applied. If a Bag generates a conflict, the merge function will be called to produce a Bag that will be unconditionally applied. Once all conflicts have been resolved, the accumulated Bags will be upserted by a recursive call to the upsert method. There is only ever one level of recursion.

This change is part of #487 to support three-way merges. This change add supports for declaring a user-defind merge function within the userscript. The goja version is updated so that we can create a lightweight wrapper around the ident.Map that will store the reified mutation values.

This change is part of #487 to support three-way merges. This change updates the apply package to call into the merge function when targeting CockroachDB or PostgreSQL. The conditional-upsert SQL is extended to return the index of the conflicting data and the contents of the blocking row. The blocking and conflicting row data are then used to drive the merge function. The merge API proposed in PR #534 is refined. The relevent types are extracted into their own package, which contains a new "Bag" type. A Bag holds reified properties and can represent the data in a mutation or in a database row. It additonally classifies properties as being "mapped" or "unmapped" as to whether or not the property maps onto a known column. Some of the bookkeeping previously in the apply code to track missing or extra properties is simplified by letting the Bag keep track of unexpected input properties. The upsert code also becomes recursive. Mutations are reified into Bags and are applied. If a Bag generates a conflict, the merge function will be called to produce a Bag that will be unconditionally applied. Once all conflicts have been resolved, the accumulated Bags will be upserted by a recursive call to the upsert method. There is only ever one level of recursion.

This change is part of #487 to support three-way merges. PR #540 and #543 were submitted separately, so the apply code did not support writing to the DLQ. This change completes the wiring and allows conflicts to be written to the queue. The unexported apply.newApply() function was made a method on the factory type, to decrease the number of arguments.

This change is part of #487 to support three-way merges. This change adds merge.Standard and exposes it to the userscript. The merge function identifies the properties which have changed between the before and proposed bags. If the value of the property in the before bag equals the value of the property in the target, the change is applied. Property equivalency is presently defined as "serializes to the same JSON bytes". The golang json serializer is deterministic and we have a rather fluid typesystem, so this seems like a reasonable initial implementation. A fallback merge function can be composed with merge.Standard to handle properties with application-specific semantics. The script test shows the composition of the standard merge with a counter-like field that is only ever incremented. Properties which cannot be automatically merged are indicated by a new Conflict.Unmerged field and corresponding userscript binding. This fallback function can also be used to "merge or else" by using a trivial fallback that always returns the name of a dlq. This, too, is demonstrated in the test script. The merge.Conflict.Existing field is renamed to Target. It either contains the existing state of the row in the target database, or it contains the data that merge.Standard determines should be stored in the target. The change in sort order improves readability in the tests: Before, Proposed, Target -> Expected.

This change is part of #487 to support three-way merges. PR #540 and #543 were submitted separately, so the apply code did not support writing to the DLQ. This change completes the wiring and allows conflicts to be written to the queue. The unexported apply.newApply() function was made a method on the factory type, to decrease the number of arguments.

This change is part of #487 to support three-way merges. This change adds merge.Standard and exposes it to the userscript. The merge function identifies the properties which have changed between the before and proposed bags. If the value of the property in the before bag equals the value of the property in the target, the change is applied. Property equivalency is presently defined as "serializes to the same JSON bytes". The golang json serializer is deterministic and we have a rather fluid typesystem, so this seems like a reasonable initial implementation. A fallback merge function can be composed with merge.Standard to handle properties with application-specific semantics. The script test shows the composition of the standard merge with a counter-like field that is only ever incremented. Properties which cannot be automatically merged are indicated by a new Conflict.Unmerged field and corresponding userscript binding. This fallback function can also be used to "merge or else" by using a trivial fallback that always returns the name of a dlq. This, too, is demonstrated in the test script. The merge.Conflict.Existing field is renamed to Target. It either contains the existing state of the row in the target database, or it contains the data that merge.Standard determines should be stored in the target. The change in sort order improves readability in the tests: Before, Proposed, Target -> Expected.

This change moves tracking of partially-applied resolved timestamp windows into the staging tables by adding a new `applied` column. The goal of this change is to move some state-tracking out of the cdc resolver loop into the stage package. Tracking apply status on a per-mutation basis improves idempotency of cdc-sink when the userscript has non-idempotent behaviors (e.g.: three-way merge). It also allows us to export monitoring data around mutations which may have slipped through the cracks or to detect when a migration process has completely drained. Fine-grained tracking will also be useful for unifying the non-transactional modes into a single behavior. Many unused methods in the stage API have been deleted. The "unstaging" SQL query is now generated with a golang template and is tested similarly to the apply package. The cdc package performs less work to track partial application of large individual changes. It just persists the contents of the UnstageCursor as a performance enhancement. Exactly-once behavior is provided by the applied column. The change to `server/integration_test.go` is due to the unstage processing being a one-shot. The test being performed duplicates an existing test in `cdc/handler_test.go`. Breaking change: The `--selectBatchSize` flag is deprecated in favor of two different flags `--largeTransactionLimit` and `--timestampWindowSize` which, respectively, enable partial processing of a single, over-sized transaction and a general limit on the total amount of data to be unstaged. Breaking change: A staging schema migraion is required, this is documented in the migrations directory. X-Ref: #487 X-Ref: #504 X-Ref: #565

bobvawter added the enhancement New feature or request label Sep 18, 2023

bobvawter self-assigned this Sep 18, 2023

bobvawter mentioned this issue Oct 2, 2023

RFC: Three-way merge #509

Closed

bobvawter added a commit that referenced this issue Oct 6, 2023

stage: Support storing a Mutation.Before field

4fd98c3

This change is part of #487 to support three-way merges. It adds a before field to types.Mutation and allows this data to be persisted.

bobvawter mentioned this issue Oct 6, 2023

stage: Support storing a Mutation.Before field #517

Merged

bobvawter mentioned this issue Oct 6, 2023

cdc: Parse changefeeds with diff option #520

Merged

bobvawter mentioned this issue Oct 11, 2023

script: Allow user-defined merge functions #534

Closed

bobvawter mentioned this issue Oct 16, 2023

dlq: Add dead-letter queue package #543

Merged

bobvawter mentioned this issue Oct 18, 2023

apply: Finish DLQ wiring #546

Merged

bobvawter mentioned this issue Oct 27, 2023

Option to track re-applied mutations in the target for exactly-once behavior #565

Open

bobvawter mentioned this issue Oct 31, 2023

stage: Track applied mutations in staging tables #572

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Three-way merge #487

Three-way merge #487

bobvawter commented Sep 18, 2023 •

edited

Loading

Three-way merge #487

Three-way merge #487

Comments

bobvawter commented Sep 18, 2023 • edited Loading

bobvawter commented Sep 18, 2023 •

edited

Loading