Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firestore #191

Closed
wants to merge 5 commits into from
Closed

Firestore #191

wants to merge 5 commits into from

Conversation

bobvawter
Copy link
Member

@bobvawter bobvawter commented Jul 27, 2022

This PR contains a sequence of stacked commits that add support for consuming documents from Google Firestore. The goja JavaScript runtime is incorporated, since input documents may require arbitrarily complex mapping onto a relational schema. The core logical-loop code is also augmented with a backfill mode so that documents can be consumed in a page-oriented fashion before switching to streaming subscriptions. This code is in use at a customer and they consider it to be feature-complete with respect to their use case. It is also expected that this work will make subsequent document stores easy to add.

The cdc-sink@v1.d.ts file documents the userscript API and shows the end-point of this PR.

This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2022

Codecov Report

Base: 67.08% // Head: 64.18% // Decreases project coverage by -2.89% ⚠️

Coverage data is based on head (980fb5b) compared to base (f8d5c57).
Patch coverage: 54.89% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #191      +/-   ##
==========================================
- Coverage   67.08%   64.18%   -2.90%     
==========================================
  Files          92      111      +19     
  Lines        5854     7441    +1587     
==========================================
+ Hits         3927     4776     +849     
- Misses       1610     2219     +609     
- Partials      317      446     +129     
Impacted Files Coverage Δ
internal/cmd/fslogical/fslogical.go 0.00% <0.00%> (ø)
internal/script/help_command.go 0.00% <0.00%> (ø)
internal/script/options.go 0.00% <0.00%> (ø)
internal/source/fslogical/jsonload/main.go 0.00% <0.00%> (ø)
internal/source/logical/dialect.go 100.00% <ø> (ø)
internal/source/mylogical/config.go 16.04% <0.00%> (ø)
internal/source/mylogical/provider.go 73.91% <ø> (-2.09%) ⬇️
internal/source/pglogical/config.go 9.52% <0.00%> (ø)
internal/source/pglogical/provider.go 52.72% <ø> (-1.66%) ⬇️
internal/target/sinktest/integration.go 71.42% <ø> (ø)
... and 46 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bobvawter bobvawter marked this pull request as ready for review July 27, 2022 17:52
Copy link
Contributor

@sravotto sravotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 32 of 32 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @bobvawter and @BramGruneir)

a discussion (no related file):
Could clarify the restrictions on the target table (if any). It seems to me that docs in a collection need to have the same fields. Perhaps I'm missing something.



-- commits line 40 at r1:
Are we assuming that all the documents in the collection have the same fields?


internal/cmd/fslogical/fslogical.go line 68 at r1 (raw file):

}

// metricsServer starts a trivial prometheus endpoint server which runs

I think we should move this to util/metrics (perhaps in a separate PR). We have this code repeated all over.


internal/source/fslogical/integration_test.go line 27 at r1 (raw file):

)

func TestMain(m *testing.M) {

Do we need tests for different data types?

Copy link
Member Author

@bobvawter bobvawter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @bobvawter, @BramGruneir, and @sravotto)


-- commits line 40 at r1:

Previously, sravotto (silvano) wrote…

Are we assuming that all the documents in the collection have the same fields?

For the specific use-case this is being built for, yes. That said, it might be interesting to allow the source document to be inserted as a JSONB column. If there's an extras column, then we could lift the default restriction in the apply package that all properties of a mutation must map onto (at least one) database column. Operators could perform schema changes at their leisure to unpack the extras column into whatever schema ultimately makes sense.


internal/cmd/fslogical/fslogical.go line 68 at r1 (raw file):

Previously, sravotto (silvano) wrote…

I think we should move this to util/metrics (perhaps in a separate PR). We have this code repeated all over.

Agreed, that's #117

@bobvawter bobvawter force-pushed the firestore branch 10 times, most recently from 55b8169 to 07201d8 Compare August 9, 2022 19:17
@bobvawter bobvawter force-pushed the firestore branch 5 times, most recently from 909101e to 37e9a95 Compare August 16, 2022 17:17
Copy link
Member

@BramGruneir BramGruneir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 20 files at r5, 41 of 42 files at r6, 4 of 4 files at r7, all commit messages.
Reviewable status: 51 of 52 files reviewed, 16 unresolved discussions (waiting on @bobvawter, @BramGruneir, and @sravotto)


README.md line 566 at r7 (raw file):

## Google Firestore replication

`cdc-sink` supports [Google Cloud Firestore](https://cloud.google.com/firestore) as a source of

Please add a note about how only a single instance of cdc-sink should be used when handling FS replication.


internal/source/fslogical/config.go line 54 at r6 (raw file):

	UpdatedAtProperty ident.Ident

	docIDTemp        string

What are these temps needed for? Ah, I see. I thought pflag allowed you to add arbitrary code to convert/validate.


internal/source/fslogical/consistent_point.go line 66 at r6 (raw file):

	o := other.(*consistentPoint)

	tt := t.AsTime()

can you just rename the variables to make it a bit more readable?


internal/source/fslogical/fslogical.go line 35 at r6 (raw file):

	cfg  *loopConfig
	fs   *firestore.Client
	st   *Tombstones

tomb? Why st?


internal/source/fslogical/fslogical.go line 66 at r6 (raw file):

	ctx context.Context, ch chan<- logical.Message, state logical.State,
) error {
	// From will either be the update time of the last streamed record,

What is "From"? I understand the concept obviously, but there is no from variable.


internal/source/fslogical/fslogical.go line 102 at r6 (raw file):

	ctx context.Context, ch chan<- logical.Message, now time.Time, cp *consistentPoint,
) (_ *consistentPoint, moreWork bool, _ error) {

nit: empty line


internal/source/fslogical/fslogical.go line 111 at r6 (raw file):

		Limit(d.cfg.BackfillBatch)
	if !cp.IsZero() {
		if cp.AsID() == "" {

What's the logic here? Can you add a comment?


internal/source/fslogical/injector.go line 25 at r6 (raw file):

)

// Start creates a PostgreSQL logical replication loop using the

Firestore


internal/source/fslogical/tombstones.go line 63 at r6 (raw file):

//
// This implementation assumes that document ids within a single
// collection are not recycled.

Is this a valid assumption? It should be possible to remove a tombstone if a new mutation was previously deleted.


internal/source/fslogical/tombstones.go line 102 at r6 (raw file):

) error {
	cp, _ := state.GetConsistentPoint().(*consistentPoint)
	q := t.coll.

Is this sorting necessary?


internal/source/fslogical/jsonload/main.go line 53 at r6 (raw file):

	eg, egCtx := errgroup.WithContext(ctx)

	ch := make(chan map[string]interface{}, *workers)

Can you rename these variables to something clearer? Ch is channel, sure but for what? Ct is what? Count?


internal/source/logical/loop.go line 201 at r6 (raw file):

			// from backfill to streaming modes) or if the outer
			// context is being shut down.
			if err == nil {

Why split this up?

@bobvawter bobvawter force-pushed the firestore branch 5 times, most recently from c2cffd1 to 12add65 Compare August 25, 2022 21:37
@bobvawter bobvawter force-pushed the firestore branch 3 times, most recently from d8f0411 to 2e03569 Compare August 31, 2022 17:47
@bobvawter bobvawter force-pushed the firestore branch 3 times, most recently from 88290ee to a3d1202 Compare September 1, 2022 18:27
@bobvawter bobvawter force-pushed the firestore branch 3 times, most recently from 3dbd2b3 to 5a546ba Compare September 14, 2022 19:12
This change adds a proper ParseIdent function and reworks the Relative function
to be implemented in terms of ParseIdent. ParseIdent consumes possibly-quoted
sequences of UTF8 data to produce an Ident.

The motivation for this change is an upcoming use-case that make use of quoted
schema names that contain multi-byte characters.
This change allows an Ident to be used as the target of a CLI flag. It also
simplifies the existing logical/config.go.
This change adds the goja JavaScript runtime to cdc-sink in order to support
arbitrary user-defined behavior at runtime. The motivating use case is to be
able to support migrations from document stores where the source document
schema does not have a direct mapping to a tabular schema. Consider the case of
a document with nested objects or collections; a single document may expand
into multiple rows in several tables.

The userscript is washed through esbuild to convert the source to
ES5-compatible, CommonJS-packaged scripts. This has the nice side-effect of
allowing the userscript to be written as TypeScript, for which a .d.ts binding
is provdided. A top-level CLI command is provided to print the bindings file
and additional help related to userscripts.

Future work and notes:
* Execution through the userscript is single-threaded. If this presents a
  limitation on ultimate throughput, it would be straightforward to have
  multiple, independent JS runtimes.
* Access to the target database could be added in the future, but is not
  currently part of the user API.
* An experiment to bundle the V8 runtime was performed, but we wound up
  spending as much time marshaling data over the cgo barrier as executing the
  user code.

h/t: @sjbarag for feedback on TypeScript API.
This change prepares for the Firestore integration by allowing dialects to
implement a non-transaction backfill mode which then switches to the existing,
fully-transactional mode.

The Config type is broken out into an interface to allow all CLI flags and
configuration to be driven via the user-script.

If configured, the user-script is called via a shim around the existing
logical.Events interface. The logical-replication source is added to
Events.OnData to facilitate routing within the user-script.
This change adds support for using Google Firestore as a logical replication
source.

Tests are run against the Firestore SDK Emulator, which is packaged into a
private Docker container, hosted within GitHub packages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants