Skip to content

Commit

Permalink
fslogical: Initial add of Firestore support.
Browse files Browse the repository at this point in the history
This change adds support for using Google Firestore as a logical replication
source.

Tests are run against the Firestore SDK Emulator, which is packaged into a
private Docker container, hosted within GitHub packages.
  • Loading branch information
bobvawter committed Sep 20, 2022
1 parent 4e2c340 commit 980fb5b
Show file tree
Hide file tree
Showing 22 changed files with 2,305 additions and 25 deletions.
7 changes: 6 additions & 1 deletion .github/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,18 @@ services:
image: cockroachdb/cockroach-unstable:v22.2.0-alpha.3
network_mode: host
command: start-single-node --insecure --store type=mem,size=2G
firestore:
image: ghcr.io/cockroachdb/cdc-sink/firestore-emulator:latest
# Expose the emulator on port 8181 to avoid conflict with CRDB admin UI.
ports:
- "8181:8080"
mysql-v8:
image: mysql:8-debian
platform: linux/x86_64
environment:
MYSQL_ROOT_PASSWORD: SoupOrSecret
MYSQL_DATABASE: _cdc_sink
command:
command:
--default-authentication-plugin=mysql_native_password
--gtid-mode=on
--enforce-gtid-consistency=on
Expand Down
9 changes: 9 additions & 0 deletions .github/firestore/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Build from a Node base, add the required JDK, then install.
# https://firebase.google.com/docs/emulator-suite/install_and_configure
FROM node:alpine
RUN apk add openjdk11
RUN npm install -g firebase-tools
# Pre-cache the firestore emulator.
RUN firebase setup:emulators:firestore
COPY emulators.json .
CMD ["firebase", "emulators:start", "--only", "firestore", "-c", "emulators.json"]
5 changes: 5 additions & 0 deletions .github/firestore/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Firestore emulator container

This directory builds a container which runs the Google Cloud Firestore emulator. We save time in
each of the tests by pre-caching the jar files which the firebase cli will download. We also
configure the emulator to bind to all addresses instead of loopback.
7 changes: 7 additions & 0 deletions .github/firestore/emulators.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"emulators": {
"firestore": {
"host": "0.0.0.0"
}
}
}
48 changes: 48 additions & 0 deletions .github/workflows/firestore.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Build the Firestore emulator docker image and push it to the GitHub
# package repository.
name: Firestore Emulator
permissions:
contents: read
packages: write
on:
push:
branches: [ master ]
paths:
- ".github/firestore/**"
pull_request:
paths:
- ".github/firestore/**"

env:
REGISTRY: ghcr.io
IMAGE_NAME: ghcr.io/${{ github.repository }}/firestore-emulator

jobs:
# https://docs.github.com/en/actions/publishing-packages/publishing-docker-images#publishing-images-to-github-packages
push:
name: Push Firestore Emulator to GH Packages
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: actions/checkout@v3

- name: Log in to GitHub Package Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.IMAGE_NAME }}

- name: Build and push Docker image
uses: docker/build-push-action@v3
with:
context: ./.github/firestore
push: true
tags: ${{ env.IMAGE_NAME }}:latest
labels: ${{ steps.meta.outputs.labels }}
2 changes: 2 additions & 0 deletions .github/workflows/golang.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ jobs:
# directory.
cockroachdb: [ v22.1 ]
integration:
- "firestore"
- "mysql-v8"
- "mysql-mariadb-v10"
- "postgresql-v11"
Expand All @@ -156,6 +157,7 @@ jobs:
- cockroachdb: v22.2 # Note this is currently pinned to alpha.3
env:
COVER_OUT: coverage-${{ matrix.cockroachdb }}-${{ matrix.integration }}.out
FIRESTORE_EMULATOR_HOST: 127.0.0.1:8181
JUNIT_OUT: junit-${{ matrix.cockroachdb }}-${{ matrix.integration }}.xml
TEST_OUT: go-test-${{ matrix.cockroachdb }}-${{ matrix.integration }}.json
steps:
Expand Down
120 changes: 120 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -561,6 +561,126 @@ SET GLOBAL gtid_slave_pos='0-1-1';
- Run `cdc-sink mylogical` with at least the `--sourceConn`, `--targetConn`
, `--defaultGTIDSet` and `--targetDB`. Set `--defaultGTIDSet` to the GTID state shown above.

## Google Firestore replication

`cdc-sink` supports [Google Cloud Firestore](https://cloud.google.com/firestore) as a source of
logical-replication data. It is very likely that a data model which is appropriate for a document
store will need signification revisions to be useful in a relational, tabular database. This
transformation can be accomplished within cdc-sink with a userscript (see the output
of `cdc-sink userscript` for details).

```text
Usage:
cdc-sink fslogical [flags]
Flags:
--applyTimeout duration the maximum amount of time to wait for an update to be applied (default 30s)
--backfillBatchSize int the number of documents to load when backfilling (default 10000)
--bytesInFlight int apply backpressure when amount of in-flight mutation data reaches this limit (default 10485760)
--credentials string a file containing JSON service credentials.
--docID ident the column name (likely the primary key) to populate with the document id (default id)
--fanShards int the number of concurrent connections to use when writing data in fan mode (default 16)
-h, --help help for fslogical
--loopName string identifies the logical replication loops in metrics (default "fslogical")
--metricsAddr string a host:port to serve metrics from at /_/varz
--projectID string override the project id contained in the credentials file
--retryDelay duration the amount of time to sleep between replication retries (default 10s)
--stagingDB ident a SQL database to store metadata in (default _cdc_sink)
--standbyTimeout duration how often to commit the consistent point (default 5s)
--targetConn string the target cluster's connection string
--targetDB ident the SQL database in the target cluster to update
--targetDBConns int the maximum pool size to the target cluster (default 1024)
--tombstoneCollection string the name of a collection that contains document Tombstones
--tombstoneCollectionProperty ident the property name in a tombstone document that contains the original collection name (default collection)
--tombstoneIgnoreUnmapped skip, rather than reject, any tombstone documents that do not map to a target table
--updatedAt ident the name of a document property used for high-water marks (default updated_at)
--userscript string the path to a configuration script, see userscript subcommand
Global Flags:
--logDestination string write logs to a file, instead of stdout
--logFormat string choose log output format [ fluent, text ] (default "text")
-v, --verbose count increase logging verbosity to debug; repeat for trace
```

Source document collections are mapped onto target tables within the destination database. A
document is the unit of atomicity when applying updates. Updates to collections are applied to their
target tables via independent replication loops. This is to say that code which consumes replicated
data from the target database may encounter skew, since there are no synchronization guarantees
provided by Firestore.

The use of an `extras` column will allow source documents with variable schemas to be stored in
a `JSONB` column, to facilitate future schema changes.

### Document requirements

All documents to be replicated must have a timestamp property which is set
to `FieldValue.serverTimestamp()` or its equivalent in your SDK of choice. By default, this property
is named `updated_at`, but the specific property name can be changed with the `--updatedAt` flag.
This server-assigned timestamp allows `cdc-sink` to behave in a resumable manner.

The source document ID will be provided as a property specified by the `--docID` flag, which
defaults to `id`. This property must be used as the primary key of the destination table. Future
re-keying of the replicated data is possible by adding a unique secondary index on columns with a
reasonable `DEFAULT` value, such as `gen_random_uuid()`.

### Configuring source collections

The use of Firestore as a replication source is configured through cdc-sink's userscript mechanism.
Top-level collections are configured by declaring a source and a target table or a dispatch
function. [Collection group queries](https://firebase.google.com/docs/firestore/query-data/queries#collection-group-query)
can be used by using a `group:` prefix with the collection name. If documents have sub-collections
with dynamic names, the `recurse` source option will iterate over each document's sub-collections
when the document is updated.

```typescript
// This module name is recognized by the userscript loader. A corresponding .d.ts file
// can be retrieved by running "cdc-sink userscript --api"
import * as api from "cdc-sink@v1";

// This call will map a top-level collection onto a single table. A simple
// mapping like this can work for "flat" documents that translate directly
// to SQL rows.
api.configureSource("my-collection", {target: "my_table"});

// This call creates a collection-group query. This is useful when
// documents contain a sub-collection with a common name.
api.configureSource("group:subcollection", {target: "other_table"});

// More complex mappings of documents to rows can be accomplished by using
// a dispatch function. This will recive the source document and return a
// mapping of destintation tables to, potentially, multiple rows.
api.configureSource("complex-collection", {
dispatch: (doc, meta) => {
return {
"parent_table": [{"some": "data"}],
"child_table": [{"child": 1}, {"child": 2}],
"join_table": [{}, {}, {}]
}
},
deletesTo: "parent_table",
})
```

### Hard deletion

It is not possible to receive notification of deleted documents when `cdc-sink` is not running. If
users require a "hard delete" solution, it is necessary to write tombstone documents to a separate
collection.

The structure of a tombstone is as follows:

```json
{
"id": "AABBCCDD",
"collection": "my-collection",
"updated_at": "2022-08-11T13:01:59Z"
}
```

The specific property names used in the tombstone document can be configured by the `--docID`
, `--tombstoneCollectionProperty`, and `--updatedAt` flags. The `--tombstoneCollection` flag enables
the behavior.

## Security Considerations

At a high level, `cdc-sink` accepts network connections to apply arbitrary mutations to the target
Expand Down
19 changes: 15 additions & 4 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@ module github.com/cockroachdb/cdc-sink
go 1.19

require (
cloud.google.com/go/firestore v1.6.1
github.com/cockroachdb/apd v1.1.0
github.com/cockroachdb/crlfmt v0.0.0-20220610162206-024b567ce87b
github.com/dop251/goja v0.0.0-20220815083517-0c74f9139fd6
github.com/evanw/esbuild v0.15.5
github.com/dop251/goja v0.0.0-20220915101355-d79e1b125a30
github.com/evanw/esbuild v0.15.8
github.com/go-mysql-org/go-mysql v1.6.0
github.com/gofrs/uuid v4.3.0+incompatible
github.com/golang-jwt/jwt/v4 v4.4.2
Expand All @@ -27,13 +28,17 @@ require (
github.com/spf13/pflag v1.0.5
github.com/stretchr/testify v1.8.0
golang.org/x/lint v0.0.0-20210508222113-6edffad5e616
golang.org/x/net v0.0.0-20220919171627-f8f703f97925
golang.org/x/net v0.0.0-20220920152717-4a395b0a80a1
golang.org/x/sync v0.0.0-20220907140024-f12130a52804
golang.org/x/tools v0.1.12
google.golang.org/api v0.96.0
google.golang.org/grpc v1.49.0
honnef.co/go/tools v0.3.3
)

require (
cloud.google.com/go v0.104.0 // indirect
cloud.google.com/go/compute v1.10.0 // indirect
github.com/BurntSushi/toml v1.2.0 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.1.2 // indirect
Expand All @@ -43,7 +48,10 @@ require (
github.com/dlclark/regexp2 v1.7.0 // indirect
github.com/go-sourcemap/sourcemap v2.1.3+incompatible // indirect
github.com/golang/protobuf v1.5.2 // indirect
github.com/google/go-cmp v0.5.9 // indirect
github.com/google/subcommands v1.2.0 // indirect
github.com/googleapis/enterprise-certificate-proxy v0.1.0 // indirect
github.com/googleapis/gax-go/v2 v2.5.1 // indirect
github.com/inconshreveable/mousetrap v1.0.1 // indirect
github.com/jackc/chunkreader/v2 v2.0.1 // indirect
github.com/jackc/pgio v1.0.0 // indirect
Expand All @@ -59,14 +67,17 @@ require (
github.com/shopspring/decimal v1.3.1 // indirect
github.com/siddontang/go v0.0.0-20180604090527-bdc77568d726 // indirect
github.com/siddontang/go-log v0.0.0-20190221022429-1e957dd83bed // indirect
go.opencensus.io v0.23.0 // indirect
go.uber.org/atomic v1.10.0 // indirect
golang.org/x/crypto v0.0.0-20220919173607-35f4265a4bc0 // indirect
golang.org/x/exp/typeparams v0.0.0-20220916125017-b168a2c6b86b // indirect
golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4 // indirect
golang.org/x/oauth2 v0.0.0-20220909003341-f21342109be1 // indirect
golang.org/x/sys v0.0.0-20220919091848-fb04ddd9f9c8 // indirect
golang.org/x/text v0.3.7 // indirect
golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 // indirect
google.golang.org/genproto v0.0.0-20220919141832-68c03719ef51 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/genproto v0.0.0-20220920164045-a2a065f3c118 // indirect
google.golang.org/protobuf v1.28.1 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
Loading

0 comments on commit 980fb5b

Please sign in to comment.