Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derivation Connectors + SQL MegaPR #985

Merged
merged 28 commits into from
Apr 5, 2023
Merged

Derivation Connectors + SQL MegaPR #985

merged 28 commits into from
Apr 5, 2023

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Mar 27, 2023

Description:

This is a substantial refactoring of the Flow protocols and runtime,
which introduces Derivation Connectors as a threaded-through concept.

It also refactors our capture and materialization protocols so that captures, derivations, and materializations are logically cohesive and so that our JSON and gRPC protocols have identical semantics and message representations.

There are many other included changes, see individual comments for more granular updates.


This change is Reviewable

@jgraettinger jgraettinger force-pushed the johnny/more-sql branch 14 times, most recently from 1b18078 to 727a543 Compare March 31, 2023 05:44
Remove Shuffle as a persisted message.
It's now computed as needed from constituent parts.

Remove DerivationSpec as a top-level task type.
Instead, nest it within CollectionSpec.

Remove remaining aspects of V1 derivations.

Add NetworkPort.

Rename message fields to consolidate gRPC and JSON into unified
protocol.
Capture protocol is now a single bidi stream of Request / Response.

Push-based captures are removed entirely (replaced with connector
networking feature).

Client is simplified -- it handles only a pull-based workflow now,
not push as well.
Materialization protocol is now a single bidi stream of Request /
Response.

The major structure of the protocol is unchanged, but messages (notably
Load and Store) are simplified to present a unified gRPC / JSON
protocol.
Runtime are internal protocol extensions used by the Flow runtime.
It homes all messages types which are not part of Flow's public protocol
interface.

We will eventually move more message types, like Shuffle and
JournalShuffle, into this package.
Introduce new derivation protocol, structured as a bidi stream of
Request / Response.
Add a consolidated protobuf package for ops Log and Stat message types,
with a JSON serialization which matches are current productionized
messages.

Move ShardLabeling to this package and re-export from go/labels, to
avoid a circular dependency.
Regenerate all messages for their source protobuf updates.

Introduce a build.rs mechanism to generate canonical serde
serializations for all protobuf types that matches the Go-side behavior,
and handles JSON embedded within protobuf messages (again, in a manner
that matches our use of json.RawMessage from Go types).

This lets has have single message types that are shared between Go and
Rust, and which have canonical protobuf and JSON encodings, which is now
the basis for our JSON protocols.

Add new regression tests for all message encodings (protobuf and JSON).

Remove proto-convert, as it's no longer needed.
Remove V1 `derivation` stanza of collections, and add new `derive`
stanza for derivation connectors. Define connector types for images,
SQLite, and TypeScript.

Also update and unify how sources are specified across materializations,
derivations, and tests.
Before, we would significantly alter specification structure as it was
loaded and processed, interning chunks into various tables that held
their representations.

Now, we avoid altering specification structure at all during source
loading. This makes it possible to round-trip loaded sources back out
into marshalled specifications and get exactly-matched DOM structure.

Then, add routines which re-introduce inlining and indirection
operations for manipulating sources, as part of an overall build
pipeline.

This factoring lets us better-handle specification validation, which
requires fully inlined specifications, as well as the transformation
operations that our CLI flowctl offers over specifications.
Rework validation to account for changes to specification
representations in tables. Top-level specs are now a fully-inlined
concern, rather than being highly normalized across many tables of
constituent parts.

Introduce a mechanism for dynamically resolving referenced collections
which are not explicitly included in the set of validated specs.
This is used by flowctl to dynamically resolve named collections
as-needed, but is not used by control-plane agent builds yet.

Account for capture / derive / materialize protocol additions and changes.
Add a derivation connector implementation for TypeScript, which
encapsuales details for code generation and stubs which were previously
part of the runtime proper.

Derivations are powered by Deno.
Add a derivation connector implementation for SQL (specifically SQLite)
which uses event-based SQL lambda blocks with bound $parameters.

The connector provides long-lived and persistent state through the
instrumented SQLite VFS (powered by Gazette's store-sqlite
implementation and backing recovery log).
This crate is replacing the current derive crate. The CGO channel
pattern used by that crate and its Go-side bindings is going away, in
favor of a pure gRPC interface between Rust and Go.

The derivation protocol is fully implemented, as a middleware stack of
layered Request and Response protocol interceptors which accomplish
various processing aspects of derivations.

Also introduce TaskService and TaskRuntime, which are a means of
building up an async tokio runtime which is accessible from Go,
using unix domain sockets, and with full plumbing of operations logs
from tokio's tracing.
…erive-refactor)

Rework `generate` command to reflect that generation is now a concern of
derivation connectors.

Add a `preview` command which reads the source collection(s) of a
derivation and shows what its output would be.

Add a materialize-fixture command which maps a materialization fixture
into a series of materialization protocol messages, in service of
testing connectors which are being developed.

Update to use new routines in `sources` crate for manipulating or
extending loaded sources.

Validation now dynamically retrieves referenced collections as-needed
from the control plane. This allows TypeScript or SQLite generation to
function without having to explicitly pull specs.

Back out use of system key-ring for refresh token storage.
This was brittle on many systems and a source of user confusion.
Captures, derivations, and materializations are all expressed as a
single bidi stream of Request / Response. The messages are now
identical, whether encoded as delimited protobof or JSON.

Remove all passed arguments `spec`, `validate`, `pull`, etc as well as
the `--log.level` argument. The LOG_LEVEL environment variable is used
instead.

Add an async-process crate as a replacement for tokio::process, which
avoids a signal handler that the latter installs which is incompatible
with the Go runtime.

Remove crates/connector-protocol, as there no longer _is_ a
differentiated JSON protocol.
TaskService creates a Rust tokio runtime and gRPC handler stack for
serving core runtime functions.

Extract RocksDB environment descriptors into an encapsulated concern
(previously this was buried in the implementation for derivations).
In the future, we should examine switching captures and materializations
to also use RocksDB.
Captures remove handling for push-based ingest.

Capture and materialization stat processing is refactored to use the new
ops.Stat message type, and for consistency and fewer SLOC.
OpsPublisher.PublishStats is now used for stat publishing, where before
it was not.

Capture stats were also subtly wrong, in that they previously produced
multiple stats documents for what is actually only a single capture transaction.
This is fixed.

Derivations now use a TaskService to do almost all of the heavy lifting.
Use of the previous DeriveAPI binding is removed.

Various updates to account for protocol changes.
These are mostly self-obvious, mechanical, or otherwise uninteresting
changes that reflect fallout of other changes.
Remove vestigates of V1 derivation typescript generation.
No functional changes in how derivations work, just bring them over to
the new paradigm.

Update tests to stub out & document how they can actually be manually run.

Format a bunch of schemas.
…(derive-refactor)

Updated catalog tests use a mix of TypeScript and Deno.

Add an enhanced acmeBank example, using SQL state.

Remove E2E tests that no longer have a roll (push-capture, or state-less
aibyte connectors). Add a new test for a basic source connector and
generally refactor -- this still needs more work for verifying log
outputs.
Package `deno` as a release binary.

Don't build `flowctl` with musl.

Add temporary support for using a connectors PR image while we co-land
changes to both of these repositories. We'll back this out once
everything is merged.
@jgraettinger jgraettinger changed the title Rebased Mega Commit WIP Derivation Connectors + SQL MegaPR Apr 3, 2023
@jgraettinger jgraettinger marked this pull request as ready for review April 3, 2023 20:25
When reading Load and Store requests, as we handed these bufers to the
caller and they may have retained it.
So that stat messages may be published within the shard's current
transaction.
@jgraettinger jgraettinger merged commit 9f41194 into master Apr 5, 2023
@oliviamiannone oliviamiannone added the docs pending Improvements or additions to documentation noted or in progress label Apr 10, 2023
@mdibaiee mdibaiee deleted the johnny/more-sql branch April 19, 2023 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs pending Improvements or additions to documentation noted or in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants