-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derivation Connectors + SQL MegaPR #985
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1b18078
to
727a543
Compare
Remove Shuffle as a persisted message. It's now computed as needed from constituent parts. Remove DerivationSpec as a top-level task type. Instead, nest it within CollectionSpec. Remove remaining aspects of V1 derivations. Add NetworkPort. Rename message fields to consolidate gRPC and JSON into unified protocol.
Capture protocol is now a single bidi stream of Request / Response. Push-based captures are removed entirely (replaced with connector networking feature). Client is simplified -- it handles only a pull-based workflow now, not push as well.
Materialization protocol is now a single bidi stream of Request / Response. The major structure of the protocol is unchanged, but messages (notably Load and Store) are simplified to present a unified gRPC / JSON protocol.
Runtime are internal protocol extensions used by the Flow runtime. It homes all messages types which are not part of Flow's public protocol interface. We will eventually move more message types, like Shuffle and JournalShuffle, into this package.
Introduce new derivation protocol, structured as a bidi stream of Request / Response.
Add a consolidated protobuf package for ops Log and Stat message types, with a JSON serialization which matches are current productionized messages. Move ShardLabeling to this package and re-export from go/labels, to avoid a circular dependency.
Regenerate all messages for their source protobuf updates. Introduce a build.rs mechanism to generate canonical serde serializations for all protobuf types that matches the Go-side behavior, and handles JSON embedded within protobuf messages (again, in a manner that matches our use of json.RawMessage from Go types). This lets has have single message types that are shared between Go and Rust, and which have canonical protobuf and JSON encodings, which is now the basis for our JSON protocols. Add new regression tests for all message encodings (protobuf and JSON). Remove proto-convert, as it's no longer needed.
Remove V1 `derivation` stanza of collections, and add new `derive` stanza for derivation connectors. Define connector types for images, SQLite, and TypeScript. Also update and unify how sources are specified across materializations, derivations, and tests.
Before, we would significantly alter specification structure as it was loaded and processed, interning chunks into various tables that held their representations. Now, we avoid altering specification structure at all during source loading. This makes it possible to round-trip loaded sources back out into marshalled specifications and get exactly-matched DOM structure. Then, add routines which re-introduce inlining and indirection operations for manipulating sources, as part of an overall build pipeline. This factoring lets us better-handle specification validation, which requires fully inlined specifications, as well as the transformation operations that our CLI flowctl offers over specifications.
Rework validation to account for changes to specification representations in tables. Top-level specs are now a fully-inlined concern, rather than being highly normalized across many tables of constituent parts. Introduce a mechanism for dynamically resolving referenced collections which are not explicitly included in the set of validated specs. This is used by flowctl to dynamically resolve named collections as-needed, but is not used by control-plane agent builds yet. Account for capture / derive / materialize protocol additions and changes.
Add a derivation connector implementation for TypeScript, which encapsuales details for code generation and stubs which were previously part of the runtime proper. Derivations are powered by Deno.
Add a derivation connector implementation for SQL (specifically SQLite) which uses event-based SQL lambda blocks with bound $parameters. The connector provides long-lived and persistent state through the instrumented SQLite VFS (powered by Gazette's store-sqlite implementation and backing recovery log).
This crate is replacing the current derive crate. The CGO channel pattern used by that crate and its Go-side bindings is going away, in favor of a pure gRPC interface between Rust and Go. The derivation protocol is fully implemented, as a middleware stack of layered Request and Response protocol interceptors which accomplish various processing aspects of derivations. Also introduce TaskService and TaskRuntime, which are a means of building up an async tokio runtime which is accessible from Go, using unix domain sockets, and with full plumbing of operations logs from tokio's tracing.
…erive-refactor) Rework `generate` command to reflect that generation is now a concern of derivation connectors. Add a `preview` command which reads the source collection(s) of a derivation and shows what its output would be. Add a materialize-fixture command which maps a materialization fixture into a series of materialization protocol messages, in service of testing connectors which are being developed. Update to use new routines in `sources` crate for manipulating or extending loaded sources. Validation now dynamically retrieves referenced collections as-needed from the control plane. This allows TypeScript or SQLite generation to function without having to explicitly pull specs. Back out use of system key-ring for refresh token storage. This was brittle on many systems and a source of user confusion.
Captures, derivations, and materializations are all expressed as a single bidi stream of Request / Response. The messages are now identical, whether encoded as delimited protobof or JSON. Remove all passed arguments `spec`, `validate`, `pull`, etc as well as the `--log.level` argument. The LOG_LEVEL environment variable is used instead. Add an async-process crate as a replacement for tokio::process, which avoids a signal handler that the latter installs which is incompatible with the Go runtime. Remove crates/connector-protocol, as there no longer _is_ a differentiated JSON protocol.
TaskService creates a Rust tokio runtime and gRPC handler stack for serving core runtime functions. Extract RocksDB environment descriptors into an encapsulated concern (previously this was buried in the implementation for derivations). In the future, we should examine switching captures and materializations to also use RocksDB.
Captures remove handling for push-based ingest. Capture and materialization stat processing is refactored to use the new ops.Stat message type, and for consistency and fewer SLOC. OpsPublisher.PublishStats is now used for stat publishing, where before it was not. Capture stats were also subtly wrong, in that they previously produced multiple stats documents for what is actually only a single capture transaction. This is fixed. Derivations now use a TaskService to do almost all of the heavy lifting. Use of the previous DeriveAPI binding is removed. Various updates to account for protocol changes.
These are mostly self-obvious, mechanical, or otherwise uninteresting changes that reflect fallout of other changes.
Remove vestigates of V1 derivation typescript generation.
No functional changes in how derivations work, just bring them over to the new paradigm. Update tests to stub out & document how they can actually be manually run. Format a bunch of schemas.
…(derive-refactor) Updated catalog tests use a mix of TypeScript and Deno. Add an enhanced acmeBank example, using SQL state. Remove E2E tests that no longer have a roll (push-capture, or state-less aibyte connectors). Add a new test for a basic source connector and generally refactor -- this still needs more work for verifying log outputs.
Package `deno` as a release binary. Don't build `flowctl` with musl. Add temporary support for using a connectors PR image while we co-land changes to both of these repositories. We'll back this out once everything is merged.
727a543
to
8294c87
Compare
When reading Load and Store requests, as we handed these bufers to the caller and they may have retained it.
So that stat messages may be published within the shard's current transaction.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This is a substantial refactoring of the Flow protocols and runtime,
which introduces Derivation Connectors as a threaded-through concept.
It also refactors our capture and materialization protocols so that captures, derivations, and materializations are logically cohesive and so that our JSON and gRPC protocols have identical semantics and message representations.
There are many other included changes, see individual comments for more granular updates.
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)