Skip to content

estuary/flow

Repository files navigation

CI Slack | Docs home | Free account | Data platform comparison reference | Email list

Build millisecond-latency, scalable, future-proof data pipelines in minutes.

Estuary Flow is a DataOps platform that integrates all of the systems you use to produce, process, and consume data.

Flow unifies today's batch and streaming paradigms so that your systems – current and future – are synchronized around the same datasets, updating in milliseconds.

With a Flow pipeline, you:

  • πŸ“· Capture data from your systems, services, and SaaS into collections: millisecond-latency datasets that are stored as regular files of JSON data, right in your cloud storage bucket.

  • 🎯 Materialize a collection as a view within another system, such as a database, key/value store, Webhook API, or pub/sub service.

  • 🌊 Derive new collections by transforming from other collections, using the full gamut of stateful stream workflow, joins, and aggregations β€” in real time.

Workflow Overview

Using Flow

Flow combines a low-code UI for essential workflows and a CLI for fine-grain control over your pipelines. Together, the two interfaces comprise Flow's unified platform. You can switch seamlessly between them as you build and refine your pipelines, and collaborate with a wider breadth of data stakeholders.

➑️ Sign up for a free Flow account here.

See the BSL license for information on using Flow outside the managed offering.

Resources

Support

The best (and fastest) way to get support from the Estuary team is to join the community on Slack.

You can also email us.

Connectors

Captures and materializations use connectors: plug-able components that integrate Flow with external data systems. Estuary's in-house connectors focus on high-scale technology systems and change data capture (think databases, pub-sub, and filestores).

Flow can run Airbyte community connectors using airbyte-to-flow, allowing us to support a greater variety of SaaS systems.

See our website for the full list of currently supported connectors.

If you don't see what you need, request it here.

How does it work?

Flow builds on a real-time streaming broker created by the same founding team called Gazette.

Because of this, Flow collections are both a batch dataset – they're stored as a structured "data lake" of general-purpose files in cloud storage – and a stream, able to commit new documents and forward them to readers within milliseconds. New use cases read directly from cloud storage for high-scale backfills of history, and seamlessly transition to low-latency streaming on reaching the present.

What makes Flow so fast?

Flow mixes a variety of architectural techniques to achieve great throughput without adding latency:

  • Optimistic pipelining, using the natural back-pressure of systems to which data is committed.
  • Leveraging reduce annotations to group collection documents by key wherever possible, in memory, before writing them out.
  • Co-locating derivation states (registers) with derivation compute: registers live in an embedded RocksDB that's replicated for durability and machine re-assignment. They update in memory and only write out at transaction boundaries.
  • Vectorizing the work done in external Remote Procedure Calls (RPCs) and even process-internal operations.
  • Marrying the development velocity of Go with the raw performance of Rust, using a zero-copy CGO service channel.