What would you like to happen?
Proposed Overall Structure
-
New Metadata Plugin Type: DataStreamPluginType
A new top-level plugin type (similar to other metadata plugin types in Hop). It should live in the core/engine or as its own module under plugins/metadata (or a dedicated plugins/datastreams category).
-
Metadata Element: Data Stream (user-facing name)
- Annotated with
@HopMetadata(key = "data-stream", name = "Data Stream", ...)
- Stored as JSON in the project’s
metadata/data-stream/ folder.
- Fully manageable and editable in the Metadata perspective.
-
Two New Transforms (to be placed under plugins/transforms):
- Data Stream Output – writes rows to a selected Data Stream (producer side)
- Data Stream Input – reads rows from a selected Data Stream (consumer side)
-
Pluggable Implementations (via the new DataStreamPluginType):
- Arrow Socket – for same-machine or simple multi-process streaming using Arrow IPC over TCP or Unix domain sockets.
- Arrow Flight – ideal for distributed scenarios (Spark / Flink) and network use cases; supports parallel writes from multiple executors via
DoPut to a central Flight server.
- Arrow File – simplest decoupled option; writes Arrow IPC streams or files (including partitioned writes to local disk, HDFS, or S3). Downstream processes can read and optionally merge the data.
Additional implementations (ZeroMQ + Arrow, Kafka with Arrow serialization, shared memory, etc.) can be added later without changing the core metadata or the two transforms.
How It Would Work in Practice
-
Define a Data Stream in the Metadata perspective:
- Name: e.g.
PythonArrowStream, SparkCollector, ExternalPythonConsumer
- Implementation: choose Arrow Socket, Arrow Flight, or Arrow File
- Implementation-specific settings (port range, Flight endpoint, base path, etc.)
- Common settings: batch size, direction, optional schema reference, buffer configuration
-
Use in Pipelines:
- Drop a Data Stream Output transform and select the desired Data Stream by name (dropdown, just like a database connection).
- Drop a Data Stream Input transform and select the same (or another) Data Stream by name.
- Connections are initialized lazily when the transform first needs them.
-
Distributed Support (Hop on Spark / Hop on Flink):
- Metadata is automatically serialized and distributed to all executors (standard Hop/Beam behavior).
- Arrow Flight handles parallel writes naturally.
- Arrow Socket can use port ranges or a thin merge layer.
- Arrow File lets each executor write partitioned files that can be merged downstream.
This design keeps the user experience simple and metadata-driven while providing powerful, high-performance back-ends under the hood. It builds directly on existing Hop patterns such as Pipeline Probe metadata and the metadata injection / selection mechanisms.
Issue Priority
Priority: 3
Issue Component
Component: Transforms
What would you like to happen?
Proposed Overall Structure
New Metadata Plugin Type:
DataStreamPluginTypeA new top-level plugin type (similar to other metadata plugin types in Hop). It should live in the core/engine or as its own module under
plugins/metadata(or a dedicatedplugins/datastreamscategory).Metadata Element:
Data Stream(user-facing name)@HopMetadata(key = "data-stream", name = "Data Stream", ...)metadata/data-stream/folder.Two New Transforms (to be placed under
plugins/transforms):Pluggable Implementations (via the new
DataStreamPluginType):DoPutto a central Flight server.Additional implementations (ZeroMQ + Arrow, Kafka with Arrow serialization, shared memory, etc.) can be added later without changing the core metadata or the two transforms.
How It Would Work in Practice
Define a Data Stream in the Metadata perspective:
PythonArrowStream,SparkCollector,ExternalPythonConsumerUse in Pipelines:
Distributed Support (Hop on Spark / Hop on Flink):
This design keeps the user experience simple and metadata-driven while providing powerful, high-performance back-ends under the hood. It builds directly on existing Hop patterns such as Pipeline Probe metadata and the metadata injection / selection mechanisms.
Issue Priority
Priority: 3
Issue Component
Component: Transforms