Skip to content

[Feature Request]: Data Stream metadata and plugin type #6963

@mattcasters

Description

@mattcasters

What would you like to happen?

Proposed Overall Structure

  • New Metadata Plugin Type: DataStreamPluginType
    A new top-level plugin type (similar to other metadata plugin types in Hop). It should live in the core/engine or as its own module under plugins/metadata (or a dedicated plugins/datastreams category).

  • Metadata Element: Data Stream (user-facing name)

    • Annotated with @HopMetadata(key = "data-stream", name = "Data Stream", ...)
    • Stored as JSON in the project’s metadata/data-stream/ folder.
    • Fully manageable and editable in the Metadata perspective.
  • Two New Transforms (to be placed under plugins/transforms):

    • Data Stream Output – writes rows to a selected Data Stream (producer side)
    • Data Stream Input – reads rows from a selected Data Stream (consumer side)
  • Pluggable Implementations (via the new DataStreamPluginType):

    1. Arrow Socket – for same-machine or simple multi-process streaming using Arrow IPC over TCP or Unix domain sockets.
    2. Arrow Flight – ideal for distributed scenarios (Spark / Flink) and network use cases; supports parallel writes from multiple executors via DoPut to a central Flight server.
    3. Arrow File – simplest decoupled option; writes Arrow IPC streams or files (including partitioned writes to local disk, HDFS, or S3). Downstream processes can read and optionally merge the data.

Additional implementations (ZeroMQ + Arrow, Kafka with Arrow serialization, shared memory, etc.) can be added later without changing the core metadata or the two transforms.

How It Would Work in Practice

  1. Define a Data Stream in the Metadata perspective:

    • Name: e.g. PythonArrowStream, SparkCollector, ExternalPythonConsumer
    • Implementation: choose Arrow Socket, Arrow Flight, or Arrow File
    • Implementation-specific settings (port range, Flight endpoint, base path, etc.)
    • Common settings: batch size, direction, optional schema reference, buffer configuration
  2. Use in Pipelines:

    • Drop a Data Stream Output transform and select the desired Data Stream by name (dropdown, just like a database connection).
    • Drop a Data Stream Input transform and select the same (or another) Data Stream by name.
    • Connections are initialized lazily when the transform first needs them.
  3. Distributed Support (Hop on Spark / Hop on Flink):

    • Metadata is automatically serialized and distributed to all executors (standard Hop/Beam behavior).
    • Arrow Flight handles parallel writes naturally.
    • Arrow Socket can use port ranges or a thin merge layer.
    • Arrow File lets each executor write partitioned files that can be merged downstream.

This design keeps the user experience simple and metadata-driven while providing powerful, high-performance back-ends under the hood. It builds directly on existing Hop patterns such as Pipeline Probe metadata and the metadata injection / selection mechanisms.

Issue Priority

Priority: 3

Issue Component

Component: Transforms

Metadata

Metadata

Assignees

Labels

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions