Skip to content

[Feature Request]: Provide a user-facing api to stage and download large file dependencies onto Beam SDK workers #28331

@tvalentyn

Description

@tvalentyn

What would you like to happen?

Users sometimes need to provision large files to SDK workers.

Beam Artifact staging API capabilities are not directly exposed to Python SDK users, beyond options to stage well-defined Python dependency artifacts, such as --extra_package, see:

def create_job_resources(options, # type: PipelineOptions

Currently available options for staging large resources (covering this from Beam Python SDK perspective):

Some options are not straightforward if not too hacky and some have disadvantages in usability or performance. A user-facing API dedicated to staging data dependencies can fill in the gap and provide a more robust handling of staging large files. The API can be consumed by Beam users directly, and by Beam Transforms, such as RunInference, for declaring and staging data dependencies of a specific transform.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions