[Feature Request]: Provide a user-facing api to stage and download large file dependencies onto Beam SDK workers

### What would you like to happen?

Users sometimes need to provision large files to SDK workers.

Beam Artifact staging API capabilities are not directly exposed to Python SDK users, beyond options to stage well-defined Python dependency artifacts, such as `--extra_package`, see: https://github.com/apache/beam/blob/7a4cbc18f97b4795eb00d4f14bc0790c564e5c9e/sdks/python/apache_beam/runners/portability/stager.py#L165 

Currently available options for staging large resources (covering this from Beam Python SDK perspective):
* If you need to stage a large model to run predictions, consider Beam RunInference API instead: https://beam.apache.org/documentation/transforms/python/elementwise/runinference/. The API already takes care of downloading the model and might improve overtime.
* Include your data dependency in custom containers. This increases container image size, and worker startup will be slower. Because Docker compresses images, not only downloading time will increase but also decompressing the container image during the pull. Also Dataflow runner currently needs additional flags to run large container images (increase the default `--disk_size_gb=...`, use `--experiments=disable_worker_container_image_prepull`)
* Use a python package that will download a large file upon package installation. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython . A custom `gsutil cp` command can be used in https://github.com/apache/beam/blob/99b2f7bd7939203138d4a5e18463339455fda461/sdks/python/apache_beam/examples/complete/juliaset/setup.py#L79 . 
* Use a custom container with a custom entrypoint that will download a data dependency in (e.g. via `gsutil cp`) command before starting Beam SDK workers: https://cloud.google.com/dataflow/docs/guides/using-custom-containers#custom-entrypoint.
* On pipeline level users can use [shared.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/shared.py) and download  the dependency once per process or use  [multi_process_shared.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/multi_process_shared.py) to download an artifact once per machine. Beam RunInference transforms uses these utilities and fetches models via FileSystems API, example: https://github.com/apache/beam/blob/99b2f7bd7939203138d4a5e18463339455fda461/sdks/python/apache_beam/ml/inference/sklearn_inference.py#L59. 

Some options are not straightforward if not too hacky and some have disadvantages in usability or performance. A user-facing API dedicated to staging data dependencies can fill in the gap and provide a more robust handling of staging large files. The API can be consumed by Beam users directly, and by Beam Transforms, such as RunInference, for declaring and staging data dependencies of a specific transform.

### Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

### Issue Components

- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Provide a user-facing api to stage and download large file dependencies onto Beam SDK workers #28331

What would you like to happen?

Issue Priority

Issue Components

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request]: Provide a user-facing api to stage and download large file dependencies onto Beam SDK workers #28331

Description

What would you like to happen?

Issue Priority

Issue Components

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions