What would you like to happen?
Users sometimes need to provision large files to SDK workers.
Beam Artifact staging API capabilities are not directly exposed to Python SDK users, beyond options to stage well-defined Python dependency artifacts, such as --extra_package, see:
|
def create_job_resources(options, # type: PipelineOptions |
Currently available options for staging large resources (covering this from Beam Python SDK perspective):
- If you need to stage a large model to run predictions, consider Beam RunInference API instead: https://beam.apache.org/documentation/transforms/python/elementwise/runinference/. The API already takes care of downloading the model and might improve overtime.
- Include your data dependency in custom containers. This increases container image size, and worker startup will be slower. Because Docker compresses images, not only downloading time will increase but also decompressing the container image during the pull. Also Dataflow runner currently needs additional flags to run large container images (increase the default
--disk_size_gb=..., use --experiments=disable_worker_container_image_prepull)
- Use a python package that will download a large file upon package installation. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython . A custom
gsutil cp command can be used in
|
CUSTOM_COMMANDS = [['echo', 'Custom command worked!']] |
.
- Use a custom container with a custom entrypoint that will download a data dependency in (e.g. via
gsutil cp) command before starting Beam SDK workers: https://cloud.google.com/dataflow/docs/guides/using-custom-containers#custom-entrypoint.
- On pipeline level users can use shared.py and download the dependency once per process or use multi_process_shared.py to download an artifact once per machine. Beam RunInference transforms uses these utilities and fetches models via FileSystems API, example:
|
file = FileSystems.open(model_uri, 'rb') |
.
Some options are not straightforward if not too hacky and some have disadvantages in usability or performance. A user-facing API dedicated to staging data dependencies can fill in the gap and provide a more robust handling of staging large files. The API can be consumed by Beam users directly, and by Beam Transforms, such as RunInference, for declaring and staging data dependencies of a specific transform.
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
What would you like to happen?
Users sometimes need to provision large files to SDK workers.
Beam Artifact staging API capabilities are not directly exposed to Python SDK users, beyond options to stage well-defined Python dependency artifacts, such as
--extra_package, see:beam/sdks/python/apache_beam/runners/portability/stager.py
Line 165 in 7a4cbc1
Currently available options for staging large resources (covering this from Beam Python SDK perspective):
--disk_size_gb=..., use--experiments=disable_worker_container_image_prepull)gsutil cpcommand can be used inbeam/sdks/python/apache_beam/examples/complete/juliaset/setup.py
Line 79 in 99b2f7b
gsutil cp) command before starting Beam SDK workers: https://cloud.google.com/dataflow/docs/guides/using-custom-containers#custom-entrypoint.beam/sdks/python/apache_beam/ml/inference/sklearn_inference.py
Line 59 in 99b2f7b
Some options are not straightforward if not too hacky and some have disadvantages in usability or performance. A user-facing API dedicated to staging data dependencies can fill in the gap and provide a more robust handling of staging large files. The API can be consumed by Beam users directly, and by Beam Transforms, such as RunInference, for declaring and staging data dependencies of a specific transform.
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components