Skip to content

Conversation

@jld-adriano
Copy link

Why are the changes needed?

This PR introduces a more efficient and reliable way to check for image existence, especially for images hosted on Amazon ECR. Previously, Flytekit relied solely on Docker to check if an image existed, which can be slow or fail if the Docker daemon is unavailable. By directly querying ECR when applicable, we reduce reliance on Docker and improve performance.

What changes were proposed in this pull request?

This PR implements a new mechanism to check for image existence directly in Amazon ECR before falling back to Docker.

  1. New Helper Functions: Added is_ecr_registry, check_aws_cli_and_creds, and check_ecr_image_exists in flytekit/image_spec/image_spec.py to:
    • Identify ECR registry URLs.
    • Verify AWS CLI installation and credential configuration.
    • Perform direct ECR image existence checks using aws ecr describe-images.
  2. Modified ImageSpec.exist(): The exist() method now:
    • Prioritizes an ECR check if the image registry is ECR and AWS CLI/credentials are available.
    • Gracefully falls back to the existing Docker-based check if the ECR check fails or conditions are not met.
  3. Documentation: Added docs/ecr_image_check.md detailing the feature, its benefits, requirements, and usage.

How was this patch tested?

Comprehensive unit tests were added in tests/flytekit/unit/core/image_spec/test_ecr_check.py to cover:

  • ECR registry detection.
  • AWS CLI and credential availability checks.
  • Successful ECR image existence checks.
  • Fallback scenarios to Docker when ECR checks fail or conditions are not met.
  • Behavior with non-ECR registries.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Slack Thread

Co-authored-by: adriano <adriano@exa.ai>
@jld-adriano jld-adriano changed the title Check image source and conditions dont depend on docker for ECR Jul 15, 2025
@jld-adriano
Copy link
Author

i tested and it works! both for missing and existing images

@jld-adriano jld-adriano merged commit 9cace8d into master Jul 15, 2025
devin-ai-integration bot added a commit that referenced this pull request Dec 12, 2025
…nv vars

This fixes bug #1 where overriding nproc_per_node via with_overrides would not
change the number of processes for single-node elastic tasks.

For single-node elastic tasks (task_type='python-task'), the _execute method
reads PET_NPROC_PER_NODE, PET_NNODES, PET_MAX_RESTARTS, and PET_MONITOR_INTERVAL
from environment variables. However, these env vars were never being set in the
task template during serialization.

The fix adds an environment property override to PytorchElasticFunctionTask that
includes the elastic config as environment variables. This ensures that when
task_config is modified via with_overrides, the elastic configuration is
correctly passed to the pod via environment variables.

Combined with the previous fix (dynamic task_type property), this now fully
supports:
- Bug #1: single-node (1 proc) -> single-node (multiple procs) override
- Bug #2: single-node (1 proc) -> multi-node (multiple procs) override

Co-Authored-By: carlos@exa.ai <carlosmarques.personal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants