Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ci image versions #129

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

suzhoum
Copy link
Contributor

@suzhoum suzhoum commented Jun 7, 2024

Issue #, if available:

Description of changes:
This PR fixes a few bugs in CI.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -152,7 +152,7 @@ jobs:
filters: |
other_than_docs:
- '!(docs/**)**'
- name: Test Tabular Cloud
- name: Test Text Cloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@@ -395,6 +395,10 @@ def _get_image_uri(self, framework_version: str, instance_type: str, custom_imag
image_scope="training",
instance_type=instance_type,
)
if instance_type == "gpu":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing, I created a few images with the correct ray version. I can remove them before merging.

@@ -44,3 +44,4 @@ head_node_type: head
initialization_commands:
# To get access to DLC image
- aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
- aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this region support with us-east-1 image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user uses images before 0.8.0, I think they are hosted in us-east-1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, shall we host all images in the same region say "us-east-1"?

gpu_training_image = "369469875935.dkr.ecr.us-east-1.amazonaws.com/autogluon-nightly-training:gpu-latest"
cpu_inference_image = "369469875935.dkr.ecr.us-east-1.amazonaws.com/autogluon-nightly-inference:cpu-latest"
gpu_inference_image = "369469875935.dkr.ecr.us-east-1.amazonaws.com/autogluon-nightly-inference:gpu-latest"
cpu_training_image = "369469875935.dkr.ecr.us-east-1.amazonaws.com/weisu:cpu-latest"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to use weisu's image instead of fixing the nightly image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is for testing. I created those just now.

backend_kwargs={
"initialization_commands": [
"aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com",
"aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link

github-actions bot commented Jun 8, 2024

Job PR-129-7d9fda5 is done.
Docs are uploaded to https://d12sc05jpx1wj5.cloudfront.net/PR-129/7d9fda5/index.html

Copy link

github-actions bot commented Jun 8, 2024

Job PR-129-4ea70d5 is done.
Docs are uploaded to https://d12sc05jpx1wj5.cloudfront.net/PR-129/4ea70d5/index.html

Copy link

github-actions bot commented Jun 9, 2024

Job PR-129-737a840 is done.
Docs are uploaded to https://d12sc05jpx1wj5.cloudfront.net/PR-129/737a840/index.html

@suzhoum
Copy link
Contributor Author

suzhoum commented Jun 10, 2024

@tonyhoo @prateekdesai04 CI is green after limiting the ray install in DLC images to align with what we have in https://github.com/autogluon/autogluon/blob/4825cff578c1cfad8997bd8b62dcd5749d0e55f1/core/setup.py#L51.
One issue I'm not sure about is that test_cluster always fails when first run on push https://github.com/autogluon/autogluon-cloud/actions/runs/9443270050 due to permission issue. However, when rerun the failed test in Action, the test passes https://github.com/autogluon/autogluon-cloud/actions/runs/9434703352.

@@ -395,6 +395,10 @@ def _get_image_uri(self, framework_version: str, instance_type: str, custom_imag
image_scope="training",
instance_type=instance_type,
)
if "g4dn" in instance_type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU image shall be applicable to all GPU instances. We shall update the logics here to be more robust for all GPU instance types:

def is_gpu_instance(instance_type):
    # A list of instance types that have GPUs
    gpu_instances = [
        'g2', 'g3', 'g4', 'p2', 'p3', 'p4', 'p4d', 'p5', 'p3dn', 'g4dn', 'g5', 'g5g'
    ]
    return any(instance_type.startswith(prefix) for prefix in gpu_instances)

def select_image():
    instance_type = get_instance_type()
    if is_gpu_instance(instance_type):
        return "369469875935.dkr.ecr.us-east-1.amazonaws.com/weisu:gpu-latest"
    else:
        return "369469875935.dkr.ecr.us-east-1.amazonaws.com/weisu:cpu-latest"
      ```

Copy link
Contributor Author

@suzhoum suzhoum Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will actually remove the whole commit 6cc23a7. This is to override the logic of getting the versioned DLC images using the sagemaker api for testing my custom images

@@ -6,7 +6,7 @@ max_workers: 1

docker:
# The image uri will be dynamically replaced by cloud predictor
image: "763104351884.dkr.ecr.us-east-1.amazonaws.com/autogluon-training:0.7.0-cpu-py39-ubuntu20.04"
image: "763104351884.dkr.ecr.us-west-2.amazonaws.com/autogluon-training:1.1.0-cpu-py310-ubuntu20.04"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to host the image to us-east-1 if not there already

Copy link
Contributor

@prateekdesai04 prateekdesai04 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the image is already present in east-1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -39,6 +39,10 @@ def __init__(
image_scope="training",
instance_type=instance_type,
)
if "g4dn" in instance_type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's consolidate the image selection logic in a common function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, I will drop the commit.

@suzhoum suzhoum added the safe to test This PR is safe to be tested in the cloud label Jun 10, 2024
@github-actions github-actions bot removed the safe to test This PR is safe to be tested in the cloud label Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants