Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt ray flyteplugin to Kuberay 1.1.0 #5067

Merged
merged 9 commits into from
Mar 29, 2024
Merged

Adapt ray flyteplugin to Kuberay 1.1.0 #5067

merged 9 commits into from
Mar 29, 2024

Conversation

ByronHsu
Copy link
Contributor

@ByronHsu ByronHsu commented Mar 15, 2024

Why are the changes needed?

From Kuberay 1.0.0 to 1.1.0, few changes is made in rayjob CRD, including:

  1. Rayjob status, which flyteplugin uses to determine flyte status
  2. RuntimeEnv is deprecated, which flyteplugin uses to create rayjob
  3. Few new fields introduced

image

What changes were proposed in this pull request?

In general, the change is to support kuberay 1.1.0 but ensure backward compatibility at the same time. I will first walk through the change i made, then enumerate user scenerio.

Flyteidl

  1. Deprecate but not remove runtime_env
  2. Add runtime_env_yaml

Flytekit

  1. Serialize runtime_env to yaml and pass to runtime_env_yaml

Flyteplugin

  1. Remove all v1apha1 logics
  2. If runtime_env_yaml is not present, convert runtime_env to yaml to ensure backward comp
  3. Re-map rayjob status to flyte task status. As suggested by @Yicheng-Lu-llll , only rely on jobdeployment status

How was this patch tested?

User scenario

flytekit flyte kuberay Result
Old Old 1.0.0 just works
New New 1.1.0 works
Old New 1.1.0 works because new flyte also handle base64 runtime_env
New Old 1.0.0 works because new flytekit also passes runtime_env

I have run end-to-end tests on local sandbox with the following tests.

  1. rayjob succeeds
import typing

from flytekit import ImageSpec, Resources, task, workflow

commit_hash = "5a57f299c1ae789c9e2f269302becf13ca3b21e8"
ray_plugin = f"git+https://github.com/flyteorg/flytekit.git@{commit_hash}#subdirectory=plugins/flytekit-ray"

custom_image = ImageSpec(
    name="ray-flyte-plugin",
    registry="localhost:30000",
    packages=[ray_plugin],
    apt_packages=['git', 'wget', 'vim'],
)

if custom_image.is_container():
    import ray
    from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig

@ray.remote
def f(x):
    print(a)
    print(f"ray task get {x}")
    return x * x

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
    runtime_env={"pip": []},
)

@task(
    task_config=ray_config,
    requests=Resources(mem="4Gi", cpu="1"),
    container_image=custom_image,
)
def ray_task(n: int) -> typing.List[int]:
    futures = [f.remote(i) for i in range(n)]
    l = ray.get(futures)
    print(l)
    return l


@workflow
def ray_workflow(n: int) -> typing.List[int]:
    return ray_task(n=n)

if __name__ == "__main__":
    f.remote(x=2)
image
  1. rayjob fails
import typing

from flytekit import ImageSpec, Resources, task, workflow

commit_hash = "5a57f299c1ae789c9e2f269302becf13ca3b21e8"
ray_plugin = f"git+https://github.com/flyteorg/flytekit.git@{commit_hash}#subdirectory=plugins/flytekit-ray"

custom_image = ImageSpec(
    name="ray-flyte-plugin",
    registry="localhost:30000",
    packages=[ray_plugin],
    apt_packages=['git', 'wget', 'vim'],
)

if custom_image.is_container():
    import ray
    from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig

@ray.remote
def f(x):
    a = 1 // 0
    print(f"ray task get {x}")
    return x * x

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)]
)

@task(
    task_config=ray_config,
    requests=Resources(mem="4Gi", cpu="1"),
    container_image=custom_image,
)
def ray_task(n: int) -> typing.List[int]:
    futures = [f.remote(i) for i in range(n)]
    l = ray.get(futures)
    print(l)
    return l


@workflow
def ray_workflow(n: int) -> typing.List[int]:
    return ray_task(n=n)

if __name__ == "__main__":
    f.remote(x=2)
image

Related PRs

flyteorg/flytekit#2274

Notes

If they upgrade flyte, they also have to upgrade kuberay to 1.1.0. Have to add doc somewhere

@ByronHsu ByronHsu changed the title init Adapt ray flyteplugin to Kuberay 1.1.0 Mar 15, 2024
@ByronHsu ByronHsu marked this pull request as ready for review March 16, 2024 00:02
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 16, 2024
Copy link

codecov bot commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 48.88889% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 59.11%. Comparing base (78a690f) to head (c9cbbda).
Report is 9 commits behind head on master.

Files Patch % Lines
flyteplugins/go/tasks/plugins/k8s/ray/ray.go 48.88% 21 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5067      +/-   ##
==========================================
+ Coverage   59.00%   59.11%   +0.11%     
==========================================
  Files         645      645              
  Lines       55672    55546     -126     
==========================================
- Hits        32847    32837      -10     
+ Misses      20230    20120     -110     
+ Partials     2595     2589       -6     
Flag Coverage Δ
unittests 59.11% <48.88%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ByronHsu
Copy link
Contributor Author

@pingsutw @Yicheng-Lu-llll could you provide guidance on how to resolve the make generate CI errors? Thanks!

flyte-single-binary-local.yaml Outdated Show resolved Hide resolved
flyteidl/protos/flyteidl/plugins/ray.proto Outdated Show resolved Hide resolved
// RuntimeEnvYAML represents the runtime environment configuration
// provided as a multi-line YAML string.
string runtime_env_yaml = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RuntimeEnvYAML represents the runtime environment configuration
// provided as a multi-line YAML string.
string runtime_env_yaml = 5;
// RuntimeEnvYAML represents the runtime environment configuration
// provided as a multi-line YAML string.
string runtime_env_yaml = 5;

@@ -22,7 +22,6 @@ var (
IncludeDashboard: true,
DashboardHost: "0.0.0.0",
EnableUsageStats: false,
KubeRayCrdVersion: "v1alpha1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@peterghaddad peterghaddad Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ByronHsu after this MR is complete is there no going back to v1alpha1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if Flyte has decided to do a hard switchover from v1alpha1 to v1. If not, then I would recommend with staying on v1alpha1 as the default value for the time being.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1alpha1 in kuberay 1.0.0 and kuberay 1.1.0 are different. That being said, if you upgrade flyte but keep kuberay at 1.0.0. The new v1alpha1 imported in flyte cannot be run on the old kuberay. Also, there is no purpose to keep both v1alpha1 and v1, so we decide to discard v1alpha1.

If users really want to stay at v1alpha1, they can feel free to use the old flyte and kuberay 1.0.0. But if they want to upgrade kuberay to 1.1.0, it is required for them to upgrade flyte as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there new API changes to the KubeRay v1 CRDs or are users required to update KubeRay to v1.1.0 instead of say v1.0.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can stills stay at v1.0.0 with old flyte server. See (old, old) row
image

@EngHabu
Copy link
Contributor

EngHabu commented Mar 26, 2024

Thank you for the due diligence in the testing matrix. I've one additional question related to backward compatibility. Will this handle running jobs in a graceful way? or will it fail to read the old CRD completely? I imagine if somebody has Ray jobs running and an upgrade happened at the same time.

cc @pingsutw

@pingsutw
Copy link
Member

cc @kevin85421 Will kuberay handle running jobs in a graceful way

@kevin85421
Copy link

Will this handle running jobs in a graceful way? or will it fail to read the old CRD completely?

KubeRay does not provide a conversion webhook to handle CRD upgrades gracefully. I haven't tried upgrading the CRD while RayJob custom resources were running, but I suspect there will be some issues, especially if the running RayJob uses runtimeEnv, which was deprecated in KubeRay v1.1.0.

@ByronHsu
Copy link
Contributor Author

ByronHsu commented Mar 26, 2024

Even though there might be issues as @kevin85421 mentioned, we only use rayjob (which is a batch job) in flyte. it shall be tolerable for users to rerun the job if the old one failed due to incompatibility.

Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
Signed-off-by: Pin-Lun Hsu <byhsu@linkedin.com>
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 28, 2024
@ByronHsu ByronHsu merged commit 8bb5b8c into master Mar 29, 2024
47 of 48 checks passed
@ByronHsu ByronHsu deleted the byhsu/kuberay-1-1-0 branch March 29, 2024 05:32
@eapolinario eapolinario mentioned this pull request Mar 29, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants