Add support for Multiple Architecture types #1539

akhurana001 · 2021-10-01T01:14:49Z

Motivation: Why do you think this is important?

Multiple cloud providers (AWS etc.) have extensive support for ARM based instances now and we want to explore running Flyte workloads on ARM instances due to multiple benefits over x86. In general, the goal is to enable platform admins to configure multiple architecture types as well as Flyte users to select between them based on which one they want to use/image support etc.

Goal: What should the final outcome look like, ideally?
Platform admins be able to configure multiple architecture support within the same flyte installation and users being able to specify which one to execute on.

[Optional] Propose: Link/Inline OR Additional context
The goal is to enable Platform admins to maintain both ARM and x86 ASGs on each Kubernetes cluster and to intelligently route workloads to the right ASG based on which architecture the user image was built/tested/registered for.

This can be done with adding taints to the ASGs and tolerations/node-selector to pods something Flyte already does for Interruptible enabled Tasks. However, for this use-case, it might be easier for users to tag the whole repo/sha with ARM rather than tagging individual tasks/nodes.

I was thinking we can re-use labels support at Launch-Plan level to implement this: Users can set architecture label either at project level via flyte-config or directly at a launch-plan level which will then be routed correctly to the Flyte backend for those workflow versions/sha and corresponding tolerations and node-selector be added to those pods by FlytePropeller/plugins (similar to how Interruptible works).

Let me know what your thoughts are on this.

kumare3 · 2021-10-01T03:25:35Z

Quick comment, it's flytekit tested on arm?
Also let's not as anything to flytekit config, but to launchplan or workflow spec.

akhurana001 · 2021-10-01T20:03:00Z

Quick comment, it's flytekit tested on arm? Also let's not as anything to flytekit config, but to launchplan or workflow spec.

Haven't tested extensively but I did run a basic spark/python job on ARM (flytekit=0.16) and it worked.

akhurana001 · 2021-10-05T00:26:36Z

Synced-up with Haytham. Some notes:

The architecture field is closely related to the image/version and is probably a better fit as part of Container IDL . It's better to have this strongly typed here as opposed to labels on launch-plans (thou overrides probably would still be label based)
For UX, flyte-config is probably a good solution to default at project level.
We would probably need to support overrides at launch-plan levels for larger user/projects. Ideally this should be strongly typed but label based override is ok.

For backend implementation, two potential options are:

Letting FlytePropeller/FlytePlugins add tolerations/node-selector based on the architecture field similar to how interruptible works today.
Let FlytePropeller/Plugins add a label to each pod and a mutating webhook handle adding tolerations/node-selector. This can even work with plugins which don't have tolerations/node-selector support exposed in the CRD.

akhurana001 · 2021-10-07T00:41:54Z

Hi @kumare3 @EngHabu @anandswaminathan , based on our discussions:

I am leaning towards adding an Explicit Architecture field in Container IDL and letting FlytePropeller/FlytePlugins to add tolerations/node-selector based on the architecture field similar to how interruptible works today.

For the UX, this field can be defaulted by users as part of register command (pyflyte register) which is preferable over the flyte-config default approach (@kumare3 ?) . Additionally, users can override this at a task level if need be for any container task.

We can also add a mutating webhook but that seems like an overkill since even for the webhook, FlytePropeller/Plugins will need to understand the arch-field and add a label to be consumed by the webhook and if plugins own this part, I rather have them add the tolerations/node-selector directly (@EngHabu ?)

Let me know if there are any concerns here else I will start on the changes.

EngHabu · 2021-10-08T16:45:25Z

I think that's a great start.. my main concern with plugins doing that (also applies to interruptible btw) is that plugins that create other CRDs that then create Pods (e.g. Distributed PyTorch) do not expose the full PodSpec but they do expose labels and it seems to be a common convention that created pods get a copy of the CRD labels (if not annotations as well)... hence why I think the WebHook model works better...

That said, I think it's ok to start this way and move to something else later on... as long as we are happy with the Idl representation and UX..

The one thing that's not clear to me from the summary is what will go into that architecture field in Container IDL? is it a single architecture or all supported architectures?

message Container {
   // An ordered list of architectures supported by this container image. The plugins that will run this container will attempt to match this list against supported machine types in the cluster and pick the first available one.
   repeated string architecture = X;

Is that what you meant? or did you mean they will just set a single architecture and the task fails if that architecture isn't available in the cluster? (they can still do that even if it's a list by setting a single item there)...

akhurana001 · 2021-10-08T21:45:49Z

Is that what you meant? or did you mean they will just set a single architecture and the task fails if that architecture isn't available in the cluster? (they can still do that even if it's a list by setting a single item there)...

I was thinking of an enum i.e. users only specify a single architecture and fail if its not present.

If we let them specify multiple, then it gets more complex into how does the Platform decide on one and what happens if the workflows fail on one but maybe succeed on another so I would prefer if we let users choose one and use that. Users would need to work with the Platform admin to understand which architectures are supported and choose.

github-actions · 2023-08-27T00:40:31Z

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions · 2023-09-04T00:38:13Z

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions · 2024-07-30T00:09:01Z

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

akhurana001 added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Oct 1, 2021

akhurana001 mentioned this issue Oct 19, 2021

Add Architecture field to Container flyteorg/flyteidl#226

Merged

This was referenced Nov 16, 2021

Add support for architecture flyteorg/flytekit#752

Merged

Add support for architecture flyteorg/flyteadmin#294

Open

pingsutw assigned akhurana001 Nov 22, 2021

kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label Jan 11, 2022

reverson mentioned this issue May 27, 2022

Add support for Architecture Field flyteorg/flytekit#1028

Closed

github-actions bot added the stale label Aug 27, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 4, 2023

eapolinario reopened this Nov 2, 2023

github-actions bot removed the stale label Nov 3, 2023

github-actions bot added the stale label Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Multiple Architecture types #1539

Add support for Multiple Architecture types #1539

akhurana001 commented Oct 1, 2021 •

edited

Loading

kumare3 commented Oct 1, 2021

akhurana001 commented Oct 1, 2021

akhurana001 commented Oct 5, 2021

akhurana001 commented Oct 7, 2021

EngHabu commented Oct 8, 2021

akhurana001 commented Oct 8, 2021

github-actions bot commented Aug 27, 2023

github-actions bot commented Sep 4, 2023

github-actions bot commented Jul 30, 2024

Add support for Multiple Architecture types #1539

Add support for Multiple Architecture types #1539

Comments

akhurana001 commented Oct 1, 2021 • edited Loading

kumare3 commented Oct 1, 2021

akhurana001 commented Oct 1, 2021

akhurana001 commented Oct 5, 2021

akhurana001 commented Oct 7, 2021

EngHabu commented Oct 8, 2021

akhurana001 commented Oct 8, 2021

github-actions bot commented Aug 27, 2023

github-actions bot commented Sep 4, 2023

github-actions bot commented Jul 30, 2024

akhurana001 commented Oct 1, 2021 •

edited

Loading