Support ARM-based AWS instances #1528

deliahu · 2020-11-03T22:09:45Z

Notes

Just the containers that run on worker nodes need to be compiled for ARM:

Dequeuer - add target os/arch args to dockerfile and run with docker buildx.
Enqueuer - add target os/arch args to dockerfile and run with docker buildx.
Proxy - add target os/arch args to dockerfile and run with docker buildx.
Async gateway - add target os/arch args to dockerfile and run with docker buildx.
Fluentbit - run with docker buildx.
Node exporter - run with docker buildx.
Kube rbac proxy - doesn’t have an arm64 version, but we can build one.
Kubexit - need to enable the fork to build an arm64 version.

The text was updated successfully, but these errors were encountered:

imagine3D-ai · 2020-11-04T17:28:34Z

Notes

Just the containers that run on worker nodes need to be compiled for ARM:

fluentd has an ARM build

cloudwatch-agent doesn't seem to have an ARM build

image-downloader containers will need to be updated

Modifications will likely need to be made to the API pod containers

What is the timeline on these enhacements?

deliahu · 2020-11-04T17:32:52Z

@imagine3D-ai we don't currently have a timeline for ARM instance support. Which instance type are you hoping to use, and is cost reduction your only motivation for using it (and if so, how much would it save you)?

imagine3D-ai · 2020-11-04T17:36:33Z

Cost is not my only motivation (although c6g.medium is cheaper than t3.medium and more powerful) since Compute Optimized instances seem to be more powerful and more suitable for machine learning inference applications than T3 instances

deliahu · 2020-11-04T17:49:51Z

@imagine3D-ai each model behaves a bit differently, so some lend themselves to machines with more memory compared to CPU, and others lend themselves to more/faster CPU compared to memory. The latest "Compute Optimized" non-ARM instances would be the c5 or c5a series. "large" is the smallest size for those (as opposed to "medium"), but since you can serve multiple API or multiple replicas of the same API on a single instance, using a larger instance type will not be more expensive if you have multiple APIs or multiple replicas in a single API.

sevro · 2021-03-01T17:15:29Z

Are the required enhancements listed the same for say running the realtime API locally on a Jetson? I am considering taking a swing at this vs. using another model server, I would much rather use Cortex.

The CLI fails to run at all so I guess that would need to be fixed also:

datenstrom@ant:~$ cortex
Traceback (most recent call last):
  File "/home/datenstrom/.local/bin/cortex", line 8, in <module>
    sys.exit(run())
  File "/home/datenstrom/.local/lib/python3.6/site-packages/cortex/binary/__init__.py", line 32, in run
    process = subprocess.run([get_cli_path()] + sys.argv[1:], cwd=os.getcwd())
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 8] Exec format error: '/home/datenstrom/.local/lib/python3.6/site-packages/cortex/binary/cli'

vishalbollu · 2021-03-01T21:20:31Z

The features of Cortex that used to manage docker container deployments (also referred to as Cortex local) has been deprecated and is no longer being supported. We happened to build a model server along our journey to building a distributed model inference cluster. Creating a model server isn't our primary focus.

Having said that, if you would like adopt Cortex local for a different architecture, you can take a look at Cortex v0.25 which is the last version of Cortex with local support. The requirements listed in this ticket pertain to making the different components of Cortex cluster compatible with ARM before ARM instances can be supported. From the top of my head, Cortex local relies on Docker. You may have to recompile the cortex go binary for your architecture as well.

deliahu added the enhancement New feature or request label Nov 3, 2020

deliahu added this to To prioritize in Cortex via automation Nov 3, 2020

deliahu mentioned this issue Nov 3, 2020

Can't create cluster. i/o timeout #1527

Closed

deliahu removed this from To prioritize in Cortex Nov 26, 2020

vishalbollu added research Determine technical constraints timecapped Assigned a limited amount of time labels Jun 8, 2021

RobertLucian mentioned this issue Jun 21, 2021

Enable ARM instances on Cortex #2268

Merged

3 tasks

RobertLucian closed this as completed in #2268 Jun 22, 2021

deliahu added this to the v0.37 milestone Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ARM-based AWS instances #1528

Support ARM-based AWS instances #1528

deliahu commented Nov 3, 2020 •

edited by RobertLucian

imagine3D-ai commented Nov 4, 2020

Notes

deliahu commented Nov 4, 2020

imagine3D-ai commented Nov 4, 2020

deliahu commented Nov 4, 2020

sevro commented Mar 1, 2021 •

edited

vishalbollu commented Mar 1, 2021

Support ARM-based AWS instances #1528

Support ARM-based AWS instances #1528

Comments

deliahu commented Nov 3, 2020 • edited by RobertLucian

Notes

imagine3D-ai commented Nov 4, 2020

Notes

deliahu commented Nov 4, 2020

imagine3D-ai commented Nov 4, 2020

deliahu commented Nov 4, 2020

sevro commented Mar 1, 2021 • edited

vishalbollu commented Mar 1, 2021

deliahu commented Nov 3, 2020 •

edited by RobertLucian

sevro commented Mar 1, 2021 •

edited