Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux-aarch64 support? #136

Open
1 task done
dicta opened this issue Oct 8, 2021 · 23 comments
Open
1 task done

Linux-aarch64 support? #136

dicta opened this issue Oct 8, 2021 · 23 comments

Comments

@dicta
Copy link

dicta commented Oct 8, 2021

Issue:

Tensorflow builds are not currently run for aarch64. Raising this issue here before just opening a PR to add this to the migrator since I suspect there may be issues with running linux-aarch64 CI with a package of this size. Note that GPU support is not necessary for my use case, I'd just like to get to the point on this architecture where enough is in place to open and work with pretrained model files.

@hmaarrfk
Copy link
Contributor

There are issues with running azure CIs with a package this size!

We have some machines that can be used to manually trigger builds.

Many of use are using our personal x86 64 bit machines and OSX machines to do this package anyway.

I think you should add it to the migrator to ensure that all the dependencies are migrated.

I suspect you will need
#142
or even
#140

to go through first.

@ngam
Copy link
Contributor

ngam commented Feb 2, 2022

fyi: one could emulate aarch64 on M1 Macs relatively smoothly, e.g. using multipass.run (I believe I built tf2.7 once for aarch64 back in November)

@maresb
Copy link

maresb commented Oct 29, 2022

I have colleagues with the Apple M1 chipset, and we do lots of Dockerized development. Aarch64 is the relevant platform for Docker on M1. (In contrast, emulation of linux-64 runs roughly 6× more slowly.) I realize that there are lots of competing priorities, but it would be great to support this eventually.

@hmaarrfk
Copy link
Contributor

PR s always welcome.

@iamthebot
Copy link

@hmaarrfk so it's not possible to do the build on azure? What would be roughly needed for the PR? I can take a stab at it. We plan on using aarch64 (AWS gravitron) heavily so TF builds for aarch64 would be super useful.

@jakirkham
Copy link
Member

Not Mark, but my guess is one would want to do something like these 2 PRs (this could be one PR):

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 9, 2022

We have been building TF locally.

So ultimately, we don't need it to pass on azure, just to get "far enough" to convince us the recipe is working (very loose requirement), then typically somebody follows:

https://github.com/conda-forge/cfep/blob/main/cfep-03.md

@njzjz
Copy link
Member

njzjz commented Nov 9, 2022

May I ask whether the core members have machines for aarch64 and ppc64le, or have to build them on linux-64?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 9, 2022

I have been building qt main using emulation.

It takes time, but maybe we can restrict the number of builds for aarch.

@ngam
Copy link
Contributor

ngam commented Nov 10, 2022

FWIW, I believe the Jax developers implemented a few recent changes in upstream tensorflow that should make this (aarch64, strictly speaking) relatively straightforward/doable, though I personally didn't check and I likely won't be able to help much to initiate an effort. As I said above, if you have an M1/M2 machine, you could also reasonably test and build this there (in Docker, Multipass, etc.) --- if you start and get stuck, please tag me and I will try to help

As a rough guess, I think simply following what the bot would do (e.g. look at another PR and mimic it) should likely go a long way

@iamthebot
Copy link

iamthebot commented Nov 10, 2022

Ok-- I'm going to take a stab at this when I have a moment (might be 2-3 weeks). Looks like OSX ARM64 is already covered so it's mainly linux aarch64 that we need.

The actual compilation/usage of tensorflow on aarch64 works fine (we've already tried it). Main thing is ensuring the build uses similar build settings to what the other architectures use.

I have access to actual aarch64 linux instances which should be better for initial testing since the TF build is hefty. On my M1 mac it's not all that fast.

@ngam
Copy link
Contributor

ngam commented Nov 18, 2022

Current blocker if someone wants to help: conda-forge/tensorboard-data-server-feedstock#14

@Tobias-Fischer
Copy link
Contributor

Hi @iamthebot, are you happy to give this a go? We now have tensorboard for linux-aarch64: conda-forge/tensorboard-data-server-feedstock#18

As you said, I think mainly what it takes is a powerful aarch64 machine. Unfortunately all I got is a raspberry pi 4. I think the conda-forge aarch instance still does not work (@hmaarrfk @isuruf)?

@maresb
Copy link

maresb commented Feb 13, 2023

As you said, I think mainly what it takes is a powerful aarch64 machine. Unfortunately all I got is a raspberry pi 4.

Another common example is an Apple M1 or M2 running Docker. (I don't have one myself.)

@Tobias-Fischer
Copy link
Contributor

Oh awesome, I did not know that @maresb. Do you have any instructions on how to go about this? I guess a modified version of the linux instructions in #291?

Anyhow, currently we are stuck because we only built tensorboard 2.12 for aarch64, and tensorflow 2.12 is not yet released (see #301)

@maresb
Copy link

maresb commented Feb 13, 2023

I'm not sure, but I'm guessing you could just run the Linux script (of course replacing linux_64_*) in a Docker container.

Regarding having built only 2.12, if it's not too much work to rebuild, you should be able to request that a v2.11 branch be created in tensorboard-data-server-feedstock and do a corresponding build there.

@Tobias-Fischer
Copy link
Contributor

According to the release notes, tensorboard 2.12 is the first one to support manylinux2014 and thus aarch64 (https://github.com/tensorflow/tensorboard/releases). So I guess it's easiest to wait - I am not in a rush :).

@iamthebot
Copy link

Dumb question-- do we really need to rebuild from source or can we just leverage the compiled libs from the officially released libs? Per @Tobias-Fischer that's the implication right?

While I can certainly do a few one-off aarch64 builds not sure I can commit to a dedicated box for this / the overhead of running these builds on a cadence.

@Tobias-Fischer
Copy link
Contributor

Unfortunately, in conda-forge we cannot leverage pre-compiled libs. I found that using an M1-powered MacBook works well for linux-aarch64 builds using the build-locally.py which uses Docker.

@iamthebot
Copy link

iamthebot commented Mar 7, 2023

Unfortunately, in conda-forge we cannot leverage pre-compiled libs. I found that using an M1-powered MacBook works well for linux-aarch64 builds using the build-locally.py which uses Docker.

If that's the case happy to help with some builds (have access to an M1 mac as well) assuming I'm not a single point of failure. Like I said, I can also easily run builds on an aarch64 linux instance. What would be needed to get started?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Mar 7, 2023

did emulation really not work? typically, I can spare my workstation for 1 day. 6 hours = 30 hours of emulation. So if you have a few more than 2 cores, you can get that down to 5 hours. I guess tensorflow is already 3 hours natively, so maybe 15 hours for aarch with emulation?

@Tobias-Fischer
Copy link
Contributor

Our progress is here: #301

Basically, we are stuck with some build system changes that seem to have occurred .. I might give it a go once they release another RC.

@ngam
Copy link
Contributor

ngam commented Mar 7, 2023

I might give it a go once they release another RC.

That was my plan too.

Btw, my general advice is: Wait until we clear the six hours on the public CI first before doing much locally unless you know precisely what you're doing. Currently, I really have no idea what's going on with the error we hit. One time in the past, I simply identified the commit that seemed to break things and kept nudging the person who committed it upstream (for Tensorflow, many important commits "just happen" directly on master/main without PRs because they're often pushed by internal engineers). After a week or so of tagging that person, they added a commit that essentially fixed the issue without notification or update (I noticed the commit because it tagged the comments in the issue).

Once there's a working version with high confidence, we rarely have problem finding volunteers to build locally. I really wouldn't worry about that. However, we cannot ask volunteers to help if we aren't relatively confident we have something working. For now, we simply don't have a passing build and it's erroring relatively early in the build process. That's the focus. When adding a new arch, like aarch64, I think we will wait to assess that once we have the other builds working (especially linux-64).

did emulation really not work? typically, I can spare my workstation for 1 day. 6 hours = 30 hours of emulation.

I think the aarch64 build on M1 Macs should be much faster btw, perhaps close to the regular osx build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants