Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLA doesn't support Mac ARM #217

Closed
jeffreyksmithjr opened this issue Feb 17, 2021 · 27 comments · Fixed by #423 or #486
Closed

XLA doesn't support Mac ARM #217

jeffreyksmithjr opened this issue Feb 17, 2021 · 27 comments · Fixed by #423 or #486
Labels
area:exla Applies to EXLA note:upstream The issue must be tackled upstream

Comments

@jeffreyksmithjr
Copy link

In attempting to install EXLA (and thus XLA) on a Mac with an M1 chip (running Big Sur) runs into this issue:

cp -f /Users/jeff/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/bazel-bin/tensorflow/compiler/xla/exla/libexla.so priv/libexla.so

19:19:02.605 [warn]   The on_load function for module Elixir.EXLA.NIF returned:
{:error,
 {:load_failed,
  'Failed to load NIF library: \'dlopen(/Users/jeff/Projects/elixir/nx/exla/_build/test/lib/exla/priv/libexla.so, 2): no suitable image found.  Did find:\n\t/Users/jeff/Projects/elixir/nx/exla/_build/test/lib/exla/priv/libexla.so: mach-o, but wrong architecture\n\t/Users/jeff/Projects/elixir/nx/exla/_build/test/lib/exla/priv/libexla.so: stat() failed with errno=35\''}}

** (Mix) Could not start application exla: EXLA.Application.start(:normal, []) returned an error: shutdown: failed to start child: EXLA.Logger
    ** (EXIT) an exception was raised:
        ** (UndefinedFunctionError) function EXLA.NIF.start_log_sink/1 is undefined (module EXLA.NIF is not available)
            (exla 0.1.0-dev) EXLA.NIF.start_log_sink(#PID<0.210.0>)
            (exla 0.1.0-dev) lib/exla/logger.ex:12: EXLA.Logger.init/1
            (stdlib 3.14) gen_server.erl:417: :gen_server.init_it/2
            (stdlib 3.14) gen_server.erl:385: :gen_server.init_it/6
            (stdlib 3.14) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

19:19:02.605 [info] domain=otp file=application_controller.erl line=1943   Application exla exited: EXLA.Application.start(:normal, []) returned an error: shutdown: failed to start child: EXLA.Logger
    ** (EXIT) an exception was raised:
        ** (UndefinedFunctionError) function EXLA.NIF.start_log_sink/1 is undefined (module EXLA.NIF is not available)
            (exla 0.1.0-dev) EXLA.NIF.start_log_sink(#PID<0.210.0>)
            (exla 0.1.0-dev) lib/exla/logger.ex:12: EXLA.Logger.init/1
            (stdlib 3.14) gen_server.erl:417: :gen_server.init_it/2
            (stdlib 3.14) gen_server.erl:385: :gen_server.init_it/6
            (stdlib 3.14) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

Really more of an XLA issue than an EXLA issue. Just noting it for capture of the negative data around Mac ARM support.

@seanmor5
Copy link
Collaborator

Do you have the full Bazel build output? It might be something we can try to troubleshoot or we can open this as an issue upstream.

@josevalim On that note perhaps we should also add some options for logging build outputs. It would make probably make it easier for users to pass these issues up for us to try and debug.

@jeffreyksmithjr
Copy link
Author

I'm pretty sure that Mac ARM support just doesn't yet exist for XLA but will soon enough. You can see various Jax issues and discussions about the topic (as well as for PyTorch XLA).

@josevalim josevalim added area:exla Applies to EXLA note:upstream The issue must be tackled upstream labels Feb 17, 2021
@josevalim
Copy link
Collaborator

Thanks @jeffreyksmithjr! Btw, we will make the project public today, so if you get any e-mail related to your org access, that's the reason. :)

@wojtekmach
Copy link
Contributor

wojtekmach commented Feb 18, 2021

Failed to load NIF library: \'dlopen(.../libexla.so, 2): no suitable image found. Did find:\n\t...libexla.so: mach-o, but wrong architecture

I'm also on an ARM Mac and got a similar error. Inspecting the libxsla.so file I get:

$ file _build/dev/lib/exla/priv/libexla.so
_build/dev/lib/exla/priv/libexla.so: Mach-O 64-bit dynamically linked shared library x86_64

and I think the reason is my Bazel is a x86_64 binary as well:

$ file ~/.asdf/installs/bazel/3.1.0/lib/bazel/bin/bazel-real
/Users/wojtek/.asdf/installs/bazel/3.1.0/lib/bazel/bin/bazel-real: Mach-O 64-bit executable x86_64

Looks like support for ARM Macs landed very recently via bazel build --cpu=darwin_arm64 (bazelbuild/bazel#12900) so perhaps it's not a problem at all with XLA but just a matter of time before there's a new Bazel release.

@wojtekmach
Copy link
Contributor

I was able to maybe make some progress:

asdf plugin-add bazel
asdf install bazel 4.0.0
asdf global bazel 4.0.0
git clone git@github.com:bazelbuild/bazel ~/bazel
cd ~/bazel
bazel build --cpu=darwin_arm64 -c opt //src:bazel
git clone git@github.com:elixir-nx/nx ~/nx
cd nx/exla
export PATH=~/bazel/bazel-bin/src:$PATH
EXLA_TENSORFLOW_GIT_REV=045b62dc3ee2ce23ace71a39b5e433abbbbe3900 mix compile

Credit: bazelbuild/bazel#12900 (comment)

045b62dc3ee2ce23ace71a39b5e433abbbbe3900 is simply tip of tensorflow/tensorflow at the time of writing this.

results in a different error:

{:error,
 {:load_failed,
  'Failed to load NIF library: \'dlopen(/Users/wojtek/src/nx/exla/_build/dev/lib/exla/priv/libexla.so, 2): Symbol not found: _LLVMInitializeAArch64AsmPrinter\n  Referenced from: /Users/wojtek/src/nx/exla/_build/dev/lib/exla/priv/libexla.so\n  Expected in: flat namespace\n in /Users/wojtek/src/nx/exla/_build/dev/lib/exla/priv/libexla.so\''}}

@seanmor5
Copy link
Collaborator

@wojtekmach Can you try bazel clean --expunge and then build with EXLA_MODE=dbg? That will force a full rebuild though, so if you'd prefer troubleshooting some other steps before blowing the whole thing up that makes sense too. That's an LLVM linking issue, but I'm not sure why that's happening

@seanmor5
Copy link
Collaborator

For anybody interested in attempting to resolve this in some way, it seems it might be possible to fix:
google/jax#5501
tensorflow/tensorflow#45404

I don't have a mac to test on, but my recommendation is:

First, per the issue above -

Install x86_64 Bazel 3.7.1 through Rosetta
Install Python 3.8.2
Install Xcode 12.3

Next, change EXLA_TENSORFLOW_GIT_REV to the most recent commit on TF master. At the time of this writing that is: a2a5b86c3bd90e03151a25c52c0f6cebbd573228.

Finally, set EXLA_FLAGS:

export EXLA_FLAGS=--config=macos_arm64

@behe
Copy link

behe commented Feb 22, 2021

@seanmor5 Tried that but I don't think that version of Tensorflow likes the Makefile patches:

EXLA_FLAGS="--config=macos_arm64" EXLA_TENSORFLOW_GIT_REV=a2a5b86c3bd90e03151a25c52c0f6cebbd573228 mix compile
cd /Users/behe/.cache/exla/tf-a2a5b86c3bd90e03151a25c52c0f6cebbd573228/erts-11.1.3 && \
		bazel build --define "framework_shared_object=false" -c opt  --config=macos_arm64 //tensorflow/compiler/xla/exla:libexla.so
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /Users/behe/.cache/exla/tf-a2a5b86c3bd90e03151a25c52c0f6cebbd573228/erts-11.1.3/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /Users/behe/.cache/exla/tf-a2a5b86c3bd90e03151a25c52c0f6cebbd573228/erts-11.1.3/.bazelrc:
  'build' options: --apple_platform_type=macos --define framework_shared_object=true --java_toolchain=@org_tensorflow//third_party/toolchains/java:tf_java_toolchain --host_java_toolchain=@org_tensorflow//third_party/toolchains/java:tf_java_toolchain --define=tensorflow_enable_mlir_generated_gpu_kernels=0 --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --noincompatible_prohibit_aapt1 --enable_platform_specific_config --config=short_logs --config=v2
INFO: Found applicable config definition build:short_logs in file /Users/behe/.cache/exla/tf-a2a5b86c3bd90e03151a25c52c0f6cebbd573228/erts-11.1.3/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /Users/behe/.cache/exla/tf-a2a5b86c3bd90e03151a25c52c0f6cebbd573228/erts-11.1.3/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:macos_arm64 in file /Users/behe/.cache/exla/tf-a2a5b86c3bd90e03151a25c52c0f6cebbd573228/erts-11.1.3/.bazelrc: --config=macos --apple_platform_type=macos --cpu=darwin_arm64 --noenable_platform_specific_config
INFO: Found applicable config definition build:macos in file /Users/behe/.cache/exla/tf-a2a5b86c3bd90e03151a25c52c0f6cebbd573228/erts-11.1.3/.bazelrc: --copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14
Loading:
Loading: 0 packages loaded
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (0 packages loaded, 0 targets configured)
ERROR: While resolving toolchains for target //tensorflow/compiler/xla/exla:libexla.so: invalid registered toolchain '@local_config_python//:py_toolchain': no such target '@local_config_python//:py_toolchain': target 'py_toolchain' not declared in package '' defined by /private/var/tmp/_bazel_behe/2118788051a073b285edfafde0fd5880/external/local_config_python/BUILD
ERROR: Analysis of target '//tensorflow/compiler/xla/exla:libexla.so' failed; build aborted: invalid registered toolchain '@local_config_python//:py_toolchain': no such target '@local_config_python//:py_toolchain': target 'py_toolchain' not declared in package '' defined by /private/var/tmp/_bazel_behe/2118788051a073b285edfafde0fd5880/external/local_config_python/BUILD
INFO: Elapsed time: 0.320s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
make: *** [all] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. Try running the
commands "gcc --version" and / or "make --version". If these programs
are not installed, you will be prompted to install them.

@seanmor5
Copy link
Collaborator

seanmor5 commented Feb 22, 2021

@behe I know what's going on. In #247 we "removed" the NumPy dependency by commenting out some things in the TF build script. They changed the script recently so we'll need to update what we're doing with a new commit. What you'll want to do is remove disable_python_checks from all target in Makefile! That will fix your problem and at least go forth with the build, note you will have to completely remove your TF checkout first, so run make clean in the EXLA project directory.

@wojtekmach
Copy link
Contributor

wojtekmach commented Feb 22, 2021

I got the same error as above, interestingly running the exact same version of tensorflow from a local checkout allowed me to move forward

diff --git a/exla/Makefile b/exla/Makefile
index 0db21c8..2b7864b 100644
--- a/exla/Makefile
+++ b/exla/Makefile
@@ -24,7 +24,8 @@ ERTS_SYM_DIR = $(EXLA_DIR)/erts
 BAZEL_FLAGS = --define "framework_shared_object=false" -c $(EXLA_MODE)

 TENSORFLOW_NS = tf-$(EXLA_TENSORFLOW_GIT_REV)
-TENSORFLOW_DIR = $(EXLA_CACHE)/$(TENSORFLOW_NS)/erts-$(ERTS_VERSION)
+# TENSORFLOW_DIR = $(EXLA_CACHE)/$(TENSORFLOW_NS)/erts-$(ERTS_VERSION)
+TENSORFLOW_DIR = /Users/wojtek/src/tensorflow
 TENSORFLOW_EXLA_NS = tensorflow/compiler/xla/exla
 TENSORFLOW_EXLA_DIR = $(TENSORFLOW_DIR)/$(TENSORFLOW_EXLA_NS)

@@ -48,11 +49,11 @@ PTD:
 $(TENSORFLOW_DIR):
        mkdir -p $(TENSORFLOW_DIR)

-       cd $(TENSORFLOW_DIR) && \
-               git init && \
-               git remote add origin $(EXLA_TENSORFLOW_GIT_REPO) && \
-               git fetch --depth 1 origin $(EXLA_TENSORFLOW_GIT_REV) && \
-               git checkout FETCH_HEAD
+       # cd $(TENSORFLOW_DIR) && \
+       #       git init && \
+       #       git remote add origin $(EXLA_TENSORFLOW_GIT_REPO) && \
+       #       git fetch --depth 1 origin $(EXLA_TENSORFLOW_GIT_REV) && \
+       #       git checkout FETCH_HEAD

        cd $(TENSORFLOW_DIR) && \
                sed -e '/register_toolchains("@local_config_python\/\/:py_toolchain")/ s/^#*/#/' -i.backup WORKSPACE && \
@@ -63,4 +64,4 @@ $(TENSORFLOW_DIR):
 clean:
        cd $(TENSORFLOW_DIR) && bazel clean --expunge
        rm -f $(ERTS_SYM_DIR) $(TENSORFLOW_EXLA_DIR)
-       rm -rf $(EXLA_SO) $(TENSORFLOW_DIR)
+       rm -rf $(EXLA_SO) # $(TENSORFLOW_DIR)

after this I got a different error that I'll post soon.

@wojtekmach
Copy link
Contributor

wojtekmach commented Feb 22, 2021

As I mentioned in the previous post, I was able to make further progress but eventually the compilation crashed with:

ERROR: /Users/wojtek/src/tensorflow/tensorflow/compiler/xla/exla/BUILD:82:10: Linking of rule '//tensorflow/compiler/xla/exla:libexla.so' failed (Exit 1): cc_wrapper.sh failed: error executing command external/local_config_cc/cc_wrapper.sh -lc++ -fobjc-link-runtime -shared -o bazel-out/darwin_arm64-opt/bin/tensorflow/compiler/xla/exla/libexla.so ... (remaining 1770 argument(s) skipped)
final section layout:
    __TEXT/__text addr=0x000034E0, size=0x0949D7B8, fileOffset=0x000034E0, type=1
    __TEXT/__stubs addr=0x094A0C98, size=0x00011280, fileOffset=0x094A0C98, type=29
    __TEXT/__stub_helper addr=0x094B1F18, size=0x000020D0, fileOffset=0x094B1F18, type=33
    __TEXT/__gcc_except_tab addr=0x094B3FE8, size=0x004EB47C, fileOffset=0x094B3FE8, type=0
    __TEXT/__const addr=0x0999F480, size=0x016767C6, fileOffset=0x0999F480, type=0
    __TEXT/__cstring addr=0x0B015C48, size=0x002F9299, fileOffset=0x0B015C48, type=13
    __TEXT/__ustring addr=0x0B30EEE2, size=0x0000054A, fileOffset=0x0B30EEE2, type=16
    __TEXT/text_env addr=0x0B30F42C, size=0x00002C60, fileOffset=0x0B30F42C, type=0
    __TEXT/__unwind_info addr=0x0B31208C, size=0x0014F2C0, fileOffset=0x0B31208C, type=22
    __TEXT/__eh_frame addr=0x0B461350, size=0x00006CB0, fileOffset=0x0B461350, type=19
    __DATA_CONST/__got addr=0x0B468000, size=0x0005FD38, fileOffset=0x0B468000, type=30
    __DATA_CONST/__mod_init_func addr=0x0B4C7D38, size=0x000050A0, fileOffset=0x0B4C7D38, type=34
    __DATA_CONST/__const addr=0x0B4CCDE0, size=0x0063DFE8, fileOffset=0x0B4CCDE0, type=0
    __DATA_CONST/__cfstring addr=0x0BB0ADC8, size=0x00000020, fileOffset=0x0BB0ADC8, type=17
    __DATA/__la_symbol_ptr addr=0x0BB0C000, size=0x0000B700, fileOffset=0x0BB0C000, type=28
    __DATA/__data addr=0x0BB17700, size=0x00016890, fileOffset=0x0BB17700, type=0
    __DATA/__thread_vars addr=0x0BB2DF90, size=0x000005E8, fileOffset=0x0BB2DF90, type=41
    __DATA/__thread_ptrs addr=0x0BB2E578, size=0x00000040, fileOffset=0x0BB2E578, type=45
    __DATA/__thread_data addr=0x0BB2E5B8, size=0x00000048, fileOffset=0x0BB2E5B8, type=43
    __DATA/__thread_bss addr=0x0BB2E600, size=0x000003B0, fileOffset=0x00000000, type=42
    __DATA/__bss addr=0x0BB2E9C0, size=0x00082898, fileOffset=0x00000000, type=26
    __DATA/__common addr=0x0BBB1258, size=0x00012BB4, fileOffset=0x00000000, type=26
ld: b(l) ARM64 branch out of range (178555960 max is +/-128MB): from __ZN10tensorflow10LMDBReader19OnWorkStartedLockedEv (0x008C6960) to _mdb_env_open (0x0B30F6C0) in '__ZN10tensorflow10LMDBReader19OnWorkStartedLockedEv' from bazel-out/darwin_arm64-opt/bin/tensorflow/core/kernels/liblmdb_reader_op.lo(lmdb_reader_op.o)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

below is the full output:

~/src/nx/exla[main]% env | grep EXLA_
EXLA_TENSORFLOW_GIT_REV=a2a5b86c3bd90e03151a25c52c0f6cebbd573228
EXLA_FLAGS=--config=macos_arm64

~/src/nx/exla[main]% bazel version
Build label: 3.7.2
Build target: bazel-out/darwin-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Thu Dec 17 17:02:20 2020 (1608224540)
Build timestamp: 1608224540
Build timestamp as int: 1608224540

~/src/nx/exla[main]% python3 --version
Python 3.8.2

~/src/nx/exla[main]% xcode-select -p
/Applications/Xcode.app/Contents/Developer

~/src/nx/exla[main]% xcodebuild -version
Xcode 12.4
Build version 12D4e

~/src/nx/exla[main]% time mix
cd /Users/wojtek/src/tensorflow && \
    bazel build --define "framework_shared_object=false" -c opt  --config=macos_arm64 //tensorflow/compiler/xla/exla:libexla.so
WARNING: Running Bazel server needs to be killed, because the startup options are different.
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /Users/wojtek/src/tensorflow/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /Users/wojtek/src/tensorflow/.bazelrc:
  'build' options: --apple_platform_type=macos --define framework_shared_object=true --java_toolchain=@org_tensorflow//third_party/toolchains/java:tf_java_toolchain --host_java_toolchain=@org_tensorflow//third_party/toolchains/java:tf_java_toolchain --define=tensorflow_enable_mlir_generated_gpu_kernels=0 --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --noincompatible_prohibit_aapt1 --enable_platform_specific_config --config=short_logs --config=v2
INFO: Found applicable config definition build:short_logs in file /Users/wojtek/src/tensorflow/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /Users/wojtek/src/tensorflow/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:macos_arm64 in file /Users/wojtek/src/tensorflow/.bazelrc: --config=macos --apple_platform_type=macos --cpu=darwin_arm64 --noenable_platform_specific_config
INFO: Found applicable config definition build:macos in file /Users/wojtek/src/tensorflow/.bazelrc: --copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14
Loading:
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
    currently loading: tensorflow/compiler/xla/exla
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (1 packages loaded, 0 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (13 packages loaded, 11 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (15 packages loaded, 11 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (26 packages loaded, 147 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (46 packages loaded, 173 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (56 packages loaded, 246 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (89 packages loaded, 1123 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (110 packages loaded, 2127 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (145 packages loaded, 3106 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (159 packages loaded, 8875 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (181 packages loaded, 13172 targets configured)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (219 packages loaded, 19919 targets configured)
INFO: Analyzed target //tensorflow/compiler/xla/exla:libexla.so (219 packages loaded, 20838 targets configured).
INFO: Found 1 target...
[1 / 63] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[507 / 2,211] Compiling tensorflow/core/protobuf/struct.pb.cc [for host]; 2s local ... (8 actions, 7 running)
[524 / 2,211] Compiling tensorflow/core/protobuf/meta_graph.pb.cc [for host]; 3s local ... (8 actions, 7 running)
[589 / 2,346] Compiling tensorflow/core/protobuf/meta_graph.pb.cc [for host]; 9s local ... (8 actions, 7 running)
[670 / 2,409] Compiling tensorflow/core/kernels/concat_op.cc; 4s local ... (8 actions, 7 running)
[713 / 2,409] Compiling llvm-project/llvm/lib/MC/MCAssembler.cpp [for host]; 3s local ... (8 actions, 7 running)
[1,014 / 3,031] Compiling llvm-project/mlir/tools/mlir-tblgen/SPIRVUtilsGen.cpp [for host]; 6s local ... (8 actions, 7 running)
[1,252 / 3,203] Generating code from table: lib/Target/X86/X86.td @llvm-project//llvm:X86CommonTableGen__gen_instr_info_genrule; 6s local ... (8 actions, 7 running)
[1,425 / 3,203] Compiling llvm-project/mlir/lib/IR/Dominance.cpp [for host]; 10s local ... (8 actions, 7 running)
[1,992 / 3,724] Compiling tensorflow/core/framework/tensor.cc [for host]; 5s local ... (8 actions running)
[2,116 / 3,842] Compiling tensorflow/core/framework/function.cc [for host]; 9s local ... (8 actions running)
[2,143 / 3,842] Compiling tensorflow/core/util/example_proto_fast_parsing.cc [for host]; 10s local ... (8 actions running)
[2,190 / 3,912] Compiling tensorflow/core/util/batch_util.cc [for host]; 30s local ... (8 actions running)
[2,303 / 4,009] Compiling tensorflow/core/ops/nn_ops.cc [for host]; 11s local ... (8 actions running)
[2,463 / 4,123] Compiling tensorflow/core/common_runtime/eager/core.cc; 5s local ... (8 actions running)
[2,534 / 4,176] Compiling tensorflow/compiler/xla/service/elemental_ir_emitter.cc; 21s local ... (8 actions running)
[2,616 / 4,253] Compiling tensorflow/core/kernels/list_kernels.cc; 34s local ... (8 actions running)
[2,719 / 4,309] Compiling tensorflow/compiler/xla/service/hlo_instruction.cc; 6s local ... (8 actions running)
[2,791 / 4,358] Compiling tensorflow/core/kernels/slice_op_cpu_impl_8.cc; 8s local ... (8 actions running)
[2,926 / 4,470] Compiling tensorflow/core/graph/graph.cc; 4s local ... (8 actions running)
[3,269 / 4,865] Compiling tensorflow/core/kernels/topk_op.cc; 18s local ... (8 actions running)
[3,408 / 4,896] Compiling tensorflow/core/kernels/bias_op.cc; 19s local ... (8 actions running)
[3,571 / 5,039] Compiling tensorflow/core/kernels/bias_op.cc; 100s local ... (8 actions running)
[3,755 / 5,138] Compiling tensorflow/core/kernels/linalg/matrix_square_root_op.cc; 23s local ... (8 actions running)
[3,887 / 5,419] Compiling mkl_dnn_v1/src/cpu/cpu_reorder.cpp; 62s local ... (8 actions running)
[3,927 / 5,419] Compiling mkl_dnn_v1/src/cpu/x64/gemm/f32/jit_avx_kernel_b0_sgemm_kern_autogen.cpp; 114s local ... (8 actions running)
[3,983 / 5,419] Compiling mkl_dnn_v1/src/cpu/x64/gemm/f32/jit_avx_kernel_sgemm_kern_autogen.cpp; 154s local ... (8 actions running)
[4,092 / 5,510] Compiling mkl_dnn_v1/src/cpu/x64/gemm/f32/jit_avx512_core_f32_copy_at_kern_autogen.cpp; 294s local ... (8 actions running)
[4,137 / 5,510] Compiling tensorflow/core/kernels/cwise_op_mul_1.cc; 74s local ... (8 actions running)
[4,453 / 5,907] Compiling tensorflow/core/kernels/training_ops.cc; 57s local ... (8 actions running)
[4,996 / 6,227] Compiling com_google_protobuf/src/google/protobuf/util/message_differencer.cc; 3s local ... (8 actions running)
[5,213 / 6,368] Compiling tensorflow/core/kernels/matmul_op_real.cc; 202s local ... (8 actions running)
[5,618 / 6,606] Compiling tensorflow/core/common_runtime/isolate_placer_inspection_required_ops_pass.cc; 5s local ... (8 actions, 7 running)
[6,264 / 7,131] Compiling tensorflow/core/kernels/resource_variable_ops.cc; 218s local ... (8 actions running)
[7,061 / 7,678] Compiling tensorflow/core/kernels/image/resize_nearest_neighbor_op.cc; 16s local ... (8 actions, 7 running)
[8,017 / 8,446] Compiling tensorflow/core/kernels/linalg/einsum_op_impl_int64.cc; 48s local ... (8 actions, 7 running)
[8,825 / 9,063] Compiling tensorflow/compiler/xla/service/cpu/runtime_matmul.cc; 63s local ... (8 actions, 7 running)
[9,724 / 9,752] Compiling tensorflow/core/kernels/maxpooling_op.cc; 35s local ... (8 actions, 7 running)
ERROR: /Users/wojtek/src/tensorflow/tensorflow/compiler/xla/exla/BUILD:82:10: Linking of rule '//tensorflow/compiler/xla/exla:libexla.so' failed (Exit 1): cc_wrapper.sh failed: error executing command external/local_config_cc/cc_wrapper.sh -lc++ -fobjc-link-runtime -shared -o bazel-out/darwin_arm64-opt/bin/tensorflow/compiler/xla/exla/libexla.so ... (remaining 1770 argument(s) skipped)
final section layout:
    __TEXT/__text addr=0x000034E0, size=0x0949D7B8, fileOffset=0x000034E0, type=1
    __TEXT/__stubs addr=0x094A0C98, size=0x00011280, fileOffset=0x094A0C98, type=29
    __TEXT/__stub_helper addr=0x094B1F18, size=0x000020D0, fileOffset=0x094B1F18, type=33
    __TEXT/__gcc_except_tab addr=0x094B3FE8, size=0x004EB47C, fileOffset=0x094B3FE8, type=0
    __TEXT/__const addr=0x0999F480, size=0x016767C6, fileOffset=0x0999F480, type=0
    __TEXT/__cstring addr=0x0B015C48, size=0x002F9299, fileOffset=0x0B015C48, type=13
    __TEXT/__ustring addr=0x0B30EEE2, size=0x0000054A, fileOffset=0x0B30EEE2, type=16
    __TEXT/text_env addr=0x0B30F42C, size=0x00002C60, fileOffset=0x0B30F42C, type=0
    __TEXT/__unwind_info addr=0x0B31208C, size=0x0014F2C0, fileOffset=0x0B31208C, type=22
    __TEXT/__eh_frame addr=0x0B461350, size=0x00006CB0, fileOffset=0x0B461350, type=19
    __DATA_CONST/__got addr=0x0B468000, size=0x0005FD38, fileOffset=0x0B468000, type=30
    __DATA_CONST/__mod_init_func addr=0x0B4C7D38, size=0x000050A0, fileOffset=0x0B4C7D38, type=34
    __DATA_CONST/__const addr=0x0B4CCDE0, size=0x0063DFE8, fileOffset=0x0B4CCDE0, type=0
    __DATA_CONST/__cfstring addr=0x0BB0ADC8, size=0x00000020, fileOffset=0x0BB0ADC8, type=17
    __DATA/__la_symbol_ptr addr=0x0BB0C000, size=0x0000B700, fileOffset=0x0BB0C000, type=28
    __DATA/__data addr=0x0BB17700, size=0x00016890, fileOffset=0x0BB17700, type=0
    __DATA/__thread_vars addr=0x0BB2DF90, size=0x000005E8, fileOffset=0x0BB2DF90, type=41
    __DATA/__thread_ptrs addr=0x0BB2E578, size=0x00000040, fileOffset=0x0BB2E578, type=45
    __DATA/__thread_data addr=0x0BB2E5B8, size=0x00000048, fileOffset=0x0BB2E5B8, type=43
    __DATA/__thread_bss addr=0x0BB2E600, size=0x000003B0, fileOffset=0x00000000, type=42
    __DATA/__bss addr=0x0BB2E9C0, size=0x00082898, fileOffset=0x00000000, type=26
    __DATA/__common addr=0x0BBB1258, size=0x00012BB4, fileOffset=0x00000000, type=26
ld: b(l) ARM64 branch out of range (178555960 max is +/-128MB): from __ZN10tensorflow10LMDBReader19OnWorkStartedLockedEv (0x008C6960) to _mdb_env_open (0x0B30F6C0) in '__ZN10tensorflow10LMDBReader19OnWorkStartedLockedEv' from bazel-out/darwin_arm64-opt/bin/tensorflow/core/kernels/liblmdb_reader_op.lo(lmdb_reader_op.o)
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Target //tensorflow/compiler/xla/exla:libexla.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 5128.160s, Critical Path: 437.20s
INFO: 8391 processes: 232 internal, 8159 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
make: *** [all] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. Try running the
commands "gcc --version" and / or "make --version". If these programs
are not installed, you will be prompted to install them.

mix  0.39s user 0.51s system 0% cpu 1:25:29.20 total

@jotsif
Copy link

jotsif commented Mar 1, 2021

@seanmor5 @wojtekmach Did you manage to fix the LLVM linking issue? I am having the same problem building XLA for Jax and it seems like Bazel strips out some of the LLVM dependencies.

@wojtekmach
Copy link
Contributor

Im stuck at ARM64 branch out of range error.

@jotsif
Copy link

jotsif commented Mar 1, 2021

@wojtekmach Although the above issue looks like a compiler/linker bug it also looks like you are linking in with some tensorflow core functionality (some TF kernels) and do you need those? For the Python XLA extension used in jax there is some discussion about that here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/python/BUILD#L539

@jotsif
Copy link

jotsif commented Mar 6, 2021

@wojtekmach
Copy link
Contributor

@jotsif thanks! I tried that and got the following, might be an exla issue?

% time EXLA_FLAGS=--config=macos_arm64 EXLA_TENSORFLOW_GIT_REPO=https://github.com/freedomtan/tensorflow EXLA_TENSORFLOW_GIT_REV=bazel_native_build_on_m1 mix
ERROR: /Users/wojtek/.cache/exla/tf-bazel_native_build_on_m1/erts-12.0/tensorflow/compiler/xla/exla/BUILD:82:10: C++ compilation of rule '//tensorflow/compiler/xla/exla:libexla.so' failed (Exit 1): wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 433 argument(s) skipped)
tensorflow/compiler/xla/exla/exla.cc:1559:10: error: no member named 'QRDecomposition' in namespace 'xla'
    xla::QRDecomposition(*operand, full_matrices, 128, precision), env);
    ~~~~~^
./tensorflow/compiler/xla/exla/exla_nif_util.h:345:42: note: expanded from macro 'EXLA_ASSIGN_OR_RETURN_NIF'
                                    lhs, rexpr, env)
                                         ^~~~~
./tensorflow/compiler/xla/exla/exla_nif_util.h:348:20: note: expanded from macro 'EXLA_ASSIGN_OR_RETURN_NIF_IMPL'
  auto statusor = (rexpr);                                                   \
                   ^~~~~
tensorflow/compiler/xla/exla/exla.cc:1558:34: error: no type named 'QRDecompositionResult' in namespace 'xla'; did you mean 'LuDecompositionResult'?
  EXLA_ASSIGN_OR_RETURN_NIF(xla::QRDecompositionResult qr_result,
                            ~~~~~^~~~~~~~~~~~~~~~~~~~~
                                 LuDecompositionResult
./tensorflow/compiler/xla/exla/exla_nif_util.h:345:37: note: expanded from macro 'EXLA_ASSIGN_OR_RETURN_NIF'
                                    lhs, rexpr, env)
                                    ^~~
./tensorflow/compiler/xla/exla/exla_nif_util.h:352:3: note: expanded from macro 'EXLA_ASSIGN_OR_RETURN_NIF_IMPL'
  lhs = std::move(statusor.ValueOrDie());
  ^~~
./tensorflow/compiler/xla/client/lib/lu_decomposition.h:46:8: note: 'LuDecompositionResult' declared here
struct LuDecompositionResult {
       ^
tensorflow/compiler/xla/exla/exla.cc:1561:63: error: no member named 'q' in 'xla::LuDecompositionResult'
  ERL_NIF_TERM q = exla::nif::make<xla::XlaOp>(env, qr_result.q);
                                                    ~~~~~~~~~ ^
tensorflow/compiler/xla/exla/exla.cc:1562:63: error: no member named 'r' in 'xla::LuDecompositionResult'
  ERL_NIF_TERM r = exla::nif::make<xla::XlaOp>(env, qr_result.r);
                                                    ~~~~~~~~~ ^
4 errors generated.
Target //tensorflow/compiler/xla/exla:libexla.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 3702.984s, Critical Path: 241.72s
INFO: 9249 processes: 372 internal, 8877 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
make: *** [all] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. Try running the
commands "gcc --version" and / or "make --version". If these programs
are not installed, you will be prompted to install them.

EXLA_FLAGS=--config=macos_arm64 EXLA_TENSORFLOW_GIT_REPO= = mix  0.39s user 0.48s system 0% cpu 1:01:44.03 total

@seanmor5
Copy link
Collaborator

seanmor5 commented Mar 6, 2021

@wojtekmach That is an EXLA issue. If you change qr to this:

ERL_NIF_TERM qr(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
  if (argc != 3) {
    return exla::nif::error(env, "Bad argument count.");
  }

  xla::XlaOp* operand;
  bool full_matrices;
  int config_int;

  if (!exla::nif::get<xla::XlaOp>(env, argv[0], operand)) {
    return exla::nif::error(env, "Unable to get operand.");
  }
  if (!exla::nif::get(env, argv[1], &full_matrices)) {
    return exla::nif::error(env, "Unable to get full matrices flag.");
  }

  xla::XlaOp q, r;

  xla::QrExplicit(*operand, full_matrices, q, r);

  ERL_NIF_TERM q_term = exla::nif::make<xla::XlaOp>(env, q);
  ERL_NIF_TERM r_term = exla::nif::make<xla::XlaOp>(env, r);

  return exla::nif::ok(env, enif_make_tuple2(env, q_term, r_term));
}

It should fix. QR will need to be updated in some other places as well, but this is a quick fix for the build on Mac

@wojtekmach
Copy link
Contributor

Yeah, that allowed me to move forward but ended up stuck on the same ld: b(l) ARM64 branch out of range (159775812 max is +/-128MB): as before. I'll try a new clean build soon.

[640 / 683] Compiling tensorflow/core/kernels/maxpooling_op.cc; 27s local ... (8 actions running)
ERROR: /Users/wojtek/.cache/exla/tf-bazel_native_build_on_m1/erts-12.0/tensorflow/compiler/xla/exla/BUILD:82:10: Linking of rule '//tensorflow/compiler/xla/exla:libexla.so' failed (Exit 1): cc_wrapper.sh failed: error executing command external/local_config_cc/cc_wrapper.sh -lc++ -fobjc-link-runtime -shared -o bazel-out/darwin_arm64-opt/bin/tensorflow/compiler/xla/exla/libexla.so ... (remaining 1770 argument(s) skipped)
final section layout:
    __TEXT/__text addr=0x00006700, size=0x081C82B4, fileOffset=0x00006700, type=1
    __TEXT/__stubs addr=0x081CE9B4, size=0x0000C180, fileOffset=0x081CE9B4, type=29
    __TEXT/__stub_helper addr=0x081DAB34, size=0x0000207C, fileOffset=0x081DAB34, type=33
    __TEXT/__gcc_except_tab addr=0x081DCBB0, size=0x0039B4EC, fileOffset=0x081DCBB0, type=0
    __TEXT/__const addr=0x085780A0, size=0x018AA006, fileOffset=0x085780A0, type=0
    __TEXT/__cstring addr=0x09E220A8, size=0x00305F9A, fileOffset=0x09E220A8, type=13
    __TEXT/__ustring addr=0x0A128042, size=0x0000054A, fileOffset=0x0A128042, type=16
    __TEXT/text_env addr=0x0A12858C, size=0x00002C60, fileOffset=0x0A12858C, type=0
    __TEXT/__unwind_info addr=0x0A12B1EC, size=0x00140EF0, fileOffset=0x0A12B1EC, type=22
    __TEXT/__eh_frame addr=0x0A26C0E0, size=0x00003F0C, fileOffset=0x0A26C0E0, type=19
    __DATA_CONST/__got addr=0x0A270000, size=0x0005D570, fileOffset=0x0A270000, type=30
    __DATA_CONST/__mod_init_func addr=0x0A2CD570, size=0x00005128, fileOffset=0x0A2CD570, type=34
    __DATA_CONST/__const addr=0x0A2D26A0, size=0x0066B6E8, fileOffset=0x0A2D26A0, type=0
    __DATA_CONST/__cfstring addr=0x0A93DD88, size=0x00000020, fileOffset=0x0A93DD88, type=17
    __DATA/__la_symbol_ptr addr=0x0A940000, size=0x00008100, fileOffset=0x0A940000, type=28
    __DATA/__data addr=0x0A948100, size=0x00016340, fileOffset=0x0A948100, type=0
    __DATA/__thread_vars addr=0x0A95E440, size=0x000005A0, fileOffset=0x0A95E440, type=41
    __DATA/__thread_ptrs addr=0x0A95E9E0, size=0x00000040, fileOffset=0x0A95E9E0, type=45
    __DATA/__thread_data addr=0x0A95EA20, size=0x00000048, fileOffset=0x0A95EA20, type=43
    __DATA/__thread_bss addr=0x0A95EA68, size=0x00000398, fileOffset=0x00000000, type=42
    __DATA/__bss addr=0x0A95EE00, size=0x00083BD8, fileOffset=0x00000000, type=26
    __DATA/__common addr=0x0A9E29D8, size=0x0001306C, fileOffset=0x00000000, type=26
ld: b(l) ARM64 branch out of range (159775812 max is +/-128MB): from __ZN10tensorflow10LMDBReader19OnWorkStartedLockedEv (0x008C8AB4) to _mdb_env_open (0x0A128820) in '__ZN10tensorflow10LMDBReader19OnWorkStartedLockedEv' from bazel-out/darwin_arm64-opt/bin/tensorflow/core/kernels/liblmdb_reader_op.lo(lmdb_reader_op.o)
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Target //tensorflow/compiler/xla/exla:libexla.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 486.777s, Critical Path: 149.13s
INFO: 728 processes: 2 internal, 726 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
make: *** [all] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. Try running the
commands "gcc --version" and / or "make --version". If these programs
are not installed, you will be prompted to install them.

EXLA_FLAGS=--config=macos_arm64 EXLA_TENSORFLOW_GIT_REPO= = mix  0.34s user 0.38s system 0% cpu 8:07.67 total

@andrewphillipo
Copy link

Would be interested if anyone has got this to work yet?

@seanmor5
Copy link
Collaborator

Hi @andrewphillipo, I have been tracking some developments in JAX that this may now be possible on TF Master: google/jax#5501 and google/jax#6701

I am willing to put a branch together that upgrades our TF version and adds the flags necessary for building on Mac ARM, but I unfortunately don't have a machine to test on.

@josevalim
Copy link
Collaborator

@seanmor5 i think we can upgrade the TF version and then let users play with flags. :) Upgrading TF would already be a huge help.

@aphillipo
Copy link

If it helps I have a Mac mini here you can remote into if you like? Not sure what the best solution is for that or if you even have time. Also happy to test anything...

@lenileiro
Copy link

@wojtekmach could have a crack at it, one more time?

@aphillipo
Copy link

Does this mean anything to you guys ;-) https://developer.apple.com/metal/tensorflow-plugin/

@seanmor5
Copy link
Collaborator

For those tracking this issue, #423 uses a new version of TensorFlow which should support Mac ARM. I believe you might need Bazel 4.1. If anybody would like to take a shot at building off of that branch, that would be really appreciated

@seanmor5
Copy link
Collaborator

Didn't mean to close

@seanmor5 seanmor5 reopened this Jun 19, 2021
@wojtekmach
Copy link
Contributor

I tried the following:

asdf global bazel 4.1.0
echo 4.1.0 > /Users/wojtek/.cache/exla/tf-master/erts-12.0.2/.bazelversion
time EXLA_FLAGS=--config=macos_arm64 EXLA_TENSORFLOW_GIT_REV=master mix compile
(…)
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (64 packages loaded, 314 targets configured)
ERROR: /private/var/tmp/_bazel_wojtek/fb4626d6ccd81b40dedec6d2506d2275/external/local_config_cc/BUILD:48:19: in cc_toolchain_suite rule @local_config_cc//:toolchain: cc_toolchain_suite '@local_config_cc//:toolchain' does not contain a toolchain for cpu 'darwin_arm64'
INFO: Repository llvm-project instantiated at:
  /Users/wojtek/.cache/exla/tf-master/erts-12.0.2/WORKSPACE:15:14: in <toplevel>
  /Users/wojtek/.cache/exla/tf-master/erts-12.0.2/tensorflow/workspace2.bzl:1098:21: in workspace
  /Users/wojtek/.cache/exla/tf-master/erts-12.0.2/tensorflow/workspace2.bzl:657:9: in _tf_repositories
  /Users/wojtek/.cache/exla/tf-master/erts-12.0.2/third_party/llvm/workspace.bzl:10:20: in repo
  /Users/wojtek/.cache/exla/tf-master/erts-12.0.2/third_party/repo.bzl:112:21: in tf_http_archive
Repository rule _tf_http_archive defined at:
  /Users/wojtek/.cache/exla/tf-master/erts-12.0.2/third_party/repo.bzl:65:35: in <toplevel>
ERROR: Analysis of target '//tensorflow/compiler/xla/exla:libexla.so' failed; build aborted: Analysis of target '@local_config_cc//:toolchain' failed
INFO: Elapsed time: 15.224s
INFO: 0 processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:exla Applies to EXLA note:upstream The issue must be tackled upstream
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants