New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{tools}[gfbf/2023a] jax v0.4.25, ml_dtypes v0.3.2 w/ CUDA 12.1.1 #20119
base: develop
Are you sure you want to change the base?
{tools}[gfbf/2023a] jax v0.4.25, ml_dtypes v0.3.2 w/ CUDA 12.1.1 #20119
Conversation
Test report by @ThomasHoffmann77 |
Test report by @ThomasHoffmann77 |
Test report by @ThomasHoffmann77 |
Test report by @branfosj Same three failures as #19841 (comment) |
Test report by @branfosj Same three failures as #19841 (comment) |
Test report by @ThomasHoffmann77 |
I don't have a build node setup to upload test reports. Did see this test error:
|
I see you're all building with |
|
Test report by @ThomasHoffmann77 |
Test report by @ThomasHoffmann77 |
Test report by @Flamefire |
Test report by @Flamefire |
In both cases the failure is:
Due to XLA comes with even more dependencies ( |
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
Test report by @Flamefire This is caused by a crash. It isn't really clear why it fails or in which test, as when I run the crashing test file manually it works. Attaching GDB shows |
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/m/ml_dtypes/ml_dtypes-0.3.2-foss-2023a.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
Test report by @akesandgren |
Test report by @Flamefire |
That is the same I see: #20119 (comment) |
Test report by @VRehnberg |
Test report by @VRehnberg |
@ThomasHoffmann77 and @Flamefire if you ignore the single failing test ( |
@VRehnberg I cannot reproduce this test failure on my system. Maybe it helps in particular to add lax_numpy_test.py::NumpyUfuncTests::testUfuncInputTypes763 to the list of isolated tests. |
add tests/lax_numpy_test.py::NumpyUfuncTests::testUfuncInputTypes763 to isolated tests
Very widespread. I tried to |
OMP_NUM_THREADS=2 ? |
That doesn't affect the thread pools created by jax/xla. I found @verdurin How many cores does |
@Flamefire |
Ok, maybe it isn't the number of threads after all. I tried with PJRT_NPROC=2 to only create a small number of threads, well below the ones on the working 64/20 core systems. But still the same issue. Running out of ideas... Still testing a few different combinations and versions. |
Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
Fix usage of system Pybind11
I found that there is a difference when running the tests on a machine with or without GPUs. I have a 96 core machine with GPUs and the build succeeds. The 208 core machine without GPUs fails.
|
both machines I used for the test reports are equipped with GPUs (RTX A4000 and RTX 3090) |
fix checksum
fix style; add PyBind11 builddep
Test report by @akesandgren |
@Flamefire What is your current opinion on this one? |
I have a variation of the tests currently running which might work. Can say more tomorrow. "Workaround" would be to use a GPU machine to install this. Some tests fail anyway on CPU-only (I made a patch for that already and can upload it if it runs through) |
Some of the Bazel downloads might need a further check. I did not find time, yet |
@ThomasHoffmann77 had time to look at this yet? I have a user waiting for this... |
@akesandgren no, sorry. @Flamefire suggested to use some script. However, we have been using jax v0.4.25 with AlphaFold 2.3.2 at our site for a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't got the tests working on a machine without GPUs. So that might be worth a comment in the EC.
As for downloads: I found the option --experimental_repository_disable_download
for bazel ("experimental" removed in Bazel 7)
As for a script to gather those: It will be hard and rather require manual work, e.g. one failure is "llvm-raw". This leads to third_party/llvm/workspace.bzl
in the xla download, so we would need to download xla first to get that. And then we have:
def repo(name):
"""Imports LLVM."""
LLVM_COMMIT = "e630a451b457e4d8d071a2b4f102b342bbea2d02"
LLVM_SHA256 = "184e7622a47609d960295e5e363466e9e60e6d9dbc20d554b3e1118ffd9f1bfb"
tf_http_archive(
name = name,
sha256 = LLVM_SHA256,
strip_prefix = "llvm-project-{commit}".format(commit = LLVM_COMMIT),
urls = [
"https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/{commit}.tar.gz".format(commit = LLVM_COMMIT),
"https://github.com/llvm/llvm-project/archive/{commit}.tar.gz".format(commit = LLVM_COMMIT),
],
build_file = "//third_party/llvm:llvm.BUILD",
patch_file = [
"//third_party/llvm:generated.patch", # Autogenerated, don't remove.
"//third_party/llvm:build.patch",
"//third_party/llvm:mathextras.patch",
"//third_party/llvm:toolchains.patch",
"//third_party/llvm:zstd.patch",
],
link_files = {"//third_party/llvm:run_lit.sh": "mlir/run_lit.sh"},
)
There are many instances of FOO_COMMIT = "hash"
and some seemingly already meant for tools in/of Google.
For XLA we have:
XLA_COMMIT = "4ccfe33c71665ddcbca5b127fefe8baa3ed632d4"
XLA_SHA256 = "8a59b9af7d0850059d7043f7043c780066d61538f3af536e8a10d3d717f35089"
def repo():
tf_http_archive(
name = "xla",
sha256 = XLA_SHA256,
strip_prefix = "xla-{commit}".format(commit = XLA_COMMIT),
urls = tf_mirror_urls("https://github.com/openxla/xla/archive/{commit}.tar.gz".format(commit = XLA_COMMIT)),
)
A bit easier but not perfect for automation.
I found --bazel_options="--distdir=%(builddir)/archives"
to be a better option instead of --override-repository
. We can put the downloaded archives with their original name there and let Bazel do the extraction. This would also solve issues as with llvm-raw where the Bazel rule includes patches. It will also check that we put the correct archives there, e.g. it reported the mistake with tf_runtime (both the wrong archive name and the wrong repository name) and it will also report if we downloaded the wrong archive or the checksum doesn't match. But we loose the option to easily patch them (we can still patch the bazel rules to inject our patches, see TensorFlow)
The output for failing archives is a bit obscure though, so it would need some manual work to fix them all up instead of rerunning the whole EB process. Example:
INFO: Repository io_bazel_rules_closure instantiated at:
<builddir>/jax-jaxlib-v0.4.25/WORKSPACE:9:15: in <toplevel>
<builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/xla/workspace3.bzl:12:17: in workspace
Repository rule http_archive defined at:
<builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/bazel_tools/tools/build_defs/repo/http.bzl:372:31: in <toplevel>
ERROR: An error occurred during the fetch of repository 'io_bazel_rules_closure':
Traceback (most recent call last):
File "<builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Failed to download repository @io_bazel_rules_closure: download is disabled.
ERROR: <builddir>/jax-jaxlib-v0.4.25/WORKSPACE:9:15: fetching http_archive rule //external:io_bazel_rules_closure: Traceback (most recent call last):
File "<builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Failed to download repository @io_bazel_rules_closure: download is disabled.
So not sure if this is worth the effort, but at least it would allow offline installations.
Note that all archives Bazel downloads are in <build-user_root>/cache/repos/v1/content_addressable/sha256/*/file
stored by their sha256 sum (the *
)
easybuild/easyconfigs/j/jax/jax-0.4.25-gfbf-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
(created using
eb --new-pr
)