Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{tools}[gfbf/2023a] jax v0.4.25, ml_dtypes v0.3.2 w/ CUDA 12.1.1 #20119

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

ThomasHoffmann77
Copy link
Contributor

(created using eb --new-pr)

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/43d87811306655a013126860c0bb6777 for a full test report.

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
FAILED
Build succeeded (with --ignore-test-failure) for 1 out of 2 (2 easyconfigs in total)
srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/c51c43986eae5a7afe56f715d7c5c38c for a full test report.

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/b2b075b38d9d9d5c6fe4b0503dab7279 for a full test report.

@branfosj
Copy link
Member

branfosj commented Mar 14, 2024

Test report by @branfosj
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
bear-pg0208u15a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8
See https://gist.github.com/branfosj/83b07adf11f9a9eea619d5b7e45eddb5 for a full test report.

Same three failures as #19841 (comment)

@branfosj
Copy link
Member

branfosj commented Mar 15, 2024

Test report by @branfosj
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
bear-pg0208u31a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 4 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8
See https://gist.github.com/branfosj/bec290f9c00aa6309ee649e8ff185675 for a full test report.

Same three failures as #19841 (comment)

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/59e7a52712f520a524e93b5b5210551b for a full test report.

@verdurin
Copy link
Member

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_1x4_float32_float64 PASSED                                                                                                    [ 55%]
tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted

@verdurin
Copy link
Member

I see you're all building with --ignore-test-failure - is that expected with jax?

@ThomasHoffmann77
Copy link
Contributor Author

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_1x4_float32_float64 PASSED                                                                                                    [ 55%]
tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted
#16:38 thoffman@srv-mahamid-01#NVIDIA_TF32_OVERRIDE=0 CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_ALLOCATOR=platform JAX_ENABLE_X64=true pytest -vv tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32
============================= test session starts ==============================
platform linux -- Python 3.11.3, pytest-7.4.2, pluggy-1.2.0 -- /g/easybuild/x86_64/Rocky/8/rome/software/Python/3.11.3-GCCcore-12.3.0/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/tmp/jax-jax-v0.4.25/.hypothesis/examples'))
rootdir: /tmp/jax-jax-v0.4.25
configfile: pyproject.toml
plugins: xdist-3.3.1, hypothesis-6.88.1
collected 1 item                                                               

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 PASSED [100%]
============================== 1 passed in 3.21s ===============================

@boegel boegel added the update label Mar 17, 2024
@boegel boegel added this to the 4.x milestone Mar 17, 2024
@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/37e79d6b1006b4e8bee5438a97ef2ccd for a full test report.

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/61812c0e50d74c911c9d72e03155eac6 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (2 easyconfigs in total)
n1438 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/ee41d9059916ce8b1f93b9267d0c847f for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
i8002 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/327109d42642f3d3ed5c28565c08f20b for a full test report.

@Flamefire
Copy link
Contributor

In both cases the failure is:

external/upb/upb/table.c: In function upb_inttable_pop:
external/upb/upb/table.c:588:10: error: val.val may be used uninitialized [-Werror=maybe-uninitialized]
  588 |   return val;
      |          ^~~
external/upb/upb/table.c:585:13: note: val.val was declared here
  585 |   upb_value val;
      |             ^~~

Due to -Werror added here

XLA comes with even more dependencies (workspace*.bzl). Can we add them as local repositories too? Maybe even auto-generate those lists via a Python script or so (similar to e.g. findPythonDeps which outputs a list of Python packages for use in an EC. That script is bundled with EasyBuild so readily available)

Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
@Flamefire
Copy link
Contributor

Flamefire commented Mar 26, 2024

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
n1265 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/8cbb16221ab8da073cee85c97c0dd911 for a full test report.

This is caused by a crash. It isn't really clear why it fails or in which test, as when I run the crashing test file manually it works. Attaching GDB shows ~LogMessageFatal() as the cause. Need more investigation into why, i.e. what the fatal error is, but this looks serious...

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/d08ad604ee84517b32551db64fb98aef for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 83 out of 84 (2 easyconfigs in total)
i7006 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.8.13
See https://gist.github.com/Flamefire/cd74ee4cfc219de5e77ef36ac511001e for a full test report.

@Flamefire
Copy link
Contributor

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted

That is the same I see: #20119 (comment)

@VRehnberg
Copy link
Contributor

VRehnberg commented Apr 8, 2024

Test report by @VRehnberg
SUCCESS ml_dtypes
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-05 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/51fb53ea9ab9613ab516f2582dd2cd0d for a full test report.

@VRehnberg
Copy link
Contributor

VRehnberg commented Apr 8, 2024

Test report by @VRehnberg
SUCCESS jax
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-05 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/38ee48d43360ec267bd780e39120e84e for a full test report.

@VRehnberg
Copy link
Contributor

@ThomasHoffmann77 and @Flamefire if you ignore the single failing test (pytest -k "not testUfuncInputTypes") does the rest work then? How widespread are the issues?

@ThomasHoffmann77
Copy link
Contributor Author

@ThomasHoffmann77 and @Flamefire if you ignore the single failing test (pytest -k "not testUfuncInputTypes") does the rest work then? How widespread are the issues?

@VRehnberg I cannot reproduce this test failure on my system. Maybe it helps in particular to add lax_numpy_test.py::NumpyUfuncTests::testUfuncInputTypes763 to the list of isolated tests.

add tests/lax_numpy_test.py::NumpyUfuncTests::testUfuncInputTypes763 to isolated tests
@Flamefire
Copy link
Contributor

@ThomasHoffmann77 and @Flamefire if you ignore the single failing test (pytest -k "not testUfuncInputTypes") does the rest work then? How widespread are the issues?

Very widespread. I tried to --deselect each failing test file(!) but then it just fails later on the next. The issue seems to be too many threads being created so the system runs out of resources. We have 208 cores (HT) so each ThreadPool it creates has 208 threads.

@akesandgren
Copy link
Contributor

OMP_NUM_THREADS=2 ?

@Flamefire
Copy link
Contributor

OMP_NUM_THREADS=2 ?

That doesn't affect the thread pools created by jax/xla. I found PJRT_NPROC for that but setting PJRT_NPROC=32 in local_test_exports also failed. Currently experimenting with both...

@verdurin How many cores does nproc report on your system?
@ThomasHoffmann77 As it works for you, how many is it on yours?

@ThomasHoffmann77
Copy link
Contributor Author

ThomasHoffmann77 commented Apr 9, 2024

@verdurin How many cores does nproc report on your system?
@ThomasHoffmann77 As it works for you, how many is it on yours?

@Flamefire
srv-mahamid-01.embl.de: 64
proline.embl.de: 20

@Flamefire
Copy link
Contributor

Ok, maybe it isn't the number of threads after all. I tried with PJRT_NPROC=2 to only create a small number of threads, well below the ones on the working 64/20 core systems. But still the same issue. Running out of ideas... Still testing a few different combinations and versions.

@Flamefire
Copy link
Contributor

I found that there is a difference when running the tests on a machine with or without GPUs. I have a 96 core machine with GPUs and the build succeeds. The 208 core machine without GPUs fails.

@ThomasHoffmann77
Copy link
Contributor Author

I found that there is a difference when running the tests on a machine with or without GPUs. I have a 96 core machine with GPUs and the build succeeds. The 208 core machine without GPUs fails.

both machines I used for the test reports are equipped with GPUs (RTX A4000 and RTX 3090)

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/c9f403751f88333bebc3058e3b6d7dd7 for a full test report.

@akesandgren
Copy link
Contributor

@Flamefire What is your current opinion on this one?
Should we merge as is or do we need more work?
I have a user who needs a package that need this newer jax version...

@Flamefire
Copy link
Contributor

I have a variation of the tests currently running which might work. Can say more tomorrow.

"Workaround" would be to use a GPU machine to install this. Some tests fail anyway on CPU-only (I made a patch for that already and can upload it if it runs through)

@ThomasHoffmann77
Copy link
Contributor Author

@Flamefire What is your current opinion on this one? Should we merge as is or do we need more work? I have a user who needs a package that need this newer jax version...

Some of the Bazel downloads might need a further check. I did not find time, yet

@akesandgren
Copy link
Contributor

@ThomasHoffmann77 had time to look at this yet?

I have a user waiting for this...

@ThomasHoffmann77
Copy link
Contributor Author

@ThomasHoffmann77 had time to look at this yet?

I have a user waiting for this...

@akesandgren no, sorry. @Flamefire suggested to use some script.
I won't find time this week for sure.

However, we have been using jax v0.4.25 with AlphaFold 2.3.2 at our site for a while.

Copy link
Contributor

@Flamefire Flamefire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't got the tests working on a machine without GPUs. So that might be worth a comment in the EC.

As for downloads: I found the option --experimental_repository_disable_download for bazel ("experimental" removed in Bazel 7)

As for a script to gather those: It will be hard and rather require manual work, e.g. one failure is "llvm-raw". This leads to third_party/llvm/workspace.bzl in the xla download, so we would need to download xla first to get that. And then we have:

def repo(name):
    """Imports LLVM."""
    LLVM_COMMIT = "e630a451b457e4d8d071a2b4f102b342bbea2d02"
    LLVM_SHA256 = "184e7622a47609d960295e5e363466e9e60e6d9dbc20d554b3e1118ffd9f1bfb"

    tf_http_archive(
        name = name,
        sha256 = LLVM_SHA256,
        strip_prefix = "llvm-project-{commit}".format(commit = LLVM_COMMIT),
        urls = [
            "https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/{commit}.tar.gz".format(commit = LLVM_COMMIT),
            "https://github.com/llvm/llvm-project/archive/{commit}.tar.gz".format(commit = LLVM_COMMIT),
        ],
        build_file = "//third_party/llvm:llvm.BUILD",
        patch_file = [
            "//third_party/llvm:generated.patch",  # Autogenerated, don't remove.
            "//third_party/llvm:build.patch",
            "//third_party/llvm:mathextras.patch",
            "//third_party/llvm:toolchains.patch",
            "//third_party/llvm:zstd.patch",
        ],
        link_files = {"//third_party/llvm:run_lit.sh": "mlir/run_lit.sh"},
    )

There are many instances of FOO_COMMIT = "hash" and some seemingly already meant for tools in/of Google.

For XLA we have:

XLA_COMMIT = "4ccfe33c71665ddcbca5b127fefe8baa3ed632d4"
XLA_SHA256 = "8a59b9af7d0850059d7043f7043c780066d61538f3af536e8a10d3d717f35089"

def repo():
    tf_http_archive(
        name = "xla",
        sha256 = XLA_SHA256,
        strip_prefix = "xla-{commit}".format(commit = XLA_COMMIT),
        urls = tf_mirror_urls("https://github.com/openxla/xla/archive/{commit}.tar.gz".format(commit = XLA_COMMIT)),
    )

A bit easier but not perfect for automation.

I found --bazel_options="--distdir=%(builddir)/archives" to be a better option instead of --override-repository. We can put the downloaded archives with their original name there and let Bazel do the extraction. This would also solve issues as with llvm-raw where the Bazel rule includes patches. It will also check that we put the correct archives there, e.g. it reported the mistake with tf_runtime (both the wrong archive name and the wrong repository name) and it will also report if we downloaded the wrong archive or the checksum doesn't match. But we loose the option to easily patch them (we can still patch the bazel rules to inject our patches, see TensorFlow)

The output for failing archives is a bit obscure though, so it would need some manual work to fix them all up instead of rerunning the whole EB process. Example:

INFO: Repository io_bazel_rules_closure instantiated at:
  <builddir>/jax-jaxlib-v0.4.25/WORKSPACE:9:15: in <toplevel>
  <builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/xla/workspace3.bzl:12:17: in workspace
Repository rule http_archive defined at:
  <builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/bazel_tools/tools/build_defs/repo/http.bzl:372:31: in <toplevel>
ERROR: An error occurred during the fetch of repository 'io_bazel_rules_closure':
   Traceback (most recent call last):
	File "<builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Failed to download repository @io_bazel_rules_closure: download is disabled.
ERROR: <builddir>/jax-jaxlib-v0.4.25/WORKSPACE:9:15: fetching http_archive rule //external:io_bazel_rules_closure: Traceback (most recent call last):
	File "<builddir>/tmp9rt4d2l2-bazel/df8f91ab4b3f7374d96037ea828fca4d/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Failed to download repository @io_bazel_rules_closure: download is disabled.

So not sure if this is worth the effort, but at least it would allow offline installations.

Note that all archives Bazel downloads are in <build-user_root>/cache/repos/v1/content_addressable/sha256/*/file stored by their sha256 sum (the *)

Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants