Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{bio}[foss/2023a] dorado v0.6.1, kineto v0.4.0 w/ CUDA 12.1.1 #20444

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

boegel
Copy link
Member

@boegel boegel commented Apr 30, 2024

(created using eb --new-pr)

@boegel boegel added the update label Apr 30, 2024
@boegel boegel modified the milestones: release after 4.9.1, 4.x Apr 30, 2024
@boegel
Copy link
Member Author

boegel commented Apr 30, 2024

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20444 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20444 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4047

Test results coming soon (I hope)...

- notification for comment with ID 2086042557 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/28464236459f71623d895fd2b8e66e94 for a full test report.

@boegel
Copy link
Member Author

boegel commented Apr 30, 2024

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=20444 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20444 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13376

Test results coming soon (I hope)...

- notification for comment with ID 2086186429 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member Author

boegel commented Apr 30, 2024

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node4015.donphan.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), 1 x NVIDIA NVIDIA A2, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/f79e75cdce42d707473db8155c15b7d3 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
cns2 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/4708ced08d2f6e65c510c4d31c7856e2 for a full test report.

@verdurin
Copy link
Member

Test report by @verdurin
FAILED
Build succeeded for 2 out of 3 (2 easyconfigs in total)
easybuild-c7.novalocal - Linux CentOS Linux 7.9.2009, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.6.8
See https://gist.github.com/verdurin/66f5416e8217e50466a41c79cbb1c5d3 for a full test report.

@verdurin
Copy link
Member

Test report by @verdurin
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
easybuild-c7.novalocal - Linux CentOS Linux 7.9.2009, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.6.8
See https://gist.github.com/verdurin/473780dcfbceaaa9f8d1564871bd1ae8 for a full test report.

@verdurin
Copy link
Member

Cloning into 'dorado/3rdparty/zstd'...
Submodule path 'dorado/3rdparty/zstd': checked out '97a3da1df009d4dc67251de0c4b1c9d7fe286fc1'
Unable to checkout '88544847c51397004a31cf2817c334624c6350cf' in submodule path 'dorado/3rdparty/indicators'
 (at easybuild/tools/run.py:682 in parse_cmd_output)
== 2024-05-24 14:57:28,384 build_log.py:267 INFO ... (took 39 secs)
== 2024-05-24 14:57:28,388 config.py:699 DEBUG software install path as specified by 'installpath' and 'subdir_software': /eb/maint/software
== 2024-05-24 14:57:28,388 filetools.py:2013 INFO Removing lock /eb/maint/software/.locks/_eb_maint_software_dorado_0.6.1-foss-2023a-CUDA-12.1.1.lock...
== 2024-05-24 14:57:28,388 filetools.py:383 INFO Path /eb/maint/software/.locks/_eb_maint_software_dorado_0.6.1-foss-2023a-CUDA-12.1.1.lock successfully removed.
== 2024-05-24 14:57:28,388 filetools.py:2017 INFO Lock removed: /eb/maint/software/.locks/_eb_maint_software_dorado_0.6.1-foss-2023a-CUDA-12.1.1.lock
== 2024-05-24 14:57:28,388 easyblock.py:4291 WARNING build failed (first 300 chars): cmd "git clone --depth 1 --branch v0.6.1 --recursive https://github.com/nanoporetech/dorado.git" exited with exit code 1 and output:
Cloning into 'dorado'...
Note: checking out '79b5da50f86cdd59f24aedfeb48fd97fd9149233'.

You are in 'detached HEAD' state. You can look around, make experimental
chang

@boegel
Copy link
Member Author

boegel commented Jul 6, 2024

Looks like the actual error is:

Unable to checkout '88544847c51397004a31cf2817c334624c6350cf' in submodule path 'dorado/3rdparty/indicators'

Is anyone else seeing this?

@boegel boegel mentioned this pull request Jul 6, 2024
3 tasks
@boegel
Copy link
Member Author

boegel commented Jul 6, 2024

eb --from-pr 20444 --fetch --sourcepath /tmp (still) works fine for me...

@verdurin Is this a consistent problem for you?

@Micket
Copy link
Contributor

Micket commented Jul 6, 2024

Test report by @Micket
SUCCESS
Build succeeded for 3 out of 3 (2 easyconfigs in total)
vera-skylake-build - Linux Rocky Linux 8.9, x86_64, Intel Xeon Processor (Skylake, IBRS, no TSX), Python 3.6.8
See https://gist.github.com/Micket/63d93f8fd17aaca4fb2decb6eec9eea5 for a full test report.

@zao
Copy link
Contributor

zao commented Jul 6, 2024

Test report by @zao
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
eb-mix.zao.se - Linux Ubuntu 24.04 LTS (Noble Numbat), x86_64, AMD Ryzen 9 3900X 12-Core Processor (zen2), Python 3.12.3
See https://gist.github.com/zao/f233c125bb235fd539601aac71c55cf0 for a full test report.

@bedroge
Copy link
Contributor

bedroge commented Jul 6, 2024

Test report by @bedroge
FAILED
Build succeeded for 3 out of 4 (2 easyconfigs in total)
gpu1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz (skylake_avx512), 1 x NVIDIA GRID V100D-32Q, 535.161.07, Python 3.6.8
See https://gist.github.com/bedroge/f180ed2652af0b0af2a7a4dce956eee5 for a full test report.

Hmm, killed, may have run out of memory.

@Micket
Copy link
Contributor

Micket commented Jul 6, 2024

i think

  1. @verdurin needs to ditch centos7 which is now past eol
  2. @zao needs to get a GPU
  3. @bedroge probably needs more RAM (g++: fatal error: Killed signal terminated program cc1plus)

@bedroge
Copy link
Contributor

bedroge commented Jul 6, 2024

Test report by @bedroge
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
v100gpu31 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz (skylake_avx512), 1 x NVIDIA GRID V100D-32Q, 535.161.07, Python 3.6.8
See https://gist.github.com/bedroge/7bb286c9fab3cbb939eb94ea11e65b2b for a full test report.

Now I get:

== 2024-07-06 19:39:51,330 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): Sanity check failed: Library libtorch.so not found for /home1/p251204/easybuildinstall/software/dorado/0.6.1-foss-2023a-CUDA-12.1.1/bin/dorado
Library libtorch_cpu.so not found for /home1/p251204/easybuildinstall/software/dorado/0.6.1-foss-2023a-CUDA-12.1.1/bin/dorado
Library libtorch_cuda.so not found for /home1/p251204/easybuildinstall/software/dorado/0.6.1-foss-2023a-CUDA-12.1.1/bin/dorado
Library libc10_cuda.so not found for /home1/p251204/easybuildinstall/software/dorado/0.6.1-foss-2023a-CUDA-12.1.1/bin/dorado
Library libc10.so not found for /home1/p251204/easybuildinstall/software/dorado/0.6.1-foss-2023a-CUDA-12.1.1/bin/dorado
No '(RPATH)' found in 'readelf -d' output for /home1/p251204/easybuildinstall/software/dorado/0.6.1-foss-2023a-CUDA-12.1.1/lib/libcublasLt.so.12.1.3.1:
and a whole bunch of similar RPATH errors

Let me try without RPATH support...

@bedroge
Copy link
Contributor

bedroge commented Jul 6, 2024

Test report by @bedroge
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
v100gpu31 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz (skylake_avx512), 1 x NVIDIA GRID V100D-32Q, 535.161.07, Python 3.6.8
See https://gist.github.com/bedroge/6bd962cbe9e499feac29998d55f9ef0d for a full test report.

So, apparently this doesn't work with RPATH enabled.

@branfosj
Copy link
Member

branfosj commented Jul 6, 2024

So, apparently this doesn't work with RPATH enabled.

Those are nvidia libraries that are failing the rpath. What is being pulled in our packaged inside the install?

@zao
Copy link
Contributor

zao commented Jul 6, 2024

@Micket Seems like I don't quite need a GPU. Build passed in a Centos7 container sans rpaths, failure seen on Ubuntu 24.04 was with rpaths.

@boegel I can repro that CentOS 7 fails fetching the dorado sources, while Rocky8 succeeds.
My C7 git version is 1.8.3.1 while the R8 git version is 2.39.3.

@zao
Copy link
Contributor

zao commented Jul 6, 2024

I would reckon that there's some fancy Git/GitHub features at play here where ancient Git doesn't manage to understand enough features to obtain the object at population time.
If I load an EB-built git/2.28.0-nodocs I can successfully check out the repo on CentOS 7 so the OS itself is "fine".

@verdurin
Copy link
Member

Test report by @verdurin
FAILED
Build succeeded for 130 out of 132 (2 easyconfigs in total)
easybuild-el8.cloud.in.bmrc.ox.ac.uk - Linux Rocky Linux 8.10, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.6.8
See https://gist.github.com/verdurin/4a60e3c76ccb8f08ee52e822f7fa8b46 for a full test report.

@verdurin
Copy link
Member

Ugh, will re-try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants