add patches to fix TensorFlow 2.7.1 on POWER #16795

Flamefire · 2022-12-05T16:13:45Z

(created using eb --new-pr)

Flamefire · 2022-12-10T02:59:20Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/976d5f02eaa3a8626476c2dda3ea0ce6 for a full test report.

Flamefire · 2022-12-20T15:12:39Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/c868d5c8a5c2b3a9702414c404bc859e for a full test report.

Flamefire · 2022-12-21T07:34:43Z

@boegel I'd really love to get this into the next release as it was a lot of time spent and this is now the latest version of TensorFlow working on PPC. 2.8.4 doesn't has a CUDA version and doesn't work on PPC and 2.9 isn't ready yet.

Test report coming up, need to install to /tmp as our clusters filesystem got full hence the (now deleted) failed test reports. But I tested it manually already so I expect no failures.

Flamefire · 2022-12-21T14:26:58Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/04a931eca748b60e64c185062db0f388 for a full test report.

Flamefire · 2023-01-06T12:22:17Z

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml5 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/dca0a36498f60e13fc50c62ee493894c for a full test report.

Flamefire · 2023-06-29T13:32:43Z

@boegel What is missing here that it gets postponed again and again? #16795 (comment) shows it is working on PPC which is the purpose of this PR and the patches only affect the PPC build.

branfosj · 2023-06-29T20:37:08Z

Test report by @branfosj
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
bear-pg0203u29a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/75b2224e9359a85b84d3424f6fa94c47 for a full test report.

Flamefire · 2023-06-29T20:48:08Z

Test report by @branfosj FAILED

-->
ERROR: /dev/shm/branfosj/build-up-EL8/TensorFlow/2.7.1/foss-2021b/TensorFlow/bazel-root/2b861e6b2b884d30743b7f211eb4a8d3/external/rules_cuda/cuda/BUILD:130:20: every rule of type cuda_toolchain_info implicitly depends upon the target '@local_cuda//:cuda/bin/nvcc', but this target could not be found because of: no such target '@local_cuda//:cuda/bin/nvcc': target 'cuda/bin/nvcc' not declared in package '' defined by /dev/shm/branfosj/build-up-EL8/TensorFlow/2.7.1/foss-2021b/TensorFlow/bazel-root/2b861e6b2b884d30743b7f211eb4a8d3/external/local_cuda/BUILD

This doesn't make sense for the non-cuda build. Some left-overs from another build or other temporary failure? Or space issue in /dev/shm building 2 TFs?

branfosj

I've tested the CPU-only version again and it failed on a GPU node (fresh login to the node). I also have a build running on a CPU-only node and it has not had the same issue. Testing further, this fault also occurs without the changes in this PR, which is not a surprise. So, I'm happy to approve this.

branfosj · 2023-06-30T08:51:18Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/fb7b88fb634c21ab235d24d31e3b883e for a full test report.

branfosj · 2023-06-30T08:51:38Z

Going in, thanks @Flamefire!

Flamefire marked this pull request as draft December 6, 2022 08:04

boegel added the bug fix label Dec 7, 2022

boegel modified the milestones: 4.x, next release (4.7.0) Dec 7, 2022

Flamefire marked this pull request as ready for review December 19, 2022 16:20

Flamefire force-pushed the 20221205171338_new_pr_TensorFlow271 branch from 5ac6fb6 to 62a8e13 Compare December 19, 2022 16:20

boegel modified the milestones: next release (4.7.0), release after 4.7.0 Dec 20, 2022

Flamefire mentioned this pull request Jan 6, 2023

fix TensorFlow easyblock for new versions of Bazel & TensorFlow easybuilders/easybuild-easyblocks#2854

Merged

This comment was marked as outdated.

Sign in to view

Flamefire mentioned this pull request Jan 15, 2023

fix build of TensorFlow 2.5+ on aarch64 #17101

Merged

Flamefire force-pushed the 20221205171338_new_pr_TensorFlow271 branch from d589592 to 627cab0 Compare February 10, 2023 11:29

Fix TensorFlow 2.7.1 on POWER

672f528

Flamefire force-pushed the 20221205171338_new_pr_TensorFlow271 branch from 627cab0 to 672f528 Compare February 10, 2023 11:31

boegel modified the milestones: next release (4.7.1), release after 4.7.1 Mar 17, 2023

boegel modified the milestones: next release (4.7.2), release after 4.7.2 Apr 12, 2023

branfosj approved these changes Jun 30, 2023

View reviewed changes

branfosj merged commit d5b6d93 into easybuilders:develop Jun 30, 2023

Flamefire deleted the 20221205171338_new_pr_TensorFlow271 branch June 30, 2023 11:02

boegel changed the title ~~Fix TensorFlow 2.7.1 on POWER~~ add patches to fix TensorFlow 2.7.1 on POWER Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add patches to fix TensorFlow 2.7.1 on POWER #16795

add patches to fix TensorFlow 2.7.1 on POWER #16795

Flamefire commented Dec 5, 2022

Flamefire commented Dec 10, 2022

Flamefire commented Dec 20, 2022

Flamefire commented Dec 21, 2022

Flamefire commented Dec 21, 2022

Flamefire commented Jan 6, 2023

This comment was marked as outdated.

Flamefire commented Jun 29, 2023

branfosj commented Jun 29, 2023

Flamefire commented Jun 29, 2023

branfosj left a comment

branfosj commented Jun 30, 2023

branfosj commented Jun 30, 2023

add patches to fix TensorFlow 2.7.1 on POWER #16795

add patches to fix TensorFlow 2.7.1 on POWER #16795

Conversation

Flamefire commented Dec 5, 2022

Flamefire commented Dec 10, 2022

Flamefire commented Dec 20, 2022

Flamefire commented Dec 21, 2022

Flamefire commented Dec 21, 2022

Flamefire commented Jan 6, 2023

This comment was marked as outdated.

Flamefire commented Jun 29, 2023

branfosj commented Jun 29, 2023

Flamefire commented Jun 29, 2023

branfosj left a comment

Choose a reason for hiding this comment

branfosj commented Jun 30, 2023

branfosj commented Jun 30, 2023