Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add patches to fix TensorFlow 2.7.1 on POWER #16795

Merged

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

@Flamefire Flamefire marked this pull request as draft December 6, 2022 08:04
@boegel boegel added the bug fix label Dec 7, 2022
@boegel boegel modified the milestones: 4.x, next release (4.7.0) Dec 7, 2022
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/976d5f02eaa3a8626476c2dda3ea0ce6 for a full test report.

@Flamefire Flamefire marked this pull request as ready for review December 19, 2022 16:20
@Flamefire Flamefire force-pushed the 20221205171338_new_pr_TensorFlow271 branch from 5ac6fb6 to 62a8e13 Compare December 19, 2022 16:20
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/c868d5c8a5c2b3a9702414c404bc859e for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel I'd really love to get this into the next release as it was a lot of time spent and this is now the latest version of TensorFlow working on PPC. 2.8.4 doesn't has a CUDA version and doesn't work on PPC and 2.9 isn't ready yet.

Test report coming up, need to install to /tmp as our clusters filesystem got full hence the (now deleted) failed test reports. But I tested it manually already so I expect no failures.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/04a931eca748b60e64c185062db0f388 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml5 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/dca0a36498f60e13fc50c62ee493894c for a full test report.

@boegelbot

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

@boegel What is missing here that it gets postponed again and again? #16795 (comment) shows it is working on PPC which is the purpose of this PR and the patches only affect the PPC build.

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
bear-pg0203u29a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/75b2224e9359a85b84d3424f6fa94c47 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @branfosj FAILED

-->
ERROR: /dev/shm/branfosj/build-up-EL8/TensorFlow/2.7.1/foss-2021b/TensorFlow/bazel-root/2b861e6b2b884d30743b7f211eb4a8d3/external/rules_cuda/cuda/BUILD:130:20: every rule of type cuda_toolchain_info implicitly depends upon the target '@local_cuda//:cuda/bin/nvcc', but this target could not be found because of: no such target '@local_cuda//:cuda/bin/nvcc': target 'cuda/bin/nvcc' not declared in package '' defined by /dev/shm/branfosj/build-up-EL8/TensorFlow/2.7.1/foss-2021b/TensorFlow/bazel-root/2b861e6b2b884d30743b7f211eb4a8d3/external/local_cuda/BUILD

This doesn't make sense for the non-cuda build. Some left-overs from another build or other temporary failure? Or space issue in /dev/shm building 2 TFs?

Copy link
Member

@branfosj branfosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the CPU-only version again and it failed on a GPU node (fresh login to the node). I also have a build running on a CPU-only node and it has not had the same issue. Testing further, this fault also occurs without the changes in this PR, which is not a surprise. So, I'm happy to approve this.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/fb7b88fb634c21ab235d24d31e3b883e for a full test report.

@branfosj
Copy link
Member

Going in, thanks @Flamefire!

@branfosj branfosj merged commit d5b6d93 into easybuilders:develop Jun 30, 2023
@Flamefire Flamefire deleted the 20221205171338_new_pr_TensorFlow271 branch June 30, 2023 11:02
@boegel boegel changed the title Fix TensorFlow 2.7.1 on POWER add patches to fix TensorFlow 2.7.1 on POWER Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants