Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[foss/2023a] TensorFlow v2.13.0 w/ CUDA 12.1.1 #19182

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
easyblock = 'PythonBundle'

name = 'TensorFlow'
version = '2.13.0'
versionsuffix = '-CUDA-%(cudaver)s'

homepage = 'https://www.tensorflow.org/'
description = "An open-source software library for Machine Intelligence"

toolchain = {'name': 'foss', 'version': '2023a'}
toolchainopts = {'pic': True}

builddependencies = [
('Bazel', '6.3.1'),
# git 2.x required, see also https://github.com/tensorflow/tensorflow/issues/29053
('git', '2.41.0', '-nodocs'),
('pybind11', '2.11.1'),
('UnZip', '6.0'),
# Required to build some of the extensions
('poetry', '1.5.1'),
# System protobuf doesn't seem to work: https://github.com/tensorflow/tensorflow/issues/61593
# So don't add it here
]
dependencies = [
('CUDA', '12.1.1', '', SYSTEM),
('cuDNN', '8.9.2.26', versionsuffix, SYSTEM),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VRehnberg It seems like this cuDNN version is causing trouble, I'm getting:

In file included from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:37,
                 from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_OperationGraph.h:36,
                 from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Heuristics.h:31,
                 from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend.h:101,
                 from tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:56:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h: In member function int64_t cudnn_frontend::PointWiseDesc_v8::getPortCount() const:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h:69:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
   69 |         switch (mode) {
      |                ^
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h: In member function cudnn_frontend::Operation_v8&& cudnn_frontend::OperationBuilder_v8::build_pointwise_op():
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:413:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
  413 |         switch (m_operation.pointwise_mode) {
      |                ^
cc1plus: some warnings being treated as errors

see also tensorflow/tensorflow#60832, where they suggest to downgrade to an older cuDNN (ugh...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no easyconfigs yet that using a 2023a toolchain and have a cuDNN dependency, so we still have the freedom to stick to cuDNN 8.6.* here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to stick to CUDA 11.8 though, since cuDNN 8.6 is only paired with CUDA 10.3 and 11.8 it seems, see https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And CUDA 11.8 is a problem with GCC 12.x, hitting this when installing NCCL on top of CUDA 11.8.0 with GCCcore/12.3.0:

unsupported GNU version! gcc versions later than 11 are not supported!

So that tells me we're doomed to stick to foss/2022a for TensorFlow 2.13.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, I'll close this one then.

Copy link
Contributor Author

@VRehnberg VRehnberg Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, or go with another CUDA I suppose. That's what the CUDA version suffix is for I guess. For CUDA 12.3 I can't find anything about compatible GCC, but extrapolating what I could find it will probably work for CUDA 12.3 which isn't listed for CuDNN 8.9.6, but could possibly work.

Copy link
Contributor

@Flamefire Flamefire Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsupported GNU version! gcc versions later than 11 are not supported!

So that tells me we're doomed to stick to foss/2022a for TensorFlow 2.13.0?

We can workaround this by forcing NVCC to accept the "incompatible" compiler: https://github.com/easybuilders/easybuild-easyconfigs/pull/18853/files#diff-c0833191974a98d7eddf20cecac9d27ec670e369f43f75f3a4bafb2261b1135fR27
Of course there is a risk that the compiler really is incompatible...

('NCCL', '2.18.3', versionsuffix),
('Python', '3.11.3'),
('h5py', '3.9.0'),
('cURL', '8.0.1'),
('dill', '0.3.7'),
('double-conversion', '3.3.0'),
('flatbuffers', '23.5.26'),
('flatbuffers-python', '23.5.26'),
('giflib', '5.2.1'),
('hwloc', '2.9.1'),
('ICU', '73.2'),
('JsonCpp', '1.9.5'),
('libjpeg-turbo', '2.1.5.1'),
('NASM', '2.16.01'),
('nsync', '1.26.0'),
('SQLite', '3.42.0'),
('patchelf', '0.18.0'),
('protobuf-python', '4.24.0'),
('libpng', '1.6.39'),
('snappy', '1.1.10'),
('zlib', '1.2.13'),
# Dependencies of grpcio
('OpenSSL', '1.1', '', SYSTEM),
('RE2', '2023-08-01'),
]

use_pip = True
sanity_pip_check = True

# Dependencies created and updated using findPythonDeps.sh:
# https://gist.github.com/Flamefire/49426e502cd8983757bd01a08a10ae0d
exts_list = [
('wrapt', '1.15.0', {
'checksums': ['d06730c6aed78cee4126234cf2d071e01b44b915e725a6cb439a879ec9754a3a'],
}),
('termcolor', '2.3.0', {
'source_tmpl': SOURCE_PY3_WHL,
'checksums': ['3afb05607b89aed0ffe25202399ee0867ad4d3cb4180d98aaf8eefa6a5f7d475'],
}),
('tensorflow-estimator', version, {
'source_tmpl': 'tensorflow_estimator-%(version)s-py2.py3-none-any.whl',
'checksums': ['6f868284eaa654ae3aa7cacdbef2175d0909df9fcf11374f5166f8bf475952aa'],
}),
('Werkzeug', '2.3.7', {
'source_tmpl': SOURCELOWER_TAR_GZ,
'checksums': ['2b8c0e447b4b9dbcc85dd97b6eeb4dcbaf6c8b6c3be0bd654e25553e0a2157d8'],
}),
('tensorboard-plugin-wit', '1.8.1', {
'source_tmpl': 'tensorboard_plugin_wit-%(version)s-py3-none-any.whl',
'checksums': ['ff26bdd583d155aa951ee3b152b3d0cffae8005dc697f72b44a8e8c2a77a8cbe'],
}),
('tensorboard-data-server', '0.7.1', {
'source_tmpl': 'tensorboard_data_server-%(version)s-py3-none-any.whl',
'checksums': ['9938bd39f5041797b33921066fba0eab03a0dd10d1887a05e62ae58841ad4c3f'],
}),
('Markdown', '3.4.4', {
'checksums': ['225c6123522495d4119a90b3a3ba31a1e87a70369e03f14799ea9c0d7183a3d6'],
}),
('grpcio', '1.57.0', {
'modulename': 'grpc',
'preinstallopts': "GRPC_PYTHON_BUILD_EXT_COMPILER_JOBS=%(parallel)s " +
# Required to avoid building with non-default C++ standard but keep other flags,
# see https://github.com/grpc/grpc/issues/34256
"GRPC_PYTHON_CFLAGS='-fvisibility=hidden -fno-wrapv -fno-exceptions' " +
" ".join(["GRPC_PYTHON_BUILD_SYSTEM_%s=True" % i for i in
(
'OPENSSL',
'ZLIB',
'RE2',
# 'ABSL',
)]),
'checksums': ['4b089f7ad1eb00a104078bab8015b0ed0ebcb3b589e527ab009c53893fd4e613'],
}),
('oauthlib', '3.2.2', {
'checksums': ['9859c40929662bec5d64f34d01c99e093149682a3f38915dc0655d5a633dd918'],
}),
('requests-oauthlib', '1.3.1', {
'checksums': ['75beac4a47881eeb94d5ea5d6ad31ef88856affe2332b9aafb52c6452ccf0d7a'],
}),
('rsa', '4.9', {
'checksums': ['e38464a49c6c85d7f1351b0126661487a7e0a14a50f1675ec50eb34d4f20ef21'],
}),
('pyasn1-modules', '0.3.0', {
'source_tmpl': 'pyasn1_modules-%(version)s.tar.gz',
'checksums': ['5bd01446b736eb9d31512a30d46c1ac3395d676c6f3cafa4c03eb54b9925631c'],
}),
('cachetools', '5.3.1', {
'checksums': ['dce83f2d9b4e1f732a8cd44af8e8fab2dbe46201467fc98b3ef8f269092bf62b'],
}),
('google-auth', '2.22.0', {
'modulename': 'google.auth',
'checksums': ['164cba9af4e6e4e40c3a4f90a1a6c12ee56f14c0b4868d1ca91b32826ab334ce'],
}),
('google-auth-oauthlib', '1.0.0', {
'checksums': ['e375064964820b47221a7e1b7ee1fd77051b6323c3f9e3e19785f78ab67ecfc5'],
}),
('absl-py', '1.4.0', {
'modulename': 'absl',
'checksums': ['d2c244d01048ba476e7c080bd2c6df5e141d211de80223460d5b3b8a2a58433d'],
}),
('tensorboard', version, {
'source_tmpl': SOURCE_PY3_WHL,
'checksums': ['ab69961ebddbddc83f5fa2ff9233572bdad5b883778c35e4fe94bf1798bd8481'],
}),
('opt-einsum', '3.3.0', {
'source_tmpl': 'opt_einsum-%(version)s.tar.gz',
'checksums': ['59f6475f77bbc37dcf7cd748519c0ec60722e91e63ca114e68821c0c54a46549'],
}),
('keras', '2.13.1', {
'source_tmpl': SOURCE_PY3_WHL,
'checksums': ['5ce5f706f779fa7330e63632f327b75ce38144a120376b2ae1917c00fa6136af'],
}),
('google-pasta', '0.2.0', {
'modulename': 'pasta',
'checksums': ['c9f2c8dfc8f96d0d5808299920721be30c9eec37f2389f28904f454565c8a16e'],
}),
('astunparse', '1.6.3', {
'checksums': ['5ad93a8456f0d084c3456d059fd9a92cce667963232cbf763eac3bc5b7940872'],
}),
# Required by tests
('portpicker', '1.5.2', {
'checksums': ['c55683ad725f5c00a41bc7db0225223e8be024b1fa564d039ed3390e4fd48fb3'],
}),
# System dependencies
('tblib', '2.0.0', {
'checksums': ['a6df30f272c08bf8be66e0775fad862005d950a6b8449b94f7c788731d70ecd7'],
}),
('astor', '0.8.1', {
'checksums': ['6a6effda93f4e1ce9f618779b2dd1d9d84f1e32812c23a29b3fff6fd7f63fa5e'],
}),
# Optional profile plugin + dependency
('gviz-api', '1.10.0', {
'source_tmpl': 'gviz_api-%(version)s.tar.gz',
'checksums': ['846692dd8cc73224fc31b18e41589bd934e1cc05090c6576af4b4b26c2e71b90'],
}),
('tensorboard-plugin-profile', '2.13.1', {
'source_tmpl': 'tensorboard_plugin_profile-%(version)s.tar.gz',
'checksums': ['472d1cb85d7087c5294131eb640bd771f5515ecc4867030c7904718be7fc19c1'],
}),
(name, version, {
'source_tmpl': 'v%(version)s.tar.gz',
'source_urls': ['https://github.com/tensorflow/tensorflow/archive/'],
'patches': [
'TensorFlow-2.1.0_fix-cuda-build.patch',
'TensorFlow-2.4.0_dont-use-var-lock.patch',
'TensorFlow-2.9.1_remove-duplicate-gpu-tests.patch',
'TensorFlow-2.11.0_disable-avx512-extensions.patch',
'TensorFlow-2.13.0_add-default-shell-env.patch',
'TensorFlow-2.13.0_add-missing-system-absl-py-target.patch',
'TensorFlow-2.13.0_add-missing-system-protobuf-targets.patch',
'TensorFlow-2.13.0_exclude-xnnpack-on-ppc.patch',
'TensorFlow-2.13.0_fix-protobuf-compatibility.patch',
'TensorFlow-2.13.0_remove-io-gcs-filesystem-dep.patch',
'TensorFlow-2.13.0_remove-libclang-dep.patch',
'TensorFlow-2.13.0_fix-numpy-2.15.compat.patch',
'TensorFlow-2.13.0_remove-typing_extensions-upper-bound.patch',
'TensorFlow-2.13.0_revert-to-flatbuffers-2.0.6.patch',
'TensorFlow-2.13.0_unpin-gast-version.patch',
],
'checksums': [
{'v2.13.0.tar.gz': 'e58c939079588623e6fa1d054aec2f90f95018266e0a970fd353a5244f5173dc'},
{'TensorFlow-2.1.0_fix-cuda-build.patch':
'78c20aeaa7784b8ceb46238a81e8c2461137d28e0b576deeba8357d23fbe1f5a'},
{'TensorFlow-2.4.0_dont-use-var-lock.patch':
'b14f2493fd2edf79abd1c4f2dde6c98a3e7d5cb9c25ab9386df874d5f072d6b5'},
{'TensorFlow-2.9.1_remove-duplicate-gpu-tests.patch':
'6fe50faab28387c622c68dc3fc0cbfb2a51000cd750c1a82f8420b54fcd2509f'},
{'TensorFlow-2.11.0_disable-avx512-extensions.patch':
'fb8e7694b5d2377cc44e6674ff85a7c50dc725f2f507cbcfda65f129f534b1cc'},
{'TensorFlow-2.13.0_add-default-shell-env.patch':
'a94b2e007bff5a08ec4e6ec3043985907a69e9eeaea69dc4fe2aa15d15b75aef'},
{'TensorFlow-2.13.0_add-missing-system-absl-py-target.patch':
'94bc3b155840af942437d06c43830dabf41d94391daf61e1d0add0a7bf20a538'},
{'TensorFlow-2.13.0_add-missing-system-protobuf-targets.patch':
'77d8c8a5627493fc7c38b4de79d49e60ff6628b05ff969f4cd3ff9857176c459'},
{'TensorFlow-2.13.0_exclude-xnnpack-on-ppc.patch':
'd0818206846911d946666ded7d3216c0546e37cee1890a2f48dc1a9d71047cad'},
{'TensorFlow-2.13.0_fix-protobuf-compatibility.patch':
'a9658c035b663da1b7d1983a8e37883cc40c1c0cfa22132bb7fe19c4cbc9712a'},
{'TensorFlow-2.13.0_remove-io-gcs-filesystem-dep.patch':
'39f1cbecad4b3723481b30f18f16363ab1837c8749ee197ec88b92b493e9df67'},
{'TensorFlow-2.13.0_remove-libclang-dep.patch':
'f0d067d129e817b0d371c4e48a4a1ac08f80a2c137d52b05a3c7c4370dcbd1e5'},
{'TensorFlow-2.13.0_fix-numpy-2.15.compat.patch':
'4023be57bc8e33ae55ccac54b51d6532fea7ac4a32cb1125e3e42da0dec1669a'},
{'TensorFlow-2.13.0_remove-typing_extensions-upper-bound.patch':
'ed48464ed6f4cdbd0dde93ffc413c394d363278039502d77540ff7206c2048ae'},
{'TensorFlow-2.13.0_revert-to-flatbuffers-2.0.6.patch':
'f22757250181b6165e4b2ef1e199bd4cb344a9429be5a1086638f25bcbf650fc'},
{'TensorFlow-2.13.0_unpin-gast-version.patch':
'61e0c9b67aa6c48176fcbb429bf6aa36c4fdde604c82c02f58a043412fecf285'},
],
'test_script': 'TensorFlow-2.x_mnist-test.py',
'test_tag_filters_cpu': '-gpu,-tpu,-no_cuda_on_cpu_tap,'
'-no_pip,-no_oss,-oss_serial,-benchmark-test,-v1only',
'test_tag_filters_gpu': 'gpu,-no_gpu,-nogpu,-gpu_cupti,-no_cuda11,'
'-no_pip,-no_oss,-oss_serial,-benchmark-test,-v1only',
'test_targets': [
'//tensorflow/core/...',
'-//tensorflow/core:example_java_proto',
'-//tensorflow/core/example:example_protos_closure',
'//tensorflow/cc/...',
'//tensorflow/c/...',
'//tensorflow/python/...',
'-//tensorflow/c/eager:c_api_test_gpu',
'-//tensorflow/c/eager:c_api_distributed_test',
'-//tensorflow/c/eager:c_api_distributed_test_gpu',
'-//tensorflow/c/eager:c_api_cluster_test_gpu',
'-//tensorflow/c/eager:c_api_remote_function_test_gpu',
'-//tensorflow/c/eager:c_api_remote_test_gpu',
'-//tensorflow/core/common_runtime:collective_param_resolver_local_test',
'-//tensorflow/core/kernels/mkl:mkl_fused_ops_test',
'-//tensorflow/core/kernels/mkl:mkl_fused_batch_norm_op_test',
'-//tensorflow/core/ir/importexport/tests/roundtrip/...',
],
# Need to have $HOME set for tests on PPC: https://github.com/tensorflow/tensorflow/issues/61814
'testopts': "--test_env=HOME=/tmp --test_timeout=3600 --test_size_filters=small",
'testopts_gpu': "--test_env=HOME=/tmp --test_timeout=3600 --test_size_filters=small "
"--run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute",
'with_xla': True,
}),
]

moduleclass = 'lib'