Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Tensorflow to version 2.11.0 #8258

Closed
wants to merge 19 commits into from
Closed

Conversation

smuzaffar
Copy link
Contributor

This PR updates Tensorflow to version 2.11.0 and other tools which are needed by TF 2.11.0 e.g

  • eigen
  • bazel
  • flatbuffer
  • gitlib
  • java-env version 11 (picked up from system)
  • opencv
  • cython

As the above changes resulted in rebuilding some python3 packages so this PR move those packages to their latest versions

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_13_0_X/master.

@cmsbuild, @smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

@fwyzard , Tensorflow 2.11.0 needed new eigen, so I have update it to https://gitlab.com/libeigen/eigen/-/tree/3bb6a48d8c171cf20b5f8e48bfb4e424fbd4f79e . I have added CMS related changes on top of it, can you please review these changes ?

Just for reference, the CMS changes for existing eigen version are here

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/30121/summary.html
COMMIT: a77204d
CMSSW: CMSSW_13_0_X_2023-01-22-2300/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8258/30121/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ '[' -f /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc11/./etc/profile.d/init.sh ']'
+ '[' '!' -e /usr/lib/jvm/java-11/bin/javac ']'
+ echo '/usr/lib/jvm/java-11/bin/javac path is not available'
/usr/lib/jvm/java-11/bin/javac path is not available
+ exit 1
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.DSmWpM (%install)


RPM build errors:
line 35: It's not recommended to have unversioned Obsoletes: Obsoletes: external+java-env+11.0-8a469bf4e4386211c5f34a43cdca7c47
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.DSmWpM (%install)


@cmsbuild
Copy link
Contributor

Pull request #8258 was updated.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 23, 2023

@fwyzard , Tensorflow 2.11.0 needed new eigen, so I have update it to https://gitlab.com/libeigen/eigen/-/tree/3bb6a48d8c171cf20b5f8e48bfb4e424fbd4f79e . I have added CMS related changes on top of it, can you please review these changes ?

Just for reference, the CMS changes for existing eigen version are here

OK, I will have a look.

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/30134/summary.html
COMMIT: ac806b8
CMSSW: CMSSW_13_0_X_2023-01-23-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8258/30134/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'hatchling'

error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.IvFzhd (%build)


RPM build errors:
line 37: It's not recommended to have unversioned Obsoletes: Obsoletes: external+py3-vector+0.11.0-f0ad4c02f003a41f027ffa1f6dd8c17f
Macro expanded in comment on line 342: %{pkginstroot}/bin/*


@cmsbuild
Copy link
Contributor

Pull request #8258 was updated.

@cmsbuild
Copy link
Contributor

Pull request #8258 was updated.

@smuzaffar
Copy link
Contributor Author

please test

@smuzaffar
Copy link
Contributor Author

please test for el8_ppc64le_gcc11

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31944/summary.html
COMMIT: f47a0d1
CMSSW: CMSSW_13_1_X_2023-04-11-2300/el8_ppc64le_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8258/31944/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31944/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31944/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testONNXRuntime had ERRORS
---> test testTFGraphLoadingCUDA had ERRORS
---> test testTFConstSessionCUDA had ERRORS
---> test testEigenGPUNoFit_t had ERRORS
and more ...

RelVals

  • 11634.011634.0_TTbar_14TeV+2021/step1_TTbar_14TeV+2021.log
  • 11634.711634.7_TTbar_14TeV+2021_trackingMkFit/step1_TTbar_14TeV+2021_trackingMkFit.log
  • 11634.91411634.914_TTbar_14TeV+2021_DDDDB/step1_TTbar_14TeV+2021_DDDDB.log
Expand to see more relval errors ...

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31943/summary.html
COMMIT: f47a0d1
CMSSW: CMSSW_13_1_X_2023-04-12-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8258/31943/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31943/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31943/git-merge-result

Comparison Summary

Summary:

  • You potentially added 60 lines to the logs
  • Reco comparison results: 3767 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3459609
  • DQMHistoTests: Total failures: 4607
  • DQMHistoTests: Total nulls: 58
  • DQMHistoTests: Total successes: 3454922
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: found differences in 1 / 46 workflows

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31960/summary.html
COMMIT: f47a0d1
CMSSW: CMSSW_13_1_X_2023-04-12-2300/el8_ppc64le_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8258/31960/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31960/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31960/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testTFConstSessionCUDA had ERRORS
---> test testTFGraphLoadingCUDA had ERRORS
---> test testTFHelloWorldCUDA had ERRORS
---> test testONNXRuntime had ERRORS
and more ...

@smuzaffar
Copy link
Contributor Author

please test for el8_ppc64le_gcc11

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31966/summary.html
COMMIT: f47a0d1
CMSSW: CMSSW_13_1_X_2023-04-12-2300/el8_ppc64le_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8258/31966/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31966/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31966/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test HcalPFCuts_unittest had ERRORS
---> test testTFConstSessionCUDA had ERRORS
---> test testTFGraphLoadingCUDA had ERRORS
---> test testTFHelloWorldCUDA had ERRORS
and more ...

@makortel
Copy link
Contributor

Could we try what happens on x86 with GPU (since the failing PPC tests are run on a machine with GPUs)?

@makortel
Copy link
Contributor

The crash stack trace is anyway quite peculiar

Thread 1 (Thread 0x10003125d890 (LWP 15351) "testTFHelloWorl"):
#0  0x000010002fb35510 in waitpid () from /lib64/libc.so.6
#1  0x000010002fa9b38c in do_system () from /lib64/libc.so.6
#2  0x000010002fa09d58 in system_compat () from /lib64/libpthread.so.0
#3  0x0000100001c1db18 in TUnixSystem::StackTrace() () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libCore.so
#4  0x0000100001c19964 in TUnixSystem::DispatchSignals(ESignals) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libCore.so
#5  0x0000100001c19a40 in SigHandler(ESignals) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libCore.so
#6  0x0000100001c12ce0 in sighandler(int) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libCore.so
#7  <signal handler called>
#8  0x000010002b428508 in EVP_get_digestbyname () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libtensorflow_framework.so.2
#9  0x000010002fd0abb0 in ssl_load_ciphers () from /lib64/libssl.so.1.1
#10 0x000010002fd0ffd0 in ossl_init_ssl_base_ossl_ () from /lib64/libssl.so.1.1
#11 0x000010002fa05174 in __pthread_once_slow () from /lib64/libpthread.so.0
#12 0x000010002ffd38b8 in CRYPTO_THREAD_run_once () from /lib64/libcrypto.so.1.1
#13 0x000010002fd10208 in OPENSSL_init_ssl () from /lib64/libssl.so.1.1
#14 0x00001000308af080 in ?? () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libcurl.so.4
#15 0x00001000308b6488 in ?? () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libcurl.so.4
#16 0x00001000308520d0 in curl_global_init () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libcurl.so.4
#17 0x000010003da169c8 in _GLOBAL__sub_I_scitokens_internal.cpp () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/external/el8_ppc64le_gcc11/lib/libSciTokens.so.0
#18 0x0000100000005220 in call_init (env=0x3fffd1b7b248, argv=0x3fffd1b7b238, argc=1, l=<optimized out>) at dl-init.c:72
#19 call_init (l=<optimized out>, argc=<optimized out>, argv=0x3fffd1b7b238, env=0x3fffd1b7b248) at dl-init.c:28
#20 0x000010000000537c in _dl_init (main_map=0x1003c1a4800, argc=<optimized out>, argv=0x3fffd1b7b238, env=0x3fffd1b7b248) at dl-init.c:119
#21 0x000010000000e26c in call_dl_init (closure=<optimized out>) at dl-open.c:483
#22 0x000010002fbd106c in _dl_catch_exception () from /lib64/libc.so.6
#23 0x000010000000e7d8 in dl_open_worker (a=<optimized out>) at dl-open.c:794
#24 dl_open_worker (a=<optimized out>) at dl-open.c:757
#25 0x000010002fbd0ff0 in _dl_catch_exception () from /lib64/libc.so.6
#26 0x000010000000ea84 in _dl_open (file=0x1003c179670 "/scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/pluginFWCoreServicesPlugins.so", mode=<optimized out>, caller_dlopen=0x100000af3d20 <edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&)+96>, nsid=-2, argc=<optimized out>, argv=0x3fffd1b7b238, env=0x3fffd1b7b248) at dl-open.c:875
#27 0x000010002f511108 in dlopen_doit () from /lib64/libdl.so.2
#28 0x000010002fbd0ff0 in _dl_catch_exception () from /lib64/libc.so.6
#29 0x000010002fbd1108 in _dl_catch_error () from /lib64/libc.so.6
#30 0x000010000001d528 in _rtld_catch_error (objname=<optimized out>, errstring=<optimized out>, mallocedp=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:260
#31 0x000010002f511afc in _dlerror_run () from /lib64/libdl.so.2
#32 0x000010002f5111dc in dlopen GLIBC_2.17 () from /lib64/libdl.so.2
#33 0x0000100000af3d20 in edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/libFWCorePluginManager.so
#34 0x0000100000af01a4 in edmplugin::PluginManager::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/libFWCorePluginManager.so
#35 0x0000100000ae81dc in edmplugin::PluginFactoryBase::findPMaker(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/libFWCorePluginManager.so
#36 0x0000100000723e10 in edm::serviceregistry::ServicesManager::fillListOfMakers(std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/libFWCoreServiceRegistry.so
#37 0x0000100000724d28 in edm::serviceregistry::ServicesManager::ServicesManager(std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/libFWCoreServiceRegistry.so
#38 0x000010000071f660 in edm::ServiceRegistry::createSet(std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/libFWCoreServiceRegistry.so
#39 0x000010000071f980 in edm::ServiceRegistry::createServicesFromConfig(std::unique_ptr<edm::ParameterSet, std::default_delete<edm::ParameterSet> >) () from /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_1_X_2023-04-12-2300/lib/el8_ppc64le_gcc11/libFWCoreServiceRegistry.so
#40 0x000000001000a2dc in testHelloWorldCUDA::test() ()
#41 0x000000001000d11c in std::_Function_handler<void (), std::_Bind<void (testHelloWorldCUDA::*(testHelloWorldCUDA*))()> >::_M_invoke(std::_Any_data const&) ()
#42 0x000000001000cdc4 in CppUnit::TestCaller<testHelloWorldCUDA>::runTest() ()
#43 0x000010002c4ead98 in CppUnit::TestCaseMethodFunctor::operator() (this=<optimized out>) at TestCase.cpp:32
#44 0x000010002c4de2ec in CppUnit::DefaultProtector::protect (this=0x10039b197e0, functor=..., context=...) at DefaultProtector.cpp:15
#45 0x000010002c4e7450 in CppUnit::ProtectorChain::ProtectFunctor::operator() (this=0x10039b17b10) at ProtectorChain.cpp:20
#46 CppUnit::ProtectorChain::protect (this=0x10039b19520, functor=..., context=...) at ProtectorChain.cpp:86
#47 0x000010002c4f4f8c in CppUnit::TestResult::protect (this=0x3fffd1b7acf8, functor=..., test=<optimized out>, shortDescription=...) at TestResult.cpp:182
#48 0x000010002c4eaa18 in CppUnit::TestCase::run (this=0x10039b10020, result=0x3fffd1b7acf8) at TestCase.cpp:91
#49 0x000010002c4eb4ec in CppUnit::TestComposite::doRunChildTests (this=0x10039b19dc0, controller=0x3fffd1b7acf8) at TestComposite.cpp:64
#50 0x000010002c4eb764 in CppUnit::TestComposite::run (this=0x10039b19dc0, result=0x3fffd1b7acf8) at TestComposite.cpp:23
#51 0x000010002c4eb4ec in CppUnit::TestComposite::doRunChildTests (this=0x10039b19d30, controller=0x3fffd1b7acf8) at TestComposite.cpp:64
#52 0x000010002c4eb764 in CppUnit::TestComposite::run (this=0x10039b19d30, result=0x3fffd1b7acf8) at TestComposite.cpp:23
#53 0x000010002c4f7ff8 in CppUnit::TestRunner::WrappingSuite::run (result=0x3fffd1b7acf8, this=0x10039b19ce0) at TestRunner.cpp:47
#54 CppUnit::TestRunner::WrappingSuite::run (this=0x10039b19ce0, result=0x3fffd1b7acf8) at TestRunner.cpp:44
#55 0x000010002c4f3e14 in CppUnit::TestResult::runTest (this=0x3fffd1b7acf8, test=0x10039b19ce0) at TestResult.cpp:149
#56 0x000010002c4f80cc in CppUnit::TestRunner::run (this=<optimized out>, controller=..., testPath=...) at TestRunner.cpp:96
#57 0x0000000010007f34 in main ()

@valsdav
Copy link
Contributor

valsdav commented Apr 13, 2023

Could we try what happens on x86 with GPU (since the failing PPC tests are run on a machine with GPUs)?

I'm testing on hlttest-dell-01 with el8_amd64_gcc11

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31968/summary.html
COMMIT: f47a0d1
CMSSW: CMSSW_13_1_X_2023-04-13-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8258/31968/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31968/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39ea0b/31968/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test HcalPFCuts_unittest had ERRORS

Comparison Summary

Summary:

  • You potentially added 53 lines to the logs
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3459609
  • DQMHistoTests: Total failures: 100
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3459487
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@smuzaffar smuzaffar changed the base branch from IB/CMSSW_13_1_X/master to IB/CMSSW_13_2_X/master May 4, 2023 07:41
@iarspider
Copy link
Contributor

@smuzaffar could you please merge master and retrigger the tests?

@smuzaffar
Copy link
Contributor Author

closing it, we are looking in to moving to TF 2.12

@smuzaffar smuzaffar closed this Jun 22, 2023
@smuzaffar smuzaffar deleted the tf2.11 branch June 22, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants