Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XrdAdaptor: Fix race condition leading to a crash when doing Prepare … #32763

Closed
wants to merge 2 commits into from

Conversation

esindril
Copy link

PR description:

This pull request fixes a crash in the XrdAdaptor class where the async Prepare request can lead to a SEGV if the corresponding XrdCl::FileSystem is destroyed before the response is processed. This can be easily reproduced in every version of the CMSSW.

This race condition is triggered when the response to the Prepare request is slow and the memory occupied by the file XrdCl::FileSystem object is reused leading to a crash when the XrdCl::FileSystem destructor is called. The XrdCl::FileSystem goes out of scope in the stagein method and it is therefore destroyed, but the Prepare response still needs a valid XrdCl::FileSystem object.

An example of a fully reproducible crash:

[esindril@lxplus790 CMSSW_11_2_0]> cmsRun miniaodsim_9xx_1.py 

Thu Jan 28 09:50:02 CET 2021
Thread 7 (Thread 0x7f93ef40b700 (LWP 14128)):
#0  0x00007f941e4deb3b in do_futex_wait.constprop () from /usr/lib64/libpthread.so.0
#1  0x00007f941e4debcf in __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0
#2  0x00007f941e4dec6b in sem_wait@@GLIBC_2.2.5 () from /usr/lib64/libpthread.so.0
#3  0x00007f9415ffd816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f9415ffd8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007f941e4d8ea5 in start_thread () from /usr/lib64/libpthread.so.0
#6  0x00007f941e20096d in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x7f93efc0c700 (LWP 14127)):
#0  0x00007f941e1f5c3d in poll () from /usr/lib64/libc.so.6
#1  0x00007f9415b32007 in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007f9415b3271c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007f9415b33444 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f941e138387 in raise () from /usr/lib64/libc.so.6
#6  0x00007f941e139a78 in abort () from /usr/lib64/libc.so.6
#7  0x00007f941eaa4683 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00007f941eab00a6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007f941eab0111 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#10 0x00007f941eab0365 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7f941ebd4f00 <typeinfo for std::bad_alloc>, dest=0x7f941eaae830 <std::bad_alloc::~bad_alloc()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:95
#11 0x00007f941eaa6b69 in std::__throw_bad_alloc () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_1_0_pre6-slc7_amd64_gcc900/build/CMSSW_11_1_0_pre6-build/BUILD/slc7_amd64_gcc900/external/gcc/9.3.0/gcc-9.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/exception.
h:63
#12 0x00007f941f799e2e in handleOOM (size=216172799293652993, nothrow=<optimized out>) at src/jemalloc_cpp.cpp:70
#13 0x00007f9415fa0e3f in void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .constprop.0] () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_
gcc900/lib/libXrdCl.so.2
#14 0x00007f9415fa7deb in XrdCl::FileSystem::AssignLoadBalancer(XrdCl::URL const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#15 0x00007f9415fad7ce in XrdCl::AssignLBHandler::HandleResponseWithHosts(XrdCl::XRootDStatus*, XrdCl::AnyObject*, std::vector<XrdCl::HostInfo, std::allocator<XrdCl::HostInfo> >*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/lib
XrdCl.so.2
#16 0x00007f9415fb26f9 in XrdCl::XRootDMsgHandler::HandleResponse() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#17 0x00007f9415fb6b6f in XrdCl::XRootDMsgHandler::Process(XrdCl::Message*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#18 0x00007f9415f9549a in XrdCl::Stream::HandleIncMsgJob::Run(void*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#19 0x00007f9415ffd86d in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#20 0x00007f9415ffd8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_2_0/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#21 0x00007f941e4d8ea5 in start_thread () from /usr/lib64/libpthread.so.0
#22 0x00007f941e20096d in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x7f93f040d700 (LWP 14126)):

Also the crash is easier to reproduce if the Prepare request needs to be redirected to a different server as it is the case in EOS or if the Prepare response is slow.

Given the fact that one needs to wait for the prepare response in the stagein part of the code there is absolutely no reason to use the async Prepare interface from XrdCl - so I've modified it to use the synchronous Prepare call.

PR validation:

A simple rerun of the test confirms that the fix is working properly.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32763/20919

  • This PR adds an extra 12KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32763/20920

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @esindril (Elvin Sindrilaru) for master.

It involves the following packages:

Utilities/XrdAdaptor

@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @wddgit this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

@cmsbuild, please test

@makortel
Copy link
Contributor

@Dr15Jones @bbockelm Could you take a look too?

@Dr15Jones
Copy link
Contributor

I can't make any reasonable comments on this since I have no knowledge of how xrootd handles its communication.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5894ef/12586/summary.html
COMMIT: 1b23289
CMSSW: CMSSW_11_3_X_2021-01-28-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/32763/12586/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2716596
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2716573
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 36 files compared)
  • Checked 156 log files, 37 edm output root files, 37 DQM output files

@esindril
Copy link
Author

esindril commented Aug 3, 2021

@bbockelm I've pushed a new commit to use a smart pointer for handling the raw response buffer.

Logically the file system object for which an async response handler is used should not be deleted until the async response handler is called (i.e. the response arrives) - this is a gist of the problem here and the patch fixes this race condition.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 3, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32763/24391

  • This PR adds an extra 16KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 3, 2021

Pull request #32763 was updated. @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please check and sign again.

@perrotta
Copy link
Contributor

perrotta commented Aug 3, 2021

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 3, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5894ef/17459/summary.html
COMMIT: 797cdbf
CMSSW: CMSSW_12_1_X_2021-08-02-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/32763/17459/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1299 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 2998564
  • DQMHistoTests: Total failures: 3672
  • DQMHistoTests: Total nulls: 19
  • DQMHistoTests: Total successes: 2994851
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -45.703 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 140.53 ): -44.531 KiB Hcal/DigiRunHarvesting
  • DQMHistoSizes: changed ( 140.53 ): -1.172 KiB RPC/DCSInfo
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@osschar
Copy link
Contributor

osschar commented Aug 6, 2021

I spotted this problem while working on bumping xrootd to 5.3.x and fixed in basically the same way (without removing the handler class and unique_ptr), see #34700.

I propose to close this one and go one with the other one ... I will apply the same additional changes that were done here, removal of handler class and the data member.

The unique_ptr change is not correct in this PR ... one would need to do:

unique_ptr<XrdCl::Buffer> buff;
Prepare(...., buff.get()));

and then one doesn't need to do anything later ... I can fix it in the other PR if people agree.

@esindril
Copy link
Author

esindril commented Aug 6, 2021

@osschar your suggested code will not compile as you are trying to bind an lvalue reference expected by the Prepare API to an rvalue returned by get on the unique pointer. This is why you need the extra code ... I think it's correct.

@osschar
Copy link
Contributor

osschar commented Aug 7, 2021

Oh, sorry, there is no pointer& variant of get() ... then I don;t know what Brian meant. Isn't what is in now equivalent to just calling delete?

If all this is correct, one would have to catch the exceptions, delete buffer and rethrow ... unless Prepare() would also have a signature with unique_ptr argument (which would be possible with 5.3 as it uses c++-14).

Is it even possible Prepare() would throw an exception? It wouldn't be a major leak in any case :)

@osschar
Copy link
Contributor

osschar commented Aug 18, 2021

As discussed in the Aug 17 core mtg and later confirmed by Brian, async call with discard of potential errors is the correct thing to do for Prepare as its most important effect is to warm up redirector caches (and potentially XCaches in the future).

This will be implemented properly in #34700 ... please close this PR.

@makortel
Copy link
Contributor

-1

Thanks.

@perrotta
Copy link
Contributor

perrotta commented Nov 4, 2021

Superseeded by the already merged #34700

@perrotta perrotta closed this Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants