-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
art exits with segmentation fault (sigsegv) while claiming to exit with signal 1 when xrootd fails on secondary input #112
Comments
Comment by @knoepfel on 2021-09-13 21:28:44 Dave, can you provide setup instructions for a Mu2e environment in which I can use the |
Comment by @knoepfel on 2021-09-14 21:01:37 The problem has been reproduced. Initial analysis indicates faulty framework scheduling (in the context of an exception) of calling Dave, do you know where I can locate the |
Comment by @knoepfel on 2021-09-15 20:28:42 I am able to reproduce this issue with the attached trimmed FHiCL file, which configures art to use only 1 schedule and 1 thread. |
Comment by @knoepfel on 2021-09-15 21:31:07 Contrary to my initial analysis, further investigation indicates this is not a problem with art's scheduling of calling Because this is no longer an art issue, I am recategorizing this as a Support issue. |
Comment by @kutschke on 2021-09-23 20:34:42 Dave: when you did this were your kerberos tickets and voms proxy both still alive: https://mu2ewiki.fnal.gov/wiki/DataTransfer#xrootd I can reproduce the issue if I don't have a voms proxy but it goes away if I get a proxy. ( auto correct insists that voms is spelled moms or vows - it took 3 tries to convince it otherwise .... ) |
Comment by @goodenou on 2021-09-23 21:05:45 Thank you Rob for noticing the file access using xrootd. That was indeed the issue that was producing the exception shown above. However, after switching to NFS for file access, and running the non-MT job, I do sometimes get a seg fault. So, we have an intermittent problem on our hands. I will try running with the MT module, but I assume that the issue will be present there as well. |
Comment by @kutschke on 2021-09-27 20:36:43 Dave replied to this off of the ticket. I am adding his reply here. The bottom line is that I completely misunderstood - the problem is not the initial error. The problem is that art's return code is confusing. Kyle did you understand that? I missed it. Dave's reply: Hi Rob, yes the problem accessing the file goes away with a voms proxy. Running without a proxy was just a way of precipitating an error in the 2ndary input stream, as noted in the ticket |
Comment by @knoepfel on 2021-09-27 21:07:03 Yes, I understand the issue--art reports completing with status 1 due to an exception throw, and a segfault then occurs, resulting in a "status" that is different than 1. The problem is that the segmentation violation is occurring at static destruction time--i.e. 'int main()' has already completed with a return value of 1, but because of inadequate cleanup of thread-local statics in G4 (in particular, the case where the Mu2eG4MT module processes no events but still sets up the G4 MT infrastructure), you get the segfault after the program (proper) has completed. There's not a whole lot we can do there in terms of the return code--art has already completed by that time. The thread-local statics come from G4 code (through Mu2eG4MT), which is why I changed this from a bug report to a support request. Of course, this is already a broken workflow (due to the exception), and the segfault-vs-return-code issue is primarily a bookkeeping problem--annoying, yes, but fixing this will not turn a broken workflow into a functional one. In principle, any error during static destruction time can always conflict with the return code of 'int main()'. This is an argument for not printing out the return code at the end of a job and just relying on the actual return code of the executable. That's a discussion with the stakeholders, though. Bottomline: the segfault is causing the problem and it should be fixed. The Mu2eG4MT module and its G4 connections need to be made robust against situations where an exception throw can lead to the module processing zero events. |
This problem has now been resolved in the Mu2e Offline code. See PR #700 Mu2e/Offline#700 for full details. At a very basic level, the problem was occurring because when there is a failure in the job before the G4MT code calls produce for the first time that causes the job to shutdown, the main thread, rather than the Master Run Manager thread, calls the destructor for all of the geometry objects. There is some instance data associated with each G4VPlacement object that is not accessible by the main thread, only by the Master Run Manager. |
Thanks, @goodenou. We will then close this issue. |
Great, thanks!
… On Feb 8, 2022, at 12:48 PM, Kyle Knoepfel ***@***.***> wrote:
Thanks, @goodenou <https://github.com/goodenou>. We will then close this issue.
—
Reply to this email directly, view it on GitHub <#112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE44QBQ7ZXMXT7B5D5TJ6Q3U2FQP7ANCNFSM5G5IWGZA>.
You are receiving this because you were mentioned.
|
This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/26250 (FNAL account required)
Originally created by @brownd1978 on 2021-09-09 16:04:26
Mu2e observes that art exits with sigsegv when there is a problem accessing a secondary input stream file via xrootd. While the xrootd problem has nothing to do with art, the erroneous return code makes it difficult to distinguish IO problems from program bugs. One can reproduce the problem using the SimJob MDC2020k musing, running /mu2e/app/users/brownd/MDC2020j/debug2.fcl on the build01 machine. Since the secondary input file is specified to be accessed via xrootd, and mu2ebuild01 doesn't have xrootd access, the execution fails as listed below. However, while art claims to exit with status 1, the process actually exits with status sigsegv. While the setup of running an xrootd job on mu2ebuild01 is artificial, we believe we are seeing the same effect in grid jobs when access to the 2ndary file via xrootd fails during the job (presumably due to a network issue). The symptom is that some jobs return status sigsegv, but run to completion when resubmitted or run interactively. We also searched for memory errors in the job as an alternate explanation for the irreproducible behavior but didn't see any. Note that this job is running in multi-threaded mode, with 2 threads. Note too this was observed running art v03_09_03, but that wasn't an option in the pulldown menu.
To setup environment (additional information from Dave)
The text was updated successfully, but these errors were encountered: