Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Framework behavior after exceptions in begin/end transitions (Job, Stream, ProcessBlock) #45434

Merged

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Jul 11, 2024

PR description:

Improve the behavior of the Framework after exceptions in beginJob, endJob, beginStream, endStream, beginProcessBlock and endProcessBlock transitions. This is the fourth and final PR in a series of PRs modifying the behavior of the Framework after exceptions so that it more consistently handles exceptions in all the begin/end transitions. The first PR handled stream lumi exceptions (PR #44624). The second PR handled global lumi exceptions (PR #44840). The third PR handled run transitions (PR #45017). The comments at the head of the first PR state the design for the new behavior we are implementing.

The intent is that nothing in the output will change if there are not any exceptions. In some cases, ordering of operations may change where that ordering is not supposed to matter. There are some minor differences related to signals.

  • PreBeginStream, PostBeginStream, PreEndStream, and PostEndStream signals are newly added. Probably they were inadvertently omitted before.
  • The PreBeginJob, PostBeginJob, PreEndJob, and PostEndJob enclose less than before to be more like runs and lumis. Some input source, looper and subprocess signals are no longer enclosed between the higher level Pre and Post signals, but the same module level signals are still enclosed.
  • There are functions used by the looper called replaceModule (rarely or possibly never actually used). The module level signals are not enclosed by transition level signals when called from the replaceModule function. beginStream and beginJob are the module level transitions that are run there when a module is replaced.

This work was motivated by discussions related to Issues #43831 and #42501.

PR validation:

An existing unit test covering exceptions in different transitions is extended to cover the most salient cases. Additional manual testing of many various cases was also done. Existing unit tests pass.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45434/40882

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit for master.

It involves the following packages:

  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/ServiceRegistry (core)
  • FWCore/Services (core)
  • FWCore/TestProcessor (core)
  • FWCore/Utilities (core)
  • IOMC/RandomEngine (core)
  • Mixing/Base (simulation)

@Dr15Jones, @civanch, @cmsbuild, @makortel, @mdhildreth, @smuzaffar can you please review it and eventually sign? Thanks.
@fabiocos, @felicepantaleo, @fwyzard, @makortel, @missirol this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Jul 11, 2024

enable threading

@wddgit
Copy link
Contributor Author

wddgit commented Jul 11, 2024

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals RelVals-THREADING
Size: This PR adds an extra 240KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-158789/40348/summary.html
COMMIT: 0a92808
CMSSW: CMSSW_14_1_X_2024-07-11-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45434/40348/install.sh to create a dev area with all the needed externals and cmssw changes.

  • DAS Queries: The DAS query tests failed, see the summary page for details.

RelVals

  • 29634.911A fatal system signal has occurred: segmentation violation

RelVals-THREADING

  • 29696.0DAS Error
  • 29700.0DAS Error

@makortel
Copy link
Contributor

  • 29634.911A fatal system signal has occurred: segmentation violation

This failure is known #41927

@wddgit
Copy link
Contributor Author

wddgit commented Jul 12, 2024

please test

The failures seem unrelated to the changes in the PR and possibly not reproducible. Try again and see if we can get them to pass.

@wddgit
Copy link
Contributor Author

wddgit commented Jul 12, 2024

I added some TWIKI documentation related to this PR here:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideExceptionsInBeginEndTransitions

And a link to the new TWIKI page here:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideFrameWork#Error_handling

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals-THREADING
Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-158789/40363/summary.html
COMMIT: 0a92808
CMSSW: CMSSW_14_1_X_2024-07-11-2300/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45434/40363/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test test-particleLevel_fromMiniAod had ERRORS

RelVals-THREADING

  • 29696.0DAS Error
  • 29700.0DAS Error

Comparison Summary

There are some workflows for which there are errors in the baseline:
29634.911 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially removed 14 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 47
  • DQMHistoTests: Total histograms compared: 3246798
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3246775
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 46 files compared)
  • Checked 199 log files, 163 edm output root files, 47 DQM output files
  • TriggerResults: no differences found

@civanch
Copy link
Contributor

civanch commented Jul 13, 2024

please test

let us test once more

@wddgit
Copy link
Contributor Author

wddgit commented Jul 19, 2024

please test

I believe all the review comments received so far are dealt with now. Let me know if there are more.

I just recently noticed there is a link in the PR to compare after a force push is made, so I went ahead and just squashed everything.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-THREADING
Size: This PR adds an extra 92KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-158789/40501/summary.html
COMMIT: af8027d
CMSSW: CMSSW_14_1_X_2024-07-19-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/45434/40501/install.sh to create a dev area with all the needed externals and cmssw changes.

  • DAS Queries: The DAS query tests failed, see the summary page for details.

RelVals-THREADING

  • 29696.0DAS Error
  • 29700.0DAS Error

Comparison Summary

Summary:

@wddgit
Copy link
Contributor Author

wddgit commented Jul 22, 2024

Same as before. All tests pass except for 2 DAS Errors in the additional tests added when threading tests are enabled. There is nothing in this PR that could cause a DAS Error and these failures are also occurring in the IBs. The failures are unrelated to the PR.

@makortel
Copy link
Contributor

Comparison differences show #39803

@makortel
Copy link
Contributor

+core

@makortel
Copy link
Contributor

@cmsbuild, ignore tests-rejected ib-failure

@civanch
Copy link
Contributor

civanch commented Jul 23, 2024

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @rappoccio, @antoniovilela, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit fd19e2b into cms-sw:master Jul 26, 2024
11 of 12 checks passed
@wddgit wddgit deleted the exceptionBehaviorBeginEndJobAndStream branch October 28, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants