Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online HCAL DQM adding a new application #13805

Merged
merged 19 commits into from May 11, 2016

Conversation

vkhristenko
Copy link
Contributor

This is a test PR based on #13565
to test the addition of a new application to the Online HCAL DQM Workflow

VK

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @vkhristenko (Viktor Khristenko) for CMSSW_8_1_X.

It involves the following packages:

DQM/HcalCommon
DQM/HcalTasks
DQM/Integration
DQMOffline/Configuration

@cmsbuild, @vanbesien, @deguio, @davidlange6 can you please review it and eventually sign? Thanks.
@threus, @batinkov, @rociovilar this is something you requested to watch as well.
@slava77, @Degano, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

@deguio
Copy link
Contributor

deguio commented Mar 23, 2016

please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/12015/console

@deguio
Copy link
Contributor

deguio commented Mar 23, 2016

please test

@deguio
Copy link
Contributor

deguio commented Mar 23, 2016

please submit to 80x as well

@cmsbuild
Copy link
Contributor

Pull request #13805 was updated. @cmsbuild, @vanbesien, @deguio, @davidlange6 can you please check and sign again.

@cmsbuild
Copy link
Contributor

-1
Tested at: 91913d5
When I ran the RelVals I found an error in the following worklfows:
134.911 step1

DAS Error

1000.0 step1

DAS Error

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-13805/12015/summary.html

@deguio
Copy link
Contributor

deguio commented Mar 23, 2016

please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/12022/console

@deguio
Copy link
Contributor

deguio commented Apr 26, 2016

please test

@deguio
Copy link
Contributor

deguio commented Apr 26, 2016

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 26, 2016

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/12648/console

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_1_X IBs after it passes the integration tests. This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @Degano, @smuzaffar

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

@deguio
Copy link
Contributor

deguio commented May 4, 2016

@davidlange6
could we have this one merged please?
thanks,
F.

@dmitrijus @vanbesien

@abdoulline
Copy link

abdoulline commented May 9, 2016

I second Federico's request to David @davidlange6
Then we could hopefully proceed with (back-ported to 80X) #13813

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit d89f2c1 into cms-sw:CMSSW_8_1_X May 11, 2016
@Dr15Jones
Copy link
Contributor

It looks like this pull request is causing the crashes in the IB Release Validation Jobs: https://cms-sw.github.io/relvalLogDetail.html#slc6_amd64_gcc493;CMSSW_8_1_X_2016-05-11-2300

The traceback from the crash is

#5  0x00007f93a367bf7d in hcaldqm::RawRunSummary::endJob(DQMStore::IBooker&, DQMStore::IGetter&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_1_X_2016-05-11-2300/lib/slc6_amd64_gcc493/libDQMHcalTasks.so
#6  0x00007f93a36f6aff in HcalOfflineHarvesting::_dqmEndJob(DQMStore::IBooker&, DQMStore::IGetter&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_1_X_2016-05-11-2300/lib/slc6_amd64_gcc493/pluginDQMHcalTasksAuto.so
#7  0x00007f93f97ac0d7 in DQMEDHarvester::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_1_X_2016-05-11-2300/lib/slc6_amd64_gcc493/libDQMServicesCore.so
#8  0x00007f9400904df8 in edm::Worker::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_1_X_2016-05-11-2300/lib/slc6_amd64_gcc493/libFWCoreFramework.so
#9  0x00007f9400906790 in edm::WorkerManager::endJob(edm::ExceptionCollector&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_1_X_2016-05-11-2300/lib/slc6_amd64_gcc493/libFWCoreFramework.so
#10 0x00007f94008bf755 in edm::Schedule::endJob(edm::ExceptionCollector&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_1_X_2016-05-11-2300/lib/slc6_amd64_gcc493/libFWCoreFramework.so
#11 0x00007f9400836f91 in edm::EventProcessor::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_1_X_2016-05-11-2300/lib/slc6_amd64_gcc493/libFWCoreFramework.so
#12 0x000000000040ceea in main::{lambda()#1}::operator()() const ()
#13 0x000000000040b406 in main ()

@dmitrijus
Copy link
Contributor

@vkhristenko can you please take care of this?

@vkhristenko
Copy link
Contributor Author

vkhristenko commented May 12, 2016

@deguio , @dmitrijus , @Dr15Jones

There are 2 separate issues in here.

  • Workflows 134.7** and 134.7** Already fixed. pushed. 1 statement should've been included...
  • Workflow 4.77
    This workflow contains Harvesting over 2 runs - you can see that from logs of step3 for instance where the DQMEDAnalyzers are run for step3, but harvesting will get applied to that guy in a similar manner. Even though I misread https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkOneModuleInterface slightly.... it was guaranteed that HARVESTING step is done only over 1 run always for the following time period [Construction of Harvestor, Destruction of Harvestor]. I'm not sure how that is imposed... Because from the description of edm::one you can have (if you have WatchRuns) harvesting done over many runs just not in parallel...

Since this was a guaranty, all of my Harvestors are designed to have harvesting being done only over 1 Run from Construction of the object until the destruction! I've pushed an update which would be a temporary solution and it does pass 4.77 - basically just clear vectors per each beginRun transition...

my apology

VK

@dmitrijus
Copy link
Contributor

I saw the fix for 8_0_X - but create another PR for 8_1_X, on the IB containing this.
I will accept it - and hopefully make IBs valid again.

@deguio
Copy link
Contributor

deguio commented May 13, 2016

let me try to clarify:

  • in prod at the moment there are no WFs that run the harvesting step on multiple runs (neither in relvals nor in prompt)
  • in the future it would be nice to have this possibility supported (multiRun harvesting) so I would propose to extend the use case to cover multi run processing as well
  • the skims are a special case due to the way in which they are processed in the RECO-DQM step:
    • each file could contain multiple runs
    • in prod at the harvesting N separate jobs are potentially run on the same file where N is the number of runs contained in that file
    • the LS from a given run are selected at the source where a filter is configured by WMAgent
    • even if only single runs are processed in each harvesting job, endRun - beginRun transitions could be triggered anyway during the processing.

In summary a design which assumes that the beginRun transition is called only once per harvesting job conclusion it is not safe.

I include few experts who might want to comment/correct me if I am wrong.
@cerminar @franzoni @fabozzi

@Dr15Jones
Copy link
Contributor

From the framework perspective, modules should always be able to handle multiple Runs or LuminosityBlocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants