New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replay test with CMSSW_11_3_1 #4574
Conversation
run replay please |
Container Tests |
Host name : vocms047.cern.ch |
Container Tests |
Hi @qliphy, at the moment we are experiencing an issue with an Oracle account after the recent upgrade, and we are unable to run replays. It should be fixed during the day, at which point I'll start the replay. I'm sorry for the delay. |
Host name : vocms047.cern.ch |
I relaunched the replay and It's running right now. I'll let you know how it goes. |
!!! Couldn't read commit file !!! |
Hi @germanfgv has the replay test failed? |
@silviodonato Hi Silvio. The test was interrumpted due to an issue accessing Also, yesterday I manually started this exact configuration file (with your proposed GTs) in another machine. This test is about to finish. |
Here you can find a detailed description of the error: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223.html |
Hi @germanfgv can you confirm the GTs used in your manual test? because the HN post you link here it's the wrong (old) one. The correct (new) post is: https://hypernews.cern.ch/HyperNews/CMS/get/calibrations/4408.html |
In |
Thanks @silviodonato! The GT seems the correct one! |
yes @francescobrivio, we used the new GTs. I provided the wrong link in my comment. I apologize. The HN thread has the correct link. |
alca: @francescobrivio @christopheralanwest @malbouis @pohsun @yuanchao @tlampen Reproducing the replay test, I got this failure related to
More details at https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1.html |
I am running a few tests:
I'm now running all the possible combinations of options 1 and 2...but any input from ECAL or DQM is welcome! |
BTW, just to give everyone a full picture, after the stack trace (related to ECAL I think) I see:
|
Thanks @francescobrivio. Indeed I've just tried to remove the EcalDQMonitorClient overwriting
|
@silviodonato I didn't touch that file in the past 3 years: @pieterdavid touched it fairly recently though. |
@silviodonato @francescobrivio |
thanks @mmusich ! let me also point the PR cms-sw/cmssw#32677 to get track of the issue |
Here is the recipe:
|
My replay, using the GTs in this PR, finished with a similar error. More details can be found in HN: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1/1.html |
Hi German, thanks for the additional info! |
With the recipe from #4574 (comment) I could reproduce the problem in #4574 (comment), I'm investigating. |
placing some random
|
Host name : vocms047.cern.ch |
I tried to remove
and I've got
|
Container Tests |
@Dr15Jones @smuzaffar @makortel do you have any idea about this? |
@jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213 |
I do not know if it is normal since we have not monitorized memory at Production level for Cosmics Harvesting, all that I can say is that despite being Cosmics, almost all DPGs/POGs (but Trigger due to FakeHLT menu) are involved in this Harvesting process: We can do a check with igproof to see if there is any module out of control |
I made some tests reducing the number of files. I see that running on all files using However, if you run on a subset of ~10 files the job is ok. If you job on ~50 files the run sometimes is ok sometime the job crashes. @germanfgv is there the possibility to split the job in ~5 parts? |
To be honest, that's not something I have ever done with our regular workflows. I'll guess I'll have to mess with our JobSplitter logic. I don't think is something we can modify that easily. Let me check with the rest of the team to see what we can do. |
Thanks German, this would be only an "emergency" solution. Of course we want to understand and fix the bug. |
@silviodonato could this be an issue with EOS? We have recently had problems reading files. Are we sure that's not what's happening here. |
yes, they seem problems with EOS, but I though it is ok now. I'll try again with a local test |
@germanfgv @ all Yes, I can confirm this is a problem with EOS.
to pick the local file instead of the files on EOS. I'll send a message to the HN. Thanks everybody for the help! |
@silviodonato We have an open ticket with EOS discussing problems that we had during MWGR#3. I'm reporting this into that ticket. The configuration of T2 backfill area was changed recently and that may be the cause of the problem. I'll retry the replay using our new T0 storage site where this should not be a problem. |
All the fatal exception messages are saying that the PoolSource requested ROOT to read some more of the file and what ROOT got back was basically 'broken'. That completely fits with an EOS problem where the remote file reading was giving back garbage. |
Container Tests |
I'll close this PR as testing for MWGR has moved towards #4577 |
Test of CMSSW_11_3_1 for the MWGR#4 (2-4 June).
Configuration followed from #4573
cms-sw/cmssw#33867 should fix the issue reported in:
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2213/1/2/1/1/1/1/1.html