Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to allow the parallel use of ORA db and new CondDB #1831

Merged
merged 12 commits into from Jan 14, 2014

Conversation

ggovi
Copy link
Contributor

@ggovi ggovi commented Dec 16, 2013

The Condition Database implementation (write/read) has been modified to allow the access to both the old format (ORA/Pool) and the new schema (CondDB).

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @ggovi for CMSSW_7_0_X.

Fixes to allow the parallel use of ORA db and new CondDB

It involves the following packages:

CondCore/CSCPlugins
CondCore/CalibPlugins
CondCore/CondDB
CondCore/DBCommon
CondCore/DBOutputService
CondCore/DTPlugins
CondCore/ESSources
CondCore/HcalPlugins
CondCore/IOVService
CondCore/PhysicsToolsPlugins
CondCore/PopCon
CondCore/Utilities
CondFormats/Calibration
CondFormats/GeometryObjects
CondTools/DQM
CondTools/DT
CondTools/L1Trigger

@apfeiffer1, @nclopezo, @demattia, @danduggan, @rovere, @cmsbuild, @rcastello, @deguio, @ggovi, @eliasron, @mulhearn can you please review it and eventually sign? Thanks.
@ghellwig this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
@ktf you are the release manager for this.

@apfeiffer1
Copy link
Contributor

+1

On Mon, Dec 16, 2013 at 4:28 PM, cmsbuild notifications@github.com wrote:

A new Pull Request was created by @ggovi https://github.com/ggovi for
CMSSW_7_0_X.

Fixes to allow the parallel use of ORA db and new CondDB

It involves the following packages:

CondCore/CSCPlugins
CondCore/CalibPlugins
CondCore/CondDB
CondCore/DBCommon
CondCore/DBOutputService
CondCore/DTPlugins
CondCore/ESSources
CondCore/HcalPlugins
CondCore/IOVService
CondCore/PhysicsToolsPlugins
CondCore/PopCon
CondCore/Utilities
CondFormats/Calibration
CondFormats/GeometryObjects
CondTools/DQM
CondTools/DT
CondTools/L1Trigger

@apfeiffer1 https://github.com/apfeiffer1, @nclopezohttps://github.com/nclopezo,
@demattia https://github.com/demattia, @dandugganhttps://github.com/danduggan,
@rovere https://github.com/rovere, @cmsbuildhttps://github.com/cmsbuild,
@rcastello https://github.com/rcastello, @deguiohttps://github.com/deguio,
@ggovi https://github.com/ggovi, @eliasron https://github.com/eliasron,
@mulhearn https://github.com/mulhearn can you please review it and
eventually sign? Thanks.
@ghellwig https://github.com/ghellwig this is something you requested
to watch as well.
You can sign-off by replying to this message having '+1' in the first line
of your reply.
You can reject by replying to this message having '-1' in the first line
of your reply.
@ktf https://github.com/ktf you are the release manager for this.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1831#issuecomment-30668965
.

Thanks,
cheers, andreas

@cmsbuild
Copy link
Contributor

-1
I ran the usual tests and I found errors in the following unit tests:

---> test runtestTqafTopJetCombination had ERRORS
---> test runtestTqafTopEventSelection had ERRORS

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1831/1808/summary.html

@cmsbuild
Copy link
Contributor

Pull request #1831 was updated. @apfeiffer1, @nclopezo, @demattia, @danduggan, @rovere, @cmsbuild, @rcastello, @deguio, @ggovi, @eliasron, @mulhearn can you please check and sign again.

@deguio
Copy link
Contributor

deguio commented Dec 17, 2013

+1
confident that the fix solves the problem :)

@ggovi
Copy link
Contributor Author

ggovi commented Dec 17, 2013

Ok, I have produced a fix to support the old convention in GT naming (postfix "::All" ). With this fix, the two tests are reading correctly the GT, but failing with the following error:

----- Begin Fatal Exception 17-Dec-2013 17:33:51 CET-----------------------
An exception of category 'ConditionDatabase' occurred while
[0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
Exception Message:
No valid iov found for run 1, lumisection 0 in tag lhcscan_1 from BasePayloadProxy::setIntervalFor

The target IOV starts effectively from 1/1:

================================================================
Tag: lhcscan_1
================================================================
OID: 0003-0000000E
Scope: Unknown
Description:  
TimeType: lumiid
Since                 Since (runn)          Since (lumi)          Till                  Till (runn)           Till (lumi)           Payload OID    Payload Class        
--------------------  --------------------  --------------------  --------------------  --------------------  --------------------  -------------  ---------------------
          4294967297                     1                     1  18446744073709551615            4294967295            4294967295  0007-00000000  lumi::LumiSectionData

The code previous to my changes was somewhat 'tolerating' this invalid requests.

This brings us to a question for Framework people: is the request for run 1, lumi 0 expected/correct?
If it is the case, the starting validity for some tags has to be backward extended.

Giacomo

@Dr15Jones
Copy link
Contributor

This brings us to a question for Framework people: is the request for run 1, lumi 0 expected/correct?

It is both expected and correct. Basically it is the begin run.

@apfeiffer1
Copy link
Contributor

+1

On Di., Dez. 17, 2013 at 5:54 nachm., Chris Jones <notifications@github.com="mailto:notifications@github.com">> wrote:

This brings us to a question for Framework people: is the request for run 1, lumi 0 expected/correct?

It is both expected and correct. Basically it is the begin run.

Hmmm ... while I would expect this to be only correct for run- and time- based conditions, for lumi-based ones this implies that any IOV has to start at lumi=0 instead of lumi=1 as it is so far (at least in the ones used in these tests, potentially for many more). 

If this is the case, we will need to check and update all lumi-based conditions to make sure they contain lumi=0.

Thanks,

    Cheers, andreas

@wmtan
Copy link
Contributor

wmtan commented Dec 17, 2013

Andreas, I'm not sure I understand your last comment, but to clarify, 0 is not a valid luminosity section number.
1:0 means run 1. It is not a lumi. There is no lumi 0. lumi 0 is used to indicate a run.
1:1 means run 1, lumi 1. It is a lumi.

@ggovi
Copy link
Contributor Author

ggovi commented Dec 17, 2013

Two possible fixes for the tests then:

  1. Provide a valid condition for lumi 0 in ALL tags
  2. Make the code 'tolerant' like it was before

On Dec 17, 2013, at 5:54 PM, Chris Jones wrote:

This brings us to a question for Framework people: is the request for run 1, lumi 0 expected/correct?

It is both expected and correct. Basically it is the begin run.


Reply to this email directly or view it on GitHub.

@ggovi
Copy link
Contributor Author

ggovi commented Dec 17, 2013

The target 'time' which is being looked up in the iov is 4294967296 . Not 1. This should mean run 1, lumi 0.
Am I missing something?

@cmsbuild
Copy link
Contributor

Pull request #1831 was updated. @apfeiffer1, @nclopezo, @demattia, @danduggan, @rovere, @cmsbuild, @rcastello, @deguio, @ggovi, @eliasron, @mulhearn can you please check and sign again.

@apfeiffer1
Copy link
Contributor

Bill, thanks for the clarification - this is what I thought it should be. :)

In this case, as there is no valid lumi number for beginRun, I think the
FWK should not try to retrieve conditions for any lumi-based tag, and do
this only for run- and time- based conditions. If this is too much work for
the FWK to implement, we can put a fix in and return "nothing" (for some
definition of "nothing") for any request with lumi==0 -- please let us know.

Thanks,
cheers, andreas

@apfeiffer1
Copy link
Contributor

+1

@Dr15Jones
Copy link
Contributor

Andreas,
lumi/time based is only a concept internal to the PoolDBESSource and is unknown to the EventSetup system. However, the EventSetup system is designed to handle the case where a source says it can't deliver data for a given synchronization value. There are two ways you can do it

  1. You can say a given Record has not content for a given IOVSyncValue. The header file for EventSetupRecordIntervalFinder has the following
      /**returns the 'default constructed' ValidityInterval if no valid interval.
       If upperbound is not known, it should be set to IOVSyncValue::invalidIOVSyncValue()
      */
      const ValidityInterval& findIntervalFor(const eventsetup::EventSetupRecordKey&,
                                            const IOVSyncValue&);
  1. You can also tell the system that a particular data item is not available for the given IOV. This is done by having your edm::eventsetup::DataProxyTemplate<>::make method return a nullptr.
  2. a) instead of return a nullptr you could throw an cms::Exception with useful information stating they attempted to read the data on the wrong boundary.

In either case, if a user attempts to read the data at a Run boundary they will get an exception.

@ggovi
Copy link
Contributor Author

ggovi commented Dec 17, 2013

Hi Chris,

Based on your clarification - and given the current Framework behavior at begin run - the IOV lookup should never fail. It is the actual payload access that might eventually fail with an exeception. So, I will go for 1) in PoolDBESSource and 2a)
for any access to the payload in case of an invalid iov.

Thanks,

Giacomo

On Dec 17, 2013, at 10:28 PM, Chris Jones wrote:

Andreas,
lumi/time based is only a concept internal to the PoolDBESSource and is unknown to the EventSetup system. However, the EventSetup system is designed to handle the case where a source says it can't deliver data for a given synchronization value. There are two ways you can do it

  1. You can say a given Record has not content for a given IOVSyncValue. The header file for EventSetupRecordIntervalFinder has the following
  /**returns the 'default constructed' ValidityInterval if no valid interval.
   If upperbound is not known, it should be set to IOVSyncValue::invalidIOVSyncValue()
  */
  const ValidityInterval& findIntervalFor(const eventsetup::EventSetupRecordKey&,
                                        const IOVSyncValue&);
  1. You can also tell the system that a particular data item is not available for the given IOV. This is done by having your edm::eventsetup::DataProxyTemplate<>::make method return a nullptr.
  2. a) instead of return a nullptr you could throw an cms::Exception with useful information stating they attempted to read the data on the wrong boundary.

In either case, if a user attempts to read the data at a Run boundary they will get an exception.


Reply to this email directly or view it on GitHub.

@apfeiffer1
Copy link
Contributor

Hi Chris, all,

lumi/time based is only a concept internal to the PoolDBESSource and is
unknown to the EventSetup system. However, the EventSetup system is
designed to handle the case where a source says it can't deliver data for a
given synchronization value. There are two ways you can do it

  1. You can say a given Record has not content for a given IOVSyncValue.
    The header file for EventSetupRecordIntervalFinder has the following

/**returns the 'default constructed' ValidityInterval if no valid
interval.
If upperbound is not known, it should be set to
IOVSyncValue::invalidIOVSyncValue()
*/
const ValidityInterval& findIntervalFor(const
eventsetup::EventSetupRecordKey&,
const IOVSyncValue&);
2) You can also tell the system that a particular data item is not
available for the given IOV. This is done by having your
edm::eventsetup::DataProxyTemplate<>::make method return a nullptr.
2) a) instead of return a nullptr you could throw an cms::Exception with
useful information stating they attempted to read the data on the wrong
boundary.

In either case, if a user attempts to read the data at a Run boundary
they will get an exception.

one question which I still have is why are the conditions read at beginRun
? I understood from our discussion earlier this year that you would
read/update conditions only at beginLumi (and then with a valid lumi
number) ?

We were planning to provide an optimisation to be able to read a whole GT
in one go - but that assumes that we are called with valid run/lumi numbers
so we can't do this if this is called at beginRun with invalid lumi and
then again in beginLumi with the correct one. I'd have thought that no
event is processed between beginRun and the first beginLumi - am I wrong
here ? If not, what is the use-case to load conditions in beginRun ??

Thanks,
cheers, andreas

@cmsbuild
Copy link
Contributor

-1
I ran the usual tests and I found errors in the following unit tests:

---> test runtestTqafTopEventSelection had ERRORS
---> test runtestTqafTopJetCombination had ERRORS

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1831/1835/summary.html

@apfeiffer1
Copy link
Contributor

+1

Well, that is now due to the fact that we do a more strict checking of
what's asked to us and we're asked for something which doesn't exist. Is
that - which is in fact an issue in another part of CMSSW - a reason to not
merge the PR ? ;)

[or, asked differently: how many +1's do I have to send to make the -1 from
testing positive ? :)]

Thanks,
cheers, andreas

@ktf
Copy link
Contributor

ktf commented Dec 18, 2013

The rationale has always been: "if you break something, you fix it (or you follow up to have it fixed)". In this particular case you have changed the behaviour so either you resume the old behaviour and call for its deprecation or you adapt everyone else to follow it.

@apfeiffer1
Copy link
Contributor

@demattia, @rcastello, @eliasron, @mulhearn we'd need to get this in pre11 (and 700) so could you please sign again ? The last commits were only in CondCore packages - though you may want to double check again :)

Thanks - and seasonal greetings ! :)

cheers, andreas

@deguio
Copy link
Contributor

deguio commented Dec 19, 2013

for DQM and Validation I have signed it already.
no need to wait for elias.

seasonal greetings to all!
F.

@rcastello
Copy link

+1
Done!
Have nice holidays :-)

@ktf
Copy link
Contributor

ktf commented Dec 19, 2013

@nclopezo can you please restart the tests?

@Dr15Jones
Copy link
Contributor

@apfeiffer1 I guess I'm still not being successful in my communication. I'll take a different tack: I'll describe what I think you mean by requesting conditions for illegal lumi (0) and you can tell me if I'm right or wrong.

First, as I said before, the framework has always had run(N), lumi(0) as a valid synch point and it is used to denote the framework is beginning a new Run. Having appropriate conditions be available at begin Run has been a feature of the framework since day one and much code depends on this. For example, I believe we can agree that the geometry of detector is stable of the Run, i.e. we won't add a new tracking layer during a Run.

So I assume that when you say ``requesting conditions for illegal lumi (0)` you mean there is a (possibly small) subset of conditions which are only defined within a lumi and do not have a legitimate value at begin run. An example would be the integrated luminosity for a given lumi block. The framework gives two ways to accommodate that behavior. However, first I need to review how the framework deals with new transitions.

On each new Run or LuminosityBlock (and formerly Event) transition the framework needs to know if data in the EventSetup needs to be updated. It does that by seeing if the IOVSyncValue of the new transition lies within the IOV of each EventSetup Record being used. If the IOVSyncValue lies outside of a particular Record's last IOV, then the framework will call the ESSource's setIntervalFor method so it can get a new IOV. This is not a request for the data in the IOV, but only a request for the new begin/end range of the IOV.

So in the case where a source knows it has no valid data for a particular Record for a given IOVSyncValue, the source can either

  1. The source can return a new edm::ValidityInterval from setIntervalFor which starts at the ‘invalid’ IOVSyncValue and ends just before a known good IOVSyncValue. For the case of a begin run it would just be [RunNumber, 0], [RunNumber,0] since edm::ValidityInterval ranges are inclusive. In addition, the source would have to tell all its Proxies which reside in the given Record to not return any data if called and instead do
    a) return a nullptr which will make the framework thrown a generic exception or
    b) throw a cms::Exception explaining that the data is not available for that given sync value
  2. The source can return a defaultly constructed edm::ValidityInterval from setIntervalFor. When the framework sees the default, it does not add the given Record to the edm::EventSetup for that given IOVSyncValue and any code attempting to obtain that Record will get an exception thrown.

I personally prefer 1) in this case since it is likely you know exactly when the next good IOV is (presumable lumi(1)) and you can therefore tell the framework the exact IOV over which you do not have actual data for that Record. Essentially you have an IOV for which you happen to not have data which is a concept the framework can handle.

As an aside, a quick cursory look at your code change sees to show that you threw an exception during the setIntervalFor call. That is not the way the framework expects to deal with this situation since the framework was merely asking your Source if it had a valid IOV which is a legitimate question for it to ask.

@ktf
Copy link
Contributor

ktf commented Dec 20, 2013

@nclopezo any idea what happened of the tests?

@ktf
Copy link
Contributor

ktf commented Dec 20, 2013

@ktf
Copy link
Contributor

ktf commented Dec 20, 2013

Testing again to see if there was some infrastructural problem.

@ktf
Copy link
Contributor

ktf commented Dec 20, 2013

still fails a bunch of workflows (e.g. 100X). Moving this to pre12 / 71X.

@ktf
Copy link
Contributor

ktf commented Jan 9, 2014

@nclopezo can you restart these tests? Thanks.

@nclopezo
Copy link
Contributor

nclopezo commented Jan 9, 2014

ok, I restarted them with CMSSW_7_0_X_2014-01-09-0200

@nclopezo
Copy link
Contributor

nclopezo commented Jan 9, 2014

-1
When I ran the RelVals I got a segmentation violation error in the following workflows:
4.53 step3
5.1 step1
8.0 step2
25.0 step2
401.0 step1
1000.0 step2
1001.0 step2
1003.0 step2
1306.0 step2

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1831/1930/summary.html

@ggovi
Copy link
Contributor Author

ggovi commented Jan 13, 2014

Hi, I have successfully tested the failing workflows. Can you please re-start the merging procedures? Otherwise, let me know if you prefer a new pull request.

@cmsbuild
Copy link
Contributor

Pull request #1831 was updated. @apfeiffer1, @ojeda, @danduggan, @rovere, @cmsbuild, @diguida, @rcastello, @deguio, @ggovi, @Degano, @mulhearn, @nclopezo can you please check and sign again.

@ktf
Copy link
Contributor

ktf commented Jan 13, 2014

Va ancora bene 70X...

On 10 Jan 2014, at 22:04, Giacomo Govi wrote:

Ciao,

ho un fix per questo. Come procedo: la pull request e' gia updatata
(il mio branch ha il fix), ma forse meglio una nuova pull request su
7_1_X?

Grazie,

Giacomo
On Jan 9, 2014, at 11:56 AM, Giulio Eulisse wrote:

Ho chiesto che vengano fatti ripartire i test.

On 9 Jan 2014, at 11:54, Giacomo Govi wrote:

Ciao Giulio,

c'e' un modo di recuperare questo log? Altrimenti, come vengono
girati i test per i workflow - devo provare a riprodurre il problema.

Grazie,

G.

On Dec 20, 2013, at 1:25 PM, Giulio Eulisse wrote:

Mmm… a bunch of workflows failed.

https://cmssdt.cern.ch/jenkins/job/Pull-Request-Integration/ARCHITECTURE=slc5\_amd64\_gcc481/1879/console

any idea?


Reply to this email directly or view it on GitHub.

On 13 Jan 2014, at 12:03, cmsbuild wrote:

Pull request #1831 was updated. @apfeiffer1, @ojeda, @danduggan,
@rovere, @cmsbuild, @diguida, @rcastello, @deguio, @ggovi, @Degano,
@mulhearn, @nclopezo can you please check and sign again.


Reply to this email directly or view it on GitHub:
#1831 (comment)

@cmsbuild
Copy link
Contributor

@mulhearn
Copy link
Contributor

+1

@apfeiffer1
Copy link
Contributor

+1
as apparently the same on the last commit only is not seen by the bot ... :(

@rcastello
Copy link

+1

1 similar comment
@deguio
Copy link
Contributor

deguio commented Jan 14, 2014

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IBs unless changes (tests are also fine). @ktf can you please take care of it?

ktf added a commit that referenced this pull request Jan 14, 2014
DB improvements -- Fixes to allow the parallel use of ORA db and new CondDB
@ktf ktf merged commit dd64de8 into cms-sw:CMSSW_7_0_X Jan 14, 2014
@Martin-Grunewald
Copy link
Contributor

Hi,

I believe it it this pull request which lets the HLT crash in all AddOn tests such as these:
https://cmssdt.cern.ch/SDT/cgi-bin//showAddOnLogs.py/slc5_amd64_gcc481/www/wed/7.0-wed-02/CMSSW_7_0_X_2014-01-15-0200/addOnTests/

Please have a look and fix for pre12.

@Martin-Grunewald
Copy link
Contributor

Some more info:
To reproduce: re-run one of the above addon tests, step1 snd (failing) step2.

The crash occurs in: void IOVProxy::load( const std::string& tag, bool full )
when evaluating the bool in the condition of the if statement:

  if(!m_session->iovSchema().tagTable().select( tag, m_data->timeType, m_data->payloadType,
                                                m_data->endOfValidity, dummy, m_data->lastValidatedTime ) ){
    throwException( "Tag \""+tag+"\" has not been found in the database.","IOVProxy::load");
  }

Note it is NOT the exception thrown in the body of the if statement, rather, the evaluation of the
Boolean result aborts.

The values of the parameters are:

tag=DTVdrift_V02_mc
data=0x2af3ce32e4e0
session=0x2af3ce238730
timetype=0
payloadtype=DTMtime
endofValidity=4294967295
lastValidatedTime=1

timetype=0 seems fine as it is used w/o problems in other calls.

@ggovi
Copy link
Contributor Author

ggovi commented Jan 15, 2014

Yes I'm already producing the fix.

Thanks

Giacomo
On Jan 15, 2014, at 2:24 PM, Martin Grunewald wrote:

Some more info:
To reproduce: re-run one of the above addon tests, step1 snd (failing) step2.

The crash occurs in: void IOVProxy::load( const std::string& tag, bool full )
when evaluating the bool in the condition of the if statement:

if(!m_session->iovSchema().tagTable().select( tag, m_data->timeType, m_data->payloadType,
m_data->endOfValidity, dummy, m_data->lastValidatedTime ) ){
throwException( "Tag ""+tag+"" has not been found in the database.","IOVProxy::load");
}
Note it is NOT the exception thrown in the body of the if statement, rather, the evaluation of the
Boolean result aborts.

The values of the parameters are:

tag=DTVdrift_V02_mc
data=0x2af3ce32e4e0
session=0x2af3ce238730
timetype=0
payloadtype=DTMtime
endofValidity=4294967295
lastValidatedTime=1

timetype=0 seems fine as it is used w/o problems in other calls.


Reply to this email directly or view it on GitHub.

@VinInn
Copy link
Contributor

VinInn commented Jan 24, 2014

This pull request removed what introduced by 1665. I am not sure how this happen.
please restore the associated commit. Other (newer) pull request depends on what existed till pre11

@ggovi
Copy link
Contributor Author

ggovi commented Jan 24, 2014

Ok, i'll create a pull request with the removed changes.

On Jan 24, 2014, at 12:04 PM, Vincenzo Innocente wrote:

This pull request removed what introduced by 1665. I am not sure how this happen.
please restore the associated commit. Other (newer) pull request depends on what existed till pre11


Reply to this email directly or view it on GitHub.

@VinInn
Copy link
Contributor

VinInn commented Jan 24, 2014

On 24 Jan, 2014, at 12:44 PM, ggovi notifications@github.com wrote:

Ok, i'll create a pull request with the removed changes.
Giuilo just did it.
is pull request #2158.
please verify that.
v.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet