New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Das client fails with interface conversion #37052
Comments
A new Issue was created by @smuzaffar Malik Shahzad Muzaffar. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
The fix is here cms-sw/cmsdist#7646 |
The issue pop-up due to change in DBS server we performed yesterday. The new DBS server does not provide aggregated records and as a consequence lumi values changed their data-type, i.e. from results from DBS no longer contains list of lumis but rather individual lumi numbers. |
@vkuznet , I am afraid this DBS server side change is going to break all our IB. The https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/scripts/das-selected-lumis.py script is used by all IBs and cmssw releases to select the lumi numbers. e.g. previously das query
was returning https://raw.githubusercontent.com/cms-sw/cms-sw.github.io/master/das_queries/8b/8b5f64ce639191e4db57cc88035da15cbdc57a3faaa60ea1eab0ca4377221249.json which can be parsed by das-selected-lumis.py (in cmssw env) e.g.
but now new dasgoclient returns a different format which breaks all IBs and existing releases
|
@vkuznet , In new format , I see that for each file there are multiple records with just the lumi number changed [a]. What was the motivation behind this change? Previously we had one entry per file with list of lumi numbers [a]
|
@vkuznet , das queries like
|
@smuzaffar , we advertised all changes to DBS server up-front for two months. In short, the changes to DBS removed aggregation layer which contributed to instability of DBS itself. Instead, new DBS server just yield records from DB without any aggregation. We adjusted official DBSClient to handle aggregation on a client side. All details about new DBS server I presented on C&O meeting in early Jan. You can find all details here: Now, I was not aware of It basically aggregated lumis, but I only look at specific use-case which AlCaDB people pointed out. I suggest you to try it out and we can adjust this script to perform the aggregation you require. Also, regarding For, instance, if I need only uniq files I can do
or if I need files for specific lumis I can do:
and, then again use I'll be happy to discuss and help with your use-cases once I'll know all details. If you want we may schedule zoom meeting to discuss this. |
@vkuznet , problem is that We run hundreds of unique das queries for each IB during the IB validation tests ( see the |
by the way all the suggestion you mentioned in #37052 (comment) are not going to fix existing cmssw releases. |
@vkuznet , I went through your slides https://indico.cern.ch/event/1096200/contributions/4611461/attachments/2356881/4022206/DBS_migration.pdf . As you mentioned |
@smuzaffar you touched few independent issues and I don't know why we did not communicate properly on that. To move forward and properly fix them I suggest to clarify independent issues:
It is unfortunate that you did not raise these issues before we made DBS upgrade since I explicitly asked all groups to do their home work and tests changes. But, I'll do my best and prioritize work on DAS go client to resolve this asap. |
@vkuznet , das client instabilities are not new, these are there since the start of das client (may be 8-10 years). In old days (pre dasgoclient era) it was due to das not returning results ( due to various reasons, network issue, service not reachable etc.). Now a days it is mostly due to failure of various services which das tries to communicate. In order to protect our IBs and PR tests against these instabilities , we use github caches for das results. |
list of all queries for CMSSW are in https://github.com/cms-sw/cms-sw.github.io/tree/master/das_queries ( search for |
@smuzaffar there are 3760 query files, and I need to know which one should be addressed first. My understanding that main failures coming from file,lumi queries where you need to aggregate results by lumi sections, is this correct? If so, I'll start with this query, provide aggregation within dasgoclient and we can move forward from there. How does it sound? |
All of these queries are run by cmssw IBs ( cmssw 5.3.X up to 12.3.X) and all of these are important otherwise IBs and PR tests will fail. I do not exactly know how many types of queries are there but I guess you can We need to fix the results so that existign cmssw release do not fail and to make sure that we not do not dump 50 times more data in to github caches |
@smuzaffar I made first change to make file-lumis aggregation. Could you please try out this executable:
So, it only produces 2 LFns with range of lumis. This is what das_queries/41/4100c4315028e2807e9860d22f84218aa7e86b77871bef89198d0e0523ede0db has. Therefore, we need to decide on default behavior of dasgoclient. Should we explicitly use There are only few APIs (and therefore DAS queries) which require aggregation, and once I'll know the default behavior I'll provide fixes one by one. |
@vkuznet , I tried your change but it still fails. Can you please change lumi format and make it consistent with old one
in old format, lumi is list of one element |
by the way https://cmssdt.cern.ch/jenkins/job/das-query is jenkins job which I run for caching das results and https://cmssdt.cern.ch/jenkins/job/das-query/4142/console used your new dasgoclient and it failed due to mismatch of lumi format |
please try now the same executable, I just updated with what you requested. |
@vkuznet , it failed again with error
see details https://cmssdt.cern.ch/jenkins/job/das-query/4146/console . Can you please also fix output for file which is also a list e.g. |
@smuzaffar , it already has file as list of dict type. And bus error most likely comes from expired proxy. Please run it manually on lxplus to confirm that. This is exactly what I do:
If I use
|
about
|
@vkuznet , think look good now. I ran all the active queries and only two failed [a]. We need to make
is this a correct change? [a]
|
Please check again the executable, now I added aggregation by default for file-lumis, so you can query it as usual:
and, new executable correctly returns file query output, see output of
|
thanks, I am testing it now |
let's do one issue at a time. Once we have file,lumis in place, then I can correct other deviations in data types. |
looks good, both lumi,file and file queries worked. Also |
Regarding auto_cross_section value. The new DBS server does not perform any data manipulation (as Python server did) and only stream data directly from ORACLE DB. Even though auto_cross_section is declared in ORACLE schema as float, the ORACLE itself return 0 , eg.
This is why you get integer value instead of float in DAS output. I will not change anything in DBS server to correct this since 0 is proper value for float data type. |
ok, I'm glad that new executable works. Please confirm how many queries (if any are failing). If all tests pass I will push my changes and create new dasgoclient release via PR. |
all active queries ( 2130 in total ) look good, please go ahead and push the changes |
ok , I made new PR here cms-sw/cmsdist#7650 |
But I also noticed that something has changed in build procedure as I no longer able to build dasgoclient using my version of build script:
where should I get |
|
The pkgtools is V00-30-XX and for cmsdist I use IB/CMSSW_12_3_X/master as starting point. |
you need pkgtools V00-34-XX |
thanks, I confirm that with V00-34-XX I can build now again. I think we may close this ticket. |
This has been fixed via cms-sw/cmsdist#7650 also there is now a unit tests to #37072 check checks that |
One unit tests https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc10/CMSSW_12_3_X_2022-02-23-2300/unitTestLogs/PhysicsTools/Utilities#/1576-1576 is failing in all IBs and problem is
dasgoclient
which fails to run the following command@vkuznet can you please look in to this?
The text was updated successfully, but these errors were encountered: