Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Successful (no job failure) workflows w/ missing output on DBS #11358

Closed
haozturk opened this issue Nov 4, 2022 · 76 comments · Fixed by #11361
Closed

Successful (no job failure) workflows w/ missing output on DBS #11358

haozturk opened this issue Nov 4, 2022 · 76 comments · Fixed by #11361

Comments

@haozturk
Copy link

haozturk commented Nov 4, 2022

Impact of the bug
Workflow announcement

Describe the bug
We see lots of successful workflows w/ missing output on DBS. I'm not sure whether this is due to delay or failure in DBS injection or a problem in job failure accounting.

How to reproduce it
Here are some affected workflows:

  1. https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-RunIISummer20UL17NanoAODv9-02666
  2. https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-RunIISummer20UL16NanoAODAPVv9-02101
  3. https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-RunIISummer20UL18NanoAODv9-02582
  4. https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer20UL16NanoAODAPVv9-05498

If you need a complete list, please let me know. I can provide it.

Expected behavior
If there is no job failure, we expect to see 100% output on DBS.

Additional context and error message
None

@amaltaro
Copy link
Contributor

amaltaro commented Nov 4, 2022

@vkuznet Valentin, can you please investigate this issue?

A short summary of steps to debug it is:

  1. Hasan provided the workflow PrepId, so we need to open one of those links to find out the precise workflow name. If we do so for the first link, here is the workflow name: pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960
  2. then open WMStats and filter the output for the workflow above. Click in the "all requests" button to bring you a table of workflow(s)
  3. next, click on the first column D (grey/red box). It gives you a short summary of the workflow, and the information we are looking for is to know which agents worked on this workflow. For the workflow above, this is what I see agent url: cmsgwms-submit6.fnal.gov
    i) this same summary information provides us with a list of the output datasets. In this case, here it is: /LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM
  4. now that we know the agent that processed this workflow and which output(s) is expected, you can ssh to submit6 and debug it.
  5. I suggest to start looking at the DBS3Upload ComponentLog and see if we can find blocks failing to be inserted
  6. another check is to verify the state of such blocks in the local database, mainly looking at the dbsbuffer_block table.
  7. of course, make sure DBS3Upload is up and running.

I can help you once you get to this, but this is a brief set of instructions to get started with this and it might be enough to find the problem.

@vkuznet
Copy link
Contributor

vkuznet commented Nov 4, 2022

I can start looking into this issue next week as I'm busy today and early next week with doctor appointments. Meanwhile, it would be useful if @amaltaro will provide which exact URL calls DBS3Upload do and does it log these calls to DBS. From what I read I see two possibilities here:

  • we either have failed call to DBS from DBS3Upload, or
  • block already exists and something internally goes wrong in DBS3Upload

The problem will be much easier to debug if for given workflow we'll identify the URL call DBS3upload makes to DBS and see its parameters, i.e. if we can know file/block which is supplied via HTTP call we can easily check if it is an issue with DBS or not.

@amaltaro
Copy link
Contributor

amaltaro commented Nov 4, 2022

@vkuznet DBS3Upload uses only 1 API, which is insertBulkBlock, see:
https://github.com/dmwm/WMCore/blob/2.1.2.patch2/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L92

you can also see what is logged around this code and check the logs of the relevant agent (once you figure which agent it is, please see instructions above).

As you know, the block information can be extremely large, so we definitely do not log and/or dump it anywhere. There is a way to dump it though, but I'd suggest to take the first debugging steps without the whole information that is passed to the server (other than the block name).

Given that you decided to be more verbose and explicit on the exception hit in the server, it might be that an analysis of the error message is enough to figure what the problem is.

@todor-ivanov
Copy link
Contributor

Hi @vkuznet @amaltaro I'll take a look at this as well, as we discussed during the meeting yesterday.

@vkuznet
Copy link
Contributor

vkuznet commented Nov 8, 2022

From submit6 agent ComponentLog:

2022-09-20 20:32:57,233:139722709800704:ERROR:DBSUploadPoller:Error trying to process block /NMSSM_XToYHTo2G2WTo2G2L2Nu_MX-2600_MY-60_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM#ae6aeb11-6f20-42e6-9e7f-5ac952f3f118 through DBS. Error: DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set
Traceback (most recent call last):
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.2.patch1/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 92, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 630, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 464, in __callServer
    self.__parseForException(http_error)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 508, in __parseForException
    raise http_error
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 461, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp2/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp2/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: Bad Request

What it actually means is that provided JSON has insufficient information to insert a specific info. The insertBulkBlock API expect a JSON. This JSON contains dataset/block/file config information, release information, etc. The error we see Error: sql: no rows in result set suggests that some aux info is not found in DBS database, e.g. dataset/block configuration and without it present in DB insert bulk block will fail.

In order to answer what is missing we need an actual JSON of a specific failed block. My question is how to get this JSON?

The recent error I see from ComponentLog is dated by

2022-11-08 08:54:24,174:139796326827776:ERROR:DBSUploadPoller:Error found in multiprocess during process of block /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2

but I do not see block with ID 397d4e07-e918-4c6d-ab5b-b074f1b83ee2 anywhere in dbs2go-global-w logs on that date.

SO, there are several questions here:

  • how to get JSON of the failed block
  • which DBS instances DBS3Upload is using that we can try to see the request in DBS logs

@vkuznet
Copy link
Contributor

vkuznet commented Nov 8, 2022

From the DBS logs (from cmsweb-prod of today), I see a few datasets which failed to be injected via bulkblocks API and all of them points to the same issue in a code:

  • /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM
  • /NMSSM_XToYHTo2B2WTo2B2L2Nu_MX-300_MY-60_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM
  • /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM

The codeline which fails to insert them is here: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L917 and it tells me that the supplied JSON points contains the parent which is not found in DBS database, please see entire block logic: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L911-L926

Therefore, if it is the same issue I bet that whoever who construct the JSON input did not check that parents of these datasets/blocks are present in DBS. Again, without actual JSON it is hard to pin-point further which parents are not present. But I think we have a clear clue of what is going on with injections.

I still feel weak and can't spend more time on this, but I hope @todor-ivanov can dig more deeply into this issue.

For the records DBS logs can be found on vocms0750 and today's logs are

/cephfs/product/dbs-logs/dbs2go-global-w-684697bc8b-4crnd.log-20221108
/cephfs/product/dbs-logs/dbs2go-global-w-684697bc8b-wlnhc.log-20221108
/cephfs/product/dbs-logs/dbs2go-global-w-684697bc8b-x25rj.log-20221108

@amaltaro
Copy link
Contributor

@vkuznet Valentin, DBS3Upload component provides a way to dump the block information in a file (json), but that required further work to become useful.
I created this PR:
#11361

to change how we decide to dump the block data for a given block name, such that now we can dump information for a single block and inspect what exactly gets posted to DBSServer.

I have already patched submit6 with that patch, changed the component configuration with the block name you mentioned above (pasted here as well):

/NMSSM_XToYHTo2G2WTo2G2L2Nu_MX-2600_MY-60_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM#ae6aeb11-6f20-42e6-9e7f-5ac952f3f118

and restarted DBS3Upload. Once you have the data you need, please revert that config back to an empty string and restart the component.

@vkuznet
Copy link
Contributor

vkuznet commented Nov 11, 2022

Alan, thanks for providing this but so far I do not see anything on submit6 agent. If I read you changes correctly and reading config, I see that the data should go to /storage/local/data1/cmsdataops/srv/wmagent/v2.1.2.patch1/install/wmagentpy3/DBS3Upload, right? So far this area does not have dbsuploader_block.json (this file name I read from PR). It can be either agent did not yet retried the given block or may be anything else. I think I still need you to look at this agent to see if everything is properly set.

@amaltaro
Copy link
Contributor

Yes, that agent seems to have an unusually high number of open blocks (or not, vocms0253 has 15919 blocks!!!):

2022-11-11 03:54:08,603:139930855028480:INFO:DBSUploadPoller:Starting the DBSUpload Polling Cycle
2022-11-11 03:54:15,012:139930855028480:INFO:DBSUploadPoller:Found 9356 open blocks.

that's why it's taking a while to associate files to their respective blocks. The first cycle is usually longer than the subsequent ones.

You are correct, that's the correct place where you should see the json file.

@amaltaro
Copy link
Contributor

@vkuznet I just figured why the json dump never showed up for the block above, and the reason is that that block went through the system long ago (it had only a transient failure), sorry about that.

It looks like we have another block that has been failing over the last few cycles of DBS3Upload. So I updated the component config to get a dump for:

config.DBS3Upload.dumpBlockJsonFor = "/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2"

@vkuznet
Copy link
Contributor

vkuznet commented Nov 14, 2022

@amaltaro , I got the JSON file for /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2 and upon JSON inspection as I was expecting it has a problem with the following section:

  "dataset_parent_list": [
    "/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM"
  ]

I just checked DBS global database directly using and it has no such datasets registered, see

select * from cms_dbs3_prod_global_owner.datasets where dataset='/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM';

no rows selected

So, the problem, at least with this JSON and this block, is lack of parent dataset in DBS. Since I do not know how this JSON is constructed I do not have any further comments, but I suggest to figure it out and see why we create a JSON (block) with parent info which is not present in DBS DB.

@amaltaro
Copy link
Contributor

Thank you for pin-pointing this issue, Valentin.

Based on this information, I investigated it a little further and it looks like we lost the ability of spotting such problems in DBS3Upload with the migration of DBS to the Golang implementation, as expected in these 2 lines of code:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L102-L103

Now on why and how it happened, we can look this up in WMStats by:

as such, StepChain workflows produce the unmerged files at one point and the merge is triggered by dataset + output module + datatier, asynchronously (actually they depend on many other constraints like size, events, etc).

So it's indeed possible to have a dataset/block to be injected into DBS server without having its parent data.

I don't know how to properly resolve it, given that an ACDC workflow can be executed to recover the MINIAOD data, thus eventually succeeding the insertion of this NANOAOD block into DBS server.
At the moment though, I can think of two actions that would need to be taken:

  • @haozturk to investigate why ACDC workflows are not executed. There are job failures up in WMStats, so ideally we should run ACDC(s) and recover those lumis.
  • regarding DBS server, is there any way we can instrument the server such that we can clearly see that the parent information is missing, hence the current operation cannot be completed?

@vkuznet
Copy link
Contributor

vkuznet commented Nov 15, 2022

@amaltaro , this seems to be identical issue as I reported over here: #11106 In other words, the order to inject data into DBS matters since dataset/blocks/files info depends on other information such as configs, parentage, etc. And, resolution requires careful inspection of data injection procedure in WMCore.

Said that, indeed, these lines https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L102-L103 rely on DBS python exception logic rather checking actual DBS server error. Therefore, our effort of properly handling DBSErrors #10962, #11173 also matter here.

What currently can be done is the following:

  • we need to patch DBSUploadPoller to include an additional block to check for DBSError in exception message.
    • if you need precise knowledge when parentage fails, we may scan returned exception (JSON) for the line unable to find dataset_id for message, here is an example from dbs2go log:
DBSError Code:109 Description:DBS DB ID error for provided entity, e.g. there is no record in DB for provided value Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:110b358e8d7a9864961e2b9f1ba069f7c314fd4e76863206b783bde269d84658 unable to find dataset_id for /NMSSM_XToYHTo2B2WTo2B2Q1L1Nu_MX-1600_MY-700_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set Stacktrace`
  • you may decide to raise priority for DBSError issue and incorporate these changes into WMA
  • you may add an additional check to DBSUploadPoller to query DBS for dataset parents, in fact it is much better solution since DBSUploadPoller have JSON, it can extract dataset parents and place queries to DBS to see if such datasets exists. This step can be added in addition to aforementioned exception parsing one

I think the most reliable solution should be adding into DBSUploadPoller a code which will query DBS for dataset parents. The reason is simple, it is the most explicit action. The code should not rely on matching specific exception string from DBS server as server implementation can change and exception message too. Instead, the poller should explicitly check if data is allowed to be inserted. For that it should check the following:

  • check dataset parents, via /dataset?dataset=<dataset_name> DBS API
  • check dataset, file configuration, via /outputconfigs? (and provide either dataset or logical_file_name as input parameter) to ensure that output configuration is injected in DBS too.
    After that DBSUploadPolller can inject block info. This will eliminate possibility of data racing reported here Eliminate racing conditions in DBS data injection procedure #11106

@amaltaro
Copy link
Contributor

We can implement it in a "permission vs forgiveness" model, up to now we have adopted the forgiveness one (deal with exceptions). The reason for this is:

  • before we inject a block, we need to ensure that the dataset exists in DBS
  • if the block has a parent dataset/block, we also need to check whether the parent dataset exists in DBS
  • the polling cycle of DBS3Upload is 3min, so we would have to repeat all of it every cycle, for every block. Luckily, we are clever enough to:
    • group dataset and parent dataset
    • use some sort of in-memory cache to avoid unneeded RESTful calls; or
    • apply a database schema change such that we have a flat specifying whether a dataset has been injected or not in DBS.

We could try to look into this, but this becomes a reasonably more expensive development.
A similar model is used in RucioInjector btw, and with the problems that we have been getting there, IMO the best and the most robust way forward would be to add database schema changes and properly store the state of datasets injections (both in DBS and Rucio).

@vkuznet
Copy link
Contributor

vkuznet commented Nov 15, 2022

@amaltaro , everything really depends on implementation. If code will use sequential mode, i.e. make DBS calls for configs, dataset, blocks, etc. in a sequence, of course it will be slow. But if code will implement making this calls in parallel then at most you will add 1-2 sec to the over time, not more. The checks which I suggested to do before data inject are very simple DBS queries since the query contains specific value, e.g. /datasets?dataset=/a/b/c is very fast query, similar /outputcofigs?dataset=/a/b/c. So if you'll make N calls in parallel, the overall overhead is the longest query which I doubt should be more than 1sec. Also, in this scenario there is no need for cache since cache will happen on ORACLE side which will prepare index and if someone will request the same query over and over again it will be originated from ORACLE cache. I certainly do not know Rucio implementation but I doubt they do parallel calls either. Since our pycurl_manager support that what is required is to prepare series of URLs and get them together. You should start taking advantage of fast Go server and apply concurrency model more often in WMCore codebase.

@amaltaro
Copy link
Contributor

What I meant with more expensive development was that we go away from a simple block dump provided to the server towards much more control and checks on the client side:

  • thus, creating more HTTP requests (I am very much adept of NOT making them if not really needed). Relying on server cache is a bad practice in my opinion as it generates more network traffic, more frontend and backend threads, more backend resources, etc
  • creating more checks in the component
  • when running concurrent http calls, we need to have proper error handling and maybe some retry logic
  • more lines of code, thus eventually more maintenance and complexity in debugging (and places to have a bug).

If we take this path, then I think the best would be to make this stateful (dependent on the local database information).

@vkuznet
Copy link
Contributor

vkuznet commented Nov 15, 2022

if you serious about eliminating data race conditions then it is the only way, i.e. client should check everything before injection. The implementation details can vary though. If you need cache you can use it but it implies that you need to maintain it, and use resources for it. Making HTTP calls is cheap in my view but of course it has its own drawback (everything depends on data volume, rate, etc.). Retry or not again is up to a client. The bottom line client should perform all checks before trying to inject the new data. If this is implemented then we will eliminate data racing condition issue which is present right now and issue with failures due to not existing data in DB (this one). The only failure may happen in this scenario are due to database and/or DBS server issue, but in this case we'll have strong separation of responsibilities between the client (which performed all necessary checks) and server which is provided with fully valid input and should insert the data.

@amaltaro
Copy link
Contributor

Yes, given that we found this to actually be a problem with the workflow itself, in the sense that some merge jobs failed and it caused one of the parent datasets not to be created (now pending ACDC creation and execution), I would suggest to close this one out (once we hear back from Hasan). At some point, that/those blocks should succeed once the ACDC goes through and (parent) data gets properly produced.

On what concerns the long term and major refactoring of DBS3Upload, I would suggest to update the GH issue #11106 with some of the comments made in this issue (my preference at this very moment is to include a database schema change) and plan it for the future quarters.

@vkuznet
Copy link
Contributor

vkuznet commented Nov 15, 2022

fine with me.

@haozturk
Copy link
Author

Hi all, apologies for late reply. We check the following endpoint [1] which I think should include the recovery info that ACDC will use. But it's empty for these workflows. That's why we didn't ACDC them

[1] https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960%22&include_docs=true&reduce=false

@amaltaro
Copy link
Contributor

@haozturk I think you looked into the wrong workflow. A few replies above, I investigated the following workflow
https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22cmsunified_task_SMP-RunIISummer20UL16MiniAODv2-00275__v1_T_221103_122546_3035%22&include_docs=true&reduce=false

where we can actually find ACDC entries for the merge task (among others).

@haozturk
Copy link
Author

haozturk commented Nov 16, 2022

@amaltaro himm, that's not one of the workflows for which we opened this issue. It's not a "Successful (no job failure) workflow w/ missing output on DBS". If you check the workflows that I provided in the issue body, they don't have ACDC entries.

Regarding that workflow in particular, it has a filemismatch problem where an output file [1] is missing on DBS while it's on Rucio. We don't create ACDCs before fixing the mismatch. Do you think these issues are related?

I haven't read the whole thread. I might be missing some context.

[1] /store/mc/RunIISummer20UL16NanoAODv9/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v2/60000/E9D7E442-299E-FC4B-BA82-7D87E3A82B7D.root

@amaltaro
Copy link
Contributor

Reopening as there are still ongoing discussions.

In order to answer this question, the best we can do is to check the Global WorkQueue logs (reqmgrInteraction CP thread) and try to match all the cases where extra work has been added to the same workflow.

I don't think it's impossible that we count the same block stats multiple times (when iterating over a workflow that has been already splitted in GQ), of course it should not happen and it's just a wild guess of what could have gone wrong in these workflows.

@vkuznet can you please try to scan the global workqueue logs for one of these workflows?

@amaltaro amaltaro reopened this Dec 13, 2022
@vkuznet
Copy link
Contributor

vkuznet commented Dec 13, 2022

@amaltaro , you put too much faith in my abilities. At least I need to know where GWQ log is, is it on production server (cmsweb.cern.ch)? Is it available on vocms0750? Does it called workqueu*.log? Then, it would be more useful to know which WMCore generates specific log entries, how nlumis are calculated which which log entries are generated.

I'm asking because I assume that GWQ is on cmsweb, and its log on vocms0750, and its log is called workqueue*.log, if so than tehre is nothing in there:

# from vocms0750
grep pdmvserv_Run2022E_DisplacedJet_PromptNanoAODv10_v1_221017_083709_9356 /cephfs/product/dmwm-logs/workqueue*.log

return nothing.

At least I need to know more information about GWQ and which patterns to look at and at which log(s).

@vkuznet
Copy link
Contributor

vkuznet commented Dec 14, 2022

I made an effort to look-up given workflow in workflow log. Here I made several assumptions which may or may not be true:

  • I assumed that global workqueue logs are called workflow*.log and they can be located on vocms0750
  • I look up pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 workflow in reqmgr2 and found that it was processed from 2022-10-03 till 2022-10-06, see here
  • I found workflow*.logs for these days in vocms070 in /cephfs/product/dmwm-logs/old-logs-20221001-0359.zip log
  • I unpacked the zip archive to extract workflow logs and scanned those dates
  • what I found is the following entries:
grep pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 wq/*.log
wq/workqueue-20221006-workqueue-545bc88678-6j7hh.log:INFO:cleanUpTask:Going to delete 1 documents in *workqueue* db for workflow: pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960. Doc IDs: ['a3e6145deeaaf42fb300af39ac515a3c']
wq/workqueue-20221006-workqueue-545bc88678-6j7hh.log:INFO:cleanUpTask:Going to delete 2 documents in *workqueue_inbox* db for workflow: pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960. Doc IDs: ['a3e6145deeaaf42fb300af39ac515a3c', 'pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960']

There is no any other information about given workflow and I have no other clue where to look it up.

@amaltaro , please provide further instructions how to deal with it. Better, as I requested, it would be nice to look at specific codebase which does the initial estimate of nlumis numbers and see the logic as well as if it logs anything into the log.

@amaltaro
Copy link
Contributor

Valentin, see further comments and instructions below:

I assumed that global workqueue logs are called workflow*.log and they can be located on vocms0750

yes, logs are available in vocms0750:/ceph/production/dmwm-logs. Given that we only keep the last 4 or 5 days worth of logs, and older logs get zipped, you will actually find these logs in the non-deterministic zipped files, e.g.:
old-logs-20221001-0359.zip --> reqmgrInteractionTask-workqueue-545bc88678-6j7hh-20221005.log

I look up pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 workflow in reqmgr2 and found that it was processed from 2022-10-03 till 2022-10-06, see here

this is the way to move forward, checking when the statuses transition happened. Given that this one is about global workqueue, we don't really care about the whole lifetime of the workflow, but solely the transition from staging to staged, which is the moment that this workflow is executed in global workqueue (exception to growing workflows, that can keep acquiring data for longer period of time). In short, for this workflow we only care about date 20221005.

Expanding on the content of that reqmgrInteractionTask log file (mentioned above):

2022-10-05 21:13:43,230:INFO:WorkQueueReqMgrInterface:Processing request pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 at https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/spec
2022-10-05 21:13:43,230:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/spec"
2022-10-05 21:13:43,450:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False
2022-10-05 21:13:43,542:INFO:WorkQueue:Splitting /pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/EXO-RunIISummer20UL17NanoAODv9-02666_0 with policy name Dataset and policy params {'name': 'Dataset', 'args': {}}
2022-10-05 21:13:43,962:INFO:WorkQueue:Work splitting completed with 1 units, 0 rejectedWork and 0 badWork
2022-10-05 21:13:43,962:INFO:WorkQueue:Queuing element a3e6145deeaaf42fb300af39ac515a3c for /pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/EXO-RunIISummer20UL17NanoAODv9-02666_0 with policy Dataset, with 2 job(s) and 179 lumis on /LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM
2022-10-05 21:13:49,576:INFO:WorkQueue:Split work for request(s): "pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960"
2022-10-05 21:13:49,598:INFO:WorkQueueReqMgrInterface:1 units(s) queued for "pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960"
...
2022-10-05 23:43:31,471:INFO:WorkQueue:Workflow pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 has no OpenRunningTimeout. Queuing to be closed.

Last line of this log represents the moment the workflow gets closed for further input data, so no more blocks/stats can be added to it.

In these logs, we can see that indeed 179 lumis was found/calculated. I had a quick look into the DBS entries, and my script reports 0 files marked as invalid (and filesummaries indeed says 100 lumis), so something is very wrong with this workflow.

In terms of source code, this code is quite complex, but it's performed by a Global Workqueue cherrypy app that starts up from this module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueReqMgrInterface.py
and after loading the workflow spec and figuring some parameters like the WorkQueue start policy, it will execute one of these modules (again, it depends on the workflow construction: no input, with input, with input MINIAOD, harvesting workflow, ACDC):
https://github.com/dmwm/WMCore/tree/master/src/python/WMCore/WorkQueue/Policy/Start

Maybe the next step we can do is, to clone this workflow into one of our dev setup and see if global workqueue would again find the 179 lumis.

@vkuznet
Copy link
Contributor

vkuznet commented Dec 15, 2022

@amaltaro, I looked up the code and can reproduce 179 nlumis number. The code Start/Policy/Dataset.py calls validateBlocks function which by itself calls getDBSSummaryInfo function for provided block name. So, our dataset is

/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM

it has 4 blocks:

./dasgoclient -query="block dataset=$d"
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#471e5596-af04-4423-a850-5ef9091f154f
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#6eb03689-167a-472f-8b09-f4bfadad6a8a
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b8cdec8f-b664-49a6-ab2d-bb2a89893581
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#ff78bb73-0e8c-41cb-9e51-381cfbdf15e2

and getDBSSummaryInfo calls filesummaries DBS API:

vk@vkair(10:30:56)$ blk1=/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM%23471e5596-af04-4423-a850-5ef9091f154f
[~/CMS/DMWM/GIT/wflow-dbs, main+1]
vk@vkair(10:30:59)$ blk3=/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM%23b8cdec8f-b664-49a6-ab2d-bb2a89893581
[~/CMS/DMWM/GIT/wflow-dbs, main+1]
vk@vkair(10:31:19)$ blk4=/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM%23ff78bb73-0e8c-41cb-9e51-381cfbdf15e2
[~/CMS/DMWM/GIT/wflow-dbs, main+1]
vk@vkair(10:31:37)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk1"
[
{"file_size":9407933969,"max_ldate":1664655121,"median_cdate":1664655121,"median_ldate":1664655121,"num_block":1,"num_event":111000,"num_file":4,"num_lumi":97}
]
[~/CMS/DMWM/GIT/wflow-dbs, main+1, 1s]
vk@vkair(10:31:42)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk2"
[
{"file_size":5638876233,"max_ldate":1662652610,"median_cdate":1662652610,"median_ldate":1662652610,"num_block":1,"num_event":72000,"num_file":1,"num_lumi":72}
]
[~/CMS/DMWM/GIT/wflow-dbs, main+1, 1s]
vk@vkair(10:31:44)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk3"
[
{"file_size":726640358,"max_ldate":1664151901,"median_cdate":1664151901,"median_ldate":1664151901,"num_block":1,"num_event":9000,"num_file":1,"num_lumi":9}
]
[~/CMS/DMWM/GIT/wflow-dbs, main+1, 1s]
vk@vkair(10:31:47)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk4"
[
{"file_size":105745614,"max_ldate":1664655121,"median_cdate":1664655121,"median_ldate":1664655121,"num_block":1,"num_event":1000,"num_file":1,"num_lumi":1}
]

If you sum-up num_limis across all blocks you'll get 179 :)

>>> 97+72+9+1
179

While, the filesummaries API for dataset returns 100:

scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=$d"
[
{"file_size":15879196174,"max_ldate":1664655121,"median_cdate":1664655121,"median_ldate":1664655121,"num_block":4,"num_event":193000,"num_file":7,"num_lumi":100}
]

So, the issue here is that filesummaries API for dataset and blocks provides different results. I looked up DBS queries and they differ as following:

  • dataset nlumis lookup
(select count(*) from (select distinct l.lumi_section_num, l.run_num from {{.Owner}}.files f
 join {{.Owner}}.file_lumis l on l.file_id=f.file_id
 join {{.Owner}}.datasets d on d.DATASET_ID = f.dataset_id
{{if .Valid}}
  JOIN {{.Owner}}.DATASET_ACCESS_TYPES DT ON  DT.DATASET_ACCESS_TYPE_ID = D.DATASET_ACCESS_TYPE_ID
{{end}}
 where d.dataset=:dataset wheresql_isFileValid)
) as num_lumi
  • block nlumis lookup
(select count(*) from (select distinct l.lumi_section_num, l.run_num from {{.Owner}}.files f
 join {{.Owner}}.file_lumis l on l.file_id=f.file_id
 join {{.Owner}}.blocks b on b.BLOCK_ID = f.block_id
{{if .Valid}}
  JOIN {{.Owner}}.DATASETS D ON  D.DATASET_ID = F.DATASET_ID JOIN {{.Owner}}.DATASET_ACCESS_TYPES DT ON  DT.DATASET_ACCESS_TYPE_ID = D.DATASET_ACCESS_TYPE_ID
{{end}}
 where b.BLOCK_NAME=:block_name wheresql_isFileValid)
) as num_lumi

The difference is additional join for DATASET_ACCESS_TYPES table in dataset query. I need time to investigate further DBS queries as I inherited them from Python based code. I'll report later with more results on how DBS queries differ for data and blocks, but obviously it is the root of the problem reported here.

@amaltaro
Copy link
Contributor

Updating the comment above that tagged the "wrong" Alan (sorry about that!)

@amaltaro
Copy link
Contributor

amaltaro commented Dec 15, 2022

@vkuznet without looking carefully into this. My hypothesis is that filesummaries is actually returning UNIQUE tuples of run/lumi:

distinct l.lumi_section_num, l.run_num 

which means that there are files in different blocks (and maybe even in the same block) that have exactly the same run/lumi tuple. Hence DBS returns a smaller amount of lumis when queried by dataset.

Given that this dataset is pretty small, I would suggest to retrieve all the run/lumis for all files in this dataset, order them and look how many duplicates we have (if any).

@vkuznet
Copy link
Contributor

vkuznet commented Dec 15, 2022

yes, this is the case different blocks have same run/lumi tuples. And, sorting them and taking unique set gives me 100 unit lumis. So the mystery is solved.

Said that, the remedy in WMCore should be the following:

  • fix validBlocks function call in Start/Policy/Dataset.py to call instead of filesummaries DBS API in getDBSSummaryInfo function to use filelumis DBS API which will provide list of files and run/lumis.
  • collect all results, and extract unit number of run/lumi pairs

For example:

# for block b1, get this output
scurl "https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader/filelumis?block_name=$b1"
# it will provide this JSON
[
{"event_count":1000,"logical_file_name":"/store/mc/RunIISummer20UL17MiniAODv2/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/60000/99CB45DF-2F92-4249-B57C-81E777C33EEB.root","lumi_section_num":34,"run_num":1}
,{"event_count":1000,"logical_file_name":"/store/mc/RunIISummer20UL17MiniAODv2/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/60000/F2C290E2-1D56-4F46-A590-2453E052FF52.root","lumi_section_num":1,"run_num":1}
...
]

Now, repeat this for all blocks, and extract run/lumis pairs. Then make a set of this list and take its size.

Here is a simple python code which does exactly that and it returns 100 as expected:

#!/usr/bin/env python3
import os
from WMCore.Services.pycurl_manager import RequestHandler

def blockLumis(blocks):
    mgr = RequestHandler()
    pairs = set()
    for blk in blocks:
        blk = blk.replace('#', '%23')
        url = 'https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader/filelumis?block_name={}'.format(blk)
        ckey = os.getenv('X509_USER_KEY')
        cert = os.getenv('X509_USER_CERT')
        data = mgr.getdata(url, params={}, headers={'Accept': 'application/json'}, ckey=ckey, cert=cert, decode=True)
        for row in data:
            pair = (row['lumi_section_num'], row['run_num'])
            pairs.add(pair)
    return len(pairs)

blocks = [
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#471e5596-af04-4423-a850-5ef9091f154f',
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#6eb03689-167a-472f-8b09-f4bfadad6a8a',
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b8cdec8f-b664-49a6-ab2d-bb2a89893581',
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#ff78bb73-0e8c-41cb-9e51-381cfbdf15e2'
]

res = blockLumis(blocks)
print(res)

@amaltaro
Copy link
Contributor

Thank you for this investigation, Valentin.

Could you please check if the output dataset:
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM

also has those 179 lumis? If a merge job has multiple files with the same run/lumi, then the output would carry the unique information. But if the unmerged run/lumi is scattered in different merge jobs, then I think it's possible that the output dataset would have duplicate run/lumis. In that case, using the filesummaries for input discovery isn't wrong.

@vkuznet
Copy link
Contributor

vkuznet commented Dec 16, 2022

In this issue #11403 (comment) I provided two examples of python functions, blockLumis and concurrentBlockLumis, which can be used to check number of lumis and avoid duplicates. I tested both functions and concurrentBlockLumis only requires time for single block call and therefore will be much more efficient with large number of blocks in a dataset. If necessary both function can be added to Start/Policy/Dataset.py but it will require fix in pycurl_manager I provided over here #11404

@vkuznet
Copy link
Contributor

vkuznet commented Dec 16, 2022

@amaltaro , regarding NANOAODSIM we also have discrepancy. The nlumis for dataset is 100, the dataset has two blocks, and their sum of nlumis is 32+99=131. But using my code from #11403 (comment) both functions blockLumis and concurrentBlockLumis properly reports 100 lumis for provided blocks.

How would you like to move forward with this? My suggestion to add both blockLumis and concurrentBlockLumis functions to Start/Policy/Dataset.py and either switch to either of them or add UniqueNumLumis attribute to the outgoing JSON to avoid mess with counting unique lumis for list of blocks.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 6, 2023

@amaltaro , is there anything left for this issue? My understanding that we fully debugged the issue and now understand its cause. We provided tools (either wmcore python script or wflow-dbs service) to data-ops, and I wonder if we need to keep open this issue. If so, it would be nice to list actions items required to move forward with this issue. Thanks.

@amaltaro
Copy link
Contributor

amaltaro commented Jan 6, 2023

Yes, I think we can declare this issue as resolved. In the future, we still have to think in a more sustainable way to find how many total and how many unique lumis were expected to be processed (which might be different between the beginning and end of the workflow lifetime as well).

@haozturk please reopen it in case there is anything else missing.

@amaltaro amaltaro closed this as completed Jan 6, 2023
@amaltaro amaltaro assigned amaltaro and unassigned todor-ivanov Jan 6, 2023
@amaltaro amaltaro added the QPrio: High quarter priority label Jan 6, 2023
@amaltaro
Copy link
Contributor

amaltaro commented Mar 2, 2023

Qier was asking about the following workflow:
pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871

which keeps coming back in operations as "noRecoveryDoc", thus without any ACDC documents to be recovered.

I decided to run it over the service that Valentin deployed in cmsweb-testbed and here is the output:

$ curl -k --cert $X509_USER_CERT --key $X509_USER_KEY --cacert $X509_USER_CERT "https://cmsweb-testbed.cern.ch/wflow-dbs/stats?workflow=pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871"
[
   {
      "Workflow": "pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871",
      "TotalInputLumis": 27653,
      "InputDataset": "/MET/Run2018C-15Feb2022_UL2018-v1/AOD",
      "OutputDataset": "/MET/Run2018C-UL2018_MiniAODv2_GT36-v1/MINIAOD",
      "InputStats": {
         "num_lumi": 27653,
         "num_file": 1188,
         "num_event": 31219922,
         "num_block": 29,
         "num_file_lumis": 20957,
         "unique_file_lumis": 20957,
         "filesummaries_lumis": 27653,
         "num_invalid_files": 1187
      },
      "OutputStats": {
         "num_lumi": 27605,
         "num_file": 432,
         "num_event": 31144738,
         "num_block": 14,
         "num_file_lumis": 27503,
         "unique_file_lumis": 27503,
         "filesummaries_lumis": 27605,
         "num_invalid_files": 431
      },
      "Status": "WARNING: number of lumis differ 27653 != 27605, number of events differ 31219922 != 31144738",
      "ElapsedTime": 6.685214328
   }

from the report above, it looks like the input data contains thousands of duplicate lumis (based on unique_file_lumis).

Actually, having a second look at these input metrics:

         "num_file": 1188,
         "num_invalid_files": 1187

it looks like the input dataset only has 1 valid file(!). I did a spot check and I think this is actually wrong, given that all 6 files in this block are actually valid:
https://cmsweb.cern.ch/dbs/prod/global/DBSReader/files?block_name=/MET/Run2018C-15Feb2022_UL2018-v1/AOD%23130e6704-0fa9-4675-848a-e80345d94640&detail=true

@vkuznet could you please review how you count those (output stats seem to be miscounted as well)?

@vkuznet
Copy link
Contributor

vkuznet commented Mar 3, 2023

Alan, yes there was an error (mis-match valid vs invalid in DBS API query). Now server is fixed and reports:

scurl "https://cmsweb-testbed.cern.ch/wflow-dbs/stats?workflow=pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871"
[
   {
      "Workflow": "pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871",
      "TotalInputLumis": 27653,
      "InputDataset": "/MET/Run2018C-15Feb2022_UL2018-v1/AOD",
      "OutputDataset": "/MET/Run2018C-UL2018_MiniAODv2_GT36-v1/MINIAOD",
      "InputStats": {
         "num_lumi": 27653,
         "num_file": 1188,
         "num_event": 31219922,
         "num_block": 29,
         "num_file_lumis": 27118,
         "unique_file_lumis": 27118,
         "filesummaries_lumis": 27653,
         "num_invalid_files": 0
      },
      "OutputStats": {
         "num_lumi": 27605,
         "num_file": 432,
         "num_event": 31144738,
         "num_block": 14,
         "num_file_lumis": 27541,
         "unique_file_lumis": 27541,
         "filesummaries_lumis": 27605,
         "num_invalid_files": 0
      },
      "Status": "WARNING: number of lumis differ 27653 != 27605, number of events differ 31219922 != 31144738",
      "ElapsedTime": 8.060713354
   }
]

@z4027163
Copy link

z4027163 commented Mar 6, 2023

Do you know what's the problem about this workflow? The num_invalid_files is 0, so it doesn't look like the invalidate issue.

@amaltaro
Copy link
Contributor

amaltaro commented Mar 6, 2023

@z4027163 issue with that workflow is not related to invalid files, but to amount of unique (or duplicate) run/lumis in the input, see:

         "num_lumi": 27653,
         "unique_file_lumis": 27118,

from the report above. Does it answer the remaining question that was reported at the CompOps meeting?

@z4027163
Copy link

z4027163 commented Mar 6, 2023

@vkuznet Can you give more details of what's the meaning of "unique_file_lumis"? I am a bit supprised that the output has a higher value than the input dataset.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 6, 2023

@z4027163 , it sets over here: https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L37 and calculated in this function https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L194 But in plain English it is number of lumis returned by filelumis DBS API for a given block name (I resolved a dataset into block names, then query filelumis for every block and calculate unique number of run-lumi pairs).

@amaltaro
Copy link
Contributor

amaltaro commented Mar 6, 2023

@vkuznet according to the code, is it correct to say that unique_file_lumis and num_file_lumis contain the same information?

@vkuznet
Copy link
Contributor

vkuznet commented Mar 7, 2023

@amaltaro , the num_file_lumis represents total number of run-lumis pairs from filelumis API for all blocks in a dataset. While unique_file_lumis represents unique number of run-lumis from filelumis API for all blocks in a dataset. They may be the same or may differ. But, yes they contain similar information. Please see their assignments in a code:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants