-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing Number of workflows with Duplicates Lumis #11956
Comments
@hassan11196 Hassan, I am not saying I am going to debug it :) But just to help WM on debugging it, would you have further details for one of these workflows? Which dataset? Which run/lumi and file have dups? |
@hassan11196 Hassan, I was looking for a workflow with very few stats, and I found this cmsunified_task_EXO-Run3Summer22MiniAODv4-00662__v1_T_240213_134751_3377 which has been sitting in I implemented lumi check at 3 levels: dataset, block and file; and indeed they differ as can be seen in the following table: where we can see 4 duplicate lumis in the input dataset. In case you want to check on your side, here is the duplicate report for the input dataset (which can potentially cause duplicate lumis in the output, and it did!):
lastly, this dataset above was (potentially) produced by 3 workflows, according to request reqmgr2 api. Just for your information, the distinction of those 3 different ways to calculate lumis in the dataset are:
If you want, I can polish my python notebook and share it with you tomorrow, such that you can check the other workflows. PS.: the workflow that produced the dataset above did not have any input dataset. So duplication originated on the direct parent workflow. |
Hi @amaltaro, Thank you for this investigation. I did not update the ticket but I was able to find the files that that had duplicated lumis using our scripts but was not able to find the lumis yet, Please do share your notebook (even in unpolished state) this will help me a lot. We now have this tool in our dashboard which we use to find the files to invalidate for duplicate lumis. I will provide you an updated list of wfs that had duplicated lumis, along with list of files and lumi run numbers.
You mentioned the same query for block and dataset. So inside each the filesummaries for a dataset, their are filesummaries for each block? Another question I have is, if multiple workflows are writing to a dataset, what are the parameters that control the workflow outputs what lumi no's. I assume Lastly, I want to minimize your your time with trivial things given wmcores priorities this quarter, you can just give me pointers to find the stuff, I will gather all logs and relevant info for you in a single place and make it easy for you to narrow down the issue. Thanks a lot Alan. |
@hassan11196 Hassan, I think we covered most of this over zoom today, but please let me know if anything needs follow up.
If you planned on having multiple workflows writing to the same output dataset, then yes, you need to use However, most - if not all - of these duplicate lumis happen unintentionally, and you can either invalidate given files in DBS or recreate the output dataset in a v++ setup. We briefly discussed the notebook today, and here it is: I am moving this issue to waiting, once we confirm with another workflow or two that there is no problem on the WM side, I would suggest to get it closed. Please let us know how it goes. Thanks! |
Hello @amaltaro https://docs.google.com/document/d/1bH6etTBucsw5F_wUKiqHBHuN1fRAlLUuGEbAaGAwd2g/edit?usp=sharing Thank you. |
Hi @amaltaro For background context We reproduced this dataset as V2, but there is a discrepancy between MiniAOD and NanoAOD events and lumis.
our initial suspicion was that it might be caused by duplicated lumis. However, our duplication check in Unified did not detect any issues. I tried the Here is a complete list of duplicated Lumis, /Muon1/Run2023C-22Sep2023_v4-v2/MINIAOD
/Muon1/Run2023C-22Sep2023_v4-v2/NANOAOD
Total Duplicated Lumis MinAOD : 372 unique lumis that are duplicated -> 865 sum of duplicated lumis @amaltaro can you tell me how to verify the duplicates from root? I have downloaded one file. |
@hassan11196 Ahmed, I have not yet checked the list you provided above. |
Hi @amaltaro, I was reviewing one of the files[1] mentioned above using the dbs API and noticed that they had duplicate lumis but different run numbers. It seems that the Before:
After:
So can you confirm that this was a false alarm? [1] |
@hassan11196 Ahmed, if those same lumis belong to different run numbers, then it is definitely NOT a duplication. Thank you very much for spotting that. I will take this opportunity and update the python notebook in my repository (but if you prefer, feel free to share your current code and I can push that in as well). Maybe we should re-do such tests with the previous 22 (?) workflows that you reported with dup lumis? |
Hi @amaltaro, Thank you |
Hi @amaltaro @hassan11196 @haozturk , I confirm what is reported in the last message from Ahmed. I think we can proceed with understanding the source of the duplicated lumis across different files |
I am afraid that these workflows are now old enough and we are no longer able to find their logs anywhere. In addition, it looks like we are not collecting logs for production/processing tasks in StepChain, because I could only find logs for merge jobs under CERN EOS:
and these 2 tarballs contains Merge logs for the NANOAODSIM output with duplicate lumi number 7:
These are contents from the FJR from the relevant merge jobs:
and
My suggestion is then to either invalidate things and move forward, or start these workflows from scratch. @hassan11196 @hassan11196 This is a small workflow though cmsunified_task_EXO-Run3Summer22MiniAODv4-00662__v1_T_240213_134751_3377 and it would be interesting to clone it to see if the problem reoccurs. Hassan, Ahmed, is it something that you could do? |
From the google document, I see that the last workflow has been reported with duplicate lumis: I would suggest you to follow up on that one (either invalidate the relevant files; or produce that dataset all over again). Input data is |
@hassan11196 Ahmed, can you please also share your modified version of the initial jupyter notebook? |
@amaltaro shared with you on mattermost. |
Thanks Ahmed!
and there are no duplicate lumis in the output data, as can be seen below:
Based on that, I am inclined to say that there is nothing particular to the workflow and/or job splitting that triggers this duplicate run/lumi. so I don't think it would be an edge case performing the job splitting. More debugging is needed though. |
@amaltaro @hassan11196 maybe we can consider running a backfill workflow with the same config, to confirm this result? IMU, backfill agents are closer to the production agents |
Andrea provided me a list of duplicated AODSIM run/lumis for the following dataset: Zooming in into the following files with duplicate run/lumi:
we managed to find the merge job log for "1:6" file while the second file @@@@@@@@@@@
While for the second merged file
@@@@@@@@@@@
For the second merged file, if we search for the unmerged file
For the record, the script used for searching for a given file id name inside the tarballs is: https://raw.githubusercontent.com/amaltaro/ProductionTools/master/untarLogArchive.py |
Now if you only want to read about the actual culprit -at least for this workflow - the problem seems to be with WorkQueueManager, which actually acquired the same WQE element in multiple agents! Log from submit7:
log from vocms0252:
there is an offset of about 5min between these agents. Actually, even if it was an offset of half second, this should NEVER happened. This is a serious bug that needs to have high priority attention! I am going to create a new ticket to properly reflect this problem; then we can decide whether we close this or not (or if we close it together with the to-be-created ticket). UPDATE: here is the ticket #12041 |
Thank you for narrowing down the issue @amaltaro. Do let me know if I can help in replicating the WQE being picked by multiple agents. |
Hi @amaltaro, As I mentioned in today's WMCore Dev meeting, I found a recent workflow acquired after 2 August 2024 to have duplicated lumis in Output. Here are the details, workflow name:
Do let me know if you need anything else. Thank you. |
I still don't have a final answer on what happened with the workflow above, but I can confirm that it is not the same issue that we were having - multiple agents pulling the same WQE - as I couldn't see anything in the other agents, only vocms0281 worked on this workflow. I confirmed that the duplicate lumi
and
Using a modified version of
and
As can be seen, this script does not find any occurrence of The only explanation I can give for that is that the job producing that file was actually retried somehow, hence overwriting the wmagentJob.log with the content of the new retry. But, if it is retried, it cannot be successful and that file cannot get assigned to a merge job... so this hypothesis is pretty much weak! |
I decided to revisit this workflow above and try to make sense of the duplicate lumis. Here are new important findings. I looked again into the Production job previously mentioned, copying it here:
and noticed it was retried once. Given that
and
If the job was retried, then it must have been seen as failed by JobAccountant before being accepted as a successful job. This can be confirmed in the component logs:
Now we can look into the merge jobs that produced these 2 files (as reported by Ahmed above):
and those logs are reported in my post above, which I also copy here:
where the merge job |
As shown in the comment above, the output files of a successful job that was marked as failed in JobAccountant (hence, it being an actual failed job) was actually fed as input for merge jobs. This is a bug and it should have never happened! Looking into JobAccountant log, I see this:
Based on this log, I am inclined to say that we have a bug on the way JobAccountant deals with these ill cases where an output file is reported as not having any location. Thanks to Git blame, I can see that this issue was likely caused by a pull request I provided last year: @hassan11196 given that this duplicate lumi reason is completely different than what we were debugging, I would like to close this issue out an open a new bug ticket to address the problem aforementioned in WMAgent. |
@hassan11196 I have created the following issue to address the recent duplicate lumi issue, with root cause on an ill handling of an exceptional job report in JobAccountant. I am closing this issue out, but please let me know if there is anything else missing. Thank you for all your help so far! |
Impact of the bug
Duplicate Lumis in output files affect the Output Datasets, The workflows with duplicates in their lumis are not announced automatically and need Manual Operations from P&R to remove the files with duplicate lumis.
Describe the bug
There has been an increase in Workflows with duplicates in their Outputs over the past weeks.
Monitoring Link: https://monit-grafana.cern.ch/goto/W0feXBxSR?orgId=11
It has also affected RelVal workflows, as described in this ticket
https://its.cern.ch/jira/browse/CMSPROD-165
A recent example of a workflow with duplicates
For this workflow I have invalidated the files in DBS with duplicate Lumis
How to reproduce it
I can try submitting one of the above workflows as a backfill and see if its output also has duplicates.
Expected behavior
Output Datasets to not have files with duplicate Lumis.
The text was updated successfully, but these errors were encountered: