Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to streamline parent file discovery #8352

Merged
merged 3 commits into from
Apr 25, 2024
Merged

Conversation

davidlange6
Copy link
Contributor

After a bit of discussion with @belforte , this is some untested code that shoudl considerably reduce the time spent in checking for overlapping lumis when looking for secondary files.

@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

@belforte belforte self-assigned this Apr 24, 2024
@belforte
Copy link
Member

To get a baseline, I timed the use case which gave origin to this https://cms-talk.web.cern.ch/t/crab-task-queued-on-command-submit-for-a-while/39714 i.e.
primary dataset: /WtoLNu-4Jets_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v2/MINIAODSIM
secondary dataset: /WtoLNu-4Jets_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Summer22EEDRPremix-124X_mcRun3_2022_realistic_postEE_v1-v2/AODSIM

primary dataset has 6833 files, secondary dataset has 31962 files
code is like

for file in primary:  # loop 1
  for file in secondary:  # loop 2
    find parents by matching lumis

I found that each iteration of loop1 takes 15~20 seconds. Of course I killed after a few tens of iterations but all files should be pretty similar.

For a total of the order of 30 hours.
(maybe I should have let that task run 😁 !

@belforte
Copy link
Member

With new code (from this PR) time for each iteration went down to 0.15~0.20 seconds. A neat x100 improvement.
Looking forward to a "reasonable" 20 minutes for the whole match.

🙇‍♂️

Onward to validation

src/python/TaskWorker/Actions/DBSDataDiscovery.py Outdated Show resolved Hide resolved
src/python/TaskWorker/Actions/DBSDataDiscovery.py Outdated Show resolved Hide resolved
src/python/TaskWorker/Actions/DBSDataDiscovery.py Outdated Show resolved Hide resolved
@belforte
Copy link
Member

I tested on
primary: /DoubleMuon/Run2018B-02Apr2020-v1/NANOAOD
secondary: /DoubleMuon/Run2018B-17Sep2018-v1/MINIAOD

and got identical results

@belforte belforte merged commit 616470a into dmwm:master Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants