Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMAgent: continuous update of the pileup availability #11619

Closed
todor-ivanov opened this issue Jun 27, 2023 · 3 comments · Fixed by #11884
Closed

WMAgent: continuous update of the pileup availability #11619

todor-ivanov opened this issue Jun 27, 2023 · 3 comments · Fixed by #11884

Comments

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jun 27, 2023

Impact of the new feature
MSPileup

Is your feature request related to a problem? Please describe.
The library implemented in this ticket will have to be incorporated into this new WMComponent: #11733

Due to the huge size of Premix pileup datasets held on disk we are forced to start supporting a mechanism for partial data placement for those. Currently they are held mostly in CERN and FNAL. A typical size is of the order of 0.5-1PB.

In order to implement this, we need to allow MSPileup, to be configured to place only a fraction of the full dataset on disk. That means, remove the constraint that a full container is locked under the same RSE, supporting Rucio dataset-based data placement. This fraction depends only on the minimally allowed ratio between number of unique events in the Premix pileup and the number of events processed by the workflow (R). Currently the commonly accepted ratio is 0.5 and this should be a number to be provided by PPD.

This number should also be dynamic - meaning we should allow to be altered during a campaign's lifetime.

Describe the solution you'd like
UPDATE: The new WMAgent component (WorkflowUpdater) will contact MSPileup to fetch an up-to-date location for each pileup that is currently active in the agent (aka, which is requested by an active workflow).
If there are changes between the pileup configuration in the sandbox and the pileup configuration/location in MSPileup (which is a live'ish data), then this component is supposed to update the workflow sandbox (including the sandbox tarball).

More in-depth details are provided in the following comment: #11619 (comment)

OLD description: Create a module/function that will resolve the pileup data location based on the rucio dataset names. The high level logic is:

  • fetch any rules for the container name under a given Rucio account
  • fetch all of the Rucio dataset names in a given Rucio container and retrieve the rules for every single dataset, given a Rucio account.
  • rules that are not in state OK need to be logged
  • the output needs to be a JSON dump in exactly the same format as the one made for pileup_config.json
  • see THIS module for further details. Apparently there is a cross check against DBS, potentially for the file name which needs to be marked as VALID.

Potential input parameters: pileup name (str), rucio account (str), rucio auth url (str), rucio url (str)

Given that this will have to resolve every single Rucio dataset in a container, we need to be able to run concurrent requests in a timely fashion manner.

Describe alternatives you've considered
None

Additional context
This is part of the meta issue: #11537

@amaltaro amaltaro changed the title Support partial pileup data placement for PREMIX type WMAgent: continuous update of the pileup availability Sep 21, 2023
@d-ylee
Copy link
Contributor

d-ylee commented Dec 11, 2023

@amaltaro I am looking at working on this issue, but I might need more information about MSPileup. I looked at the wiki page on MSPileup, but I don't think I really understand what MSPileup does. Is there another page where I can get more information?

Would implementing this be another MSPileup task?

@d-ylee d-ylee assigned d-ylee and unassigned d-ylee Dec 11, 2023
@amaltaro
Copy link
Contributor

I am inclined to say that this ticket is no longer relevant. Instead of writing a new algorithm to perform the data location through Rucio, I think we should solely rely on MSPileup information in order to keep up-to-date pileup location across the WM system.

@amaltaro
Copy link
Contributor

amaltaro commented Jan 23, 2024

Among the more detailed tasks to be performed and/or implemented in this component, are:

  1. Find a reliable way to identify where workflow sandboxes are stored.
  2. Find the pileupconf.json files within a given workflow sandbox
  3. Check if there is actually any pileup changes - if not, move to the next workflow
  • same blocks? same block location? If yes for both, there are no changes.
  1. Update the pileupconf.json in place
  • is the block present in the new container? if not, pop it out; if yes, update location
  1. Now we update the sandbox tarball (atomic operation, if possible) - named after the workflow with suffix "-Sandbox.tar.bz2"

Observation:
Each workflow can have 0 or more pileups, BUT there cannot be more than 1 pileup within a single Task/Step/cmsRun. Example:

cmst1@vocms0193:/data/srv/wmagent/current $ find  install/wmagentpy3/WorkQueueManager/cache/amaltaro_SC_MultiPU_Agent227_Val_240110_215719_7133 -name pileupconf.json
install/wmagentpy3/WorkQueueManager/cache/amaltaro_SC_MultiPU_Agent227_Val_240110_215719_7133/WMSandbox/GenSimFull/cmsRun2/pileupconf.json
install/wmagentpy3/WorkQueueManager/cache/amaltaro_SC_MultiPU_Agent227_Val_240110_215719_7133/WMSandbox/GenSimFull/cmsRun1/pileupconf.json

Combinatorics of this process is a function of: one mspileup call + number of pileups

UPDATE: other questions that were asked/answered during our conversation (Valentin and Alan):
Example of pileup JSON dump in a workflow sandbox:

vocms0193:/data/srv/wmagent/current $ cat install/wmagentpy3/WorkQueueManager/cache/amaltaro_TC_6Tasks_PU_Agent227_Val_240110_213013_3577/WMSandbox/HIG_RunIISummer20UL16DIGIPremixAPV_02791_0/cmsRun1/pileupconf.json  | jq | more

Location for the workflow sandboxes are defined by the WorkQueueManager configuration, e.g.:

config.WorkQueueManager.componentDir = '/data/srv/wmagent/v2.2.6rc7/install/wmagentpy3/WorkQueueManager'

Module that creates the sandbox:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMRuntime/SandboxCreator.py#L69

Apparently, the sandbox location can be defined as:
WorkQueueManager.componentDir + "/cache/" + workflow_name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants