Support partial data placement/location for PREMIX pileup #11537

amaltaro · 2023-04-06T17:55:36Z

Impact of the new feature
MSPileup (but will require changes at Global WorkQueue and WMAgent)

Is your feature request related to a problem? Please describe.
This topic has been discussed a few times already and Dima also started gathering the feature requirements in a google document named "Premix data management for Production", which I won't link here to avoid unwanted access.

The main goal is to keep a fraction of a PREMIX container in the storage, being able to shrink or expand it as needed.
Further details and WMCore dependencies are still to be discussed and investigated.

In a private chat with Hasan today, he also expressed that this is the most important issue for P&R, once MSPileup is deployed and fully functional in production.

Describe the solution you'd like
The expected solution is as follows:

support partial pileup data placement for PREMIX type (that means, remove the constraint that a full container is locked under the same RSE, supporting Rucio dataset-based data placement)
support partial pileup data location in global workqueue (hence, looking both at container and dataset-level rules)
support partial pileup data location in local workqueue (hence, looking both at container and dataset-level rules)

The original requirements described in the document say:
"""

Every pileup should have a desired fraction value defined in its config.
- When this number is reduced, workflow management should free some Rucio rules and ensure the new fraction on disk.
- When this number is increased, workflow management should create new rules to make tape recall and ensure the new fraction on disk
- Fraction increase and decrease should be done evenly at every defined RSE, e.g. if we reduce the fraction from 100% to 50% for a pileup whose locations are defined as CERN and FNAL, we expect a fraction drop of 50% at both sites.
Workflows which cannot proceed due to insufficient pileup on disk should trigger an alert to P&R and PDMV and these two groups should decide which action to take (increase fraction, resubmit workflows w/ fewer events etc.)
"""

Sub-tasks for this meta-issue are:

Extra sub-tasks created along the way:

Not a sub-task but potentially related: WMAgent: refactor pileup json location in the sandbox #11735

Describe alternatives you've considered
Keep placing and enforcing the whole container to be available at any given RSE.

Update from a discussion in the weekly O&C meeting:

whenever pileup fraction goes < 1.0, it would be beneficial to keep a different set of pileup blocks in different RSEs, maximizing the pileup statistics.
this feature is mostly relevant for PREMIX pileup, as classical mix is supposed to be on Disk for a very short period of time.
safety guards are not yet considered. One day, we can have a mechanism that will scale pileup availability on Disk according to the number of requested events in incoming workflows. A policy between CompOps and PPD needs to be established though.

Additional context
It requires further investigation, but I do think that in order to support this, we should also support live update to the pileup JSON map that is shipped with every single job. In other words, we cannot create a pileup location JSON file at the beginning of the workflow and keep using it - unchanged - during the lifetime of a workflow.

amaltaro · 2023-04-06T17:58:12Z

This is likely a meta-issue and we need to think and create more manageable/actionable issues.

todor-ivanov · 2023-06-27T08:57:13Z

hi @amaltaro

Dima also started gathering the feature requirements in a google document named "Premix data management for Production", which I won't link here to avoid unwanted access.

Does this document contain any sensitive data? At least to me, it seems not to be the case. If @drkovalskyi does not object, I am in favor of linking it in the issue here. Because this document does contain important information and is a major pointer to the issue and how we currently handle pileup data.

amaltaro · 2024-06-14T01:53:55Z

As we only have one enhancement issue opened against this project, and the full functionality has been implemented and deployed to production, I think we can declare this project as completed.
The optional/enhancement issue remains open, but at least this major milestone can be scratched from our ToDo list.

Thank you very much Valentin for carrying over most of this development, and everyone else that helped with discussions/suggestions and so on. Closing this one out.

amaltaro added New Feature WMAgent WorkQueue High Priority Further Discussion Stakeholders MSPileup labels Apr 6, 2023

amaltaro added QPrio: Low quarter priority and removed QPrio: Low quarter priority labels Apr 23, 2023

This was referenced Jun 27, 2023

WMAgent: continuous update of the pileup availability #11619

Closed

Refactor pileup data location in global workqueue to support partial availability #11620

Closed

MSPileup: complete support to fraction of pileup container #11621

Closed

vkuznet mentioned this issue Nov 28, 2023

Add support for partial data placement tracing in MSPileup documents #11801

Closed

amaltaro mentioned this issue Jan 18, 2024

Create new script for adjusting Pileup documents in MongoDB #11867

Closed

amaltaro mentioned this issue Feb 7, 2024

Create mock class/data for MSPileup #11891

Closed

amaltaro mentioned this issue Feb 16, 2024

Efficiently obtain file list and number of events information for a given list of block names #11899

Closed

amaltaro removed Further Discussion QPrio: Low quarter priority labels Feb 16, 2024

This was referenced Feb 28, 2024

Add code to insert transition record; code re-factoring #11915

Closed

Fix issues with MSPIleup integration tests #11920

Closed

vkuznet mentioned this issue Mar 26, 2024

Add code to insert transition record; code re-factoring #11947

Merged

amaltaro closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support partial data placement/location for PREMIX pileup #11537

Support partial data placement/location for PREMIX pileup #11537

amaltaro commented Apr 6, 2023 •

edited

Loading

amaltaro commented Apr 6, 2023

todor-ivanov commented Jun 27, 2023 •

edited

Loading

amaltaro commented Jun 14, 2024

Support partial data placement/location for PREMIX pileup #11537

Support partial data placement/location for PREMIX pileup #11537

Comments

amaltaro commented Apr 6, 2023 • edited Loading

amaltaro commented Apr 6, 2023

todor-ivanov commented Jun 27, 2023 • edited Loading

amaltaro commented Jun 14, 2024

amaltaro commented Apr 6, 2023 •

edited

Loading

todor-ivanov commented Jun 27, 2023 •

edited

Loading