Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support partial data placement/location for PREMIX pileup #11537

Closed
15 of 16 tasks
amaltaro opened this issue Apr 6, 2023 · 3 comments
Closed
15 of 16 tasks

Support partial data placement/location for PREMIX pileup #11537

amaltaro opened this issue Apr 6, 2023 · 3 comments

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Apr 6, 2023

Impact of the new feature
MSPileup (but will require changes at Global WorkQueue and WMAgent)

Is your feature request related to a problem? Please describe.
This topic has been discussed a few times already and Dima also started gathering the feature requirements in a google document named "Premix data management for Production", which I won't link here to avoid unwanted access.

The main goal is to keep a fraction of a PREMIX container in the storage, being able to shrink or expand it as needed.
Further details and WMCore dependencies are still to be discussed and investigated.

In a private chat with Hasan today, he also expressed that this is the most important issue for P&R, once MSPileup is deployed and fully functional in production.

Describe the solution you'd like
The expected solution is as follows:

  • support partial pileup data placement for PREMIX type (that means, remove the constraint that a full container is locked under the same RSE, supporting Rucio dataset-based data placement)
  • support partial pileup data location in global workqueue (hence, looking both at container and dataset-level rules)
  • support partial pileup data location in local workqueue (hence, looking both at container and dataset-level rules)

The original requirements described in the document say:
"""

  • Every pileup should have a desired fraction value defined in its config.
    • When this number is reduced, workflow management should free some Rucio rules and ensure the new fraction on disk.
    • When this number is increased, workflow management should create new rules to make tape recall and ensure the new fraction on disk
    • Fraction increase and decrease should be done evenly at every defined RSE, e.g. if we reduce the fraction from 100% to 50% for a pileup whose locations are defined as CERN and FNAL, we expect a fraction drop of 50% at both sites.
  • Workflows which cannot proceed due to insufficient pileup on disk should trigger an alert to P&R and PDMV and these two groups should decide which action to take (increase fraction, resubmit workflows w/ fewer events etc.)
    """

Sub-tasks for this meta-issue are:

Extra sub-tasks created along the way:

Not a sub-task but potentially related: WMAgent: refactor pileup json location in the sandbox #11735

Describe alternatives you've considered
Keep placing and enforcing the whole container to be available at any given RSE.

Update from a discussion in the weekly O&C meeting:

  1. whenever pileup fraction goes < 1.0, it would be beneficial to keep a different set of pileup blocks in different RSEs, maximizing the pileup statistics.
  2. this feature is mostly relevant for PREMIX pileup, as classical mix is supposed to be on Disk for a very short period of time.
  3. safety guards are not yet considered. One day, we can have a mechanism that will scale pileup availability on Disk according to the number of requested events in incoming workflows. A policy between CompOps and PPD needs to be established though.

Additional context
It requires further investigation, but I do think that in order to support this, we should also support live update to the pileup JSON map that is shipped with every single job. In other words, we cannot create a pileup location JSON file at the beginning of the workflow and keep using it - unchanged - during the lifetime of a workflow.

@amaltaro
Copy link
Contributor Author

amaltaro commented Apr 6, 2023

This is likely a meta-issue and we need to think and create more manageable/actionable issues.

@amaltaro amaltaro added QPrio: Low quarter priority and removed QPrio: Low quarter priority labels Apr 23, 2023
@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jun 27, 2023

hi @amaltaro

Dima also started gathering the feature requirements in a google document named "Premix data management for Production", which I won't link here to avoid unwanted access.

Does this document contain any sensitive data? At least to me, it seems not to be the case. If @drkovalskyi does not object, I am in favor of linking it in the issue here. Because this document does contain important information and is a major pointer to the issue and how we currently handle pileup data.

@amaltaro
Copy link
Contributor Author

As we only have one enhancement issue opened against this project, and the full functionality has been implemented and deployed to production, I think we can declare this project as completed.
The optional/enhancement issue remains open, but at least this major milestone can be scratched from our ToDo list.

Thank you very much Valentin for carrying over most of this development, and everyone else that helped with discussions/suggestions and so on. Closing this one out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants