Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partial pileup placement logic #11807

Merged
merged 9 commits into from
Jan 18, 2024
Merged

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Nov 30, 2023

Fixes #11621

Status

ready

Description

Introduce partial pileup placement logic described in #11621 It involves the following:

  • user place a request in form of {containerFraction: <float>, 'pileupName': <string>}
  • request is placed via PUT HTTP call to MSPileup data-service
  • MSPileup data service calls updatePileup API which by itself pass the call to MSPileupData manage and calls its updatePileupAPI
  • the updatePileup API performs necessary checks, creates cusomDID and creates new transition record with the following structure:
{containerFraction: float,
 customDID: string,
 updateTime: GMTimeInSeconds,
 DN: string}

In addition, we provide partialPileupTask in MSPileupTask class which is used during each polling cycle. This method performs transition logic outlined in the following gist

  • it fetches pileup documents from MongoDB
  • it identifies the slope of container fraction
  • get rucio DIDs for our pileup document
  • get portion of DIDs based on ceil(containerFraction * num_rucio_datasets) formula
  • create new container DID in Rucio
  • call attachDIDs Rucio wrapper API with our set of DIDs and rses from pileup document
  • create new rules for custom DID
  • set expiration date (to be 24h) for already existing ruleIds from pileup document
  • add new ruleIds to pileup document

TODO:

Further actions:

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

External dependencies / deployment changes

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 13 tests deleted
    • 2 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 2 warnings
    • 12 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14681/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 8 warnings
    • 15 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14682/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 3 tests deleted
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 21 warnings
    • 35 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 90 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14684/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 23 warnings
    • 41 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14685/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 23 warnings
    • 41 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14686/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 4, 2023

Alan, it would be nice if you'll review it before you leave. I implemented initial logic but need to do some testing, etc. Since changes are touching different parts of WMCore, e.g. Rucio, MSAuth, etc. it would be nice to have initial review of these changes.

@vkuznet vkuznet requested a review from amaltaro December 4, 2023 19:27
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, these developments are on the right track, thanks.

I did leave a bunch of comments along the code though. And I wanted to make another comment here for the partialPileupTask function. One of the very first pre-processing to be done on that function is actually identifying if the containerFraction has changed or not. The rest of the function should be only executed when there is a real change of container fraction.

@@ -89,6 +89,26 @@ def __init__(self, msConfig, **kwargs):
with open(authFile, 'rb') as istream:
self.authzKey = istream.read()

def userDN(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that this whole method could be replaced by a call to https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/REST/Auth.py#L17C5-L17C27. On the caller, just access the dn field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I prefer to have proper wrapper function for two reason:

  1. we should be able to test without user DN (this is what function provides)
  2. we already have MSAuth module as wrapper used in MSPileup code and bypassing this module to bring REST/Auth.py feels wrong since MSAuth depends on it. In other words we should either be dependent on REST/Auth.py or MSAuth.py but not on both at the same time within code base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still test it without a user DN, you just need to move the DN check to a different place, of course.

About the dependency on REST/Auth.py, MSAuth module already depends on it. That means that, there is no difference at this level if MSPileup depends (imports) MSAuth or REST/Auth, given that REST/Auth will be imported regardless of our choice.

We can stick to this development though, I am just trying not to inflate the codebase with - apparently - unnecessary extra developments/functions/methods.

src/python/WMCore/MicroService/MSPileup/MSPileupData.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupData.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupData.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/Services/Rucio/Rucio.py Outdated Show resolved Hide resolved
@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 5, 2023

@amaltaro , I need additional input from you how we should handle PUT requests? We have two options here:

  1. User will send {"containerFraction":<float>, "pileupName": <name of pileup document>} JSON with HTTP put
  2. User will first whole JSON document from MSPileup, then change its containerFraction and send HTTP PUT request

In first case, we should perform additional look-up of corresponding pileup document, and therefore, few changes should be made to the code:

  • adjust MSPileup service to handle different PUT request schemas, so far we assume that user send valid pileup document, while in case 1 it would be a update small doc
  • add look-up of MSPileup document itself and only then perform updatePileup API call

Please clarify which scenario to adopt.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 23 warnings and errors that must be fixed
    • 32 warnings
    • 68 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14694/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 11, 2023

@amaltaro , I'm awaiting your reply to my comment #11807 (comment). Please provide your feedback as I need to know how to implement this in code.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 23 warnings and errors that must be fixed
    • 32 warnings
    • 68 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14713/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

@amaltaro , I need additional input from you how we should handle PUT requests? We have two options here:

@vkuznet Valentin, apologies for missing this question.
My opinion on this is that we should not change the current behavior. As far as I can tell, we are using the updatePileup function:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSPileup/MSPileup.py#L97C9-L97C21

which ends up calling MSPileupData.updatePileup function:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSPileup/MSPileupData.py#L132

which itself already has a document lookup.

Said that, your option 1 seems to me to be the way to go, which I copy below as well.
"""

  1. User will send {"containerFraction":, "pileupName": } JSON with HTTP put
    """

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, I left a few comments in place. However, I also wanted to clarify the following:
a) if the container fraction is decreasing, instead of using random blocks from the standard pileup dataset (pileupName), we should use it from the custom dataset (customName) - if existent. This will ensure that no data will have to be staged from Tape, instead it will all be already available on Disk.
b) if the container fraction is increasing, we should use EVERY single block defined in the custom dataset (customName) - if existent - plus the required additional blocks from the standard pileup dataset (pileupName). This way we minimize the amount of blocks that will have to be staged from Tape.

In other words:

  • when decreasing: new container is a subset of customName
  • when increasing: new container is a superset of customName + subset of pileupName

@@ -89,6 +89,26 @@ def __init__(self, msConfig, **kwargs):
with open(authFile, 'rb') as istream:
self.authzKey = istream.read()

def userDN(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still test it without a user DN, you just need to move the DN check to a different place, of course.

About the dependency on REST/Auth.py, MSAuth module already depends on it. That means that, there is no difference at this level if MSPileup depends (imports) MSAuth or REST/Auth, given that REST/Auth will be imported regardless of our choice.

We can stick to this development though, I am just trying not to inflate the codebase with - apparently - unnecessary extra developments/functions/methods.

"""Test the customDID function"""
pname = "/abc/xyz/MINIAOD"
did = customDID(pname)
self.assertTrue(did.endswith('-V1'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test the whole string. Same comment for 2 lines below

- create new custom Name as pileup+extention
- add new transition record and update MSPileup document
- we call attachDIDs Rucio wrapper API with our set of DIDs and rses from pileup document
- create new rules for custom DID using either FNAL disk or T2_CERN sites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot assume location will be FNAL disk or CERN. Instead, we might want to decide to use either currentRSEs or expectedRSEs. Each one of those have their own risk of using, but I would stick with the former and let Rucio manage data accordingly.

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 9 new failures
    • 41 tests deleted
    • 584 tests no longer failing
    • 163 tests added
    • 17 changes in unstable tests
  • Python3 Pylint check: failed
    • 24 warnings and errors that must be fixed
    • 32 warnings
    • 65 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 37 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14726/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 41 tests deleted
    • 586 tests no longer failing
    • 168 tests added
    • 17 changes in unstable tests
  • Python3 Pylint check: failed
    • 23 warnings and errors that must be fixed
    • 32 warnings
    • 66 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14727/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 8 new failures
    • 41 tests deleted
    • 576 tests no longer failing
    • 168 tests added
    • 17 changes in unstable tests
  • Python3 Pylint check: failed
    • 16 warnings and errors that must be fixed
    • 32 warnings
    • 65 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14729/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 8 new failures
    • 41 tests deleted
    • 585 tests no longer failing
    • 168 tests added
    • 17 changes in unstable tests
  • Python3 Pylint check: failed
    • 16 warnings and errors that must be fixed
    • 32 warnings
    • 65 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 10 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14730/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Dec 18, 2023

@amaltaro , I addressed your feedback. Few comments though:

  • I refactor MSAuth to better handle your concern about userDN. Now there is no such method but rather userInfo which is used without this codebase and we can use it now in MSPileup. I still think it is proper place to keep it
  • I added logic to handle incoming fraction request
  • I addressed your comment about usage of pileupName vs customName based on slope of container fraction
  • I addressed pylint concerns and squashed changes
  • the unit tests are also improved based on your request

I think this PR is ready for final review. Please note that @klannon is interested to merge this PR within this calendar year as I'm heading to holiday break starting this Thus and will not be available till Jan 4th. Therefore, please speed-up the review and if new comments will pop-up I can try to resolve them by Wed.

Said that, I checked jenkins report and failed tests seems not related to this PR (I also noticed something weird in jenkins tests, like you are not authorized messages which lead me to conclude that it may be some expired certificate are around which has avalanche effect).

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, please find comments/requests/suggestions along the code.
For future changes, please keep them separated in their own commit for the moment. Until we can have another review pass and squash them accordingly. Thanks

src/python/WMCore/MicroService/MSCore/MSAuth.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileup.py Outdated Show resolved Hide resolved
:return: results of MSPileup data layer (list of dicts)
"""
self.authMgr.authorizeApiAccess('ms-pileup', 'update')
keys = sorted(pdict.keys())
if keys == ['containerFraction', 'pileupName']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear this list comparison might not be reliable across different python versions. Wouldn't a test like if "containerFraction" in pdict: be safer and as good as this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is different in different python versions? Please provide an example, list is a basic data type and never change. Said that, we can't check it with if "containerFraction" in pdict: either since containerFraction is present in both pileup document and partial pileup spec. Unless you provide better argument here I'm leaving code as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any concrete examples. It's just that more complex data types are sometimes harder to check for equality. If you have no concerns, then let it be.

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/Services/Rucio/Rucio.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests added
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 19 warnings and errors that must be fixed
    • 32 warnings
    • 66 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14732/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 19 warnings and errors that must be fixed
    • 32 warnings
    • 66 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14733/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 32 warnings
    • 62 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14734/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
    • 2 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 32 warnings
    • 62 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14735/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 33 warnings
    • 64 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14762/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 9, 2024

@amaltaro , this PR is ready for review. Please note the following:

  • there are lots of changes to many files, including Rucio mock, Authz, MSPileup codebase
  • I made Authz to work with local tests using localhost-user DN
  • I run all MSPileup local test locally, without this I doubt they will be finished at all. This is particular very challenging PR and relying on Jenkins was not an option (or it will takes years to run the logic of unit tests). I would like to stress this particularly on this use-case that we should be able to run unit tests locally.
  • I implemented full logic from your gist which we discussed extensively and provided unit tests listing every step of iteration
  • I found and filled all missing parts of Rucio mock APIs without them testing was impossible
  • Even though I implemented the whole logic I found it very complex (code wise). In my view we step away from Microservice nature and all our services are too big to be called Microservices. Said that, the challenging logic comes from separation of MSPilupe APIs with MSPileupTasks, i.e. asynchronous nature of processing documents. It would be much simple if we'll have everything under single service and made all operations atomic. I fear that further changes will be very difficult to implement to adopt to all possible use-cases.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, apologies for the belated review. Please find a bunch of comments/requests/concerns along the code.

Regarding your previous comment:

I run all MSPileup local test locally, without this I doubt they will be finished at all. This is particular very challenging PR and relying on Jenkins was not an option (or it will takes years to run the logic of unit tests). I would like to stress this particularly on this use-case that we should be able to run unit tests locally.

Why do you say we cannot run unit tests locally? What are the problems that you are encountering? Have you ever seen that we have a docker container which provides the FULL environment for unit tests (of course, not any certificates in case we need to reach external services).

Said that, the challenging logic comes from separation of MSPilupe APIs with MSPileupTasks, i.e. asynchronous nature of processing documents. It would be much simple if we'll have everything under single service and made all operations atomic.

Given how expensive some of these operations can be, I think it is not feasible to have atomic operations in this service, as this would require to have blocking user calls).

src/python/WMCore/MicroService/MSPileup/MSPileupData.py Outdated Show resolved Hide resolved

# find first document with custom name
results = []
if doc.get('customName', '') != '':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what I said above, I think this code can be made simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I made it simple,. i.e. check for pileup name and prohibit custom name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I see implemented. You are actually using customName to look up for a document in MongoDB.

According to step 4 in the gist, end user will update the pileup fraction by providing 2 keys: pileupName and containerFraction. Hence, the mongodb document look up needs to be performed with the pileupName.


# perform check of input doc, is it partial pileup spec or not
partialPileupSpec = False
if len(doc.keys()) == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fragile check. I would suggest to check for the exact keys that we expect, so something like:

if set(doc.keys()) == set(["pileupName", "containerFraction"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you already perform the following, but I would suggest to also make sure that the new containerFraction is different than the actual fraction. Just so we can avoid unnecessary operations and potentially unnecessary partial pileup creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you missed this update(?)

obj = MSPileupObj(dbDoc, validRSEs=rseList)
else:
self.logger.info("#### full pileup obj")
obj = MSPileupObj(dbDoc, validRSEs=rseList)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pileup object is updated for something else, this rseList would be an empty list. Why can't we leave it with the same behavior for partial pileup (a few lines above)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to logic of MSPileupObj. it performs validation of document rse with validRSEs. Therefore, if we got document from DB which does have rses, we should provide a new set of RSEs which should be in that list. Please see logic of MSpileupObj https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py#L176C22-L176C22

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wasn't clear in my previous message. What I am asking you here is whether we really need to have this if/else statement? My understanding is that the pileup object is created in the same way, regardless whether it's full or partial pileup.

self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope)

# create new rule for custom DID using pileup document rse
newRules += self.rucioClient.createReplicationRules(portion, rse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replication rule needs to be created for the custom pileup, not for all the block names. So please replace portion by doc['customName'].

BTW, where is createReplicationRules defined? I think you mixed up the method name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done. The createReplcationRule are defined here https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L444 I fixed the function name though, i.e. createReplicationRules to createReplicationRule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, you are still creating rules for the wrong DID object.

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
for rid in doc['ruleIds']:
# set expiration date to be 24h ahead of right now
opts = {'lifetime': 24 * 60 * 60}
self.rucioClient.updateRule(rid, opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would log that we are updating rule id XXX with the new lifetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these rules that have been updated are not associated with doc['customName']. Said that, I would suggest to refactor the log message to something like: "Rule id: %s has been updated with lifetime: %s"

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
self.logger.info("update pileup document %s", doc)

# update MSPileup document in MongoDB
self.mgr.updatePileup(doc, rseList=newRules)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug? providing rules for rseList param(?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should be doc['expectedRSEs'] or doc['currentRSEs'] since my default rseList is None and passed document may have either expectedRSEs or currentRSEs to be not empty lists and during validation we check for them. Please clarify which list should be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably forgot pushing your changes in.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 15, 2024

@amaltaro , I addressed all issue you reported in your review except few items. Please reply to my observations/questions in place. In particular, In particular:

  • the custom Name pattern does not align with CMS dataset one
  • the MSPileupObj validate object rse list and therefore code provides them
  • create pileup should creates transition record per logic in gist and it is in disagreement with your comments
  • portion logic seems correct to me and you still insist that it is not valid, I followed the one reported earlier in this issue and I tried to outline it step by step. To avoid further back and forth comments I suggest that you write this snippet as you view it
  • I need further instructions what to do with Rucio logic which you pointed out as it is unclear from your review
  • I corrected createReplicationRule API name (and its mock up counterpart)
  • Due to validation of RSE list in MSPileup object we still need to pass it to updatePileup API and there are two lists around expectedRSEs and currentRSEs. Please clarify which one to use

@vkuznet vkuznet requested a review from amaltaro January 15, 2024 14:45
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, I suspect you forgot to push your changes in, as I see many of your comments saying that the code was updated, while nothing changed compared to my previous review.

For now, I would suggest you to provide additional commits, which might make the review experience easier and cleaner. Before we merge it, we can come back to this and properly squash the commits.

As mentioned along the code, I updated the gist with further instructions for decreasing/increasing the container, see: https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction

Lastly, I just wanted to raise a concern that we are likely fetching the document from MongoDB multiple times within the same REST operation. Just a concern for the future though, as there is already tons of changes in this PR.


# find first document with custom name
results = []
if doc.get('customName', '') != '':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I see implemented. You are actually using customName to look up for a document in MongoDB.

According to step 4 in the gist, end user will update the pileup fraction by providing 2 keys: pileupName and containerFraction. Hence, the mongodb document look up needs to be performed with the pileupName.


# perform check of input doc, is it partial pileup spec or not
partialPileupSpec = False
if len(doc.keys()) == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you missed this update(?)

obj = MSPileupObj(dbDoc, validRSEs=rseList)
else:
self.logger.info("#### full pileup obj")
obj = MSPileupObj(dbDoc, validRSEs=rseList)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wasn't clear in my previous message. What I am asking you here is whether we really need to have this if/else statement? My understanding is that the pileup object is created in the same way, regardless whether it's full or partial pileup.

# use EVERY single block defined in the custom dataset plus blocks from the standard pileup
cname = doc.get('customName', '')
if cname:
blockNames = self.rucioClient.getBlocksInContainer(cname) + blockNames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, I think the "remaining" word was ambiguous in my explanation, sorry. By remaining, I meant the remaining fraction of blocks, not the remaining blocks from the original dataset.

I updated the gist to reflect details of the container increase/decrease algorithm:
https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction

However, if there is anything not yet clear, I am happy to go through these over Zoom.

blockNames += self.rucioClient.getBlocksInContainer(pname)

# get portion of DIDs based on ceil(containerFraction * num_rucio_datasets)
portion = blockNames[:math.ceil(fraction * len(blockNames))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# we call attachDIDs Rucio wrapper API with our set of DIDs
newRules = []
for rse in doc['currentRSEs']:
self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Stefano Belforte, he/CRAB creates custom container and he does not provide any RSEs when attaching DIDs to a container. Based on that, I would suggest to pass None as RSE name (in practice, not setting any RSE when attaching a Rucio dataset to a Rucio container).

I still don't know how in practice that is going to be reflected in Rucio server, and what latency it's going to cause. But I believe it to be the safest approach at the moment.

self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope)

# create new rule for custom DID using pileup document rse
newRules += self.rucioClient.createReplicationRules(portion, rse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, you are still creating rules for the wrong DID object.

self.logger.info("update pileup document %s", doc)

# update MSPileup document in MongoDB
self.mgr.updatePileup(doc, rseList=newRules)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably forgot pushing your changes in.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 33 warnings
    • 64 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14769/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 33 warnings
    • 64 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14770/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 16 warnings and errors that must be fixed
    • 33 warnings
    • 64 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14771/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 3 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 33 warnings
    • 63 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14772/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 17, 2024

@amaltaro , I implemented new logic for increase/decrease fraction of pileup via separate functions. You can find appropriate commit in this PR. I also added #11863 issue to address separation of validation step from constructor of MSPileupObj. I also updated PR description with TODO to adjust MongoDB records. All unit tests are passed in my local setup and in Jenkins (though there is unrelated failed unit test from workqueue). Please have another look and hopefully we may converge and merge this PR.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, these changes are getting closer to the final product. I left a few comments along the code though and I would like to mention the following 3 points as well:

  • Please update the PR description with: correction of typos, containerFraction is a float not int, lines from the template should be either removed or replaced.
  • Perhaps we should have a script to update the pileup documents with a new transition record(?) Should we create a new GH issue and link it to the meta issue?
  • Even though we are passing the actual pileup document to the lower layers, I am almost sure that we retrieve pileup document multiple times from MongoDB (e.g. MSPileup.updatePileup, then MSPileupData.updatePileup, etc). If that is the case and it only affects efficiency of the service, we can get back to it at a later stage. However, if it can affect the smooth operation of the service, we should revisit it now.

src/python/WMCore/MicroService/MSPileup/MSPileup.py Outdated Show resolved Hide resolved
for rid in doc['ruleIds']:
# set expiration date to be 24h ahead of right now
opts = {'lifetime': 24 * 60 * 60}
self.rucioClient.updateRule(rid, opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these rules that have been updated are not associated with doc['customName']. Said that, I would suggest to refactor the log message to something like: "Rule id: %s has been updated with lifetime: %s"

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 16 warnings and errors that must be fixed
    • 33 warnings
    • 63 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14773/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 18, 2024

Alan, thanks for review, I applied all changes you requested. I also edited and adjusted description where I included new issue for requested script to update records in MongoDB. I also added it and another one (separation of validation step, I made it optional though) to meta-issue. Feel free to make new review.

@vkuznet vkuznet requested a review from amaltaro January 18, 2024 13:23
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnover with fixes, Valentin.
These changes look good to me and I think we should get those in and run some more realistic tests once these changes get deployed in testbed. Thanks!

@amaltaro amaltaro merged commit ccfc286 into dmwm:master Jan 18, 2024
3 of 4 checks passed
@amaltaro
Copy link
Contributor

And I just noticed we merged this PR with 9 commits, instead of having them squashed before merge, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSPileup: complete support to fraction of pileup container
3 participants