Add partial pileup placement logic #11807

vkuznet · 2023-11-30T16:55:23Z

Status

ready

Description

Introduce partial pileup placement logic described in #11621 It involves the following:

user place a request in form of {containerFraction: <float>, 'pileupName': <string>}
request is placed via PUT HTTP call to MSPileup data-service
MSPileup data service calls updatePileup API which by itself pass the call to MSPileupData manage and calls its updatePileupAPI
the updatePileup API performs necessary checks, creates cusomDID and creates new transition record with the following structure:

{containerFraction: float,
 customDID: string,
 updateTime: GMTimeInSeconds,
 DN: string}

In addition, we provide partialPileupTask in MSPileupTask class which is used during each polling cycle. This method performs transition logic outlined in the following gist

it fetches pileup documents from MongoDB
it identifies the slope of container fraction
get rucio DIDs for our pileup document
get portion of DIDs based on ceil(containerFraction * num_rucio_datasets) formula
create new container DID in Rucio
call attachDIDs Rucio wrapper API with our set of DIDs and rses from pileup document
create new rules for custom DID
set expiration date (to be 24h) for already existing ruleIds from pileup document
add new ruleIds to pileup document

TODO:

Further actions:

we may need to implement separation of validation step from constructor of MSPileupObj, see Separate MSPilupeObj creation and validation #11863
based on introduction of transition record at creation of pileup document and requirements that update API must have the transition record we should update all existing pileup documents in MongoDB (both production and testbed) to introduce transition records in existing documents, see Create new script for adjusting Pileup documents in MongoDB #11867

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

External dependencies / deployment changes

cmsdmwmbot · 2023-11-30T17:04:46Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 13 tests deleted
- 2 tests no longer failing
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 2 warnings
- 12 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14681/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-11-30T21:25:37Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 tests added
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 8 warnings
- 15 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14682/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-04T18:11:04Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 3 tests deleted
- 1 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 21 warnings
- 35 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 90 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14684/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-04T18:31:59Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 20 warnings and errors that must be fixed
- 23 warnings
- 41 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14685/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-04T19:21:31Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: failed
- 20 warnings and errors that must be fixed
- 23 warnings
- 41 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14686/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-12-04T19:27:20Z

Alan, it would be nice if you'll review it before you leave. I implemented initial logic but need to do some testing, etc. Since changes are touching different parts of WMCore, e.g. Rucio, MSAuth, etc. it would be nice to have initial review of these changes.

amaltaro

@vkuznet Valentin, these developments are on the right track, thanks.

I did leave a bunch of comments along the code though. And I wanted to make another comment here for the partialPileupTask function. One of the very first pre-processing to be done on that function is actually identifying if the containerFraction has changed or not. The rest of the function should be only executed when there is a real change of container fraction.

amaltaro · 2023-12-04T23:25:43Z

src/python/WMCore/MicroService/MSCore/MSAuth.py

@@ -89,6 +89,26 @@ def __init__(self, msConfig, **kwargs):
                with open(authFile, 'rb') as istream:
                    self.authzKey = istream.read()

+    def userDN(self):


I have the impression that this whole method could be replaced by a call to https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/REST/Auth.py#L17C5-L17C27. On the caller, just access the dn field.

No, I prefer to have proper wrapper function for two reason:

we should be able to test without user DN (this is what function provides)

we already have MSAuth module as wrapper used in MSPileup code and bypassing this module to bring REST/Auth.py feels wrong since MSAuth depends on it. In other words we should either be dependent on REST/Auth.py or MSAuth.py but not on both at the same time within code base.

You can still test it without a user DN, you just need to move the DN check to a different place, of course.

About the dependency on REST/Auth.py, MSAuth module already depends on it. That means that, there is no difference at this level if MSPileup depends (imports) MSAuth or REST/Auth, given that REST/Auth will be imported regardless of our choice.

We can stick to this development though, I am just trying not to inflate the codebase with - apparently - unnecessary extra developments/functions/methods.

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

src/python/WMCore/Services/Rucio/Rucio.py

vkuznet · 2023-12-05T14:58:54Z

@amaltaro , I need additional input from you how we should handle PUT requests? We have two options here:

User will send {"containerFraction":<float>, "pileupName": <name of pileup document>} JSON with HTTP put
User will first whole JSON document from MSPileup, then change its containerFraction and send HTTP PUT request

In first case, we should perform additional look-up of corresponding pileup document, and therefore, few changes should be made to the code:

adjust MSPileup service to handle different PUT request schemas, so far we assume that user send valid pileup document, while in case 1 it would be a update small doc
add look-up of MSPileup document itself and only then perform updatePileup API call

Please clarify which scenario to adopt.

cmsdmwmbot · 2023-12-05T20:38:53Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 23 warnings and errors that must be fixed
- 32 warnings
- 68 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14694/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-12-11T17:30:01Z

@amaltaro , I'm awaiting your reply to my comment #11807 (comment). Please provide your feedback as I need to know how to implement this in code.

cmsdmwmbot · 2023-12-11T17:38:08Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 23 warnings and errors that must be fixed
- 32 warnings
- 68 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14713/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-12-18T12:14:02Z

@amaltaro , I need additional input from you how we should handle PUT requests? We have two options here:

@vkuznet Valentin, apologies for missing this question.
My opinion on this is that we should not change the current behavior. As far as I can tell, we are using the updatePileup function:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSPileup/MSPileup.py#L97C9-L97C21

which ends up calling MSPileupData.updatePileup function:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSPileup/MSPileupData.py#L132

which itself already has a document lookup.

Said that, your option 1 seems to me to be the way to go, which I copy below as well.
"""

User will send {"containerFraction":, "pileupName": } JSON with HTTP put
"""

amaltaro

@vkuznet Valentin, I left a few comments in place. However, I also wanted to clarify the following:
a) if the container fraction is decreasing, instead of using random blocks from the standard pileup dataset (pileupName), we should use it from the custom dataset (customName) - if existent. This will ensure that no data will have to be staged from Tape, instead it will all be already available on Disk.
b) if the container fraction is increasing, we should use EVERY single block defined in the custom dataset (customName) - if existent - plus the required additional blocks from the standard pileup dataset (pileupName). This way we minimize the amount of blocks that will have to be staged from Tape.

In other words:

when decreasing: new container is a subset of customName
when increasing: new container is a superset of customName + subset of pileupName

amaltaro · 2023-12-18T12:19:35Z

src/python/WMCore/MicroService/MSCore/MSAuth.py

@@ -89,6 +89,26 @@ def __init__(self, msConfig, **kwargs):
                with open(authFile, 'rb') as istream:
                    self.authzKey = istream.read()

+    def userDN(self):


You can still test it without a user DN, you just need to move the DN check to a different place, of course.

About the dependency on REST/Auth.py, MSAuth module already depends on it. That means that, there is no difference at this level if MSPileup depends (imports) MSAuth or REST/Auth, given that REST/Auth will be imported regardless of our choice.

We can stick to this development though, I am just trying not to inflate the codebase with - apparently - unnecessary extra developments/functions/methods.

amaltaro · 2023-12-18T12:25:41Z

test/python/WMCore_t/MicroService_t/MSPileup_t/MSPileupData_t.py

+        """Test the customDID function"""
+        pname = "/abc/xyz/MINIAOD"
+        did = customDID(pname)
+        self.assertTrue(did.endswith('-V1'))


please test the whole string. Same comment for 2 lines below

amaltaro · 2023-12-18T12:30:34Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+          - create new custom Name as pileup+extention
+          - add new transition record and update MSPileup document
+        - we call attachDIDs Rucio wrapper API with our set of DIDs and rses from pileup document
+        - create new rules for custom DID using either FNAL disk or T2_CERN sites


We cannot assume location will be FNAL disk or CERN. Instead, we might want to decide to use either currentRSEs or expectedRSEs. Each one of those have their own risk of using, but I would stick with the former and let Rucio manage data accordingly.

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

cmsdmwmbot · 2023-12-18T17:05:30Z

Jenkins results:

Python3 Unit tests: failed
- 9 new failures
- 41 tests deleted
- 584 tests no longer failing
- 163 tests added
- 17 changes in unstable tests
Python3 Pylint check: failed
- 24 warnings and errors that must be fixed
- 32 warnings
- 65 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 37 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14726/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-18T17:47:13Z

Jenkins results:

Python3 Unit tests: failed
- 6 new failures
- 41 tests deleted
- 586 tests no longer failing
- 168 tests added
- 17 changes in unstable tests
Python3 Pylint check: failed
- 23 warnings and errors that must be fixed
- 32 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14727/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-18T18:08:31Z

Jenkins results:

Python3 Unit tests: failed
- 8 new failures
- 41 tests deleted
- 576 tests no longer failing
- 168 tests added
- 17 changes in unstable tests
Python3 Pylint check: failed
- 16 warnings and errors that must be fixed
- 32 warnings
- 65 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14729/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-18T19:30:41Z

Jenkins results:

Python3 Unit tests: failed
- 8 new failures
- 41 tests deleted
- 585 tests no longer failing
- 168 tests added
- 17 changes in unstable tests
Python3 Pylint check: failed
- 16 warnings and errors that must be fixed
- 32 warnings
- 65 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 10 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14730/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-12-18T19:42:41Z

@amaltaro , I addressed your feedback. Few comments though:

I refactor MSAuth to better handle your concern about userDN. Now there is no such method but rather userInfo which is used without this codebase and we can use it now in MSPileup. I still think it is proper place to keep it
I added logic to handle incoming fraction request
I addressed your comment about usage of pileupName vs customName based on slope of container fraction
I addressed pylint concerns and squashed changes
the unit tests are also improved based on your request

I think this PR is ready for final review. Please note that @klannon is interested to merge this PR within this calendar year as I'm heading to holiday break starting this Thus and will not be available till Jan 4th. Therefore, please speed-up the review and if new comments will pop-up I can try to resolve them by Wed.

Said that, I checked jenkins report and failed tests seems not related to this PR (I also noticed something weird in jenkins tests, like you are not authorized messages which lead me to conclude that it may be some expired certificate are around which has avalanche effect).

amaltaro

Valentin, please find comments/requests/suggestions along the code.
For future changes, please keep them separated in their own commit for the moment. Until we can have another review pass and squash them accordingly. Thanks

src/python/WMCore/MicroService/MSCore/MSAuth.py

src/python/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py

src/python/WMCore/MicroService/MSPileup/MSPileup.py

amaltaro · 2023-12-18T23:03:22Z

src/python/WMCore/MicroService/MSPileup/MSPileup.py

        :return: results of MSPileup data layer (list of dicts)
        """
        self.authMgr.authorizeApiAccess('ms-pileup', 'update')
+        keys = sorted(pdict.keys())
+        if keys == ['containerFraction', 'pileupName']:


I fear this list comparison might not be reliable across different python versions. Wouldn't a test like if "containerFraction" in pdict: be safer and as good as this one?

Why it is different in different python versions? Please provide an example, list is a basic data type and never change. Said that, we can't check it with if "containerFraction" in pdict: either since containerFraction is present in both pileup document and partial pileup spec. Unless you provide better argument here I'm leaving code as is.

I don't have any concrete examples. It's just that more complex data types are sometimes harder to check for equality. If you have no concerns, then let it be.

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

src/python/WMCore/Services/Rucio/Rucio.py

cmsdmwmbot · 2023-12-19T13:27:28Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 2 tests added
- 4 changes in unstable tests
Python3 Pylint check: failed
- 19 warnings and errors that must be fixed
- 32 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14732/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-19T13:32:06Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 19 warnings and errors that must be fixed
- 32 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14733/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-19T14:18:45Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 2 tests added
- 3 changes in unstable tests
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 32 warnings
- 62 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14734/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-12-19T14:21:34Z

Jenkins results:

Python3 Unit tests: failed
- 4 new failures
- 2 tests added
- 3 changes in unstable tests
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 32 warnings
- 62 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14735/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-01-09T18:19:18Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 tests added
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 33 warnings
- 64 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14762/artifact/artifacts/PullRequestReport.html

vkuznet · 2024-01-09T18:30:25Z

@amaltaro , this PR is ready for review. Please note the following:

there are lots of changes to many files, including Rucio mock, Authz, MSPileup codebase
I made Authz to work with local tests using localhost-user DN
I run all MSPileup local test locally, without this I doubt they will be finished at all. This is particular very challenging PR and relying on Jenkins was not an option (or it will takes years to run the logic of unit tests). I would like to stress this particularly on this use-case that we should be able to run unit tests locally.
I implemented full logic from your gist which we discussed extensively and provided unit tests listing every step of iteration
I found and filled all missing parts of Rucio mock APIs without them testing was impossible
Even though I implemented the whole logic I found it very complex (code wise). In my view we step away from Microservice nature and all our services are too big to be called Microservices. Said that, the challenging logic comes from separation of MSPilupe APIs with MSPileupTasks, i.e. asynchronous nature of processing documents. It would be much simple if we'll have everything under single service and made all operations atomic. I fear that further changes will be very difficult to implement to adopt to all possible use-cases.

amaltaro

Valentin, apologies for the belated review. Please find a bunch of comments/requests/concerns along the code.

Regarding your previous comment:

I run all MSPileup local test locally, without this I doubt they will be finished at all. This is particular very challenging PR and relying on Jenkins was not an option (or it will takes years to run the logic of unit tests). I would like to stress this particularly on this use-case that we should be able to run unit tests locally.

Why do you say we cannot run unit tests locally? What are the problems that you are encountering? Have you ever seen that we have a docker container which provides the FULL environment for unit tests (of course, not any certificates in case we need to reach external services).

Said that, the challenging logic comes from separation of MSPilupe APIs with MSPileupTasks, i.e. asynchronous nature of processing documents. It would be much simple if we'll have everything under single service and made all operations atomic.

Given how expensive some of these operations can be, I think it is not feasible to have atomic operations in this service, as this would require to have blocking user calls).

src/python/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

amaltaro · 2024-01-12T21:48:36Z

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

+
+        # find first document with custom name
+        results = []
+        if doc.get('customName', '') != '':


Based on what I said above, I think this code can be made simpler.

ok, I made it simple,. i.e. check for pileup name and prohibit custom name.

This is not what I see implemented. You are actually using customName to look up for a document in MongoDB.

According to step 4 in the gist, end user will update the pileup fraction by providing 2 keys: pileupName and containerFraction. Hence, the mongodb document look up needs to be performed with the pileupName.

amaltaro · 2024-01-12T21:50:33Z

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

+
+        # perform check of input doc, is it partial pileup spec or not
+        partialPileupSpec = False
+        if len(doc.keys()) == 2:


This is a fragile check. I would suggest to check for the exact keys that we expect, so something like:

if set(doc.keys()) == set(["pileupName", "containerFraction"])

I don't know if you already perform the following, but I would suggest to also make sure that the new containerFraction is different than the actual fraction. Just so we can avoid unnecessary operations and potentially unnecessary partial pileup creation.

ok, updated.

I suspect you missed this update(?)

amaltaro · 2024-01-12T21:57:17Z

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

+                    obj = MSPileupObj(dbDoc, validRSEs=rseList)
+                else:
+                    self.logger.info("#### full pileup obj")
+                    obj = MSPileupObj(dbDoc, validRSEs=rseList)


If the pileup object is updated for something else, this rseList would be an empty list. Why can't we leave it with the same behavior for partial pileup (a few lines above)?

due to logic of MSPileupObj. it performs validation of document rse with validRSEs. Therefore, if we got document from DB which does have rses, we should provide a new set of RSEs which should be in that list. Please see logic of MSpileupObj https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py#L176C22-L176C22

I think I wasn't clear in my previous message. What I am asking you here is whether we really need to have this if/else statement? My understanding is that the pileup object is created in the same way, regardless whether it's full or partial pileup.

amaltaro · 2024-01-12T22:23:04Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+                self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope)
+
+                # create new rule for custom DID using pileup document rse
+                newRules += self.rucioClient.createReplicationRules(portion, rse)


Replication rule needs to be created for the custom pileup, not for all the block names. So please replace portion by doc['customName'].

BTW, where is createReplicationRules defined? I think you mixed up the method name here.

ok, done. The createReplcationRule are defined here https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L444 I fixed the function name though, i.e. createReplicationRules to createReplicationRule.

See my comment above, you are still creating rules for the wrong DID object.

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

amaltaro · 2024-01-12T22:25:23Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+            for rid in doc['ruleIds']:
+                # set expiration date to be 24h ahead of right now
+                opts = {'lifetime': 24 * 60 * 60}
+                self.rucioClient.updateRule(rid, opts)


I would log that we are updating rule id XXX with the new lifetime.

Actually, these rules that have been updated are not associated with doc['customName']. Said that, I would suggest to refactor the log message to something like: "Rule id: %s has been updated with lifetime: %s"

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

amaltaro · 2024-01-12T22:35:40Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+                    self.logger.info("update pileup document %s", doc)
+
+                    # update MSPileup document in MongoDB
+                    self.mgr.updatePileup(doc, rseList=newRules)


bug? providing rules for rseList param(?)

yes, it should be doc['expectedRSEs'] or doc['currentRSEs'] since my default rseList is None and passed document may have either expectedRSEs or currentRSEs to be not empty lists and during validation we check for them. Please clarify which list should be used.

you probably forgot pushing your changes in.

vkuznet · 2024-01-15T14:35:08Z

@amaltaro , I addressed all issue you reported in your review except few items. Please reply to my observations/questions in place. In particular, In particular:

the custom Name pattern does not align with CMS dataset one
the MSPileupObj validate object rse list and therefore code provides them
create pileup should creates transition record per logic in gist and it is in disagreement with your comments
portion logic seems correct to me and you still insist that it is not valid, I followed the one reported earlier in this issue and I tried to outline it step by step. To avoid further back and forth comments I suggest that you write this snippet as you view it
I need further instructions what to do with Rucio logic which you pointed out as it is unclear from your review
I corrected createReplicationRule API name (and its mock up counterpart)
Due to validation of RSE list in MSPileup object we still need to pass it to updatePileup API and there are two lists around expectedRSEs and currentRSEs. Please clarify which one to use

amaltaro

Valentin, I suspect you forgot to push your changes in, as I see many of your comments saying that the code was updated, while nothing changed compared to my previous review.

For now, I would suggest you to provide additional commits, which might make the review experience easier and cleaner. Before we merge it, we can come back to this and properly squash the commits.

As mentioned along the code, I updated the gist with further instructions for decreasing/increasing the container, see: https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction

Lastly, I just wanted to raise a concern that we are likely fetching the document from MongoDB multiple times within the same REST operation. Just a concern for the future though, as there is already tons of changes in this PR.

src/python/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py

amaltaro · 2024-01-16T16:03:02Z

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

+
+        # find first document with custom name
+        results = []
+        if doc.get('customName', '') != '':


This is not what I see implemented. You are actually using customName to look up for a document in MongoDB.

According to step 4 in the gist, end user will update the pileup fraction by providing 2 keys: pileupName and containerFraction. Hence, the mongodb document look up needs to be performed with the pileupName.

amaltaro · 2024-01-16T16:04:04Z

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

+
+        # perform check of input doc, is it partial pileup spec or not
+        partialPileupSpec = False
+        if len(doc.keys()) == 2:


I suspect you missed this update(?)

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

amaltaro · 2024-01-16T16:09:04Z

src/python/WMCore/MicroService/MSPileup/MSPileupData.py

+                    obj = MSPileupObj(dbDoc, validRSEs=rseList)
+                else:
+                    self.logger.info("#### full pileup obj")
+                    obj = MSPileupObj(dbDoc, validRSEs=rseList)


I think I wasn't clear in my previous message. What I am asking you here is whether we really need to have this if/else statement? My understanding is that the pileup object is created in the same way, regardless whether it's full or partial pileup.

amaltaro · 2024-01-16T16:42:28Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+                # use EVERY single block defined in the custom dataset plus blocks from the standard pileup
+                cname = doc.get('customName', '')
+                if cname:
+                    blockNames = self.rucioClient.getBlocksInContainer(cname) + blockNames


Valentin, I think the "remaining" word was ambiguous in my explanation, sorry. By remaining, I meant the remaining fraction of blocks, not the remaining blocks from the original dataset.

I updated the gist to reflect details of the container increase/decrease algorithm:
https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction

However, if there is anything not yet clear, I am happy to go through these over Zoom.

amaltaro · 2024-01-16T16:44:08Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+                blockNames += self.rucioClient.getBlocksInContainer(pname)
+
+            # get portion of DIDs based on ceil(containerFraction * num_rucio_datasets)
+            portion = blockNames[:math.ceil(fraction * len(blockNames))]


My gist has been updated:
https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction
in case you need to apply corrections here.

amaltaro · 2024-01-16T16:50:14Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+            # we call attachDIDs Rucio wrapper API with our set of DIDs
+            newRules = []
+            for rse in doc['currentRSEs']:
+                self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope)


According to Stefano Belforte, he/CRAB creates custom container and he does not provide any RSEs when attaching DIDs to a container. Based on that, I would suggest to pass None as RSE name (in practice, not setting any RSE when attaching a Rucio dataset to a Rucio container).

I still don't know how in practice that is going to be reflected in Rucio server, and what latency it's going to cause. But I believe it to be the safest approach at the moment.

amaltaro · 2024-01-16T16:51:34Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+                self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope)
+
+                # create new rule for custom DID using pileup document rse
+                newRules += self.rucioClient.createReplicationRules(portion, rse)


See my comment above, you are still creating rules for the wrong DID object.

amaltaro · 2024-01-16T16:52:05Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+                    self.logger.info("update pileup document %s", doc)
+
+                    # update MSPileup document in MongoDB
+                    self.mgr.updatePileup(doc, rseList=newRules)


you probably forgot pushing your changes in.

cmsdmwmbot · 2024-01-16T17:39:29Z

Jenkins results:

Python3 Unit tests: failed
- 3 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 33 warnings
- 64 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14769/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-01-16T17:53:10Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 3 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 33 warnings
- 64 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14770/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-01-16T21:57:31Z

Jenkins results:

Python3 Unit tests: failed
- 3 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 16 warnings and errors that must be fixed
- 33 warnings
- 64 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14771/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-01-17T12:53:22Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 3 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 33 warnings
- 63 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14772/artifact/artifacts/PullRequestReport.html

vkuznet · 2024-01-17T13:45:48Z

@amaltaro , I implemented new logic for increase/decrease fraction of pileup via separate functions. You can find appropriate commit in this PR. I also added #11863 issue to address separation of validation step from constructor of MSPileupObj. I also updated PR description with TODO to adjust MongoDB records. All unit tests are passed in my local setup and in Jenkins (though there is unrelated failed unit test from workqueue). Please have another look and hopefully we may converge and merge this PR.

amaltaro

Valentin, these changes are getting closer to the final product. I left a few comments along the code though and I would like to mention the following 3 points as well:

Please update the PR description with: correction of typos, containerFraction is a float not int, lines from the template should be either removed or replaced.
Perhaps we should have a script to update the pileup documents with a new transition record(?) Should we create a new GH issue and link it to the meta issue?
Even though we are passing the actual pileup document to the lower layers, I am almost sure that we retrieve pileup document multiple times from MongoDB (e.g. MSPileup.updatePileup, then MSPileupData.updatePileup, etc). If that is the case and it only affects efficiency of the service, we can get back to it at a later stage. However, if it can affect the smooth operation of the service, we should revisit it now.

src/python/WMCore/MicroService/MSPileup/MSPileup.py

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

amaltaro · 2024-01-18T01:48:31Z

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

+            for rid in doc['ruleIds']:
+                # set expiration date to be 24h ahead of right now
+                opts = {'lifetime': 24 * 60 * 60}
+                self.rucioClient.updateRule(rid, opts)


Actually, these rules that have been updated are not associated with doc['customName']. Said that, I would suggest to refactor the log message to something like: "Rule id: %s has been updated with lifetime: %s"

src/python/WMCore/MicroService/MSPileup/MSPileupTasks.py

cmsdmwmbot · 2024-01-18T13:23:39Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 16 warnings and errors that must be fixed
- 33 warnings
- 63 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14773/artifact/artifacts/PullRequestReport.html

vkuznet · 2024-01-18T13:23:47Z

Alan, thanks for review, I applied all changes you requested. I also edited and adjusted description where I included new issue for requested script to update records in MongoDB. I also added it and another one (separation of validation step, I made it optional though) to meta-issue. Feel free to make new review.

amaltaro

Thanks for the quick turnover with fixes, Valentin.
These changes look good to me and I think we should get those in and run some more realistic tests once these changes get deployed in testbed. Thanks!

amaltaro · 2024-01-19T11:16:53Z

And I just noticed we merged this PR with 9 commits, instead of having them squashed before merge, sorry.

vkuznet added PR: Work in progress MSPileup labels Nov 30, 2023

vkuznet self-assigned this Nov 30, 2023

vkuznet requested a review from amaltaro December 4, 2023 19:27

amaltaro requested changes Dec 4, 2023

View reviewed changes

vkuznet requested a review from amaltaro December 12, 2023 14:11

amaltaro requested changes Dec 18, 2023

View reviewed changes

vkuznet force-pushed the fix-issue-11621 branch from f449182 to af49687 Compare December 18, 2023 19:19

vkuznet requested a review from amaltaro December 18, 2023 19:36

amaltaro requested changes Dec 19, 2023

View reviewed changes

Unit tests for partial plieup logic

b3dcc5b

vkuznet force-pushed the fix-issue-11621 branch from 13fee6c to b3dcc5b Compare January 9, 2024 18:05

amaltaro requested changes Jan 12, 2024

View reviewed changes

Applies Alan's review suggestions

03277ff

vkuznet requested a review from amaltaro January 15, 2024 14:45

amaltaro requested changes Jan 16, 2024

View reviewed changes

Assign None for rse value in attachDID

ebfc21a

vkuznet added 2 commits January 16, 2024 16:46

clarify comment

8e14d0c

implement increase/decrease logic

f998e87

fix pylint suggestions

9b90286

vkuznet mentioned this pull request Jan 17, 2024

Separate MSPilupeObj creation and validation #11863

Open

vkuznet requested a review from amaltaro January 17, 2024 13:43

amaltaro requested changes Jan 18, 2024

View reviewed changes

vkuznet added 2 commits January 18, 2024 08:07

Removed obsolete retrieval of pileup doc from DB

d58c41d

Applied Alan's feedback changes

4572a7a

vkuznet mentioned this pull request Jan 18, 2024

Create new script for adjusting Pileup documents in MongoDB #11867

Closed

vkuznet requested a review from amaltaro January 18, 2024 13:23

amaltaro approved these changes Jan 18, 2024

View reviewed changes

amaltaro merged commit ccfc286 into dmwm:master Jan 18, 2024
3 of 4 checks passed

Add partial pileup placement logic #11807

Add partial pileup placement logic #11807

Conversation

vkuznet commented Nov 30, 2023 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Nov 30, 2023

cmsdmwmbot commented Nov 30, 2023

cmsdmwmbot commented Dec 4, 2023

cmsdmwmbot commented Dec 4, 2023

cmsdmwmbot commented Dec 4, 2023

vkuznet commented Dec 4, 2023

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuznet commented Dec 5, 2023

cmsdmwmbot commented Dec 5, 2023

vkuznet commented Dec 11, 2023

cmsdmwmbot commented Dec 11, 2023

amaltaro commented Dec 18, 2023

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Dec 18, 2023

cmsdmwmbot commented Dec 18, 2023

cmsdmwmbot commented Dec 18, 2023

cmsdmwmbot commented Dec 18, 2023

vkuznet commented Dec 18, 2023

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Dec 19, 2023

cmsdmwmbot commented Dec 19, 2023

cmsdmwmbot commented Dec 19, 2023

cmsdmwmbot commented Dec 19, 2023

cmsdmwmbot commented Jan 9, 2024

vkuznet commented Jan 9, 2024

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuznet commented Jan 15, 2024

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Jan 16, 2024

cmsdmwmbot commented Jan 16, 2024

vkuznet commented Nov 30, 2023 •

edited

Loading