-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partial pileup placement logic #11807
Conversation
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Alan, it would be nice if you'll review it before you leave. I implemented initial logic but need to do some testing, etc. Since changes are touching different parts of WMCore, e.g. Rucio, MSAuth, etc. it would be nice to have initial review of these changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet Valentin, these developments are on the right track, thanks.
I did leave a bunch of comments along the code though. And I wanted to make another comment here for the partialPileupTask
function. One of the very first pre-processing to be done on that function is actually identifying if the containerFraction
has changed or not. The rest of the function should be only executed when there is a real change of container fraction.
@@ -89,6 +89,26 @@ def __init__(self, msConfig, **kwargs): | |||
with open(authFile, 'rb') as istream: | |||
self.authzKey = istream.read() | |||
|
|||
def userDN(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that this whole method could be replaced by a call to https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/REST/Auth.py#L17C5-L17C27. On the caller, just access the dn
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I prefer to have proper wrapper function for two reason:
- we should be able to test without user DN (this is what function provides)
- we already have MSAuth module as wrapper used in MSPileup code and bypassing this module to bring REST/Auth.py feels wrong since MSAuth depends on it. In other words we should either be dependent on REST/Auth.py or MSAuth.py but not on both at the same time within code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still test it without a user DN, you just need to move the DN check to a different place, of course.
About the dependency on REST/Auth.py, MSAuth module already depends on it. That means that, there is no difference at this level if MSPileup depends (imports) MSAuth or REST/Auth, given that REST/Auth will be imported regardless of our choice.
We can stick to this development though, I am just trying not to inflate the codebase with - apparently - unnecessary extra developments/functions/methods.
@amaltaro , I need additional input from you how we should handle PUT requests? We have two options here:
In first case, we should perform additional look-up of corresponding pileup document, and therefore, few changes should be made to the code:
Please clarify which scenario to adopt. |
Jenkins results:
|
@amaltaro , I'm awaiting your reply to my comment #11807 (comment). Please provide your feedback as I need to know how to implement this in code. |
Jenkins results:
|
@vkuznet Valentin, apologies for missing this question. which ends up calling MSPileupData.updatePileup function: which itself already has a document lookup. Said that, your option 1 seems to me to be the way to go, which I copy below as well.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet Valentin, I left a few comments in place. However, I also wanted to clarify the following:
a) if the container fraction is decreasing, instead of using random blocks from the standard pileup dataset (pileupName), we should use it from the custom dataset (customName) - if existent. This will ensure that no data will have to be staged from Tape, instead it will all be already available on Disk.
b) if the container fraction is increasing, we should use EVERY single block defined in the custom dataset (customName) - if existent - plus the required additional blocks from the standard pileup dataset (pileupName). This way we minimize the amount of blocks that will have to be staged from Tape.
In other words:
- when decreasing: new container is a subset of customName
- when increasing: new container is a superset of customName + subset of pileupName
@@ -89,6 +89,26 @@ def __init__(self, msConfig, **kwargs): | |||
with open(authFile, 'rb') as istream: | |||
self.authzKey = istream.read() | |||
|
|||
def userDN(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still test it without a user DN, you just need to move the DN check to a different place, of course.
About the dependency on REST/Auth.py, MSAuth module already depends on it. That means that, there is no difference at this level if MSPileup depends (imports) MSAuth or REST/Auth, given that REST/Auth will be imported regardless of our choice.
We can stick to this development though, I am just trying not to inflate the codebase with - apparently - unnecessary extra developments/functions/methods.
"""Test the customDID function""" | ||
pname = "/abc/xyz/MINIAOD" | ||
did = customDID(pname) | ||
self.assertTrue(did.endswith('-V1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please test the whole string. Same comment for 2 lines below
- create new custom Name as pileup+extention | ||
- add new transition record and update MSPileup document | ||
- we call attachDIDs Rucio wrapper API with our set of DIDs and rses from pileup document | ||
- create new rules for custom DID using either FNAL disk or T2_CERN sites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot assume location will be FNAL disk or CERN. Instead, we might want to decide to use either currentRSEs
or expectedRSEs
. Each one of those have their own risk of using, but I would stick with the former and let Rucio manage data accordingly.
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
f449182
to
af49687
Compare
Jenkins results:
|
@amaltaro , I addressed your feedback. Few comments though:
I think this PR is ready for final review. Please note that @klannon is interested to merge this PR within this calendar year as I'm heading to holiday break starting this Thus and will not be available till Jan 4th. Therefore, please speed-up the review and if new comments will pop-up I can try to resolve them by Wed. Said that, I checked jenkins report and failed tests seems not related to this PR (I also noticed something weird in jenkins tests, like you are not authorized messages which lead me to conclude that it may be some expired certificate are around which has avalanche effect). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, please find comments/requests/suggestions along the code.
For future changes, please keep them separated in their own commit for the moment. Until we can have another review pass and squash them accordingly. Thanks
src/python/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py
Outdated
Show resolved
Hide resolved
:return: results of MSPileup data layer (list of dicts) | ||
""" | ||
self.authMgr.authorizeApiAccess('ms-pileup', 'update') | ||
keys = sorted(pdict.keys()) | ||
if keys == ['containerFraction', 'pileupName']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear this list comparison might not be reliable across different python versions. Wouldn't a test like if "containerFraction" in pdict:
be safer and as good as this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it is different in different python versions? Please provide an example, list is a basic data type and never change. Said that, we can't check it with if "containerFraction" in pdict:
either since containerFraction
is present in both pileup document and partial pileup spec. Unless you provide better argument here I'm leaving code as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any concrete examples. It's just that more complex data types are sometimes harder to check for equality. If you have no concerns, then let it be.
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
13fee6c
to
b3dcc5b
Compare
Jenkins results:
|
@amaltaro , this PR is ready for review. Please note the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, apologies for the belated review. Please find a bunch of comments/requests/concerns along the code.
Regarding your previous comment:
I run all MSPileup local test locally, without this I doubt they will be finished at all. This is particular very challenging PR and relying on Jenkins was not an option (or it will takes years to run the logic of unit tests). I would like to stress this particularly on this use-case that we should be able to run unit tests locally.
Why do you say we cannot run unit tests locally? What are the problems that you are encountering? Have you ever seen that we have a docker container which provides the FULL environment for unit tests (of course, not any certificates in case we need to reach external services).
Said that, the challenging logic comes from separation of MSPilupe APIs with MSPileupTasks, i.e. asynchronous nature of processing documents. It would be much simple if we'll have everything under single service and made all operations atomic.
Given how expensive some of these operations can be, I think it is not feasible to have atomic operations in this service, as this would require to have blocking user calls).
|
||
# find first document with custom name | ||
results = [] | ||
if doc.get('customName', '') != '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what I said above, I think this code can be made simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I made it simple,. i.e. check for pileup name and prohibit custom name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what I see implemented. You are actually using customName
to look up for a document in MongoDB.
According to step 4 in the gist, end user will update the pileup fraction by providing 2 keys: pileupName
and containerFraction
. Hence, the mongodb document look up needs to be performed with the pileupName
.
|
||
# perform check of input doc, is it partial pileup spec or not | ||
partialPileupSpec = False | ||
if len(doc.keys()) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fragile check. I would suggest to check for the exact keys that we expect, so something like:
if set(doc.keys()) == set(["pileupName", "containerFraction"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you already perform the following, but I would suggest to also make sure that the new containerFraction is different than the actual fraction. Just so we can avoid unnecessary operations and potentially unnecessary partial pileup creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you missed this update(?)
obj = MSPileupObj(dbDoc, validRSEs=rseList) | ||
else: | ||
self.logger.info("#### full pileup obj") | ||
obj = MSPileupObj(dbDoc, validRSEs=rseList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the pileup object is updated for something else, this rseList
would be an empty list. Why can't we leave it with the same behavior for partial pileup (a few lines above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to logic of MSPileupObj. it performs validation of document rse with validRSEs. Therefore, if we got document from DB which does have rses, we should provide a new set of RSEs which should be in that list. Please see logic of MSpileupObj https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py#L176C22-L176C22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I wasn't clear in my previous message. What I am asking you here is whether we really need to have this if/else statement? My understanding is that the pileup object is created in the same way, regardless whether it's full or partial pileup.
self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope) | ||
|
||
# create new rule for custom DID using pileup document rse | ||
newRules += self.rucioClient.createReplicationRules(portion, rse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replication rule needs to be created for the custom pileup, not for all the block names. So please replace portion
by doc['customName']
.
BTW, where is createReplicationRules
defined? I think you mixed up the method name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done. The createReplcationRule
are defined here https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L444 I fixed the function name though, i.e. createReplicationRules
to createReplicationRule
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above, you are still creating rules for the wrong DID object.
for rid in doc['ruleIds']: | ||
# set expiration date to be 24h ahead of right now | ||
opts = {'lifetime': 24 * 60 * 60} | ||
self.rucioClient.updateRule(rid, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would log that we are updating rule id XXX with the new lifetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, these rules that have been updated are not associated with doc['customName']
. Said that, I would suggest to refactor the log message to something like: "Rule id: %s has been updated with lifetime: %s"
self.logger.info("update pileup document %s", doc) | ||
|
||
# update MSPileup document in MongoDB | ||
self.mgr.updatePileup(doc, rseList=newRules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug? providing rules for rseList param(?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should be doc['expectedRSEs']
or doc['currentRSEs']
since my default rseList is None and passed document may have either expectedRSEs or currentRSEs to be not empty lists and during validation we check for them. Please clarify which list should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably forgot pushing your changes in.
@amaltaro , I addressed all issue you reported in your review except few items. Please reply to my observations/questions in place. In particular, In particular:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, I suspect you forgot to push your changes in, as I see many of your comments saying that the code was updated, while nothing changed compared to my previous review.
For now, I would suggest you to provide additional commits, which might make the review experience easier and cleaner. Before we merge it, we can come back to this and properly squash the commits.
As mentioned along the code, I updated the gist with further instructions for decreasing/increasing the container, see: https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction
Lastly, I just wanted to raise a concern that we are likely fetching the document from MongoDB multiple times within the same REST operation. Just a concern for the future though, as there is already tons of changes in this PR.
|
||
# find first document with custom name | ||
results = [] | ||
if doc.get('customName', '') != '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what I see implemented. You are actually using customName
to look up for a document in MongoDB.
According to step 4 in the gist, end user will update the pileup fraction by providing 2 keys: pileupName
and containerFraction
. Hence, the mongodb document look up needs to be performed with the pileupName
.
|
||
# perform check of input doc, is it partial pileup spec or not | ||
partialPileupSpec = False | ||
if len(doc.keys()) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you missed this update(?)
obj = MSPileupObj(dbDoc, validRSEs=rseList) | ||
else: | ||
self.logger.info("#### full pileup obj") | ||
obj = MSPileupObj(dbDoc, validRSEs=rseList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I wasn't clear in my previous message. What I am asking you here is whether we really need to have this if/else statement? My understanding is that the pileup object is created in the same way, regardless whether it's full or partial pileup.
# use EVERY single block defined in the custom dataset plus blocks from the standard pileup | ||
cname = doc.get('customName', '') | ||
if cname: | ||
blockNames = self.rucioClient.getBlocksInContainer(cname) + blockNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, I think the "remaining" word was ambiguous in my explanation, sorry. By remaining, I meant the remaining fraction of blocks, not the remaining blocks from the original dataset.
I updated the gist to reflect details of the container increase/decrease algorithm:
https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction
However, if there is anything not yet clear, I am happy to go through these over Zoom.
blockNames += self.rucioClient.getBlocksInContainer(pname) | ||
|
||
# get portion of DIDs based on ceil(containerFraction * num_rucio_datasets) | ||
portion = blockNames[:math.ceil(fraction * len(blockNames))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gist has been updated:
https://gist.github.com/amaltaro/b4f9bafc0b58c10092a0735c635538b5#logic-for-increasingdecreasing-container-fraction
in case you need to apply corrections here.
# we call attachDIDs Rucio wrapper API with our set of DIDs | ||
newRules = [] | ||
for rse in doc['currentRSEs']: | ||
self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Stefano Belforte, he/CRAB creates custom container and he does not provide any RSEs when attaching DIDs to a container. Based on that, I would suggest to pass None
as RSE name (in practice, not setting any RSE when attaching a Rucio dataset to a Rucio container).
I still don't know how in practice that is going to be reflected in Rucio server, and what latency it's going to cause. But I believe it to be the safest approach at the moment.
self.rucioClient.attachDIDs(rse, doc['customName'], portion, scope=self.customRucioScope) | ||
|
||
# create new rule for custom DID using pileup document rse | ||
newRules += self.rucioClient.createReplicationRules(portion, rse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above, you are still creating rules for the wrong DID object.
self.logger.info("update pileup document %s", doc) | ||
|
||
# update MSPileup document in MongoDB | ||
self.mgr.updatePileup(doc, rseList=newRules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably forgot pushing your changes in.
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
@amaltaro , I implemented new logic for increase/decrease fraction of pileup via separate functions. You can find appropriate commit in this PR. I also added #11863 issue to address separation of validation step from constructor of MSPileupObj. I also updated PR description with TODO to adjust MongoDB records. All unit tests are passed in my local setup and in Jenkins (though there is unrelated failed unit test from workqueue). Please have another look and hopefully we may converge and merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, these changes are getting closer to the final product. I left a few comments along the code though and I would like to mention the following 3 points as well:
- Please update the PR description with: correction of typos, containerFraction is a float not int, lines from the template should be either removed or replaced.
- Perhaps we should have a script to update the pileup documents with a new transition record(?) Should we create a new GH issue and link it to the meta issue?
- Even though we are passing the actual pileup document to the lower layers, I am almost sure that we retrieve pileup document multiple times from MongoDB (e.g. MSPileup.updatePileup, then MSPileupData.updatePileup, etc). If that is the case and it only affects efficiency of the service, we can get back to it at a later stage. However, if it can affect the smooth operation of the service, we should revisit it now.
for rid in doc['ruleIds']: | ||
# set expiration date to be 24h ahead of right now | ||
opts = {'lifetime': 24 * 60 * 60} | ||
self.rucioClient.updateRule(rid, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, these rules that have been updated are not associated with doc['customName']
. Said that, I would suggest to refactor the log message to something like: "Rule id: %s has been updated with lifetime: %s"
Jenkins results:
|
Alan, thanks for review, I applied all changes you requested. I also edited and adjusted description where I included new issue for requested script to update records in MongoDB. I also added it and another one (separation of validation step, I made it optional though) to meta-issue. Feel free to make new review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick turnover with fixes, Valentin.
These changes look good to me and I think we should get those in and run some more realistic tests once these changes get deployed in testbed. Thanks!
And I just noticed we merged this PR with 9 commits, instead of having them squashed before merge, sorry. |
Fixes #11621
Status
ready
Description
Introduce partial pileup placement logic described in #11621 It involves the following:
{containerFraction: <float>, 'pileupName': <string>}
updatePileup
API which by itself pass the call toMSPileupData
manage and calls itsupdatePileup
APIupdatePileup
API performs necessary checks, createscusomDID
and creates newtransition
record with the following structure:In addition, we provide
partialPileupTask
in MSPileupTask class which is used during each polling cycle. This method performs transition logic outlined in the following gistceil(containerFraction * num_rucio_datasets)
formulaTODO:
Further actions:
transition
record at creation of pileup document and requirements that update API must have the transition record we should update all existing pileup documents in MongoDB (both production and testbed) to introducetransition
records in existing documents, see Create new script for adjusting Pileup documents in MongoDB #11867Is it backward compatible (if not, which system it affects?)
YES
Related PRs
External dependencies / deployment changes