-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the spec parameters (MaxMemory, etc) for running workflow. #8622
Comments
See #8646 for further details |
Given that we initially thought about these changes more on the resource requirements land (can be then extended to site lists and etc), it would be interesting to know how Unified does the workflow/job tweak in order to better use grid resources. @vlimant can you give us a brief explanation on how it's done in Unified (live resource requirements update)? Which services/API is used and whether all workflows are under this monitoring? Or only what is configured for? |
@vlimant I'm planning to work on this ticket in the coming weeks. Unless you think Unified mechanism is good enough and we don't need it. So your input and answer to the questions I asked above would be highly appreciated. |
@amaltaro there are several other candidates for unified integration already (#8914, #8921, #8920, #8324, ...) ; I believe those are the ones we put together as first thing of the integration. The mechanism for classad tweaking in unified is all in https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/equalizor.py and is depending on gwmsmon (although everything can be retrieved from ES directly). It will likely require further documentation of what is done exactly |
Ok, none of the issues you pointed out are straight forward. But eventually we have to get them started... |
@amaltaro , According to a small discussion with James today morning, these computations(memory tuning) are a little expensive and increase the loads on the schedd. He mentioned that @todor-ivanov tried implementing something like this in CRAB3 schedds which made the schedds slower. So the best place could be to implement this directly at the condor level(probably in a schedd attached to negotiator) and can be a feature request to the condor developers. May be we can ask about this in the next condor developers meeting. @dpiparo FYI |
@sharad1126 - I’m not sure that comment makes much sense. Without knowing the exact thing Alan is planning, it could be almost no load - or very expensive. In fact, if done right, this could be much more efficient than the current system because one could affect all idle jobs in a single transaction instead of doing it one-by-one (like Unified does today). |
What I have in mind is actually an update of the workflow spec file, such that jobs still waiting in the global workqueue (or waiting for the agent job splitting) could use the up-to-date parameters, thus stopping the usage of JobRouter. In the next phase of this tunning, we could also update jobs pending in the local condor queue (basically the same process as done for RequestPriority/JobPrio). I believe those 2 approaches are not tightly coupled and can be delivered in different stages. |
As discussed with Alan,
if reqmgr2 couchdb record is newer, update the specs in the disk (JobCache, sandbox),
The text was updated successfully, but these errors were encountered: