Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the spec parameters (MaxMemory, etc) for running workflow. #8622

Open
ticoann opened this issue May 18, 2018 · 9 comments
Open

Update the spec parameters (MaxMemory, etc) for running workflow. #8622

ticoann opened this issue May 18, 2018 · 9 comments

Comments

@ticoann
Copy link
Contributor

ticoann commented May 18, 2018

As discussed with Alan,

  1. add the time stamp in couch db reqmgr when spec file is updated.
  2. add new table with workflow id and updated timestamp.
  3. Job updater compares the time stamp in wmbs (table above) and reqmgr2 couchdb.
    if reqmgr2 couchdb record is newer, update the specs in the disk (JobCache, sandbox),
  4. update the wmbs table.
@amaltaro
Copy link
Contributor

See #8646 for further details

@amaltaro amaltaro modified the milestones: WMAgent1806, WMAgent1807 Jun 14, 2018
@ticoann ticoann modified the milestones: WMAgent1807, WMAgent1809 Aug 28, 2018
@amaltaro amaltaro modified the milestones: WMAgent1809, WMAgent1810 Oct 1, 2018
@amaltaro amaltaro modified the milestones: WMAgent1810, WMAgent1902 Jan 7, 2019
@amaltaro
Copy link
Contributor

Given that we initially thought about these changes more on the resource requirements land (can be then extended to site lists and etc), it would be interesting to know how Unified does the workflow/job tweak in order to better use grid resources.

@vlimant can you give us a brief explanation on how it's done in Unified (live resource requirements update)? Which services/API is used and whether all workflows are under this monitoring? Or only what is configured for?
Trying to evaluate how much we'd gain by implementing this in WMCore...

@amaltaro
Copy link
Contributor

amaltaro commented Feb 5, 2019

@vlimant I'm planning to work on this ticket in the coming weeks. Unless you think Unified mechanism is good enough and we don't need it. So your input and answer to the questions I asked above would be highly appreciated.

@vlimant
Copy link
Contributor

vlimant commented Feb 5, 2019

@amaltaro there are several other candidates for unified integration already (#8914, #8921, #8920, #8324, ...) ; I believe those are the ones we put together as first thing of the integration.

The mechanism for classad tweaking in unified is all in https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/equalizor.py and is depending on gwmsmon (although everything can be retrieved from ES directly). It will likely require further documentation of what is done exactly

@amaltaro
Copy link
Contributor

amaltaro commented Feb 5, 2019

Ok, none of the issues you pointed out are straight forward. But eventually we have to get them started...
If you can get this equalizor properly documented, it will be certainly helpful in the near future.

@amaltaro amaltaro modified the milestones: WMAgent1902, WMAgent1905 Feb 5, 2019
@sharad1126
Copy link

sharad1126 commented Feb 5, 2020

@amaltaro , According to a small discussion with James today morning, these computations(memory tuning) are a little expensive and increase the loads on the schedd. He mentioned that @todor-ivanov tried implementing something like this in CRAB3 schedds which made the schedds slower. So the best place could be to implement this directly at the condor level(probably in a schedd attached to negotiator) and can be a feature request to the condor developers. May be we can ask about this in the next condor developers meeting. @dpiparo FYI

@bbockelm
Copy link
Contributor

bbockelm commented Feb 5, 2020

@sharad1126 - I’m not sure that comment makes much sense. Without knowing the exact thing Alan is planning, it could be almost no load - or very expensive.

In fact, if done right, this could be much more efficient than the current system because one could affect all idle jobs in a single transaction instead of doing it one-by-one (like Unified does today).

@sharad1126
Copy link

@bbockelm I discussed about this with @amaltaro and then I discussed this with James Letts and James told me what I exactly mentioned in the above comment. Of course it is a good idea to get this done which would help us making the system more efficient.

@amaltaro
Copy link
Contributor

amaltaro commented Feb 7, 2020

What I have in mind is actually an update of the workflow spec file, such that jobs still waiting in the global workqueue (or waiting for the agent job splitting) could use the up-to-date parameters, thus stopping the usage of JobRouter.

In the next phase of this tunning, we could also update jobs pending in the local condor queue (basically the same process as done for RequestPriority/JobPrio).

I believe those 2 approaches are not tightly coupled and can be delivered in different stages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants