YUNIKORN-462: Streamline core to shim update on allocation change #58

manirajv06 · 2021-12-15T16:20:33Z

What is this PR for?

SI changes to remove ReSyncSchedulerCache plugin and added new fields in AllocationRelease message

What type of PR is it?

- Improvement

Todos

- Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-462

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

manirajv06 · 2021-12-15T16:41:57Z

In addition to changes discussed offline, have also made changes for synchronous release allocations in a separate flow as I thought of not disturbing the other use cases (trigger points) which are ok with async communication. This kind of synchronous release allocations is required only for processAllocationReleases() method because call to ForgotPod() happens only in this flow through ReSyncSchedulerCache plugin.

Once overall approach makes sense, will need to work on unit tests

wilfred-s

not sure about the AllocationRelease message changes.

wilfred-s · 2022-01-05T01:34:40Z

scheduler-interface-spec.md

+  // AllocationKey from AllocationAsk
+  string allocationkey = 6;


This should already be covered in the message as the UUID. Otherwise we currently would not the possibility to release a pod that is not allocated yet (which is what an ask is).

AssumePod/ForgetPod expects allocationKey to do its cache add/remove operation. Currently, ReSyncSchedulerCache plugin wraps AllocationKey and few more fields under AssumedAllocation/ForgotAllocation message from core side and sends to shim which eventually passes to AssumePod/ForgetPod methods. Now, with the removal of ReSyncSchedulerCache plugin, in this PR, added AllocationKey in AllocationRelease message as AllocationRelease is being passed to shim (as AllocationResponse) through event processing.

wilfred-s · 2022-01-05T01:38:58Z

scheduler-interface-spec.md

+  // update cache, default is false
+  bool updateCache = 7;


Can you explain why we would ever not want to update the cache. Would that not cause issues with an out of sync cache?
The cache should be idempotent and update when needed:

delete can never fail: if the entry does not exist the result is the same as a successful delete

for creates and updates we should do a create OR update call with the end result being a consistent cache representing the correct object.

context#processAllocationReleases is the only place where ReSyncSchedulerCache plugin is used to call ForgetPod in shim side for cache updates in addition to sending RMReleaseAllocationEvent event. RMReleaseAllocationEvent uses AllocationRelease message. AllocationRelease message is being used in few more places (for ex, context#handleRMUpdateApplicationEvent, node decommission etc) in addition to context#processAllocationReleases in core side. Hence, to help shim whether to do cache operation or not while processing UpdateAllocation callback method, used "updateCache" field

wilfred-s · 2022-02-09T05:08:01Z

Sorry that it has taken so long to get back to you :-(
Way to long an update but it sets the direction for the other two repos involved

We have one change in mind remove the ReSyncSchedulerCache call. The call is made twice from the core to the shim:

for new allocations
for removed allocations

First new allocations: the message we send is a RMNewAllocationsEvent with all allocations that are new. That event is currently async. That is why we first we call a sync cache update with partial info. The change is that the event will become a sync call. We have enough information in the current Allocations array. This is part of the event we send and we can also call the assume of the pod inside the cache on the shim side by pulling that key from the allocation. We always call assume pod for every new allocation.

Simple change on the SI side: we can remove the AssumedAllocation message. Leverage existing information for the AllocationKey

On the remove side we have 4 locations where we send a RMReleaseAllocationEvent. Only in one location, as you pointed out, we also call the sync of the cache. The sync of the cache triggers the forget of an assumed pod. Only looking at the path that sends events back. The cache sync is part of one of these calls.

handleRMUpdateApplicationEvent handles removal of an application. Does not call the cache sync.
updateNode handles the node removal. Does not call the cache sync.
schedule triggers the release of a placeholder. Does not call the cache sync.
processAllocationReleases is processing release requests send by the shim. This calls the cache sync.
The termination type for call 1,2 and 4 is STOPPED_BY_RM. For call 2 it is PLACEHOLDER_REPLACED.
Every single allocation is assumed as per above description. So we should also forget a pod, remove it, from the assumed pod list when we remove the pod without exception. If we do not we could leak the entry in the assumed pod cache structure. There should be no difference in the communication for any of these cases between core and shim.

Simple change on the SI side: we can remove the ForgotAllocation message. Add the allocationKey to the AllocationRelease message (check case for the new field!)

The core sends the events synchronously. The shim collapses the assume call into the event processing for new allocations, returns as soon as possible forking of the long running tasks. The shim collapses the forget call into the event processing for the remove. If there is any special cases to not forget the assumption of a removed pod then the shim must implement it. The core should not be the one that decides this.

After this we also need to completely remove the ReSyncSchedulerCacheArgs and the ReSyncSchedulerCache call from the interface.

manirajv06 · 2022-02-11T18:01:56Z

Thanks for your detailed explanation and giving me a clear picture.

Have taken care of SI and Core accordingly. As discussed, once YUNIKORN-876 goes in, we can work on the shim. Summary is, We should call clear the cache on shim side for all these 4 calls. Since we need to clear cache in all 4 calls, then probably we should make all these calls sync?

wilfred-s

LGTM

manirajv06 requested a review from wilfred-s December 15, 2021 16:20

manirajv06 self-assigned this Dec 21, 2021

manirajv06 mentioned this pull request Dec 21, 2021

YUNIKORN-462: Streamline core to shim update on allocation change apache/yunikorn-core#350

Closed

5 tasks

wilfred-s requested changes Jan 5, 2022

View reviewed changes

manirajv06 force-pushed the YUNIKORN-462 branch from 4274a45 to 62bb2e8 Compare February 10, 2022 14:05

manirajv06 added 2 commits February 10, 2022 19:41

Streamline core to shim update on allocation change

62bb2e8

Addressed review comments

1602234

wilfred-s approved these changes Feb 18, 2022

View reviewed changes

wilfred-s closed this in e19a1b0 Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YUNIKORN-462: Streamline core to shim update on allocation change #58

YUNIKORN-462: Streamline core to shim update on allocation change #58

manirajv06 commented Dec 15, 2021

manirajv06 commented Dec 15, 2021

wilfred-s left a comment

wilfred-s Jan 5, 2022

manirajv06 Jan 5, 2022

wilfred-s Jan 5, 2022

manirajv06 Jan 5, 2022

wilfred-s commented Feb 9, 2022

manirajv06 commented Feb 11, 2022

wilfred-s left a comment

		// AllocationKey from AllocationAsk
		string allocationkey = 6;

YUNIKORN-462: Streamline core to shim update on allocation change #58

YUNIKORN-462: Streamline core to shim update on allocation change #58

Conversation

manirajv06 commented Dec 15, 2021

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

manirajv06 commented Dec 15, 2021

wilfred-s left a comment

Choose a reason for hiding this comment

wilfred-s Jan 5, 2022

Choose a reason for hiding this comment

manirajv06 Jan 5, 2022

Choose a reason for hiding this comment

wilfred-s Jan 5, 2022

Choose a reason for hiding this comment

manirajv06 Jan 5, 2022

Choose a reason for hiding this comment

wilfred-s commented Feb 9, 2022

manirajv06 commented Feb 11, 2022

wilfred-s left a comment

Choose a reason for hiding this comment