Rewrite allocator #2615

dperny · 2018-04-27T22:35:18Z

- What I did

Added all of the lower-level allocator subcomponents, paving the way for the final bits of allocator rewrite.

Includes and supersedes #2579 and #2605. Does not integrate the new components, which still requires a rewrite of the top-level Allocator object (which is in progress on a local branch).

WIP because it lacks tests, but all code is ready for review.

- How I did it

Painfully.

- How to review it

This PR is massive, and will be difficult to adequately review. Do not let this discourage you; the design should be clearly delineated, and the components and loosely coupled. There are many comments. If you do not understand why a particular thing was done a particular way during your review, that's not a sign that you can't review it, that's a sign that it hasn't been commented adequately to explain it.

This PR should be review-able by anyone with any level of familiarity with the code. The goal is to create a maintainable, easy-to-understand component, that even new contributors to the project can easily understand. Effort has been made to ensure that nothing in the code assumes itself to be obvious.

To make reviewing easier, you should follow these guidelines:

Start by reading doc.go, which should explain the purpose and use of each package. Then, read the doc comments of the public interfaces. Make sure they make sense to you, and adequately explain the intended behavior of the components. If they do not, request changes.
Look over the design and method signatures. Do the signatures make sense? Are the methods easy to use? Do the concerns of each component seem well-separated?
Review the lower level components first. The port package has previously been subject to review in Add new PortAllocator #2579, but has received changes. The ipam package has been partially reviewed in [WIP] Add new IPAM subcomponent #2605, but has received substantial changes. The driver and top-level network package has not reviewed yet.
Review the Allocator component. Does it fit the sub-components together well? Does it adequately respect the separation of concerns?
Review the tests. Do they cover the cases adequately?

- How to test it

This PR is marked WIP primarily because it currently lacks complete testing. The tests it does have are written with the Ginkgo framework.

- Description for the changelog

dperny · 2018-04-27T22:45:17Z

TODO list for the stuff we need to do to ship this

Write new top-level allocator
Ensure manager/allocator/allocator_test.go runs against the new allocator
Go through old allocator code, and match old logic to new logic. Make sure all of the logic in the old allocator is covered in the new one
Write and run performance tests, optimize where appropriate.

codecov · 2018-04-27T22:48:11Z

Codecov Report

Merging #2615 into master will increase coverage by 0.61%.
The diff coverage is 72.46%.

@@            Coverage Diff             @@
##           master    #2615      +/-   ##
==========================================
+ Coverage   61.76%   62.37%   +0.61%     
==========================================
  Files         134      142       +8     
  Lines       21760    23134    +1374     
==========================================
+ Hits        13440    14430     +990     
- Misses       6871     7196     +325     
- Partials     1449     1508      +59

dperny · 2018-04-28T18:20:20Z

@aaronlehmann you wrote some of my favorite code in swarmkit, so I'm really interested in your high-level opinion of this new code, even if you don't have time for a review.

aaronlehmann

I mostly looked at the port allocator because it's most interesting and approachable to me.

aaronlehmann · 2018-04-28T18:36:47Z

manager/allocator/network/doc.go

+// from having to worry about the overlying storage method, and reduces the
+// already rather bloated surface are of this component
+//
+// The network allocator provideds the interface for the top-level API objets,


aaronlehmann · 2018-04-28T18:37:01Z

manager/allocator/network/doc.go

+// caller will provide all of the objects, and be in charge of committing those
+// objects to the raft store. Instead, it works directly on the api objects,
+// and gives them back to the caller to figure out how to store. This frees us
+// from having to worry about the overlying storage method, and reduces the


underlying?

aaronlehmann · 2018-04-28T18:37:07Z

manager/allocator/network/doc.go

+// objects to the raft store. Instead, it works directly on the api objects,
+// and gives them back to the caller to figure out how to store. This frees us
+// from having to worry about the overlying storage method, and reduces the
+// already rather bloated surface are of this component


surface area

aaronlehmann · 2018-04-28T18:39:45Z

manager/allocator/network/ipam/ipam.go

+	"github.com/docker/swarmkit/manager/allocator/network/errors"
+)
+
+// DrvRegistry is an interface defining


Comment seems incomplete

aaronlehmann · 2018-04-28T18:43:37Z

manager/allocator/network/allocator.go

+	if oldService, ok := a.services[service.ID]; ok {
+		var oldVersion, newVersion uint64
+		// we need to do this dumb dance because for some crazy reason
+		// SpecVersion is nullable


SpecVersion did not always exist. If it was not nullable, we would see a SpecVersion with default values on services that predate the field.

yeah this comment isn't super constructive.

aaronlehmann · 2018-04-28T18:59:07Z

manager/allocator/network/port/port.go

+//
+// Because I'm reluctant to alter what I have that already works, we will not
+// return ErrAlreadyAllocated if the endpoint is already allocated. Instead,
+// callers can use the AlreadyAllocated function from this package.


I think doing this might be a net simplification. In the past, there's been a big problem with port allocation logic being spread across a number of places. I really like the idea of centralizing it.

aaronlehmann · 2018-04-28T19:02:40Z

manager/allocator/network/port/port.go

+
+// IsNoop returns true if the ports in use before this proposal are the same as
+// the ports in use after.
+func (prop *proposal) IsNoop() bool {


I have a slight preference for Allocate figuring out whether any state changes are necessary. It could return a proposal with empty allocate and deallocate sets when there is nothing to do, or a separate type with a Commit method that does nothing (so proposals that change state and don't change state are distinguished by type, not with a method call).

this is almost entirely for the benefit of the tests, and i think the "correct" answer may actually be to change the tests to not use the port_test package and instead allow them to look at an unexported method or field providing this functionality.

I made a PR to your forked repository with a couple optimizations to the Allocate loop. The second one is based on this comment from @aaronlehmann.

P.S: If the optimizations get accepted, IsNoop results in checking that both allocate and deallocate are empty, so it may not even need a method if you opt to not use the port_test package.

aaronlehmann · 2018-04-28T19:09:49Z

manager/allocator/network/port/doc.go

+// necessary information to commit the change. Calling the `Commit` method on
+// the Proposal will commit the changes to the Allocator. If the caller decides
+// to abandon the changes, the user can simply abandon the Proposal without
+// calling Commit.


This seems like a good approach. My only hesitation is that it bakes the single-thread assumption into the API. If we ever wanted to support concurrency, it would have to change so that the state is updated immediately, with the possibility of rolling back an allocation. I don't think this is a big deal because it wouldn't be a difficult change.

we can't support concurrency and rolling back. if we support concurrency, you may not be able to roll back a port allocation, because one of the ports you released may have been claimed. you'd have to split out allocation and de-allocation into separate steps.

if you look in the manager/allocator/network.go AllocateService method, we allocate ports, get a proposal, and then try allocation IP addresses. If the IP address allocation fails, we discard the proposal. in the ipam allocator, when allocation IP addresses, you allocate first, and then if allocation is successful, you free addresses. at that point, no further errors can occur, so you can safely commit the proposal. if you allocated ports after IP addresses, you might find that IP allocation succeeds but port allocation fails, and then you'd have to roll back IP address allocation in the same way and risk a freed resource being taken.

i think that this actually makes me realize there is one use for the IsNoop method on proposals (or some similar construct). concurrently allocating IP addresses would be a big win, but is blocked because a lock on the portallocator would have to be acquired before allocating and released after committing, and the IPAM call is inside that synchronized path. however, if you found the proposal was a noop and did not alter port state (which i suspect to be a more common case), you could release that lock immediately and allow other service allocations to proceed calling into the ip allocator concurrently.

on the other hand, i strongly suspect network updates to services are going to be infrequent enough to not substantially impact the performance of the allocator even if that whole code path is single-threaded.

As stated by dperny the commit inmediately and roll back later approach is not thread-safe for deallocations. The prepare a proposal and commit later approach could be made thread-safe by keeping a logical clock in the allocator that gets updated every time a proposal is committed, but this would mean that any concurrent proposal would have to test its validity again. If a proposal is not valid due to a concurrent change then all the optimization of concurrency would be lost. And any deallocation in a proposal would make it non-valid, as you can not be sure if the committed proposal that triggered this validity check has deallocated and reallocated this resource.

Would an hybrid solution work? Allocating inmediatelly with roll back support, but not deallocating until committed. This will enforce the proposals to be either commited or reversed, dropping a proposal would result into some resources to be allocated forever until the swarm was rebuilt. I think this approach may be thread-safe, but it has to be taken into account that proposals hold both the resources they want to allocate and the ones that they want to deallocate, so a call to Allocate may seem not to have enough resources due to a concurrent proposal when it actually has.

Discussed this with @dperny offline. The problem with making this multi-threaded is not deallocs, since if there is an error with deallocs, we just roll forward and commit. So, in the face of failures, deallocs are effectively best-effort.

So I think the commit immediately and roll back later approach should work for supporting multiple concurrent proposals, correct ? @dperny @Adirio

I'm not sure if I understand what you mean by roll forward and commit but I don't think deallocs are safe in the commit immediately and roll back later approach. Let me describe the proposed approaches to make sure we understand the same:

Resources need to be allocated and deallocated. To be thread safe this operations need to be done in an atomic way, this is, all of them need to be commited simultaneously or you can not guarantee that the laters will not fail. Three aproaches were mentioned:

Prepare a proposal and commit later: when preparing a proposal you can ensure its validity, but as the allocator state can change before commiting, the validity of the proposal can not be ensured at commit time. To ensure both we could use a logical clock on the state that would be updated with every commit, if the clock has different value at commit time that when the proposal was created, the proposal would need to be checked again, and if there was any conflict, not only a new proposal would be needed but also all the steps between the proposal generation and the commit time would need to be executed again. Deallocs in this case will never result in a conflict as deallocating is idempotent, and actually should never happen as despite the allocator not knowing who owns which resource, they have a single owner so only it will try to deallocate it. On the other hand, allocs can result in conflicts if that same resource was already allocated (and commited) concurrently.

Commit immediately and roll back later: the oposite aproach, commit at proposal preparation and rollback if unsuccessful. A stright difference from the previous method is that now unsuccessful proposals are the ones that need to be rollbacked while successful ones can be discarded, while it was the other way in the first approach. Validity can be ensured at commit time but can not be ensured at rollback time, as a deallocated resource can have been fully allocated (commit and proposal discarded as it was considered valid without rollback) and so the rollback method would fail re-allocating that resource and thus unable to get to a valid state. A logical clock could also be used here but with the same implications that in the first case.

Hybrid (allocate immediately and rollback later, prepare deallocate and commit later): this approach combines the previous two. It allocates the new resources at proposal preparation but doesn't release already in use resources until commit phase, when it either releases the deallocated resources if successful or the pre-allocated resources if not. As you can guess, both successful and unsuccessful proposals would need to be commited/rollbacked and none could be discarded, but the validity of the proposal can be ensured at commit time.

Summing up: (info about which operations are safe and which proposals can be dropped)

Prepare a proposal and commit later: Unsafe allocs and safe deallocs; commit successful and drop unsuccessful.

Commit immediately and roll back later: Safe allocs and unsafe deallocs; drop successful and rollback unsuccessful.

Hybrid (allocate immediately and rollback later, prepare deallocate and commit later): Safe allocs and deallocs; commit successful and rollback unsuccessful.

The only method that is concurrency-proof is the third one. Its drawbacks are being unable to drop any proposal and reserving resources that would be locked while the proposal is active.

You claim that deallocs are not a problem since we can just roll forward so the second approach can be used. What do you mean by roll forward? The resource that we deallocated at proposal preparation may have been allocated by another proposal and thus we would be unable to get it back. Do you mean that we should allocate another resource and change it for the one we can not reallocate?

aaronlehmann · 2018-04-28T19:17:01Z

manager/allocator/network/port/port.go

+		portObj := port{p.Protocol, DynamicPortStart}
+		// loop through the whole range of dynamic ports and select the first
+		// one available
+		for i := DynamicPortStart; i <= DynamicPortEnd; i++ {


This seems like it would not scale well with many dynamically assigned ports in play. Maybe it's not a big concern right now, especially since there is an upper bound on how many ports can be assigned.

it's worst case O(n) with the number of ports, because everything is done off map lookups. There are more efficient ways to dynamically assign ports (@stevvooe has mentioned things like bit-twiddling) but I don't think they'll be worth optimizing unless we see that this behavior is bottlenecking performance.

By not starting always at the DynamicPortStart (keeping track of which was the last i in the previous iteration of the ports loop) you will optimize it even if the theoretical complexity still remains O(n).

If you have 3 ports inside finalPorts that require a dynamically assigned port and the first already has checked that the first 50% of the dynamic port range is full, the second port should start iterating where the first stopped, as it knows that the range before the dynamically assigned port for the first one is full.

that would be a useful optimization.

Made a PR to your rewrite_allocator branch with this optimization and another one focused on the loops that re-allocates deallocated ports. Was quite straight-forward.

dperny · 2018-04-30T23:11:46Z

Ugh, I just realized that the allocator doesn't handle the case of node-local networks, because I've segregated allocating the driver into the driver package and allocating IP addresses into the ipam package, but you need the driver capabilities to allocate network attachments correctly because you take no action on node-local networks.

dperny · 2018-05-03T20:57:09Z

When I plug the new allocator into the main code base and run the integration tests, tehy fail. I believe this is due to the new allocator being substantially too slow. Most of the optimization that needs to happen should happen in the top-level allocator.Allocator. Specifically, I believe it holds store locks open for entirely too long.

talex5 · 2018-05-04T10:49:14Z

manager/allocator/network/port/port.go

+		}
+		if spec.TargetPort > masterPortEnd {
+			return nil, errors.ErrInvalidSpec("target port %v isn't in the valid port range", spec.TargetPort)
+		}


This probably needs a check for masterPortStart too. For PublishedPort, it's not actually a problem (unless someone changes the constant) because the only value less that 1 means dynamic allocation. But I guess we shouldn't allow TargetPort to be 0?

doc.go says

The allocator does not keep track of host-mode ports in any capacity. It silently adds them to the returned Ports assignment, without checking their validity or availability.

However, we do seem to be doing some validity checks here, even for host-mode ports.

this doesn't count. it's an undocumented freebie feature.

talex5

I started making a TLA+ model of the port allocator, and realised I hadn't quite understood some things about how host-mode ports work. This probably just needs a bit more documentation...

talex5 · 2018-05-08T14:30:04Z

manager/allocator/network/port/port.go

+		}
+		if spec.TargetPort > masterPortEnd {
+			return nil, errors.ErrInvalidSpec("target port %v isn't in the valid port range", spec.TargetPort)
+		}


doc.go says

The allocator does not keep track of host-mode ports in any capacity. It silently adds them to the returned Ports assignment, without checking their validity or availability.

However, we do seem to be doing some validity checks here, even for host-mode ports.

talex5 · 2018-05-08T14:31:54Z

manager/allocator/network/port/doc.go

+// no PublishedPort. The allocator does _not_ keep track of host-mode ports in
+// any capacity. It silently adds them to the returned Ports assignment,
+// without checking their validity or availability. Port assignments for
+// host-mode ports are handled on the worker.


Is the scheduler involved too, for host-mode ports? Otherwise, they might get scheduled on a host which is already using that port.

Also, what about this case:

docker service create --publish published=30000,target=80,mode=host nginx docker service create --publish published=0,target=80 nginx

Doesn't the allocator need to know that a host is using port 30000?

(I just tried it with 18.03.1-ce and it let me use the same port for two services. Requests always seemed to go to the ingress service until I killed that service, then they went to the host service.)

the scheduler is not involved in host mode ports. it was a very rushed feature, if i recall correctly, and it's sensitive to collisions.

the problem with handling host-mode ports at the cluster level is that it involves an interaction between the allocator and the scheduler. the scheduler won't know until the task passes through the allocator what ports it's using, but the allocator won't know until it passes through the scheduler what node it's on. plus, you can dynamically assign ports, which means you really actually can't know what port will be chosen on the target node.

so, the solution now is to pass host ports as-is, and let the worker fail the task if there is a collision with a local port (thus pushing the scheduler to reschedule the task, hopefully on a different node with the requested port available).

But if a host port isn't available on a node because it's already used as an ingress port, won't we have the same problem on all other nodes of the cluster too? Should a worker reject a request to forward an ingress port if it is already using it as a host-mode port? (I'm not too clear about how overlay networks work currently)

This interaction between the scheduler and the allocator has been discussed in other Issues and PRs. It has been suggested to extract the allocators from the pipeline:

Current:

MANAGER NODE (LEADER) WORKER NODE _______________ ______________ ______________ ______________ _______________ \ \ \ \ \ \ \ | Orchestrator | Allocator | Scheduler | Dispatcher | | Agent | /______________/______________/______________/______________/ /______________/

Proposed:

MANAGER NODE (LEADER) WORKER NODE _______________ ______________ ______________ _______________ \ \ \ \ \ \ | Orchestrator | Scheduler | Dispatcher | | Agent | /______________/______________/______________/ /______________/ | +-----+-----+ | Allocator | +-----------+

This way the port allocator could be called before (or at the start of) the scheduler but other allocators could be called later (resources could be consider allocatable items for example).

talex5 · 2018-05-08T14:37:56Z

manager/allocator/network/port/port.go

+
+	// DynamicPortEnd is the end of the dynamic port range from which node
+	// ports will be allocated when the user did not specify a port.
+	DynamicPortEnd uint32 = 32767


Should this be configurable?

it is configurable, just at compile time instead of run time.

Any specific usecase to have these configurable at runtime ? @talex5

No. I just thought a customer might hit this limit and it would be nice if we could tell them to change a setting rather than waiting for a new engine. But maybe they never get near to this limit anyway.

dperny · 2018-05-08T21:02:37Z

However, we do seem to be doing some validity checks here, even for host-mode ports.

yes, we are. there is no good reason for this. i am a liar.

dperny · 2018-05-08T21:03:40Z

i'm unsure how useful TLA+ is going to be for the allocator, which is not distributed and is not yet even concurrent.

dperny · 2018-05-09T21:24:06Z

I've updated the PR to include some major differences:

The new allocator can be used in swarmkit by setting the environment variable SWARMKIT_USE_NEW_ALLOCATOR="iknowtherisk". The new allocator is gated behind this setting. This is a temporary measure.
Added a second integration test run to CI, in addition to a regular make target, which runs the integration tests with the new allocator as opposed to the old one.

talex5 · 2018-05-10T13:17:44Z

i'm unsure how useful TLA+ is going to be for the allocator, which is not distributed and is not yet even concurrent.

Yeah, I wasn't sure whether to bother, but I thought I might as well record what I'd learnt from reading the code, and then it found the ingress/host collision problem. I've now finished the spec, and configured it to check these properties:

All allocations are consistent:
- All ports marked as used in the allocator are, in fact, used by some endpoint.
- All ports used by an endpoint are marked as allocated.
- No two services have conflicting allocations, except for host/host collisions (which the allocator can't control, since it depends on the scheduler).
If the allocator rejects a spec, then the spec is invalid, requests ports in use elsewhere, or requires more free dynamic ports for some protocol than are available.
Allocation is idempotent.

Here is the spec: PortAllocator.pdf -- the source is at https://github.com/talex5/swarmkit/compare/tla...talex5:tla-alloc?expand=1

I marked with XXX places where I had to deviate from the Go code to make the tests pass.
In summary, the model checker found the following corner cases:

Reusing a previous assignment when we need it elsewhere

Example:

[name="foo", dynamic] (initial configuration)
[name="foo", dynamic], [name="bar", published=30000] (updated)

Here we will reject the update because we try to reuse port 30000 for "foo", even though we now need it for "bar". Perhaps this is desired behaviour, but it does mean that you can't update to a state even though you could have specified the target state initially without a problem.

Duplicate similar ports

Example:

[name="foo", dynamic]
[name="foo", dynamic], [name="foo", dynamic]

We will try to use port number 30000 for both ports and reject the allocation.

Shadowing an existing host port

As noted previously, the allocator ignores host ports completely. It may allocate a dynamic ingress port using a network address that is already in use on some host. In this case, the original service becomes unreachable.

Accepting a host port that is hidden

The allocator may accept a host-port allocation that it knows will be unreachable on any host because that address is already in use as an ingress port.

I confirmed the first two by adding some tests for them to the Go tests:
https://github.com/dperny/swarmkit-1/compare/rewrite-allocator...talex5:allocator?expand=1

These issues might not be too important, but I think it's interesting to see what potential problems there could be, and what the model checker can find.

Adirio · 2018-05-10T13:51:44Z

@talex5

No two services have conflicting allocations, except for host/host collisions (which the allocator can't control, since it depends on the scheduler).

I believe it is the Agent and not the Scheduler who handles host ports. The scheduler basically ensure deployment constraints, orders possible target nodes according to deployment preferences and number of running tasks and assigns the task to the first target node.

Corner cases:

Reusing a previous assignment when we need it elsewhere

That is correct, if you manually assign a port that was already dynamically assigned you will get a conflict. When you give a dynamic port range you are kind of "reserving" this ports for dynamic use only.
There are two solutions:

Consider the dynamic port range a soft reservation (actual behaviour): a manual assignment in this range may or may not succeed depending on the cluster state. The user should use ports in this range under its own risk. IMO, this should get documented if it is the desired behaviour.
Consider the dynamic port range a strict reservation: disallow manually assignments in the dynamic port range programatically. This should probably be checked when issuing docker stack deploy or similar API/CLI calls and return an Error.

The advantages of the second solution are being self-documented and probably simplifiying the Allocate method. I am not sure that these small advantages are worth the change.

Duplicate similar ports

That will not happen in the allocator, the first would get 30000 and the second 30001. At least the last time I checked the code it was checking that the pre-selected port was not assigned and discard it if it was.

Shadowing an existing host port & Accepting a host port that is hidden

Either you or me got the host-port part wrong. In my understanding, host-ports are from a different virtual network and thus a completely new set of ports is available, so there is no way to shadow/hide a port.

talex5 · 2018-05-10T14:26:32Z

That will not happen in the allocator, the first would get 30000 and the second 30001. At least the last time I checked the code it was checking that the pre-selected port was not assigned and discard it if it was.

Perhaps there is an error in my tests then. Do you see a problem with this code:

https://github.com/talex5/swarmkit/blob/allocator/manager/allocator/network/port/port_test.go#L867

For me, this test fails with:

        Expected error:
            <errors.errInvalidSpec>: {
                cause: "published port 30000/TCP is assigned to more than 1 port config",
            }
            spec is invalid: published port 30000/TCP is assigned to more than 1 port config
        not to have occurred

Either you or me got the host-port part wrong. In my understanding, host-ports are from a different virtual network and thus a completely new set of ports is available, so there is no way to shadow/hide a port.

This is what I see with the current allocator, which seems to have the same behaviour:

docker service create --publish published=30001,target=80,mode=host nginx
docker service create --publish published=0,target=80 wordpress

Then accessing port 30001 gives me wordpress:

$ curl -v http://35.205.55.240:30001/
[...]
Location: http://35.205.55.240:30001/wp-admin/setup-config.php

After removing the wordpress service (docker service rm mszqf4ey1nem) I get:

$ curl -v http://35.205.55.240:30001/
[...]
<title>Welcome to nginx!</title>

If they are on different virtual networks, how would I specify which one I meant?

Adirio · 2018-05-10T14:58:45Z

Perhaps there is an error in my tests then. Do you see a problem with this code:

I am not familiar with gimko so will be of little help here, but you should probably not have a value assigned in the spec and then change it to 30000 in the endpoint.Ports slice. Use 0 for spec and 30000 for endpoint.Ports if you want to simulate a dynamically assigned port, or both to 30000 if you want to simulate a manually assigned port in the dynamic range.

This is what I see with the current allocator, which seems to have the same behaviour:
docker service create --publish published=30001,target=80,mode=host nginx
docker service create --publish published=0,target=80 wordpress
Then accessing port 30001 gives me wordpress:
$ curl -v http://35.205.55.240:30001/
[...]
Location: http://35.205.55.240:30001/wp-admin/setup-config.php
After removing the wordpress service (docker service rm mszqf4ey1nem) I get:
$ curl -v http://35.205.55.240:30001/
[...]
<title>Welcome to nginx!</title>
If they are on different virtual networks, how would I specify which one I meant?

That behaviour seems to be in line with your interpretation of host-ports. Lets see what a member/collaborator says about it.

dperny · 2018-05-10T18:39:07Z

The problem with two identical ports is that we can't know which ports on the spec map to ports on the object, which we need to know in order to correctly re-use dynamic port allocations across service updates. So, in your failing test, the port allocator is looking for a spec on the endpoint which matches the config in the new spec, seeing that one is found and has no port allocation, and then looking in the object for the port allocation matching that spec, which it fines because the old object has it. That corner case, of identical ports, needs to be blocked from even occuring, in order to avoid "undefined" behavior.

Adirio · 2018-05-11T07:17:45Z

@dperny Could that situation happen in real life? I mean, can a spec have two identical ports including the TargetPort-Protocol pair? You are using always 80/TCP as the target because you are not doing any check against them at this point, but where will a query to port 80/TCP be redirected if such a spec was found?

@talex5 using TargetPort=81 for the new spec port should fix the test

talex5

I had a go at reviewing the other subcomponents (ipam and driver), but I don't know much about libnetwork. Some overview documentation would be useful.

talex5 · 2018-05-11T10:02:27Z

manager/allocator/network/allocator.go

+	// that i'm willing to just check it and panic if it occurs.
+	reg, err := drvregistry.New(nil, nil, nil, nil, pg)
+	if err != nil {
+		panic("drvregistry.New returned an error... it's not supposed to do that")


Would be nice to include err in the panic message.

yeah, good catch.

talex5 · 2018-05-11T10:12:44Z

manager/allocator/network/allocator.go

+
+// Restore takes slices of the object types managed by the network allocator
+// and syncs the local state of the Allocator to match the state of the objects
+// provided. It also initializes the default drivers to the reg.


It also initializes the default drivers to the reg.

Doesn't look like it does anything with the reg.

This is a hold over comment from an earlier version of the code, sorry about that.

talex5 · 2018-05-11T10:17:00Z

manager/allocator/network/driver/doc.go

@@ -0,0 +1,11 @@
+package driver
+
+// TODO(dperny) write


It's difficult to review this module without this information.

Yeah, whoops, sorry. Will turn this into a TODONE.

talex5 · 2018-05-11T10:33:04Z

manager/allocator/network/allocator.go

+		_, ok := nw.Spec.Annotations.Labels["com.docker.swarm.internal"]
+		if ok && nw.Spec.Annotations.Name == "ingress" {
+			a.ingressID = nw.ID
+		}


This ingress-detection logic is duplicated in AllocateNetwork. Maybe make a helper function for that?

talex5 · 2018-05-14T09:21:08Z

manager/allocator/network/allocator.go

+		if a.ingressID != "" {
+			// NOTE(dperny): if i recall correctly, append should not modify the
+			// original slice here, which means this is safe to do without
+			// accidentally modifying the spec.


I don't know much Go, but I think it depends how the slice was constructed. e.g.

foo := make([]int, 0, 5) bar := append(foo, 1) baz := append(foo, 2) fmt.Println(foo) fmt.Println(bar) fmt.Println(baz)

prints

[] [2] [2]

So the second append changed the bar slice. I think this is safe only if the underlying array is already at capacity or no other user will append to it.

Append modifies the underlying array, but doesn't modify the original slice:

When you do foo := make([]int, 0, 5) you are creating an array at memory address 0xXXXXXXXX and an slice at memory address 0xAAAAAAAA:

0xXXXXXXXX: 0 0 0 0 0 <- underlying array 0xAAAAAAAA: 0xXXXXXXXX 0 5 <- foo

bar := append(foo, 1) creates a new slice bar at 0xBBBBBBBB using the same underlying array than foo and inserts a 1. As foo has no element, the insert is done in position 0.

0xXXXXXXXX: 1 0 0 0 0 <- underlying array 0xAAAAAAAA: 0xXXXXXXXX 0 5 <- foo 0xBBBBBBBB: 0xXXXXXXXX 1 5 <- bar

baz := append(foo, 2) creates a new slice baz at 0xCCCCCCCC using the same underlying array than foo and inserts a 2. As foo has no element, the insert is done in position 0.

0xXXXXXXXX: 2 0 0 0 0 <- underlying array 0xAAAAAAAA: 0xXXXXXXXX 0 5 <- foo 0xBBBBBBBB: 0xXXXXXXXX 1 5 <- bar 0xCCCCCCCC: 0xXXXXXXXX 1 5 <- baz

So printing foo gives 0 elements from the underlying array, printing bar and baz both give 1 element of the underlying array and as the last operation set it to a 2 that is what you get for both.

addressed this by using a different slice.

talex5 · 2018-05-14T10:08:32Z

manager/allocator/network/allocator.go

+				finalErr = err
+			}
+			// remove the addresses after we've deallocated every attachment
+			// attachment.Addresses = nil


Why is this commented out?

triggers the race detector. i will go back and change it, but, basically, the object you get back from an event stream is shared, and mutating it one place mutates it in all places.

talex5 · 2018-05-15T15:13:34Z

manager/allocator/network/ipam/ipam.go

+				"no ip addresses remain in any pool for network %v",
+				local.nw.ID,
+			)
+		}


What if address != ""? Isn't that also an error (the requested address wasn't available)?

hm... i think this might be broken, actually. i'm going to write a test for it real quick.

talex5 · 2018-05-15T15:16:57Z

manager/allocator/network/doc.go

+
+// The allocator/network package is the subcomponent in charge of allocating
+// network resources for swarmkit resources. Give it objects, and it gives you
+// resources.


Would be good to say something about thread-safety here. The port allocator says the PortAllocator is *not* concurrency-safe, and this module uses that one without locks, so presumably this module isn't concurrency-safe either. It should probably mention that.

Seconded. Maybe it makes sense to point this out on a per-API level.

talex5 · 2018-05-15T15:21:07Z

manager/allocator/network/ipam/doc.go

+// ipam is a package for interfacing with the libnetwork IPAM drivers.
+// It performs the work of initializing and bootstrapping the IPAM state for
+// networks, keeping track of IPAM pools, and allocating new addresses for VIPs
+// and network attachments.


Would be good to provide a brief overview of what IPAM is and does here.

talex5

Some comments on allocator_new.go.

talex5 · 2018-05-16T13:49:29Z

manager/allocator/allocator_new.go

+	// has been allocated. If it is true, then regardless of which nodes are
+	// currently pending, all nodes will need to be reallocated with the new
+	// network. we will do this after network allocation, but before task
+	// allocation, to avoid a case where


Incomplete comment.

talex5 · 2018-05-16T14:02:57Z

manager/allocator/allocator_new.go

+				var err error
+				nodes, err = store.FindNodes(tx, store.All)
+				if err != nil {
+					// i don't know what to do here


talex5 · 2018-05-16T14:12:14Z

manager/allocator/allocator_new.go

+				t := store.GetTask(tx, taskid)
+				// if the task is not nil, is in state NEW, and is desired to
+				// be running, then it still needs allocation
+				if t != nil && t.Status.State == api.TaskStateNew && t.DesiredState == api.TaskStateRunning {


What about tasks with a desired state of READY?

good catch.

talex5 · 2018-05-16T14:20:16Z

manager/allocator/allocator_new.go

+				// to the pending map. they will never succeed. additionally,
+				// remove these tasks from the map
+				log.G(ctx).WithError(err).Error("service cannot be allocated")
+				delete(tasks, service.ID)


Should we attach the error to the task objects somehow, e.g. in task.Status.Err? Otherwise, I assume the user won't be able to see why it failed. Same applies in other places.

perhaps, yes.

calling this out of scope for this initial rewrite, but dropped a TODO comment.

talex5 · 2018-05-16T14:27:12Z

manager/allocator/allocator_new.go

+							// free some state when we deallocate a second
+							// time! a classic double-free case!
+							log.G(ctx).WithField("node.id", node.ID).Debug("node was deleted after allocation but before committing")
+							a.network.DeallocateNode(node)


Why do we need to deallocate here? Won't the event monitor handle this case too?

the event stream will deallocate the node in its state before we performed re-allocation. however, we may have locally allocated some new resources for the node, which have not been committed, but which must be released in our local state.

i'm unsure what the correct approach to working around this problem is. probably making Deallocate idempotent. but i would really appreciate suggestions.

Ah, so the node object in the event isn't the same as the node object we change during allocation? If I understand correctly then, the problem case is:

The batch thread starts allocating things for an updated node.

The node is deleted. The event handler gets an event with the old state of the node. It waits for the lock.

The batch thread notices that the node is gone and deallocates it, from the correct (newly-allocated) state. It releases the lock.

The event handler gets the lock and deallocates again, from the old state.

If we get here then:

The node hadn't been deleted when we looked it up earlier in processPendingAllocations, which must have happened after we took pendingMu.

It is deleted now. So, a delete event must be on its way.

The event handler can't have deallocated yet, because it needs pendingMu.

So I guess we could keep a set of "already-deallocated" objects. We add the object to the set here to indicate that we already freed it. When the event thread later gets the lock, it will see that the deallocation was already done and not do it again. It will remove it from the "already-deallocated" set. A bit messy, but it should work.

Other solutions:

Ensure that we only have one object in memory at once representing the state of the node. Have the event thread deallocate in all cases. It will always have up-to-date information.

Use a single database transaction for processPendingAllocations. Then the user can't delete the node until we're done. Would this be a performance problem?

talex5 · 2018-05-16T14:47:22Z

manager/allocator/allocator_new.go

+	}
+}
+
+func (a *NewAllocator) allocateTask(tx store.Tx, taskid string) error {


What is this for?

dead code that i forgot to delete. i'm sorry about the dead code and half-done comments, i lose track of things frequently, especially in this huge project.

Adirio · 2018-05-17T06:43:16Z

CircleCI is detecting a change in a proto file without a make generate. The change is in the comment, but the test is unable to know that. I'm not sure how hard would making the checkproto test aware of comments and if it really would be that useful.

dperny · 2018-05-17T07:02:28Z

Yeah, I forgot to regenerate. I was going to just put off doing so until my next push

…

Adirio · 2018-05-17T07:54:43Z

It's fine, I was just checking CircleCI output because I'm getting a fail in another PR (different error) and I figured I would just comment in case you did not notice.

Can I ask you if you have any feedback on the two loop optimizations I proposed as a PR in your fork? I've noticed that you didn't implement them and I'm not sure if you do not like any of the changes or if the PR in your fork wasn't the way to go.

talex5 · 2018-05-17T09:56:12Z

manager/allocator/allocator_new.go

+			networks, _ = store.FindNetworks(tx, store.All)
+			services, _ = store.FindServices(tx, store.All)
+			tasks, _ = store.FindTasks(tx, store.All)
+			nodes, _ = store.FindNodes(tx, store.All)


If that's true, perhaps we should change the return type of Find* so they can't return errors, and make them panic instead. It doesn't seem like a good idea to allow them to return errors that we just ignore. Even if my swarm is hosed, I'd still like to know about it.

find can return errors if you call it with a store.By* filter that is invalid. Though we could change the method signature to check that at compile time, I suppose.

Seconded that I don't think we should plain ignore these errors. If the underlying code changes to sometime return an actual errors and an empty list, we surely want to know. If the error is "not possible", I'd prefer the allocator not to start - better safe than sorry.

talex5 · 2018-05-17T15:31:18Z

manager/allocator/allocator_new.go

+							// free some state when we deallocate a second
+							// time! a classic double-free case!
+							log.G(ctx).WithField("node.id", node.ID).Debug("node was deleted after allocation but before committing")
+							a.network.DeallocateNode(node)


Ah, so the node object in the event isn't the same as the node object we change during allocation? If I understand correctly then, the problem case is:

The batch thread starts allocating things for an updated node.

The node is deleted. The event handler gets an event with the old state of the node. It waits for the lock.

The batch thread notices that the node is gone and deallocates it, from the correct (newly-allocated) state. It releases the lock.

The event handler gets the lock and deallocates again, from the old state.

If we get here then:

The node hadn't been deleted when we looked it up earlier in processPendingAllocations, which must have happened after we took pendingMu.

It is deleted now. So, a delete event must be on its way.

The event handler can't have deallocated yet, because it needs pendingMu.

So I guess we could keep a set of "already-deallocated" objects. We add the object to the set here to indicate that we already freed it. When the event thread later gets the lock, it will see that the deallocation was already done and not do it again. It will remove it from the "already-deallocated" set. A bit messy, but it should work.

Other solutions:

Ensure that we only have one object in memory at once representing the state of the node. Have the event thread deallocate in all cases. It will always have up-to-date information.

Use a single database transaction for processPendingAllocations. Then the user can't delete the node until we're done. Would this be a performance problem?

cyli

Apologies, reviewing in chunks.

cyli · 2018-05-17T20:54:20Z

manager/allocator/allocator.go

@@ -83,6 +84,8 @@ func New(store *store.MemoryStore, pg plugingetter.PluginGetter) (*Allocator, er
 func (a *Allocator) Run(ctx context.Context) error {
 	// Setup cancel context for all goroutines to use.
 	ctx, cancel := context.WithCancel(ctx)
+	log.WithModule(ctx, "allocator")


I think log.WithModule(...) returns a context that has the logging context updated. So you might have to assign ctx = log.withModule(ctx, "allocator")

cyli · 2018-05-17T21:53:54Z

manager/allocator/allocator_new.go

+			// possibility for errors lies deep within memdb, and seems to be a
+			// consequence of some kind of misconfiguration of some sort of the
+			// indexes or tables. Those errors should not be possible, so we
+			// ignore them to simplify the code. If the do occur, then your


spelling nitpick: "If the do occur" -> "If they do occur"

cyli · 2018-05-17T21:55:55Z

manager/allocator/allocator_new.go

+		api.EventDeleteNode{},
+	)
+	if err != nil {
+		// TODO(dperny): error handling


Could you clarify this TODO? Is returning the error the incorrect behavior?

cyli · 2018-05-17T21:57:16Z

manager/allocator/allocator_new.go

+	// now restore the local state
+	log.G(ctx).Info("restoring local network allocator state")
+	if err := a.network.Restore(networks, services, tasks, nodes); err != nil {
+		// TODO(dperny): handle errors


Could you clarify this TODO? Is it just that you don't know if the entire run sequence should error?

cyli · 2018-05-17T23:05:21Z

manager/allocator/allocator_new.go

+		api.EventCreateNetwork{},
+		api.EventUpdateNetwork{},
+		api.EventDeleteNetwork{},
+		api.EventCreateService{},


The comment below (https://github.com/docker/swarmkit/pull/2615/files#diff-b775c4fdb79a3a4b943132a95b2f6ef3R337) says that we don't actually need to do anything for the service create and service update events- should we just not subscribe to these two events?

cyli · 2018-05-17T23:58:45Z

manager/allocator/allocator_new.go

+		api.EventUpdateTask{},
+		api.EventDeleteTask{},
+		api.EventCreateNode{},
+		api.EventUpdateNode{},


Similarly should we unsubscribe from EventUpdateNode since we don't handle that case?

cyli · 2018-05-18T00:15:25Z

manager/allocator/allocator_new.go

+	// has been allocated. If it is true, then regardless of which nodes are
+	// currently pending, all nodes will need to be reallocated with the new
+	// network. we will do this after network allocation, but before task
+	// allocation, to avoid a case where nodes use up all of the available


nit: isn't this the case we want to favor, as opposed to avoid? We want the nodes to use up all the available network addresses as opposed to tasks using up all of them.

cyli · 2018-05-18T00:20:56Z

manager/allocator/allocator_new.go

+							log.G(ctx).WithField("network.id", network.ID).Debugf("network was deleted after allocation but before commit")
+							// if the current network is nil, then the network was
+							// deleted and we should deallocate it.
+							a.network.DeallocateNetwork(network)


I guess DeallocateNetwork is idempotent and hence would not experience a similar race to DeallocateNode? (looks like we check for the network ID and don't do anything if it doesn't exist). It probably doesn't make sense to keep track of node IDs in the network allocator, since the node ID is not needed, but would that be one way of making the node deallocations idempotent?

this is addressed, i think? i think this review comment happens to refer to an unchanged section of code.

Yes indeed :) Thanks!

dperny · 2018-05-31T23:07:08Z

Added basic integration test with ginkgo for the full allocator.

cyli

Still working on IPAM

cyli · 2018-05-31T17:39:48Z

manager/allocator/network/ipam/doc.go

+// bootstrapped before IP address allocation can proceed correctly.
+// Bootstrapping, in this case, means that when a fresh allocator comes up, we
+// must first populate it with all of the addresses we know to be in use before
+// allocating new ones.


Could you elaborate on the lifecycle of an IPAM driver? For instance, is the IPAM local state ever cleared? What happens if a manager is a leader, so the IPAM driver has some local state, then it loses leadership and re-acquires?

Is manager/allocator/network.NewAllocator the one that clears the state by initializing the IPAM drivers from scratch?

Thank you! This comment has been addressed.

cyli · 2018-05-31T17:41:22Z

manager/allocator/network/driver/doc.go

+// The drivers package contains the Allocator in charge of setting a Network's
+// Driver state. Before a network can be used, if it depends on a global-scoped
+// driver, it must be allocated on that driver. This package provides the code
+// that handles that state allocation.


Also, could you please elaborate on the lifecycle of the network drivers - what do you mean global-scoped? Does the state on the drivers ever need to be reset? (e.g. when leadership is lost?)

cyli · 2018-06-01T00:04:17Z

manager/allocator/network/ipam/ipam.go

+			// driver that's the kind of address we're requesting
+			ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway
+			// and now, if we have a Gateway address for this network, we need
+			// to allocate it. This condition should never happen; a network


Should this say "this condition should always be true", the condition being "we have a gateway address for this network"?

cyli · 2018-06-01T00:04:59Z

manager/allocator/network/ipam/ipam.go

+
+				// most addresses need to be added to the map of
+				// address -> poolid, but we don't need to do this with the
+				// gateway address because it's store in the ipam config with


nitpick: "it's store in" -> "it's stored in"

cyli · 2018-06-01T00:09:44Z

manager/allocator/network/ipam/ipam.go

+				// most addresses need to be added to the map of
+				// address -> poolid, but we don't need to do this with the
+				// gateway address because it's store in the ipam config with
+				// the subnet, which is the key for the pools map.


So subnet is the key for local.pools? So that's what poolIP is? Would it make sense to include that as a comment up above?

I updated the comment for the pools map in the network struct to elaborate on this, as well as the endpoints map.

cyli · 2018-06-01T00:11:05Z

manager/allocator/network/ipam/ipam.go

+				// the subnet, which is the key for the pools map.
+			}
+			// delete the Gateway option from the IPAM options when we're done
+			delete(ipamOpts, ipamapi.RequestAddressType)


We are mutating the IPAM options map here - is it guaranteed that there was no previous value for ipamapi.RequestAddressType in the map? Otherwise editing and deleting it would be mutating the map, and I wasn't sure if it ever gets written back to the store.

I'm 80% sure that this map entry should never be present usually... but 80% isn't really good enough, especially when the API here is so ill-defined, so I've changed the code to save the previous entry if one exists and restore it if one exists (and delete it if one doesn't).

dperny · 2018-06-01T20:15:48Z

I have removed the long-deprecated ServiceSpec.Networks field.

cyli

IPAM is the last one I think. I've reveiwed the port allocator in a previous PR (#2579)

I haven't looked too hard at the tests except for port-allocator, but I think one or two of my review comments had to do with rollback of allocations, so just wanted to call it out in case you already knew if/where there should be more coverage of those.

Update: Nm, ignore my stupid comment below about deallocation.

cyli · 2018-06-04T21:52:02Z

manager/allocator/network/ipam/ipam.go

+	if n.Spec.IPAM != nil && n.Spec.IPAM.Driver != nil && n.Spec.IPAM.Driver.Name != "" {
+		ipamName = n.Spec.IPAM.Driver.Name
+	}
+	ipam, _ := a.drvRegistry.IPAM(ipamName)


Should we be checking the capabilities here to ensure that it's not a node-local network?

i think not actually. we should check in the caller, and not pass node-local networks into the ipam allocator (because they don't need IPAM state allocated), which we already do.

this is specified in the doc comment already, as well.

Got it - just wanted to make sure we didn't need to enforce it in this function.

cyli · 2018-06-04T22:04:02Z

manager/allocator/network/ipam/ipam.go

+		// add the option indicating that we're gonna request a gateway, and
+		// remove it before we exit this function
+		ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway
+		defer delete(ipamOpts, ipamapi.RequestAddressType)


I know ipamOpts is a local variable (copied from the actual map above), but it seems to be then assigned back into the network object in the allocator network map. Does this get written back to the store? The comment for the function says that the provided network is mutated.

If it does get updated in the store, do we need to check that we aren't overwriting an existing value in the map?

i fixed this in restore, actually, but not here. i'll do so now.

cyli · 2018-06-04T22:06:08Z

manager/allocator/network/ipam/ipam.go

+		ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway
+		defer delete(ipamOpts, ipamapi.RequestAddressType)
+		if config.Gateway != "" || gwIP == nil {
+			gwIP, _, err = ipam.RequestAddress(poolID, net.ParseIP(config.Gateway), ipamOpts)


Out of curiosity does ipam.RequestAddress return a gateway address even if config.Gateway is empty? (which could happen if gwIP is nil)

Yes, because the result of net.ParseIP will be nil, which will cause RequestAddress to choose an IP for us.

cyli · 2018-06-04T23:53:54Z

manager/allocator/network/ipam/ipam.go

+	// now figure out which new network IDs we've added, which is the same loop
+	// above but swapped around to check network IDs against VIPs
+newvips:
+	for nwid := range networkIDs {


Non-blocking: Since we're already looping through the VIPs up above, can we just create a map of vip network IDs while looping there? That way we don't have to do this double-loop here with the continue?

No, because this vip loop is part of a quadratic check for which networks have vips already allocated and which do not. We can probably optimize this by combining this logic with the allocation logic, but I think that optimization would be out of scope for this revision.

Hm... I'm not entirely sure I understand, apologies. What I mean is that this inner loop is going through every VIP to find a network ID that matches the network ID in the outer loop.

However, the loop above is also looping through every VIP, to see which ones to keep and deallocate. In the upper loop, if we build up a mapping of network IDs to struct{}{}, we'll have a set of which network IDs are specified by the VIPS that we can use to look up here, instead of having an inner loop, if that makes sense.

e.g:

nwidsInVIPS := make(map[string]struct{}{}) for _, vip := range endpoint.VirtualIPs { if _, ok := networkIDs[vip.NetworkID]; ok { keep = append(keep, vip) } else { deallocate = append(deallocate, vip) } nwidsInVIPS[vip.NetworkID] = struct{}{} } for nwid := range networkIDs { if _, ok := nwidsInVIPS[vip.NetworkID]; !ok { allocate = append(allocate, nwid) } }

I think this is equivalent to the code that is there.

ahhhh... i see. let me put this in and run it through the wash.

cyli · 2018-06-04T23:59:05Z

manager/allocator/network/ipam/ipam.go

+	// network, and each network in turn has non-overlapping subnets, there is
+	// no chance of IPs we're releasing to be reused in the allocation of new
+	// VIPs. So, instead, we deallocate last, so that if any allocation fails,
+	// we only have to roll back incomplete allocation, not re-allocation a


nitpick: "re-allocation a release" -> "re-allocate a release"

cyli · 2018-06-05T00:25:05Z

manager/allocator/network/ipam/ipam.go

+	defer func() {
+		if rerr != nil {
+			for _, addr := range attach.Addresses {
+				poolID := local.endpoints[addr]


Do we need to delete it from local.endpoints[addr] as well?

good catch.

This comment has been addressed.

cyli · 2018-06-05T00:30:06Z

manager/allocator/network/ipam/ipam.go

+		if err != nil {
+			requestIP = net.ParseIP(address)
+			if requestIP == nil {
+				return nil, errors.ErrInvalidSpec("address %v is not valid", address)


Will this correctly trigger the defer function that deallocates/rolls back? This does not set rerr and I'm not sure the deferred function will pick up on that. https://play.golang.org/p/rcNPmYSE4yr

~~Similar with the other error returns below.~~

Update: Nm, I'm dumb - passing the function a value is different than referring to that value in the function, which should get it when it executes.

anshulpundir · 2018-06-01T18:30:37Z

manager/allocator/network/port/port.go

+// Allocator is the interface for the port allocator, which chooses and keeps
+// track of assigned and available port resources.
+type Allocator interface {
+	Restore([]*api.Endpoint)


I would think the descriptions be on the interface, because thats what the caller writes their code to. But its not a big deal.

anshulpundir · 2018-06-01T18:36:12Z

manager/allocator/network/port/port.go

+	// user's spec wants a new (dynamically allocated) port, just like if we
+	// had dynamically assigned the port.
+	//
+	// Luckily for use, we keep a copy of the spec on the Endpoint object,


anshulpundir · 2018-06-01T18:36:49Z

manager/allocator/network/port/port.go

+
+	// So, basically, here's what we're going to do: we're going to create a
+	// new list of PortConfigs. This will be the final list that gets put into
+	// the object endpoint object


not sure if "object endpoint object" is what you were trying to say

dperny · 2018-06-07T19:20:46Z

Updated some comments, noticed a bug in us not tracking node-local networks properly, fixed it.

dperny · 2018-06-07T19:27:01Z

Addressed comment errors.

cyli

I think I just had a couple of minor nits about comments, but other than that I think this PR LGTM. I'm sure I've missed some stuff, but I spot nothing else in the logic on another re-review.

Adds the ginkgo BDD testing framework. Signed-off-by: Drew Erny <drew.erny@docker.com>

Rewrites the manager/allocator component to be more readable, documented, tested, and correct. The new allocator is tested with the Ginkgo BDD testing framework Allocator shares one set of errors across the whole implementation, defined in the manager/allocator/network/errors package Adds a new, more robust implementation of assigning ports. Tested with the ginkgo testing framework. Adds a subcomponent to the network allocator with the purpose of interfacing with the IPAM drivers and allocating IP addresses for swarmkit objects. Adds a subcomponent for the network allocator that has the responsibility of allocating network resources and driver state from network drivers. Adds the highest-level component charged with allocating network resources from from the underlying libraries. The network allocator provides an interface between the higher level swarmkit objects (services, tasks and nodes), and the lower level swarmkit objects (endpoints, attachments) that the lower level-allocator components operate on. Adds mocks, and the makefile rules to generate them. Readds a vendoring of the mocking library. Adds the top-level allocator component, (called NewAllocator while this PR is still WIP), which is the component that interacts with the store and event stream. Copies the tests from the old allocator, and starts adapting them to the new one. Changes some tests from the old allocator that fail because of assumptions that no longer hold. Adds prometheus metrics collection to the new allocator. Signed-off-by: Drew Erny <drew.erny@docker.com>

This field has, in various forms, been deprecated since at least 2016, and we should be well clear to remove it. Signed-off-by: Drew Erny <drew.erny@docker.com>

Continues improving the new allocator, after having removed the Networks field from the Service object Signed-off-by: Drew Erny <drew.erny@docker.com>

mefyl

This looks solid, probably a good improvement over the old allocator. A solid base to build upon from now on.

I did not review *_test.go. LMK if you want me to look at it in a separate review.
As I am no expert in networking, this is mostly focused on architecture and the code itself.

Overall LGTM !

mefyl · 2018-08-06T14:44:45Z

manager/allocator/allocator_new.go

+	// have resources allocated for the object that are not committed to the
+	// store, and which we cannot commit to the store because the object is
+	// gone. Therefore, we deallocate the object when we see that it has
+	// failed.


I was a bit confused by the phrasing there in the first few reads as the meaning of "fail" is not clear. I feel like something along the lines of "Therefore, we deallocate the ressources we created directly when we see an object was deleted in the meantime" would be better.

Good call, altered.

mefyl · 2018-08-06T14:47:33Z

manager/allocator/allocator_new.go

+	// allocator's restore function will add all allocated objects to the local
+	// state so we can proceed with new allocations.
+	//
+	// It thens adds all objects in the store to the working set, so that any


Nit : s/thens/then

mefyl · 2018-08-06T14:53:12Z

manager/allocator/allocator_new.go

+			networks, _ = store.FindNetworks(tx, store.All)
+			services, _ = store.FindServices(tx, store.All)
+			tasks, _ = store.FindTasks(tx, store.All)
+			nodes, _ = store.FindNodes(tx, store.All)


Seconded that I don't think we should plain ignore these errors. If the underlying code changes to sometime return an actual errors and an empty list, we surely want to know. If the error is "not possible", I'd prefer the allocator not to start - better safe than sorry.

mefyl · 2018-08-06T15:08:36Z

manager/allocator/allocator_new.go

+			case <-batchTimer.C:
+				a.pendingMu.Lock()
+				total := len(a.pendingNetworks) + len(a.pendingNodes) + len(a.pendingTasks)
+				a.pendingMu.Unlock()


Here we lock pendingMu to check if we need to enter processPendingAllocations, but as you state just below processPendingAllocations will jump straight through anyway if it's zero. Couldn't we move this check and simply exit from processPendingAllocations just after locking if the total is zero ? It would be simpler, and we are reducing concurrency by locking twice (if someone acquires the mutex in this small time gap we will be put to sleep).

so, basically, as a future improvement, we will want to make processPendingAllocations more intelligent. for example, right now, if allocation fails because resources are exhausted, we will start spinning and trying to allocate. ideally we would in that case want to check if something has been deallocated before launching into another attempt to allocate. (this was actually the case in the old allocator, which I chose to ignore in the interest of time and simplicity for this PR). the check for this would happen here in this function, and NOT in processPendingAllocations, which should have the sole responsibility of doing the actual work, not deciding if it should do the work.

does that make sense?

If we want to check whether something has been deallocated, we might have to lock here too indeed ; however I'm not sure it's worth locking twice for now, we don't know what we'll actually do in the future and when. Also that check in processPendingAllocations can't hurt I think, and it would get rid of this lock for now. All in all I think the PR should focus on making it sensible as is, not leave a placeholder for later - but I'll let you judge.

mefyl · 2018-08-06T15:13:13Z

manager/allocator/allocator_new.go

+		select {
+		case <-ctx.Done():
+			log.G(ctx).Debug("context done, canceling the event stream")
+			cancel()


I could have missed something, but can't we just cancel() directly from this function a bit later after wg.Wait() ? Maybe we can even just defer cancel() right here ? (or a lambda if you want to keep the log)

i can't remember why i did this. let me stew on it for a bit and get back to you.

mefyl · 2018-08-06T15:59:23Z

manager/allocator/allocator_new.go

+				// because only allocated networks have the DriverState
+				// allocated, we will only attach allocated networks. if for
+				// some reason an unallocated network slips through, we'll get
+				// ErrDepedencyNotAllocated when we try to allocate the nodes.


nit : s/ErrDepedencyNotAllocated/ErrDependencyNotAllocated

mefyl · 2018-08-06T16:05:09Z

manager/allocator/allocator_new.go

+				// remove these tasks from the map
+				log.G(ctx).WithError(err).Error("task cannot be allocated")
+			}
+		}


It looks like the task allocation code is the exact same for services tasks and service-less tasks, a local lambda factoring this would probably be better.

mefyl · 2018-08-07T14:17:07Z

manager/allocator/network/allocator.go

+	// successfully freed, and there is no reason to roll back.
+	//
+	// Therefore, because the operations can fail can be undone (allocation),
+	// and the operations cannot be undone cannot fail (deallocation) then we


nit :

the operations that can fail can be undone

the operations that cannot fail can be undone

I don't mean to nitpick too much but had to reread it a few time to make sense of it :)

mefyl · 2018-08-07T14:27:19Z

manager/allocator/network/doc.go

+// to keep two different distributed states in sync. There is one sole source
+// of truth about the network state in the cluster. However, if the local state
+// of the allocator is not completely consistent with the distributed state,
+// then errors such as double allocations or use-after-frees may result.


I'm a bit confused by the comment. If both libnetwork and the allocator maintain a local state, I'm a bit suprised by the "not needing to keep two different distributed states in sync" and "There is one sole source of truth". My understanding is the opposite : we choose to duplicate libnetwork's state, thus needing to keep it it sync and duplicating the potential source of truth, which is certainly a trade of for something else (I suppose reading libnetwork's distributed state is prohibitively expensive). So I would rather expect a comment on the form "while this requires careful syncing since we duplicate the source of truth, it is preferable because ".

mefyl · 2018-08-07T14:50:01Z

manager/allocator/network/port/port.go

+// with this same Allocator, the caller must later call Commit with the
+// returned proposal. If the allocation is abandoned, meaning it's not going to
+// be committed to the store and otherwise would need to be rolled back, then
+// the proposal can just be ignored.


It was suprising to me that Allocate does not not alter the allocator state, is if we have two Allocate before a Commit we may pick the same port twice. I understand it works because we don't interleave those operations, but then :

Let's have a comment stating it, as otherwise it's screaming BUG to the innocent reader.

I feel like it's better to have that Rollback method, even if it's a no-op for now, as it will help if/when we want to make it more parallel, or have more leeway if switching implementation, but I'll leave that to your apprecation.

After offline discussion, if we consider this an internal package, I agree we don't need Rollback as one should have decent knowledge of the interactions and make adequate changes if needed. Let's just go with a short comment.

mefyl

This looks solid, probably a good improvement over the old allocator. A solid base to build upon from now on.

I did not review *_test.go. LMK if you want me to look at it in a separate review.
As I am no expert in networking, this is mostly focused on architecture and the code itself.

Overall LGTM !

aaronlehmann reviewed Apr 28, 2018

View reviewed changes

dperny force-pushed the rewrite-allocator branch from 8233a19 to 0d5b454 Compare April 30, 2018 18:56

dperny force-pushed the rewrite-allocator branch from a132615 to d4d8c2b Compare May 3, 2018 19:07

talex5 reviewed May 4, 2018

View reviewed changes

talex5 reviewed May 8, 2018

View reviewed changes

dperny force-pushed the rewrite-allocator branch from 4b29417 to 10ec596 Compare May 9, 2018 21:21

dperny mentioned this pull request May 15, 2018

[WIP] Add new IPAM subcomponent #2605

Closed

talex5 reviewed May 15, 2018

View reviewed changes

talex5 reviewed May 16, 2018

View reviewed changes

talex5 reviewed May 17, 2018

View reviewed changes

cyli reviewed May 18, 2018

View reviewed changes

dperny force-pushed the rewrite-allocator branch 2 times, most recently from c9425aa to 79f469d Compare May 23, 2018 20:30

cyli reviewed Jun 1, 2018

View reviewed changes

dperny force-pushed the rewrite-allocator branch 2 times, most recently from 8a818d1 to 6a749ac Compare June 1, 2018 20:15

cyli reviewed Jun 5, 2018

View reviewed changes

dperny force-pushed the rewrite-allocator branch from 0aab6ab to 0281d2e Compare June 6, 2018 19:24

anshulpundir reviewed Jun 6, 2018

View reviewed changes

dperny force-pushed the rewrite-allocator branch from 0281d2e to c0f1832 Compare June 7, 2018 19:20

dperny force-pushed the rewrite-allocator branch 2 times, most recently from 0da78f0 to 6b5fbc3 Compare June 11, 2018 18:12

cyli approved these changes Jun 11, 2018

View reviewed changes

dperny force-pushed the rewrite-allocator branch from 6b5fbc3 to 9afa0fe Compare June 15, 2018 17:34

dperny changed the title ~~[WIP] Rewrite allocator~~ Rewrite allocator Jun 18, 2018

dperny force-pushed the rewrite-allocator branch 2 times, most recently from 3ab272c to f0ed299 Compare June 25, 2018 22:27

dperny added 4 commits July 3, 2018 15:21

Add ginkgo

db8bf22

Adds the ginkgo BDD testing framework. Signed-off-by: Drew Erny <drew.erny@docker.com>

Remove deprecated ServiceSpec.Networks field

99d1021

This field has, in various forms, been deprecated since at least 2016, and we should be well clear to remove it. Signed-off-by: Drew Erny <drew.erny@docker.com>

Continue fixing allocator

4a526e7

Continues improving the new allocator, after having removed the Networks field from the Service object Signed-off-by: Drew Erny <drew.erny@docker.com>

dperny force-pushed the rewrite-allocator branch from f0ed299 to 4a526e7 Compare July 5, 2018 17:18

dperny mentioned this pull request Jul 5, 2018

Fully Implement New Allocator #2686

Open

chungers assigned mefyl Aug 2, 2018

mefyl reviewed Aug 7, 2018

View reviewed changes

dperny mentioned this pull request Aug 21, 2018

Rewrite Allocator #2516

Open

olljanat mentioned this pull request Oct 9, 2018

Update etcd version #2598

Open

mefyl removed their assignment May 11, 2020

Rewrite allocator #2615

Are you sure you want to change the base?

Rewrite allocator #2615

Conversation

dperny commented Apr 27, 2018 • edited

dperny commented Apr 27, 2018 • edited

codecov bot commented Apr 27, 2018 • edited

Codecov Report

dperny commented Apr 28, 2018

aaronlehmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adirio May 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dperny commented Apr 30, 2018

dperny commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talex5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dperny commented May 8, 2018

dperny commented May 8, 2018

dperny commented May 9, 2018

talex5 commented May 10, 2018

Reusing a previous assignment when we need it elsewhere

Duplicate similar ports

Shadowing an existing host port

Accepting a host port that is hidden

Adirio commented May 10, 2018 • edited

Corner cases:

Reusing a previous assignment when we need it elsewhere

Duplicate similar ports

Shadowing an existing host port & Accepting a host port that is hidden

talex5 commented May 10, 2018

Adirio commented May 10, 2018

dperny commented May 10, 2018

Adirio commented May 11, 2018

talex5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adirio May 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dperny commented Apr 27, 2018 •

edited

dperny commented Apr 27, 2018 •

edited

codecov bot commented Apr 27, 2018 •

edited

Adirio May 2, 2018 •

edited

Adirio commented May 10, 2018 •

edited

Adirio May 16, 2018 •

edited