New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite allocator #2615
base: master
Are you sure you want to change the base?
Rewrite allocator #2615
Conversation
TODO list for the stuff we need to do to ship this
|
Codecov Report
@@ Coverage Diff @@
## master #2615 +/- ##
==========================================
+ Coverage 61.76% 62.37% +0.61%
==========================================
Files 134 142 +8
Lines 21760 23134 +1374
==========================================
+ Hits 13440 14430 +990
- Misses 6871 7196 +325
- Partials 1449 1508 +59 |
@aaronlehmann you wrote some of my favorite code in swarmkit, so I'm really interested in your high-level opinion of this new code, even if you don't have time for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly looked at the port allocator because it's most interesting and approachable to me.
manager/allocator/network/doc.go
Outdated
// from having to worry about the overlying storage method, and reduces the | ||
// already rather bloated surface are of this component | ||
// | ||
// The network allocator provideds the interface for the top-level API objets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provides
manager/allocator/network/doc.go
Outdated
// caller will provide all of the objects, and be in charge of committing those | ||
// objects to the raft store. Instead, it works directly on the api objects, | ||
// and gives them back to the caller to figure out how to store. This frees us | ||
// from having to worry about the overlying storage method, and reduces the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underlying?
manager/allocator/network/doc.go
Outdated
// objects to the raft store. Instead, it works directly on the api objects, | ||
// and gives them back to the caller to figure out how to store. This frees us | ||
// from having to worry about the overlying storage method, and reduces the | ||
// already rather bloated surface are of this component |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surface area
"github.com/docker/swarmkit/manager/allocator/network/errors" | ||
) | ||
|
||
// DrvRegistry is an interface defining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment seems incomplete
if oldService, ok := a.services[service.ID]; ok { | ||
var oldVersion, newVersion uint64 | ||
// we need to do this dumb dance because for some crazy reason | ||
// SpecVersion is nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpecVersion
did not always exist. If it was not nullable, we would see a SpecVersion
with default values on services that predate the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this comment isn't super constructive.
// | ||
// Because I'm reluctant to alter what I have that already works, we will not | ||
// return ErrAlreadyAllocated if the endpoint is already allocated. Instead, | ||
// callers can use the AlreadyAllocated function from this package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing this might be a net simplification. In the past, there's been a big problem with port allocation logic being spread across a number of places. I really like the idea of centralizing it.
|
||
// IsNoop returns true if the ports in use before this proposal are the same as | ||
// the ports in use after. | ||
func (prop *proposal) IsNoop() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for Allocate
figuring out whether any state changes are necessary. It could return a proposal with empty allocate and deallocate sets when there is nothing to do, or a separate type with a Commit
method that does nothing (so proposals that change state and don't change state are distinguished by type, not with a method call).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is almost entirely for the benefit of the tests, and i think the "correct" answer may actually be to change the tests to not use the port_test
package and instead allow them to look at an unexported method or field providing this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR to your forked repository with a couple optimizations to the Allocate
loop. The second one is based on this comment from @aaronlehmann.
P.S: If the optimizations get accepted, IsNoop
results in checking that both allocate
and deallocate
are empty, so it may not even need a method if you opt to not use the port_test
package.
// necessary information to commit the change. Calling the `Commit` method on | ||
// the Proposal will commit the changes to the Allocator. If the caller decides | ||
// to abandon the changes, the user can simply abandon the Proposal without | ||
// calling Commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good approach. My only hesitation is that it bakes the single-thread assumption into the API. If we ever wanted to support concurrency, it would have to change so that the state is updated immediately, with the possibility of rolling back an allocation. I don't think this is a big deal because it wouldn't be a difficult change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't support concurrency and rolling back. if we support concurrency, you may not be able to roll back a port allocation, because one of the ports you released may have been claimed. you'd have to split out allocation and de-allocation into separate steps.
if you look in the manager/allocator/network.go AllocateService method, we allocate ports, get a proposal, and then try allocation IP addresses. If the IP address allocation fails, we discard the proposal. in the ipam allocator, when allocation IP addresses, you allocate first, and then if allocation is successful, you free addresses. at that point, no further errors can occur, so you can safely commit the proposal. if you allocated ports after IP addresses, you might find that IP allocation succeeds but port allocation fails, and then you'd have to roll back IP address allocation in the same way and risk a freed resource being taken.
i think that this actually makes me realize there is one use for the IsNoop
method on proposals (or some similar construct). concurrently allocating IP addresses would be a big win, but is blocked because a lock on the portallocator would have to be acquired before allocating and released after committing, and the IPAM call is inside that synchronized path. however, if you found the proposal was a noop and did not alter port state (which i suspect to be a more common case), you could release that lock immediately and allow other service allocations to proceed calling into the ip allocator concurrently.
on the other hand, i strongly suspect network updates to services are going to be infrequent enough to not substantially impact the performance of the allocator even if that whole code path is single-threaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated by dperny the commit inmediately and roll back later approach is not thread-safe for deallocations. The prepare a proposal and commit later approach could be made thread-safe by keeping a logical clock in the allocator that gets updated every time a proposal is committed, but this would mean that any concurrent proposal would have to test its validity again. If a proposal is not valid due to a concurrent change then all the optimization of concurrency would be lost. And any deallocation in a proposal would make it non-valid, as you can not be sure if the committed proposal that triggered this validity check has deallocated and reallocated this resource.
Would an hybrid solution work? Allocating inmediatelly with roll back support, but not deallocating until committed. This will enforce the proposals to be either commited or reversed, dropping a proposal would result into some resources to be allocated forever until the swarm was rebuilt. I think this approach may be thread-safe, but it has to be taken into account that proposals hold both the resources they want to allocate and the ones that they want to deallocate, so a call to Allocate may seem not to have enough resources due to a concurrent proposal when it actually has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with @dperny offline. The problem with making this multi-threaded is not deallocs, since if there is an error with deallocs, we just roll forward and commit. So, in the face of failures, deallocs are effectively best-effort.
So I think the commit immediately and roll back later approach should work for supporting multiple concurrent proposals, correct ? @dperny @Adirio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand what you mean by roll forward and commit but I don't think deallocs are safe in the commit immediately and roll back later approach. Let me describe the proposed approaches to make sure we understand the same:
Resources need to be allocated and deallocated. To be thread safe this operations need to be done in an atomic way, this is, all of them need to be commited simultaneously or you can not guarantee that the laters will not fail. Three aproaches were mentioned:
- Prepare a proposal and commit later: when preparing a proposal you can ensure its validity, but as the allocator state can change before commiting, the validity of the proposal can not be ensured at commit time. To ensure both we could use a logical clock on the state that would be updated with every commit, if the clock has different value at commit time that when the proposal was created, the proposal would need to be checked again, and if there was any conflict, not only a new proposal would be needed but also all the steps between the proposal generation and the commit time would need to be executed again. Deallocs in this case will never result in a conflict as deallocating is idempotent, and actually should never happen as despite the allocator not knowing who owns which resource, they have a single owner so only it will try to deallocate it. On the other hand, allocs can result in conflicts if that same resource was already allocated (and commited) concurrently.
- Commit immediately and roll back later: the oposite aproach, commit at proposal preparation and rollback if unsuccessful. A stright difference from the previous method is that now unsuccessful proposals are the ones that need to be rollbacked while successful ones can be discarded, while it was the other way in the first approach. Validity can be ensured at commit time but can not be ensured at rollback time, as a deallocated resource can have been fully allocated (commit and proposal discarded as it was considered valid without rollback) and so the rollback method would fail re-allocating that resource and thus unable to get to a valid state. A logical clock could also be used here but with the same implications that in the first case.
- Hybrid (allocate immediately and rollback later, prepare deallocate and commit later): this approach combines the previous two. It allocates the new resources at proposal preparation but doesn't release already in use resources until commit phase, when it either releases the deallocated resources if successful or the pre-allocated resources if not. As you can guess, both successful and unsuccessful proposals would need to be commited/rollbacked and none could be discarded, but the validity of the proposal can be ensured at commit time.
Summing up: (info about which operations are safe and which proposals can be dropped)
- Prepare a proposal and commit later: Unsafe allocs and safe deallocs; commit successful and drop unsuccessful.
- Commit immediately and roll back later: Safe allocs and unsafe deallocs; drop successful and rollback unsuccessful.
- Hybrid (allocate immediately and rollback later, prepare deallocate and commit later): Safe allocs and deallocs; commit successful and rollback unsuccessful.
The only method that is concurrency-proof is the third one. Its drawbacks are being unable to drop any proposal and reserving resources that would be locked while the proposal is active.
You claim that deallocs are not a problem since we can just roll forward so the second approach can be used. What do you mean by roll forward? The resource that we deallocated at proposal preparation may have been allocated by another proposal and thus we would be unable to get it back. Do you mean that we should allocate another resource and change it for the one we can not reallocate?
portObj := port{p.Protocol, DynamicPortStart} | ||
// loop through the whole range of dynamic ports and select the first | ||
// one available | ||
for i := DynamicPortStart; i <= DynamicPortEnd; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it would not scale well with many dynamically assigned ports in play. Maybe it's not a big concern right now, especially since there is an upper bound on how many ports can be assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's worst case O(n) with the number of ports, because everything is done off map lookups. There are more efficient ways to dynamically assign ports (@stevvooe has mentioned things like bit-twiddling) but I don't think they'll be worth optimizing unless we see that this behavior is bottlenecking performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By not starting always at the DynamicPortStart (keeping track of which was the last i
in the previous iteration of the ports
loop) you will optimize it even if the theoretical complexity still remains O(n).
If you have 3 ports inside finalPorts
that require a dynamically assigned port and the first already has checked that the first 50% of the dynamic port range is full, the second port should start iterating where the first stopped, as it knows that the range before the dynamically assigned port for the first one is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be a useful optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a PR to your rewrite_allocator
branch with this optimization and another one focused on the loops that re-allocates deallocated ports. Was quite straight-forward.
8233a19
to
0d5b454
Compare
Ugh, I just realized that the allocator doesn't handle the case of node-local networks, because I've segregated allocating the driver into the |
When I plug the new allocator into the main code base and run the integration tests, tehy fail. I believe this is due to the new allocator being substantially too slow. Most of the optimization that needs to happen should happen in the top-level |
} | ||
if spec.TargetPort > masterPortEnd { | ||
return nil, errors.ErrInvalidSpec("target port %v isn't in the valid port range", spec.TargetPort) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs a check for masterPortStart
too. For PublishedPort
, it's not actually a problem (unless someone changes the constant) because the only value less that 1 means dynamic allocation. But I guess we shouldn't allow TargetPort
to be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc.go
says
The allocator does not keep track of host-mode ports in any capacity. It silently adds them to the returned Ports assignment, without checking their validity or availability.
However, we do seem to be doing some validity checks here, even for host-mode ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't count. it's an undocumented freebie feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started making a TLA+ model of the port allocator, and realised I hadn't quite understood some things about how host-mode ports work. This probably just needs a bit more documentation...
} | ||
if spec.TargetPort > masterPortEnd { | ||
return nil, errors.ErrInvalidSpec("target port %v isn't in the valid port range", spec.TargetPort) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc.go
says
The allocator does not keep track of host-mode ports in any capacity. It silently adds them to the returned Ports assignment, without checking their validity or availability.
However, we do seem to be doing some validity checks here, even for host-mode ports.
// no PublishedPort. The allocator does _not_ keep track of host-mode ports in | ||
// any capacity. It silently adds them to the returned Ports assignment, | ||
// without checking their validity or availability. Port assignments for | ||
// host-mode ports are handled on the worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the scheduler involved too, for host-mode ports? Otherwise, they might get scheduled on a host which is already using that port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what about this case:
docker service create --publish published=30000,target=80,mode=host nginx
docker service create --publish published=0,target=80 nginx
Doesn't the allocator need to know that a host is using port 30000?
(I just tried it with 18.03.1-ce and it let me use the same port for two services. Requests always seemed to go to the ingress service until I killed that service, then they went to the host service.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the scheduler is not involved in host mode ports. it was a very rushed feature, if i recall correctly, and it's sensitive to collisions.
the problem with handling host-mode ports at the cluster level is that it involves an interaction between the allocator and the scheduler. the scheduler won't know until the task passes through the allocator what ports it's using, but the allocator won't know until it passes through the scheduler what node it's on. plus, you can dynamically assign ports, which means you really actually can't know what port will be chosen on the target node.
so, the solution now is to pass host ports as-is, and let the worker fail the task if there is a collision with a local port (thus pushing the scheduler to reschedule the task, hopefully on a different node with the requested port available).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if a host port isn't available on a node because it's already used as an ingress port, won't we have the same problem on all other nodes of the cluster too? Should a worker reject a request to forward an ingress port if it is already using it as a host-mode port? (I'm not too clear about how overlay networks work currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interaction between the scheduler and the allocator has been discussed in other Issues and PRs. It has been suggested to extract the allocators from the pipeline:
Current:
MANAGER NODE (LEADER) WORKER NODE
_______________ ______________ ______________ ______________ _______________
\ \ \ \ \ \ \
| Orchestrator | Allocator | Scheduler | Dispatcher | | Agent |
/______________/______________/______________/______________/ /______________/
Proposed:
MANAGER NODE (LEADER) WORKER NODE
_______________ ______________ ______________ _______________
\ \ \ \ \ \
| Orchestrator | Scheduler | Dispatcher | | Agent |
/______________/______________/______________/ /______________/
|
+-----+-----+
| Allocator |
+-----------+
This way the port allocator could be called before (or at the start of) the scheduler but other allocators could be called later (resources could be consider allocatable items for example).
|
||
// DynamicPortEnd is the end of the dynamic port range from which node | ||
// ports will be allocated when the user did not specify a port. | ||
DynamicPortEnd uint32 = 32767 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is configurable, just at compile time instead of run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific usecase to have these configurable at runtime ? @talex5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I just thought a customer might hit this limit and it would be nice if we could tell them to change a setting rather than waiting for a new engine. But maybe they never get near to this limit anyway.
yes, we are. there is no good reason for this. i am a liar. |
i'm unsure how useful TLA+ is going to be for the allocator, which is not distributed and is not yet even concurrent. |
I've updated the PR to include some major differences:
|
Yeah, I wasn't sure whether to bother, but I thought I might as well record what I'd learnt from reading the code, and then it found the ingress/host collision problem. I've now finished the spec, and configured it to check these properties:
Here is the spec: PortAllocator.pdf -- the source is at https://github.com/talex5/swarmkit/compare/tla...talex5:tla-alloc?expand=1 I marked with XXX places where I had to deviate from the Go code to make the tests pass. Reusing a previous assignment when we need it elsewhereExample:
Here we will reject the update because we try to reuse port 30000 for "foo", even though we now need it for "bar". Perhaps this is desired behaviour, but it does mean that you can't update to a state even though you could have specified the target state initially without a problem. Duplicate similar portsExample:
We will try to use port number 30000 for both ports and reject the allocation. Shadowing an existing host portAs noted previously, the allocator ignores host ports completely. It may allocate a dynamic ingress port using a network address that is already in use on some host. In this case, the original service becomes unreachable. Accepting a host port that is hiddenThe allocator may accept a host-port allocation that it knows will be unreachable on any host because that address is already in use as an ingress port. I confirmed the first two by adding some tests for them to the Go tests: These issues might not be too important, but I think it's interesting to see what potential problems there could be, and what the model checker can find. |
I believe it is the Agent and not the Scheduler who handles host ports. The scheduler basically ensure deployment constraints, orders possible target nodes according to deployment preferences and number of running tasks and assigns the task to the first target node. Corner cases:Reusing a previous assignment when we need it elsewhereThat is correct, if you manually assign a port that was already dynamically assigned you will get a conflict. When you give a dynamic port range you are kind of "reserving" this ports for dynamic use only.
The advantages of the second solution are being self-documented and probably simplifiying the Allocate method. I am not sure that these small advantages are worth the change. Duplicate similar portsThat will not happen in the allocator, the first would get 30000 and the second 30001. At least the last time I checked the code it was checking that the pre-selected port was not assigned and discard it if it was. Shadowing an existing host port & Accepting a host port that is hiddenEither you or me got the host-port part wrong. In my understanding, host-ports are from a different virtual network and thus a completely new set of ports is available, so there is no way to shadow/hide a port. |
Perhaps there is an error in my tests then. Do you see a problem with this code: https://github.com/talex5/swarmkit/blob/allocator/manager/allocator/network/port/port_test.go#L867 For me, this test fails with:
This is what I see with the current allocator, which seems to have the same behaviour:
Then accessing port 30001 gives me wordpress:
After removing the wordpress service (
If they are on different virtual networks, how would I specify which one I meant? |
I am not familiar with
That behaviour seems to be in line with your interpretation of host-ports. Lets see what a member/collaborator says about it. |
The problem with two identical ports is that we can't know which ports on the spec map to ports on the object, which we need to know in order to correctly re-use dynamic port allocations across service updates. So, in your failing test, the port allocator is looking for a spec on the endpoint which matches the config in the new spec, seeing that one is found and has no port allocation, and then looking in the object for the port allocation matching that spec, which it fines because the old object has it. That corner case, of identical ports, needs to be blocked from even occuring, in order to avoid "undefined" behavior. |
@dperny Could that situation happen in real life? I mean, can a spec have two identical ports including the TargetPort-Protocol pair? You are using always 80/TCP as the target because you are not doing any check against them at this point, but where will a query to port 80/TCP be redirected if such a spec was found? @talex5 using TargetPort=81 for the new spec port should fix the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a go at reviewing the other subcomponents (ipam and driver), but I don't know much about libnetwork. Some overview documentation would be useful.
// that i'm willing to just check it and panic if it occurs. | ||
reg, err := drvregistry.New(nil, nil, nil, nil, pg) | ||
if err != nil { | ||
panic("drvregistry.New returned an error... it's not supposed to do that") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to include err
in the panic message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good catch.
|
||
// Restore takes slices of the object types managed by the network allocator | ||
// and syncs the local state of the Allocator to match the state of the objects | ||
// provided. It also initializes the default drivers to the reg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also initializes the default drivers to the reg.
Doesn't look like it does anything with the reg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hold over comment from an earlier version of the code, sorry about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -0,0 +1,11 @@ | |||
package driver | |||
|
|||
// TODO(dperny) write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to review this module without this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, whoops, sorry. Will turn this into a TODONE.
_, ok := nw.Spec.Annotations.Labels["com.docker.swarm.internal"] | ||
if ok && nw.Spec.Annotations.Name == "ingress" { | ||
a.ingressID = nw.ID | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ingress-detection logic is duplicated in AllocateNetwork. Maybe make a helper function for that?
if a.ingressID != "" { | ||
// NOTE(dperny): if i recall correctly, append should not modify the | ||
// original slice here, which means this is safe to do without | ||
// accidentally modifying the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much Go, but I think it depends how the slice was constructed. e.g.
foo := make([]int, 0, 5)
bar := append(foo, 1)
baz := append(foo, 2)
fmt.Println(foo)
fmt.Println(bar)
fmt.Println(baz)
prints
[]
[2]
[2]
So the second append
changed the bar
slice. I think this is safe only if the underlying array is already at capacity or no other user will append to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append modifies the underlying array, but doesn't modify the original slice:
When you do foo := make([]int, 0, 5)
you are creating an array at memory address 0xXXXXXXXX
and an slice at memory address 0xAAAAAAAA
:
0xXXXXXXXX: 0 0 0 0 0 <- underlying array
0xAAAAAAAA: 0xXXXXXXXX 0 5 <- foo
bar := append(foo, 1)
creates a new slice bar
at 0xBBBBBBBB
using the same underlying array than foo
and inserts a 1
. As foo
has no element, the insert is done in position 0
.
0xXXXXXXXX: 1 0 0 0 0 <- underlying array
0xAAAAAAAA: 0xXXXXXXXX 0 5 <- foo
0xBBBBBBBB: 0xXXXXXXXX 1 5 <- bar
baz := append(foo, 2)
creates a new slice baz
at 0xCCCCCCCC
using the same underlying array than foo
and inserts a 2
. As foo
has no element, the insert is done in position 0
.
0xXXXXXXXX: 2 0 0 0 0 <- underlying array
0xAAAAAAAA: 0xXXXXXXXX 0 5 <- foo
0xBBBBBBBB: 0xXXXXXXXX 1 5 <- bar
0xCCCCCCCC: 0xXXXXXXXX 1 5 <- baz
So printing foo
gives 0 elements from the underlying array, printing bar
and baz
both give 1 element of the underlying array and as the last operation set it to a 2
that is what you get for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed this by using a different slice.
finalErr = err | ||
} | ||
// remove the addresses after we've deallocated every attachment | ||
// attachment.Addresses = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
triggers the race detector. i will go back and change it, but, basically, the object you get back from an event stream is shared, and mutating it one place mutates it in all places.
"no ip addresses remain in any pool for network %v", | ||
local.nw.ID, | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if address != ""
? Isn't that also an error (the requested address wasn't available)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... i think this might be broken, actually. i'm going to write a test for it real quick.
|
||
// The allocator/network package is the subcomponent in charge of allocating | ||
// network resources for swarmkit resources. Give it objects, and it gives you | ||
// resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to say something about thread-safety here. The port allocator says the PortAllocator is *not* concurrency-safe
, and this module uses that one without locks, so presumably this module isn't concurrency-safe either. It should probably mention that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded. Maybe it makes sense to point this out on a per-API level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
// ipam is a package for interfacing with the libnetwork IPAM drivers. | ||
// It performs the work of initializing and bootstrapping the IPAM state for | ||
// networks, keeping track of IPAM pools, and allocating new addresses for VIPs | ||
// and network attachments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to provide a brief overview of what IPAM is and does here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on allocator_new.go
.
manager/allocator/allocator_new.go
Outdated
// has been allocated. If it is true, then regardless of which nodes are | ||
// currently pending, all nodes will need to be reallocated with the new | ||
// network. we will do this after network allocation, but before task | ||
// allocation, to avoid a case where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete comment.
manager/allocator/allocator_new.go
Outdated
var err error | ||
nodes, err = store.FindNodes(tx, store.All) | ||
if err != nil { | ||
// i don't know what to do here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic
?
manager/allocator/allocator_new.go
Outdated
t := store.GetTask(tx, taskid) | ||
// if the task is not nil, is in state NEW, and is desired to | ||
// be running, then it still needs allocation | ||
if t != nil && t.Status.State == api.TaskStateNew && t.DesiredState == api.TaskStateRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about tasks with a desired state of READY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch.
// to the pending map. they will never succeed. additionally, | ||
// remove these tasks from the map | ||
log.G(ctx).WithError(err).Error("service cannot be allocated") | ||
delete(tasks, service.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we attach the error to the task objects somehow, e.g. in task.Status.Err
? Otherwise, I assume the user won't be able to see why it failed. Same applies in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling this out of scope for this initial rewrite, but dropped a TODO comment.
manager/allocator/allocator_new.go
Outdated
// free some state when we deallocate a second | ||
// time! a classic double-free case! | ||
log.G(ctx).WithField("node.id", node.ID).Debug("node was deleted after allocation but before committing") | ||
a.network.DeallocateNode(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to deallocate here? Won't the event monitor handle this case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the event stream will deallocate the node in its state before we performed re-allocation. however, we may have locally allocated some new resources for the node, which have not been committed, but which must be released in our local state.
i'm unsure what the correct approach to working around this problem is. probably making Deallocate idempotent. but i would really appreciate suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the node object in the event isn't the same as the node object we change during allocation? If I understand correctly then, the problem case is:
- The batch thread starts allocating things for an updated node.
- The node is deleted. The event handler gets an event with the old state of the node. It waits for the lock.
- The batch thread notices that the node is gone and deallocates it, from the correct (newly-allocated) state. It releases the lock.
- The event handler gets the lock and deallocates again, from the old state.
If we get here then:
- The node hadn't been deleted when we looked it up earlier in
processPendingAllocations
, which must have happened after we tookpendingMu
. - It is deleted now. So, a delete event must be on its way.
- The event handler can't have deallocated yet, because it needs
pendingMu
.
So I guess we could keep a set of "already-deallocated" objects. We add the object to the set here to indicate that we already freed it. When the event thread later gets the lock, it will see that the deallocation was already done and not do it again. It will remove it from the "already-deallocated" set. A bit messy, but it should work.
Other solutions:
- Ensure that we only have one object in memory at once representing the state of the node. Have the event thread deallocate in all cases. It will always have up-to-date information.
- Use a single database transaction for
processPendingAllocations
. Then the user can't delete the node until we're done. Would this be a performance problem?
manager/allocator/allocator_new.go
Outdated
} | ||
} | ||
|
||
func (a *NewAllocator) allocateTask(tx store.Tx, taskid string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code that i forgot to delete. i'm sorry about the dead code and half-done comments, i lose track of things frequently, especially in this huge project.
CircleCI is detecting a change in a proto file without a |
Yeah, I forgot to regenerate. I was going to just put off doing so until my
next push
… |
It's fine, I was just checking CircleCI output because I'm getting a fail in another PR (different error) and I figured I would just comment in case you did not notice. Can I ask you if you have any feedback on the two loop optimizations I proposed as a PR in your fork? I've noticed that you didn't implement them and I'm not sure if you do not like any of the changes or if the PR in your fork wasn't the way to go. |
networks, _ = store.FindNetworks(tx, store.All) | ||
services, _ = store.FindServices(tx, store.All) | ||
tasks, _ = store.FindTasks(tx, store.All) | ||
nodes, _ = store.FindNodes(tx, store.All) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's true, perhaps we should change the return type of Find*
so they can't return errors, and make them panic instead. It doesn't seem like a good idea to allow them to return errors that we just ignore. Even if my swarm is hosed, I'd still like to know about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find can return errors if you call it with a store.By*
filter that is invalid. Though we could change the method signature to check that at compile time, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded that I don't think we should plain ignore these errors. If the underlying code changes to sometime return an actual errors and an empty list, we surely want to know. If the error is "not possible", I'd prefer the allocator not to start - better safe than sorry.
manager/allocator/allocator_new.go
Outdated
// free some state when we deallocate a second | ||
// time! a classic double-free case! | ||
log.G(ctx).WithField("node.id", node.ID).Debug("node was deleted after allocation but before committing") | ||
a.network.DeallocateNode(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the node object in the event isn't the same as the node object we change during allocation? If I understand correctly then, the problem case is:
- The batch thread starts allocating things for an updated node.
- The node is deleted. The event handler gets an event with the old state of the node. It waits for the lock.
- The batch thread notices that the node is gone and deallocates it, from the correct (newly-allocated) state. It releases the lock.
- The event handler gets the lock and deallocates again, from the old state.
If we get here then:
- The node hadn't been deleted when we looked it up earlier in
processPendingAllocations
, which must have happened after we tookpendingMu
. - It is deleted now. So, a delete event must be on its way.
- The event handler can't have deallocated yet, because it needs
pendingMu
.
So I guess we could keep a set of "already-deallocated" objects. We add the object to the set here to indicate that we already freed it. When the event thread later gets the lock, it will see that the deallocation was already done and not do it again. It will remove it from the "already-deallocated" set. A bit messy, but it should work.
Other solutions:
- Ensure that we only have one object in memory at once representing the state of the node. Have the event thread deallocate in all cases. It will always have up-to-date information.
- Use a single database transaction for
processPendingAllocations
. Then the user can't delete the node until we're done. Would this be a performance problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, reviewing in chunks.
manager/allocator/allocator.go
Outdated
@@ -83,6 +84,8 @@ func New(store *store.MemoryStore, pg plugingetter.PluginGetter) (*Allocator, er | |||
func (a *Allocator) Run(ctx context.Context) error { | |||
// Setup cancel context for all goroutines to use. | |||
ctx, cancel := context.WithCancel(ctx) | |||
log.WithModule(ctx, "allocator") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think log.WithModule(...)
returns a context that has the logging context updated. So you might have to assign ctx = log.withModule(ctx, "allocator")
manager/allocator/allocator_new.go
Outdated
// possibility for errors lies deep within memdb, and seems to be a | ||
// consequence of some kind of misconfiguration of some sort of the | ||
// indexes or tables. Those errors should not be possible, so we | ||
// ignore them to simplify the code. If the do occur, then your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling nitpick: "If the do occur" -> "If they do occur"
manager/allocator/allocator_new.go
Outdated
api.EventDeleteNode{}, | ||
) | ||
if err != nil { | ||
// TODO(dperny): error handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify this TODO? Is returning the error the incorrect behavior?
manager/allocator/allocator_new.go
Outdated
// now restore the local state | ||
log.G(ctx).Info("restoring local network allocator state") | ||
if err := a.network.Restore(networks, services, tasks, nodes); err != nil { | ||
// TODO(dperny): handle errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify this TODO? Is it just that you don't know if the entire run sequence should error?
manager/allocator/allocator_new.go
Outdated
api.EventCreateNetwork{}, | ||
api.EventUpdateNetwork{}, | ||
api.EventDeleteNetwork{}, | ||
api.EventCreateService{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment below (https://github.com/docker/swarmkit/pull/2615/files#diff-b775c4fdb79a3a4b943132a95b2f6ef3R337) says that we don't actually need to do anything for the service create and service update events- should we just not subscribe to these two events?
manager/allocator/allocator_new.go
Outdated
api.EventUpdateTask{}, | ||
api.EventDeleteTask{}, | ||
api.EventCreateNode{}, | ||
api.EventUpdateNode{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly should we unsubscribe from EventUpdateNode
since we don't handle that case?
manager/allocator/allocator_new.go
Outdated
// has been allocated. If it is true, then regardless of which nodes are | ||
// currently pending, all nodes will need to be reallocated with the new | ||
// network. we will do this after network allocation, but before task | ||
// allocation, to avoid a case where nodes use up all of the available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isn't this the case we want to favor, as opposed to avoid? We want the nodes to use up all the available network addresses as opposed to tasks using up all of them.
log.G(ctx).WithField("network.id", network.ID).Debugf("network was deleted after allocation but before commit") | ||
// if the current network is nil, then the network was | ||
// deleted and we should deallocate it. | ||
a.network.DeallocateNetwork(network) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess DeallocateNetwork
is idempotent and hence would not experience a similar race to DeallocateNode
? (looks like we check for the network ID and don't do anything if it doesn't exist). It probably doesn't make sense to keep track of node IDs in the network allocator, since the node ID is not needed, but would that be one way of making the node deallocations idempotent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is addressed, i think? i think this review comment happens to refer to an unchanged section of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed :) Thanks!
c9425aa
to
79f469d
Compare
Added basic integration test with ginkgo for the full allocator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working on IPAM
// bootstrapped before IP address allocation can proceed correctly. | ||
// Bootstrapping, in this case, means that when a fresh allocator comes up, we | ||
// must first populate it with all of the addresses we know to be in use before | ||
// allocating new ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on the lifecycle of an IPAM driver? For instance, is the IPAM local state ever cleared? What happens if a manager is a leader, so the IPAM driver has some local state, then it loses leadership and re-acquires?
Is manager/allocator/network.NewAllocator
the one that clears the state by initializing the IPAM drivers from scratch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This comment has been addressed.
// The drivers package contains the Allocator in charge of setting a Network's | ||
// Driver state. Before a network can be used, if it depends on a global-scoped | ||
// driver, it must be allocated on that driver. This package provides the code | ||
// that handles that state allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you please elaborate on the lifecycle of the network drivers - what do you mean global-scoped
? Does the state on the drivers ever need to be reset? (e.g. when leadership is lost?)
// driver that's the kind of address we're requesting | ||
ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway | ||
// and now, if we have a Gateway address for this network, we need | ||
// to allocate it. This condition should never happen; a network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say "this condition should always be true", the condition being "we have a gateway address for this network"?
|
||
// most addresses need to be added to the map of | ||
// address -> poolid, but we don't need to do this with the | ||
// gateway address because it's store in the ipam config with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: "it's store in" -> "it's stored in"
// most addresses need to be added to the map of | ||
// address -> poolid, but we don't need to do this with the | ||
// gateway address because it's store in the ipam config with | ||
// the subnet, which is the key for the pools map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So subnet is the key for local.pools
? So that's what poolIP
is? Would it make sense to include that as a comment up above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment for the pools
map in the network
struct to elaborate on this, as well as the endpoints
map.
// the subnet, which is the key for the pools map. | ||
} | ||
// delete the Gateway option from the IPAM options when we're done | ||
delete(ipamOpts, ipamapi.RequestAddressType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are mutating the IPAM options map here - is it guaranteed that there was no previous value for ipamapi.RequestAddressType
in the map? Otherwise editing and deleting it would be mutating the map, and I wasn't sure if it ever gets written back to the store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 80% sure that this map entry should never be present usually... but 80% isn't really good enough, especially when the API here is so ill-defined, so I've changed the code to save the previous entry if one exists and restore it if one exists (and delete it if one doesn't).
8a818d1
to
6a749ac
Compare
I have removed the long-deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPAM is the last one I think. I've reveiwed the port allocator in a previous PR (#2579)
I haven't looked too hard at the tests except for port-allocator, but I think one or two of my review comments had to do with rollback of allocations, so just wanted to call it out in case you already knew if/where there should be more coverage of those.
Update: Nm, ignore my stupid comment below about deallocation.
if n.Spec.IPAM != nil && n.Spec.IPAM.Driver != nil && n.Spec.IPAM.Driver.Name != "" { | ||
ipamName = n.Spec.IPAM.Driver.Name | ||
} | ||
ipam, _ := a.drvRegistry.IPAM(ipamName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be checking the capabilities here to ensure that it's not a node-local network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think not actually. we should check in the caller, and not pass node-local networks into the ipam allocator (because they don't need IPAM state allocated), which we already do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is specified in the doc comment already, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - just wanted to make sure we didn't need to enforce it in this function.
// add the option indicating that we're gonna request a gateway, and | ||
// remove it before we exit this function | ||
ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway | ||
defer delete(ipamOpts, ipamapi.RequestAddressType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know ipamOpts
is a local variable (copied from the actual map above), but it seems to be then assigned back into the network object in the allocator
network map. Does this get written back to the store? The comment for the function says that the provided network is mutated.
If it does get updated in the store, do we need to check that we aren't overwriting an existing value in the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i fixed this in restore, actually, but not here. i'll do so now.
ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway | ||
defer delete(ipamOpts, ipamapi.RequestAddressType) | ||
if config.Gateway != "" || gwIP == nil { | ||
gwIP, _, err = ipam.RequestAddress(poolID, net.ParseIP(config.Gateway), ipamOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity does ipam.RequestAddress
return a gateway address even if config.Gateway
is empty? (which could happen if gwIP
is nil
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because the result of net.ParseIP
will be nil
, which will cause RequestAddress to choose an IP for us.
// now figure out which new network IDs we've added, which is the same loop | ||
// above but swapped around to check network IDs against VIPs | ||
newvips: | ||
for nwid := range networkIDs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: Since we're already looping through the VIPs up above, can we just create a map of vip network IDs while looping there? That way we don't have to do this double-loop here with the continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because this vip loop is part of a quadratic check for which networks have vips already allocated and which do not. We can probably optimize this by combining this logic with the allocation logic, but I think that optimization would be out of scope for this revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... I'm not entirely sure I understand, apologies. What I mean is that this inner loop is going through every VIP to find a network ID that matches the network ID in the outer loop.
However, the loop above is also looping through every VIP, to see which ones to keep and deallocate. In the upper loop, if we build up a mapping of network IDs to struct{}{}
, we'll have a set of which network IDs are specified by the VIPS that we can use to look up here, instead of having an inner loop, if that makes sense.
e.g:
nwidsInVIPS := make(map[string]struct{}{})
for _, vip := range endpoint.VirtualIPs {
if _, ok := networkIDs[vip.NetworkID]; ok {
keep = append(keep, vip)
} else {
deallocate = append(deallocate, vip)
}
nwidsInVIPS[vip.NetworkID] = struct{}{}
}
for nwid := range networkIDs {
if _, ok := nwidsInVIPS[vip.NetworkID]; !ok {
allocate = append(allocate, nwid)
}
}
I think this is equivalent to the code that is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhh... i see. let me put this in and run it through the wash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done this.
// network, and each network in turn has non-overlapping subnets, there is | ||
// no chance of IPs we're releasing to be reused in the allocation of new | ||
// VIPs. So, instead, we deallocate last, so that if any allocation fails, | ||
// we only have to roll back incomplete allocation, not re-allocation a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: "re-allocation a release" -> "re-allocate a release"
defer func() { | ||
if rerr != nil { | ||
for _, addr := range attach.Addresses { | ||
poolID := local.endpoints[addr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to delete it from local.endpoints[addr]
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has been addressed.
if err != nil { | ||
requestIP = net.ParseIP(address) | ||
if requestIP == nil { | ||
return nil, errors.ErrInvalidSpec("address %v is not valid", address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this correctly trigger the defer function that deallocates/rolls back? This does not set rerr
and I'm not sure the deferred function will pick up on that. https://play.golang.org/p/rcNPmYSE4yr
Similar with the other error returns below.
Update: Nm, I'm dumb - passing the function a value is different than referring to that value in the function, which should get it when it executes.
// Allocator is the interface for the port allocator, which chooses and keeps | ||
// track of assigned and available port resources. | ||
type Allocator interface { | ||
Restore([]*api.Endpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think the descriptions be on the interface, because thats what the caller writes their code to. But its not a big deal.
// user's spec wants a new (dynamically allocated) port, just like if we | ||
// had dynamically assigned the port. | ||
// | ||
// Luckily for use, we keep a copy of the spec on the Endpoint object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use => us
|
||
// So, basically, here's what we're going to do: we're going to create a | ||
// new list of PortConfigs. This will be the final list that gets put into | ||
// the object endpoint object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if "object endpoint object" is what you were trying to say
Updated some comments, noticed a bug in us not tracking node-local networks properly, fixed it. |
Addressed comment errors. |
0da78f0
to
6b5fbc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just had a couple of minor nits about comments, but other than that I think this PR LGTM. I'm sure I've missed some stuff, but I spot nothing else in the logic on another re-review.
6b5fbc3
to
9afa0fe
Compare
3ab272c
to
f0ed299
Compare
Adds the ginkgo BDD testing framework. Signed-off-by: Drew Erny <drew.erny@docker.com>
Rewrites the manager/allocator component to be more readable, documented, tested, and correct. The new allocator is tested with the Ginkgo BDD testing framework Allocator shares one set of errors across the whole implementation, defined in the manager/allocator/network/errors package Adds a new, more robust implementation of assigning ports. Tested with the ginkgo testing framework. Adds a subcomponent to the network allocator with the purpose of interfacing with the IPAM drivers and allocating IP addresses for swarmkit objects. Adds a subcomponent for the network allocator that has the responsibility of allocating network resources and driver state from network drivers. Adds the highest-level component charged with allocating network resources from from the underlying libraries. The network allocator provides an interface between the higher level swarmkit objects (services, tasks and nodes), and the lower level swarmkit objects (endpoints, attachments) that the lower level-allocator components operate on. Adds mocks, and the makefile rules to generate them. Readds a vendoring of the mocking library. Adds the top-level allocator component, (called NewAllocator while this PR is still WIP), which is the component that interacts with the store and event stream. Copies the tests from the old allocator, and starts adapting them to the new one. Changes some tests from the old allocator that fail because of assumptions that no longer hold. Adds prometheus metrics collection to the new allocator. Signed-off-by: Drew Erny <drew.erny@docker.com>
This field has, in various forms, been deprecated since at least 2016, and we should be well clear to remove it. Signed-off-by: Drew Erny <drew.erny@docker.com>
Continues improving the new allocator, after having removed the Networks field from the Service object Signed-off-by: Drew Erny <drew.erny@docker.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks solid, probably a good improvement over the old allocator. A solid base to build upon from now on.
- I did not review
*_test.go
. LMK if you want me to look at it in a separate review. - As I am no expert in networking, this is mostly focused on architecture and the code itself.
Overall LGTM !
// have resources allocated for the object that are not committed to the | ||
// store, and which we cannot commit to the store because the object is | ||
// gone. Therefore, we deallocate the object when we see that it has | ||
// failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused by the phrasing there in the first few reads as the meaning of "fail" is not clear. I feel like something along the lines of "Therefore, we deallocate the ressources we created directly when we see an object was deleted in the meantime" would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, altered.
// allocator's restore function will add all allocated objects to the local | ||
// state so we can proceed with new allocations. | ||
// | ||
// It thens adds all objects in the store to the working set, so that any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : s/thens/then
networks, _ = store.FindNetworks(tx, store.All) | ||
services, _ = store.FindServices(tx, store.All) | ||
tasks, _ = store.FindTasks(tx, store.All) | ||
nodes, _ = store.FindNodes(tx, store.All) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded that I don't think we should plain ignore these errors. If the underlying code changes to sometime return an actual errors and an empty list, we surely want to know. If the error is "not possible", I'd prefer the allocator not to start - better safe than sorry.
case <-batchTimer.C: | ||
a.pendingMu.Lock() | ||
total := len(a.pendingNetworks) + len(a.pendingNodes) + len(a.pendingTasks) | ||
a.pendingMu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we lock pendingMu
to check if we need to enter processPendingAllocations
, but as you state just below processPendingAllocations
will jump straight through anyway if it's zero. Couldn't we move this check and simply exit from processPendingAllocations
just after locking if the total is zero ? It would be simpler, and we are reducing concurrency by locking twice (if someone acquires the mutex in this small time gap we will be put to sleep).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, basically, as a future improvement, we will want to make processPendingAllocations
more intelligent. for example, right now, if allocation fails because resources are exhausted, we will start spinning and trying to allocate. ideally we would in that case want to check if something has been deallocated before launching into another attempt to allocate. (this was actually the case in the old allocator, which I chose to ignore in the interest of time and simplicity for this PR). the check for this would happen here in this function, and NOT in processPendingAllocations, which should have the sole responsibility of doing the actual work, not deciding if it should do the work.
does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to check whether something has been deallocated, we might have to lock here too indeed ; however I'm not sure it's worth locking twice for now, we don't know what we'll actually do in the future and when. Also that check in processPendingAllocations
can't hurt I think, and it would get rid of this lock for now. All in all I think the PR should focus on making it sensible as is, not leave a placeholder for later - but I'll let you judge.
select { | ||
case <-ctx.Done(): | ||
log.G(ctx).Debug("context done, canceling the event stream") | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have missed something, but can't we just cancel()
directly from this function a bit later after wg.Wait()
? Maybe we can even just defer cancel()
right here ? (or a lambda if you want to keep the log)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can't remember why i did this. let me stew on it for a bit and get back to you.
// because only allocated networks have the DriverState | ||
// allocated, we will only attach allocated networks. if for | ||
// some reason an unallocated network slips through, we'll get | ||
// ErrDepedencyNotAllocated when we try to allocate the nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : s/ErrDepedencyNotAllocated/ErrDependencyNotAllocated
// remove these tasks from the map | ||
log.G(ctx).WithError(err).Error("task cannot be allocated") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the task allocation code is the exact same for services tasks and service-less tasks, a local lambda factoring this would probably be better.
// successfully freed, and there is no reason to roll back. | ||
// | ||
// Therefore, because the operations can fail can be undone (allocation), | ||
// and the operations cannot be undone cannot fail (deallocation) then we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit :
- the operations that can fail can be undone
- the operations that cannot fail can be undone
I don't mean to nitpick too much but had to reread it a few time to make sense of it :)
// to keep two different distributed states in sync. There is one sole source | ||
// of truth about the network state in the cluster. However, if the local state | ||
// of the allocator is not completely consistent with the distributed state, | ||
// then errors such as double allocations or use-after-frees may result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the comment. If both libnetwork and the allocator maintain a local state, I'm a bit suprised by the "not needing to keep two different distributed states in sync" and "There is one sole source of truth". My understanding is the opposite : we choose to duplicate libnetwork's state, thus needing to keep it it sync and duplicating the potential source of truth, which is certainly a trade of for something else (I suppose reading libnetwork's distributed state is prohibitively expensive). So I would rather expect a comment on the form "while this requires careful syncing since we duplicate the source of truth, it is preferable because ".
// with this same Allocator, the caller must later call Commit with the | ||
// returned proposal. If the allocation is abandoned, meaning it's not going to | ||
// be committed to the store and otherwise would need to be rolled back, then | ||
// the proposal can just be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was suprising to me that Allocate
does not not alter the allocator state, is if we have two Allocate
before a Commit
we may pick the same port twice. I understand it works because we don't interleave those operations, but then :
- Let's have a comment stating it, as otherwise it's screaming BUG to the innocent reader.
- I feel like it's better to have that
Rollback
method, even if it's a no-op for now, as it will help if/when we want to make it more parallel, or have more leeway if switching implementation, but I'll leave that to your apprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After offline discussion, if we consider this an internal package, I agree we don't need Rollback
as one should have decent knowledge of the interactions and make adequate changes if needed. Let's just go with a short comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks solid, probably a good improvement over the old allocator. A solid base to build upon from now on.
- I did not review
*_test.go
. LMK if you want me to look at it in a separate review. - As I am no expert in networking, this is mostly focused on architecture and the code itself.
Overall LGTM !
- What I did
Added all of the lower-level allocator subcomponents, paving the way for the final bits of allocator rewrite.
Includes and supersedes #2579 and #2605. Does not integrate the new components, which still requires a rewrite of the top-level Allocator object (which is in progress on a local branch).
WIP because it lacks tests, but all code is ready for review.
- How I did it
Painfully.
- How to review it
This PR is massive, and will be difficult to adequately review. Do not let this discourage you; the design should be clearly delineated, and the components and loosely coupled. There are many comments. If you do not understand why a particular thing was done a particular way during your review, that's not a sign that you can't review it, that's a sign that it hasn't been commented adequately to explain it.
This PR should be review-able by anyone with any level of familiarity with the code. The goal is to create a maintainable, easy-to-understand component, that even new contributors to the project can easily understand. Effort has been made to ensure that nothing in the code assumes itself to be obvious.
To make reviewing easier, you should follow these guidelines:
doc.go
, which should explain the purpose and use of each package. Then, read the doc comments of the public interfaces. Make sure they make sense to you, and adequately explain the intended behavior of the components. If they do not, request changes.port
package has previously been subject to review in Add new PortAllocator #2579, but has received changes. Theipam
package has been partially reviewed in [WIP] Add new IPAM subcomponent #2605, but has received substantial changes. Thedriver
and top-levelnetwork
package has not reviewed yet.- How to test it
This PR is marked WIP primarily because it currently lacks complete testing. The tests it does have are written with the Ginkgo framework.
- Description for the changelog