Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite allocator #2615

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Rewrite allocator #2615

wants to merge 4 commits into from

Conversation

dperny
Copy link
Collaborator

@dperny dperny commented Apr 27, 2018

- What I did

Added all of the lower-level allocator subcomponents, paving the way for the final bits of allocator rewrite.

Includes and supersedes #2579 and #2605. Does not integrate the new components, which still requires a rewrite of the top-level Allocator object (which is in progress on a local branch).

WIP because it lacks tests, but all code is ready for review.

- How I did it

Painfully.

- How to review it

This PR is massive, and will be difficult to adequately review. Do not let this discourage you; the design should be clearly delineated, and the components and loosely coupled. There are many comments. If you do not understand why a particular thing was done a particular way during your review, that's not a sign that you can't review it, that's a sign that it hasn't been commented adequately to explain it.

This PR should be review-able by anyone with any level of familiarity with the code. The goal is to create a maintainable, easy-to-understand component, that even new contributors to the project can easily understand. Effort has been made to ensure that nothing in the code assumes itself to be obvious.

To make reviewing easier, you should follow these guidelines:

  • Start by reading doc.go, which should explain the purpose and use of each package. Then, read the doc comments of the public interfaces. Make sure they make sense to you, and adequately explain the intended behavior of the components. If they do not, request changes.
  • Look over the design and method signatures. Do the signatures make sense? Are the methods easy to use? Do the concerns of each component seem well-separated?
  • Review the lower level components first. The port package has previously been subject to review in Add new PortAllocator #2579, but has received changes. The ipam package has been partially reviewed in [WIP] Add new IPAM subcomponent #2605, but has received substantial changes. The driver and top-level network package has not reviewed yet.
  • Review the Allocator component. Does it fit the sub-components together well? Does it adequately respect the separation of concerns?
  • Review the tests. Do they cover the cases adequately?

- How to test it

This PR is marked WIP primarily because it currently lacks complete testing. The tests it does have are written with the Ginkgo framework.

- Description for the changelog

@dperny
Copy link
Collaborator Author

dperny commented Apr 27, 2018

TODO list for the stuff we need to do to ship this

  • Write new top-level allocator
  • Ensure manager/allocator/allocator_test.go runs against the new allocator
  • Go through old allocator code, and match old logic to new logic. Make sure all of the logic in the old allocator is covered in the new one
  • Write and run performance tests, optimize where appropriate.

@codecov
Copy link

codecov bot commented Apr 27, 2018

Codecov Report

Merging #2615 into master will increase coverage by 0.61%.
The diff coverage is 72.46%.

@@            Coverage Diff             @@
##           master    #2615      +/-   ##
==========================================
+ Coverage   61.76%   62.37%   +0.61%     
==========================================
  Files         134      142       +8     
  Lines       21760    23134    +1374     
==========================================
+ Hits        13440    14430     +990     
- Misses       6871     7196     +325     
- Partials     1449     1508      +59

@dperny
Copy link
Collaborator Author

dperny commented Apr 28, 2018

@aaronlehmann you wrote some of my favorite code in swarmkit, so I'm really interested in your high-level opinion of this new code, even if you don't have time for a review.

Copy link
Collaborator

@aaronlehmann aaronlehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly looked at the port allocator because it's most interesting and approachable to me.

// from having to worry about the overlying storage method, and reduces the
// already rather bloated surface are of this component
//
// The network allocator provideds the interface for the top-level API objets,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provides

// caller will provide all of the objects, and be in charge of committing those
// objects to the raft store. Instead, it works directly on the api objects,
// and gives them back to the caller to figure out how to store. This frees us
// from having to worry about the overlying storage method, and reduces the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying?

// objects to the raft store. Instead, it works directly on the api objects,
// and gives them back to the caller to figure out how to store. This frees us
// from having to worry about the overlying storage method, and reduces the
// already rather bloated surface are of this component
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surface area

"github.com/docker/swarmkit/manager/allocator/network/errors"
)

// DrvRegistry is an interface defining
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment seems incomplete

if oldService, ok := a.services[service.ID]; ok {
var oldVersion, newVersion uint64
// we need to do this dumb dance because for some crazy reason
// SpecVersion is nullable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpecVersion did not always exist. If it was not nullable, we would see a SpecVersion with default values on services that predate the field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this comment isn't super constructive.

//
// Because I'm reluctant to alter what I have that already works, we will not
// return ErrAlreadyAllocated if the endpoint is already allocated. Instead,
// callers can use the AlreadyAllocated function from this package.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing this might be a net simplification. In the past, there's been a big problem with port allocation logic being spread across a number of places. I really like the idea of centralizing it.


// IsNoop returns true if the ports in use before this proposal are the same as
// the ports in use after.
func (prop *proposal) IsNoop() bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for Allocate figuring out whether any state changes are necessary. It could return a proposal with empty allocate and deallocate sets when there is nothing to do, or a separate type with a Commit method that does nothing (so proposals that change state and don't change state are distinguished by type, not with a method call).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is almost entirely for the benefit of the tests, and i think the "correct" answer may actually be to change the tests to not use the port_test package and instead allow them to look at an unexported method or field providing this functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR to your forked repository with a couple optimizations to the Allocate loop. The second one is based on this comment from @aaronlehmann.

P.S: If the optimizations get accepted, IsNoop results in checking that both allocate and deallocate are empty, so it may not even need a method if you opt to not use the port_test package.

// necessary information to commit the change. Calling the `Commit` method on
// the Proposal will commit the changes to the Allocator. If the caller decides
// to abandon the changes, the user can simply abandon the Proposal without
// calling Commit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good approach. My only hesitation is that it bakes the single-thread assumption into the API. If we ever wanted to support concurrency, it would have to change so that the state is updated immediately, with the possibility of rolling back an allocation. I don't think this is a big deal because it wouldn't be a difficult change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't support concurrency and rolling back. if we support concurrency, you may not be able to roll back a port allocation, because one of the ports you released may have been claimed. you'd have to split out allocation and de-allocation into separate steps.

if you look in the manager/allocator/network.go AllocateService method, we allocate ports, get a proposal, and then try allocation IP addresses. If the IP address allocation fails, we discard the proposal. in the ipam allocator, when allocation IP addresses, you allocate first, and then if allocation is successful, you free addresses. at that point, no further errors can occur, so you can safely commit the proposal. if you allocated ports after IP addresses, you might find that IP allocation succeeds but port allocation fails, and then you'd have to roll back IP address allocation in the same way and risk a freed resource being taken.

i think that this actually makes me realize there is one use for the IsNoop method on proposals (or some similar construct). concurrently allocating IP addresses would be a big win, but is blocked because a lock on the portallocator would have to be acquired before allocating and released after committing, and the IPAM call is inside that synchronized path. however, if you found the proposal was a noop and did not alter port state (which i suspect to be a more common case), you could release that lock immediately and allow other service allocations to proceed calling into the ip allocator concurrently.

on the other hand, i strongly suspect network updates to services are going to be infrequent enough to not substantially impact the performance of the allocator even if that whole code path is single-threaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated by dperny the commit inmediately and roll back later approach is not thread-safe for deallocations. The prepare a proposal and commit later approach could be made thread-safe by keeping a logical clock in the allocator that gets updated every time a proposal is committed, but this would mean that any concurrent proposal would have to test its validity again. If a proposal is not valid due to a concurrent change then all the optimization of concurrency would be lost. And any deallocation in a proposal would make it non-valid, as you can not be sure if the committed proposal that triggered this validity check has deallocated and reallocated this resource.

Would an hybrid solution work? Allocating inmediatelly with roll back support, but not deallocating until committed. This will enforce the proposals to be either commited or reversed, dropping a proposal would result into some resources to be allocated forever until the swarm was rebuilt. I think this approach may be thread-safe, but it has to be taken into account that proposals hold both the resources they want to allocate and the ones that they want to deallocate, so a call to Allocate may seem not to have enough resources due to a concurrent proposal when it actually has.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with @dperny offline. The problem with making this multi-threaded is not deallocs, since if there is an error with deallocs, we just roll forward and commit. So, in the face of failures, deallocs are effectively best-effort.

So I think the commit immediately and roll back later approach should work for supporting multiple concurrent proposals, correct ? @dperny @Adirio

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand what you mean by roll forward and commit but I don't think deallocs are safe in the commit immediately and roll back later approach. Let me describe the proposed approaches to make sure we understand the same:

Resources need to be allocated and deallocated. To be thread safe this operations need to be done in an atomic way, this is, all of them need to be commited simultaneously or you can not guarantee that the laters will not fail. Three aproaches were mentioned:

  1. Prepare a proposal and commit later: when preparing a proposal you can ensure its validity, but as the allocator state can change before commiting, the validity of the proposal can not be ensured at commit time. To ensure both we could use a logical clock on the state that would be updated with every commit, if the clock has different value at commit time that when the proposal was created, the proposal would need to be checked again, and if there was any conflict, not only a new proposal would be needed but also all the steps between the proposal generation and the commit time would need to be executed again. Deallocs in this case will never result in a conflict as deallocating is idempotent, and actually should never happen as despite the allocator not knowing who owns which resource, they have a single owner so only it will try to deallocate it. On the other hand, allocs can result in conflicts if that same resource was already allocated (and commited) concurrently.
  2. Commit immediately and roll back later: the oposite aproach, commit at proposal preparation and rollback if unsuccessful. A stright difference from the previous method is that now unsuccessful proposals are the ones that need to be rollbacked while successful ones can be discarded, while it was the other way in the first approach. Validity can be ensured at commit time but can not be ensured at rollback time, as a deallocated resource can have been fully allocated (commit and proposal discarded as it was considered valid without rollback) and so the rollback method would fail re-allocating that resource and thus unable to get to a valid state. A logical clock could also be used here but with the same implications that in the first case.
  3. Hybrid (allocate immediately and rollback later, prepare deallocate and commit later): this approach combines the previous two. It allocates the new resources at proposal preparation but doesn't release already in use resources until commit phase, when it either releases the deallocated resources if successful or the pre-allocated resources if not. As you can guess, both successful and unsuccessful proposals would need to be commited/rollbacked and none could be discarded, but the validity of the proposal can be ensured at commit time.

Summing up: (info about which operations are safe and which proposals can be dropped)

  1. Prepare a proposal and commit later: Unsafe allocs and safe deallocs; commit successful and drop unsuccessful.
  2. Commit immediately and roll back later: Safe allocs and unsafe deallocs; drop successful and rollback unsuccessful.
  3. Hybrid (allocate immediately and rollback later, prepare deallocate and commit later): Safe allocs and deallocs; commit successful and rollback unsuccessful.

The only method that is concurrency-proof is the third one. Its drawbacks are being unable to drop any proposal and reserving resources that would be locked while the proposal is active.

You claim that deallocs are not a problem since we can just roll forward so the second approach can be used. What do you mean by roll forward? The resource that we deallocated at proposal preparation may have been allocated by another proposal and thus we would be unable to get it back. Do you mean that we should allocate another resource and change it for the one we can not reallocate?

portObj := port{p.Protocol, DynamicPortStart}
// loop through the whole range of dynamic ports and select the first
// one available
for i := DynamicPortStart; i <= DynamicPortEnd; i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would not scale well with many dynamically assigned ports in play. Maybe it's not a big concern right now, especially since there is an upper bound on how many ports can be assigned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worst case O(n) with the number of ports, because everything is done off map lookups. There are more efficient ways to dynamically assign ports (@stevvooe has mentioned things like bit-twiddling) but I don't think they'll be worth optimizing unless we see that this behavior is bottlenecking performance.

Copy link
Contributor

@Adirio Adirio May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not starting always at the DynamicPortStart (keeping track of which was the last i in the previous iteration of the ports loop) you will optimize it even if the theoretical complexity still remains O(n).

If you have 3 ports inside finalPorts that require a dynamically assigned port and the first already has checked that the first 50% of the dynamic port range is full, the second port should start iterating where the first stopped, as it knows that the range before the dynamically assigned port for the first one is full.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a useful optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a PR to your rewrite_allocator branch with this optimization and another one focused on the loops that re-allocates deallocated ports. Was quite straight-forward.

@dperny
Copy link
Collaborator Author

dperny commented Apr 30, 2018

Ugh, I just realized that the allocator doesn't handle the case of node-local networks, because I've segregated allocating the driver into the driver package and allocating IP addresses into the ipam package, but you need the driver capabilities to allocate network attachments correctly because you take no action on node-local networks.

@dperny
Copy link
Collaborator Author

dperny commented May 3, 2018

When I plug the new allocator into the main code base and run the integration tests, tehy fail. I believe this is due to the new allocator being substantially too slow. Most of the optimization that needs to happen should happen in the top-level allocator.Allocator. Specifically, I believe it holds store locks open for entirely too long.

}
if spec.TargetPort > masterPortEnd {
return nil, errors.ErrInvalidSpec("target port %v isn't in the valid port range", spec.TargetPort)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs a check for masterPortStart too. For PublishedPort, it's not actually a problem (unless someone changes the constant) because the only value less that 1 means dynamic allocation. But I guess we shouldn't allow TargetPort to be 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc.go says

The allocator does not keep track of host-mode ports in any capacity. It silently adds them to the returned Ports assignment, without checking their validity or availability.

However, we do seem to be doing some validity checks here, even for host-mode ports.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't count. it's an undocumented freebie feature.

Copy link
Contributor

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started making a TLA+ model of the port allocator, and realised I hadn't quite understood some things about how host-mode ports work. This probably just needs a bit more documentation...

}
if spec.TargetPort > masterPortEnd {
return nil, errors.ErrInvalidSpec("target port %v isn't in the valid port range", spec.TargetPort)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc.go says

The allocator does not keep track of host-mode ports in any capacity. It silently adds them to the returned Ports assignment, without checking their validity or availability.

However, we do seem to be doing some validity checks here, even for host-mode ports.

// no PublishedPort. The allocator does _not_ keep track of host-mode ports in
// any capacity. It silently adds them to the returned Ports assignment,
// without checking their validity or availability. Port assignments for
// host-mode ports are handled on the worker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the scheduler involved too, for host-mode ports? Otherwise, they might get scheduled on a host which is already using that port.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about this case:

docker service create --publish published=30000,target=80,mode=host nginx
docker service create --publish published=0,target=80               nginx

Doesn't the allocator need to know that a host is using port 30000?

(I just tried it with 18.03.1-ce and it let me use the same port for two services. Requests always seemed to go to the ingress service until I killed that service, then they went to the host service.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scheduler is not involved in host mode ports. it was a very rushed feature, if i recall correctly, and it's sensitive to collisions.

the problem with handling host-mode ports at the cluster level is that it involves an interaction between the allocator and the scheduler. the scheduler won't know until the task passes through the allocator what ports it's using, but the allocator won't know until it passes through the scheduler what node it's on. plus, you can dynamically assign ports, which means you really actually can't know what port will be chosen on the target node.

so, the solution now is to pass host ports as-is, and let the worker fail the task if there is a collision with a local port (thus pushing the scheduler to reschedule the task, hopefully on a different node with the requested port available).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if a host port isn't available on a node because it's already used as an ingress port, won't we have the same problem on all other nodes of the cluster too? Should a worker reject a request to forward an ingress port if it is already using it as a host-mode port? (I'm not too clear about how overlay networks work currently)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interaction between the scheduler and the allocator has been discussed in other Issues and PRs. It has been suggested to extract the allocators from the pipeline:

Current:

                    MANAGER NODE (LEADER)                           WORKER NODE
_______________ ______________ ______________ ______________     _______________
\              \              \              \              \    \              \
 | Orchestrator |  Allocator   |  Scheduler   |  Dispatcher  |    |    Agent     |
/______________/______________/______________/______________/    /______________/

Proposed:

             MANAGER NODE (LEADER)                   WORKER NODE
_______________ ______________ ______________     _______________
\              \              \              \    \              \
 | Orchestrator |  Scheduler   |  Dispatcher  |    |    Agent     |
/______________/______________/______________/    /______________/
                       |
                 +-----+-----+
                 | Allocator |
                 +-----------+

This way the port allocator could be called before (or at the start of) the scheduler but other allocators could be called later (resources could be consider allocatable items for example).


// DynamicPortEnd is the end of the dynamic port range from which node
// ports will be allocated when the user did not specify a port.
DynamicPortEnd uint32 = 32767
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is configurable, just at compile time instead of run time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific usecase to have these configurable at runtime ? @talex5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I just thought a customer might hit this limit and it would be nice if we could tell them to change a setting rather than waiting for a new engine. But maybe they never get near to this limit anyway.

@dperny
Copy link
Collaborator Author

dperny commented May 8, 2018

However, we do seem to be doing some validity checks here, even for host-mode ports.

yes, we are. there is no good reason for this. i am a liar.

@dperny
Copy link
Collaborator Author

dperny commented May 8, 2018

i'm unsure how useful TLA+ is going to be for the allocator, which is not distributed and is not yet even concurrent.

@dperny
Copy link
Collaborator Author

dperny commented May 9, 2018

I've updated the PR to include some major differences:

  1. The new allocator can be used in swarmkit by setting the environment variable SWARMKIT_USE_NEW_ALLOCATOR="iknowtherisk". The new allocator is gated behind this setting. This is a temporary measure.
  2. Added a second integration test run to CI, in addition to a regular make target, which runs the integration tests with the new allocator as opposed to the old one.

@talex5
Copy link
Contributor

talex5 commented May 10, 2018

i'm unsure how useful TLA+ is going to be for the allocator, which is not distributed and is not yet even concurrent.

Yeah, I wasn't sure whether to bother, but I thought I might as well record what I'd learnt from reading the code, and then it found the ingress/host collision problem. I've now finished the spec, and configured it to check these properties:

  1. All allocations are consistent:
    • All ports marked as used in the allocator are, in fact, used by some endpoint.
    • All ports used by an endpoint are marked as allocated.
    • No two services have conflicting allocations, except for host/host collisions (which the allocator can't control, since it depends on the scheduler).
  2. If the allocator rejects a spec, then the spec is invalid, requests ports in use elsewhere, or requires more free dynamic ports for some protocol than are available.
  3. Allocation is idempotent.

Here is the spec: PortAllocator.pdf -- the source is at https://github.com/talex5/swarmkit/compare/tla...talex5:tla-alloc?expand=1

I marked with XXX places where I had to deviate from the Go code to make the tests pass.
In summary, the model checker found the following corner cases:

Reusing a previous assignment when we need it elsewhere

Example:

  1. [name="foo", dynamic] (initial configuration)
  2. [name="foo", dynamic], [name="bar", published=30000] (updated)

Here we will reject the update because we try to reuse port 30000 for "foo", even though we now need it for "bar". Perhaps this is desired behaviour, but it does mean that you can't update to a state even though you could have specified the target state initially without a problem.

Duplicate similar ports

Example:

  1. [name="foo", dynamic]
  2. [name="foo", dynamic], [name="foo", dynamic]

We will try to use port number 30000 for both ports and reject the allocation.

Shadowing an existing host port

As noted previously, the allocator ignores host ports completely. It may allocate a dynamic ingress port using a network address that is already in use on some host. In this case, the original service becomes unreachable.

Accepting a host port that is hidden

The allocator may accept a host-port allocation that it knows will be unreachable on any host because that address is already in use as an ingress port.

I confirmed the first two by adding some tests for them to the Go tests:
https://github.com/dperny/swarmkit-1/compare/rewrite-allocator...talex5:allocator?expand=1

These issues might not be too important, but I think it's interesting to see what potential problems there could be, and what the model checker can find.

@Adirio
Copy link
Contributor

Adirio commented May 10, 2018

@talex5

No two services have conflicting allocations, except for host/host collisions (which the allocator can't control, since it depends on the scheduler).

I believe it is the Agent and not the Scheduler who handles host ports. The scheduler basically ensure deployment constraints, orders possible target nodes according to deployment preferences and number of running tasks and assigns the task to the first target node.

Corner cases:

Reusing a previous assignment when we need it elsewhere

That is correct, if you manually assign a port that was already dynamically assigned you will get a conflict. When you give a dynamic port range you are kind of "reserving" this ports for dynamic use only.
There are two solutions:

  • Consider the dynamic port range a soft reservation (actual behaviour): a manual assignment in this range may or may not succeed depending on the cluster state. The user should use ports in this range under its own risk. IMO, this should get documented if it is the desired behaviour.
  • Consider the dynamic port range a strict reservation: disallow manually assignments in the dynamic port range programatically. This should probably be checked when issuing docker stack deploy or similar API/CLI calls and return an Error.

The advantages of the second solution are being self-documented and probably simplifiying the Allocate method. I am not sure that these small advantages are worth the change.

Duplicate similar ports

That will not happen in the allocator, the first would get 30000 and the second 30001. At least the last time I checked the code it was checking that the pre-selected port was not assigned and discard it if it was.

Shadowing an existing host port & Accepting a host port that is hidden

Either you or me got the host-port part wrong. In my understanding, host-ports are from a different virtual network and thus a completely new set of ports is available, so there is no way to shadow/hide a port.

@talex5
Copy link
Contributor

talex5 commented May 10, 2018

That will not happen in the allocator, the first would get 30000 and the second 30001. At least the last time I checked the code it was checking that the pre-selected port was not assigned and discard it if it was.

Perhaps there is an error in my tests then. Do you see a problem with this code:

https://github.com/talex5/swarmkit/blob/allocator/manager/allocator/network/port/port_test.go#L867

For me, this test fails with:

        Expected error:
            <errors.errInvalidSpec>: {
                cause: "published port 30000/TCP is assigned to more than 1 port config",
            }
            spec is invalid: published port 30000/TCP is assigned to more than 1 port config
        not to have occurred

Either you or me got the host-port part wrong. In my understanding, host-ports are from a different virtual network and thus a completely new set of ports is available, so there is no way to shadow/hide a port.

This is what I see with the current allocator, which seems to have the same behaviour:

docker service create --publish published=30001,target=80,mode=host nginx
docker service create --publish published=0,target=80 wordpress

Then accessing port 30001 gives me wordpress:

$ curl -v http://35.205.55.240:30001/
[...]
Location: http://35.205.55.240:30001/wp-admin/setup-config.php

After removing the wordpress service (docker service rm mszqf4ey1nem) I get:

$ curl -v http://35.205.55.240:30001/
[...]
<title>Welcome to nginx!</title>

If they are on different virtual networks, how would I specify which one I meant?

@Adirio
Copy link
Contributor

Adirio commented May 10, 2018

Perhaps there is an error in my tests then. Do you see a problem with this code:

I am not familiar with gimko so will be of little help here, but you should probably not have a value assigned in the spec and then change it to 30000 in the endpoint.Ports slice. Use 0 for spec and 30000 for endpoint.Ports if you want to simulate a dynamically assigned port, or both to 30000 if you want to simulate a manually assigned port in the dynamic range.

This is what I see with the current allocator, which seems to have the same behaviour:

docker service create --publish published=30001,target=80,mode=host nginx
docker service create --publish published=0,target=80 wordpress

Then accessing port 30001 gives me wordpress:

$ curl -v http://35.205.55.240:30001/
[...]
Location: http://35.205.55.240:30001/wp-admin/setup-config.php

After removing the wordpress service (docker service rm mszqf4ey1nem) I get:

$ curl -v http://35.205.55.240:30001/
[...]
<title>Welcome to nginx!</title>

If they are on different virtual networks, how would I specify which one I meant?

That behaviour seems to be in line with your interpretation of host-ports. Lets see what a member/collaborator says about it.

@dperny
Copy link
Collaborator Author

dperny commented May 10, 2018

The problem with two identical ports is that we can't know which ports on the spec map to ports on the object, which we need to know in order to correctly re-use dynamic port allocations across service updates. So, in your failing test, the port allocator is looking for a spec on the endpoint which matches the config in the new spec, seeing that one is found and has no port allocation, and then looking in the object for the port allocation matching that spec, which it fines because the old object has it. That corner case, of identical ports, needs to be blocked from even occuring, in order to avoid "undefined" behavior.

@Adirio
Copy link
Contributor

Adirio commented May 11, 2018

@dperny Could that situation happen in real life? I mean, can a spec have two identical ports including the TargetPort-Protocol pair? You are using always 80/TCP as the target because you are not doing any check against them at this point, but where will a query to port 80/TCP be redirected if such a spec was found?

@talex5 using TargetPort=81 for the new spec port should fix the test

Copy link
Contributor

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at reviewing the other subcomponents (ipam and driver), but I don't know much about libnetwork. Some overview documentation would be useful.

// that i'm willing to just check it and panic if it occurs.
reg, err := drvregistry.New(nil, nil, nil, nil, pg)
if err != nil {
panic("drvregistry.New returned an error... it's not supposed to do that")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include err in the panic message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good catch.


// Restore takes slices of the object types managed by the network allocator
// and syncs the local state of the Allocator to match the state of the objects
// provided. It also initializes the default drivers to the reg.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also initializes the default drivers to the reg.

Doesn't look like it does anything with the reg.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hold over comment from an earlier version of the code, sorry about that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -0,0 +1,11 @@
package driver

// TODO(dperny) write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to review this module without this information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, whoops, sorry. Will turn this into a TODONE.

_, ok := nw.Spec.Annotations.Labels["com.docker.swarm.internal"]
if ok && nw.Spec.Annotations.Name == "ingress" {
a.ingressID = nw.ID
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ingress-detection logic is duplicated in AllocateNetwork. Maybe make a helper function for that?

if a.ingressID != "" {
// NOTE(dperny): if i recall correctly, append should not modify the
// original slice here, which means this is safe to do without
// accidentally modifying the spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much Go, but I think it depends how the slice was constructed. e.g.

        foo := make([]int, 0, 5)
	bar := append(foo, 1)
	baz := append(foo, 2)
	fmt.Println(foo)
	fmt.Println(bar)
	fmt.Println(baz)

prints

[]
[2]
[2]

So the second append changed the bar slice. I think this is safe only if the underlying array is already at capacity or no other user will append to it.

Copy link
Contributor

@Adirio Adirio May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append modifies the underlying array, but doesn't modify the original slice:

When you do foo := make([]int, 0, 5) you are creating an array at memory address 0xXXXXXXXX and an slice at memory address 0xAAAAAAAA:

0xXXXXXXXX: 0 0 0 0 0       <- underlying array
0xAAAAAAAA: 0xXXXXXXXX 0 5  <- foo

bar := append(foo, 1) creates a new slice bar at 0xBBBBBBBB using the same underlying array than foo and inserts a 1. As foo has no element, the insert is done in position 0.

0xXXXXXXXX: 1 0 0 0 0       <- underlying array
0xAAAAAAAA: 0xXXXXXXXX 0 5  <- foo
0xBBBBBBBB: 0xXXXXXXXX 1 5  <- bar

baz := append(foo, 2) creates a new slice baz at 0xCCCCCCCC using the same underlying array than foo and inserts a 2. As foo has no element, the insert is done in position 0.

0xXXXXXXXX: 2 0 0 0 0       <- underlying array
0xAAAAAAAA: 0xXXXXXXXX 0 5  <- foo
0xBBBBBBBB: 0xXXXXXXXX 1 5  <- bar
0xCCCCCCCC: 0xXXXXXXXX 1 5  <- baz

So printing foo gives 0 elements from the underlying array, printing bar and baz both give 1 element of the underlying array and as the last operation set it to a 2 that is what you get for both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed this by using a different slice.

finalErr = err
}
// remove the addresses after we've deallocated every attachment
// attachment.Addresses = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

triggers the race detector. i will go back and change it, but, basically, the object you get back from an event stream is shared, and mutating it one place mutates it in all places.

"no ip addresses remain in any pool for network %v",
local.nw.ID,
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if address != ""? Isn't that also an error (the requested address wasn't available)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... i think this might be broken, actually. i'm going to write a test for it real quick.


// The allocator/network package is the subcomponent in charge of allocating
// network resources for swarmkit resources. Give it objects, and it gives you
// resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to say something about thread-safety here. The port allocator says the PortAllocator is *not* concurrency-safe, and this module uses that one without locks, so presumably this module isn't concurrency-safe either. It should probably mention that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded. Maybe it makes sense to point this out on a per-API level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

// ipam is a package for interfacing with the libnetwork IPAM drivers.
// It performs the work of initializing and bootstrapping the IPAM state for
// networks, keeping track of IPAM pools, and allocating new addresses for VIPs
// and network attachments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to provide a brief overview of what IPAM is and does here.

Copy link
Contributor

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on allocator_new.go.

// has been allocated. If it is true, then regardless of which nodes are
// currently pending, all nodes will need to be reallocated with the new
// network. we will do this after network allocation, but before task
// allocation, to avoid a case where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete comment.

var err error
nodes, err = store.FindNodes(tx, store.All)
if err != nil {
// i don't know what to do here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic?

t := store.GetTask(tx, taskid)
// if the task is not nil, is in state NEW, and is desired to
// be running, then it still needs allocation
if t != nil && t.Status.State == api.TaskStateNew && t.DesiredState == api.TaskStateRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about tasks with a desired state of READY?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch.

// to the pending map. they will never succeed. additionally,
// remove these tasks from the map
log.G(ctx).WithError(err).Error("service cannot be allocated")
delete(tasks, service.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we attach the error to the task objects somehow, e.g. in task.Status.Err? Otherwise, I assume the user won't be able to see why it failed. Same applies in other places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, yes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling this out of scope for this initial rewrite, but dropped a TODO comment.

// free some state when we deallocate a second
// time! a classic double-free case!
log.G(ctx).WithField("node.id", node.ID).Debug("node was deleted after allocation but before committing")
a.network.DeallocateNode(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to deallocate here? Won't the event monitor handle this case too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the event stream will deallocate the node in its state before we performed re-allocation. however, we may have locally allocated some new resources for the node, which have not been committed, but which must be released in our local state.

i'm unsure what the correct approach to working around this problem is. probably making Deallocate idempotent. but i would really appreciate suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the node object in the event isn't the same as the node object we change during allocation? If I understand correctly then, the problem case is:

  1. The batch thread starts allocating things for an updated node.
  2. The node is deleted. The event handler gets an event with the old state of the node. It waits for the lock.
  3. The batch thread notices that the node is gone and deallocates it, from the correct (newly-allocated) state. It releases the lock.
  4. The event handler gets the lock and deallocates again, from the old state.

If we get here then:

  1. The node hadn't been deleted when we looked it up earlier in processPendingAllocations, which must have happened after we took pendingMu.
  2. It is deleted now. So, a delete event must be on its way.
  3. The event handler can't have deallocated yet, because it needs pendingMu.

So I guess we could keep a set of "already-deallocated" objects. We add the object to the set here to indicate that we already freed it. When the event thread later gets the lock, it will see that the deallocation was already done and not do it again. It will remove it from the "already-deallocated" set. A bit messy, but it should work.

Other solutions:

  1. Ensure that we only have one object in memory at once representing the state of the node. Have the event thread deallocate in all cases. It will always have up-to-date information.
  2. Use a single database transaction for processPendingAllocations. Then the user can't delete the node until we're done. Would this be a performance problem?

}
}

func (a *NewAllocator) allocateTask(tx store.Tx, taskid string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code that i forgot to delete. i'm sorry about the dead code and half-done comments, i lose track of things frequently, especially in this huge project.

@Adirio
Copy link
Contributor

Adirio commented May 17, 2018

CircleCI is detecting a change in a proto file without a make generate. The change is in the comment, but the test is unable to know that. I'm not sure how hard would making the checkproto test aware of comments and if it really would be that useful.

@dperny
Copy link
Collaborator Author

dperny commented May 17, 2018 via email

@Adirio
Copy link
Contributor

Adirio commented May 17, 2018

It's fine, I was just checking CircleCI output because I'm getting a fail in another PR (different error) and I figured I would just comment in case you did not notice.

Can I ask you if you have any feedback on the two loop optimizations I proposed as a PR in your fork? I've noticed that you didn't implement them and I'm not sure if you do not like any of the changes or if the PR in your fork wasn't the way to go.

networks, _ = store.FindNetworks(tx, store.All)
services, _ = store.FindServices(tx, store.All)
tasks, _ = store.FindTasks(tx, store.All)
nodes, _ = store.FindNodes(tx, store.All)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's true, perhaps we should change the return type of Find* so they can't return errors, and make them panic instead. It doesn't seem like a good idea to allow them to return errors that we just ignore. Even if my swarm is hosed, I'd still like to know about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find can return errors if you call it with a store.By* filter that is invalid. Though we could change the method signature to check that at compile time, I suppose.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded that I don't think we should plain ignore these errors. If the underlying code changes to sometime return an actual errors and an empty list, we surely want to know. If the error is "not possible", I'd prefer the allocator not to start - better safe than sorry.

// free some state when we deallocate a second
// time! a classic double-free case!
log.G(ctx).WithField("node.id", node.ID).Debug("node was deleted after allocation but before committing")
a.network.DeallocateNode(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the node object in the event isn't the same as the node object we change during allocation? If I understand correctly then, the problem case is:

  1. The batch thread starts allocating things for an updated node.
  2. The node is deleted. The event handler gets an event with the old state of the node. It waits for the lock.
  3. The batch thread notices that the node is gone and deallocates it, from the correct (newly-allocated) state. It releases the lock.
  4. The event handler gets the lock and deallocates again, from the old state.

If we get here then:

  1. The node hadn't been deleted when we looked it up earlier in processPendingAllocations, which must have happened after we took pendingMu.
  2. It is deleted now. So, a delete event must be on its way.
  3. The event handler can't have deallocated yet, because it needs pendingMu.

So I guess we could keep a set of "already-deallocated" objects. We add the object to the set here to indicate that we already freed it. When the event thread later gets the lock, it will see that the deallocation was already done and not do it again. It will remove it from the "already-deallocated" set. A bit messy, but it should work.

Other solutions:

  1. Ensure that we only have one object in memory at once representing the state of the node. Have the event thread deallocate in all cases. It will always have up-to-date information.
  2. Use a single database transaction for processPendingAllocations. Then the user can't delete the node until we're done. Would this be a performance problem?

Copy link
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, reviewing in chunks.

@@ -83,6 +84,8 @@ func New(store *store.MemoryStore, pg plugingetter.PluginGetter) (*Allocator, er
func (a *Allocator) Run(ctx context.Context) error {
// Setup cancel context for all goroutines to use.
ctx, cancel := context.WithCancel(ctx)
log.WithModule(ctx, "allocator")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think log.WithModule(...) returns a context that has the logging context updated. So you might have to assign ctx = log.withModule(ctx, "allocator")

// possibility for errors lies deep within memdb, and seems to be a
// consequence of some kind of misconfiguration of some sort of the
// indexes or tables. Those errors should not be possible, so we
// ignore them to simplify the code. If the do occur, then your
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling nitpick: "If the do occur" -> "If they do occur"

api.EventDeleteNode{},
)
if err != nil {
// TODO(dperny): error handling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this TODO? Is returning the error the incorrect behavior?

// now restore the local state
log.G(ctx).Info("restoring local network allocator state")
if err := a.network.Restore(networks, services, tasks, nodes); err != nil {
// TODO(dperny): handle errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this TODO? Is it just that you don't know if the entire run sequence should error?

api.EventCreateNetwork{},
api.EventUpdateNetwork{},
api.EventDeleteNetwork{},
api.EventCreateService{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment below (https://github.com/docker/swarmkit/pull/2615/files#diff-b775c4fdb79a3a4b943132a95b2f6ef3R337) says that we don't actually need to do anything for the service create and service update events- should we just not subscribe to these two events?

api.EventUpdateTask{},
api.EventDeleteTask{},
api.EventCreateNode{},
api.EventUpdateNode{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly should we unsubscribe from EventUpdateNode since we don't handle that case?

// has been allocated. If it is true, then regardless of which nodes are
// currently pending, all nodes will need to be reallocated with the new
// network. we will do this after network allocation, but before task
// allocation, to avoid a case where nodes use up all of the available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't this the case we want to favor, as opposed to avoid? We want the nodes to use up all the available network addresses as opposed to tasks using up all of them.

log.G(ctx).WithField("network.id", network.ID).Debugf("network was deleted after allocation but before commit")
// if the current network is nil, then the network was
// deleted and we should deallocate it.
a.network.DeallocateNetwork(network)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess DeallocateNetwork is idempotent and hence would not experience a similar race to DeallocateNode? (looks like we check for the network ID and don't do anything if it doesn't exist). It probably doesn't make sense to keep track of node IDs in the network allocator, since the node ID is not needed, but would that be one way of making the node deallocations idempotent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is addressed, i think? i think this review comment happens to refer to an unchanged section of code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed :) Thanks!

@dperny dperny force-pushed the rewrite-allocator branch 2 times, most recently from c9425aa to 79f469d Compare May 23, 2018 20:30
@dperny
Copy link
Collaborator Author

dperny commented May 31, 2018

Added basic integration test with ginkgo for the full allocator.

Copy link
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on IPAM

// bootstrapped before IP address allocation can proceed correctly.
// Bootstrapping, in this case, means that when a fresh allocator comes up, we
// must first populate it with all of the addresses we know to be in use before
// allocating new ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on the lifecycle of an IPAM driver? For instance, is the IPAM local state ever cleared? What happens if a manager is a leader, so the IPAM driver has some local state, then it loses leadership and re-acquires?

Is manager/allocator/network.NewAllocator the one that clears the state by initializing the IPAM drivers from scratch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This comment has been addressed.

// The drivers package contains the Allocator in charge of setting a Network's
// Driver state. Before a network can be used, if it depends on a global-scoped
// driver, it must be allocated on that driver. This package provides the code
// that handles that state allocation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you please elaborate on the lifecycle of the network drivers - what do you mean global-scoped? Does the state on the drivers ever need to be reset? (e.g. when leadership is lost?)

// driver that's the kind of address we're requesting
ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway
// and now, if we have a Gateway address for this network, we need
// to allocate it. This condition should never happen; a network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say "this condition should always be true", the condition being "we have a gateway address for this network"?


// most addresses need to be added to the map of
// address -> poolid, but we don't need to do this with the
// gateway address because it's store in the ipam config with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "it's store in" -> "it's stored in"

// most addresses need to be added to the map of
// address -> poolid, but we don't need to do this with the
// gateway address because it's store in the ipam config with
// the subnet, which is the key for the pools map.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So subnet is the key for local.pools? So that's what poolIP is? Would it make sense to include that as a comment up above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment for the pools map in the network struct to elaborate on this, as well as the endpoints map.

// the subnet, which is the key for the pools map.
}
// delete the Gateway option from the IPAM options when we're done
delete(ipamOpts, ipamapi.RequestAddressType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are mutating the IPAM options map here - is it guaranteed that there was no previous value for ipamapi.RequestAddressType in the map? Otherwise editing and deleting it would be mutating the map, and I wasn't sure if it ever gets written back to the store.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 80% sure that this map entry should never be present usually... but 80% isn't really good enough, especially when the API here is so ill-defined, so I've changed the code to save the previous entry if one exists and restore it if one exists (and delete it if one doesn't).

@dperny dperny force-pushed the rewrite-allocator branch 2 times, most recently from 8a818d1 to 6a749ac Compare June 1, 2018 20:15
@dperny
Copy link
Collaborator Author

dperny commented Jun 1, 2018

I have removed the long-deprecated ServiceSpec.Networks field.

Copy link
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPAM is the last one I think. I've reveiwed the port allocator in a previous PR (#2579)

I haven't looked too hard at the tests except for port-allocator, but I think one or two of my review comments had to do with rollback of allocations, so just wanted to call it out in case you already knew if/where there should be more coverage of those.

Update: Nm, ignore my stupid comment below about deallocation.

if n.Spec.IPAM != nil && n.Spec.IPAM.Driver != nil && n.Spec.IPAM.Driver.Name != "" {
ipamName = n.Spec.IPAM.Driver.Name
}
ipam, _ := a.drvRegistry.IPAM(ipamName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be checking the capabilities here to ensure that it's not a node-local network?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think not actually. we should check in the caller, and not pass node-local networks into the ipam allocator (because they don't need IPAM state allocated), which we already do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is specified in the doc comment already, as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - just wanted to make sure we didn't need to enforce it in this function.

// add the option indicating that we're gonna request a gateway, and
// remove it before we exit this function
ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway
defer delete(ipamOpts, ipamapi.RequestAddressType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know ipamOpts is a local variable (copied from the actual map above), but it seems to be then assigned back into the network object in the allocator network map. Does this get written back to the store? The comment for the function says that the provided network is mutated.

If it does get updated in the store, do we need to check that we aren't overwriting an existing value in the map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i fixed this in restore, actually, but not here. i'll do so now.

ipamOpts[ipamapi.RequestAddressType] = netlabel.Gateway
defer delete(ipamOpts, ipamapi.RequestAddressType)
if config.Gateway != "" || gwIP == nil {
gwIP, _, err = ipam.RequestAddress(poolID, net.ParseIP(config.Gateway), ipamOpts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity does ipam.RequestAddress return a gateway address even if config.Gateway is empty? (which could happen if gwIP is nil)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the result of net.ParseIP will be nil, which will cause RequestAddress to choose an IP for us.

// now figure out which new network IDs we've added, which is the same loop
// above but swapped around to check network IDs against VIPs
newvips:
for nwid := range networkIDs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Since we're already looping through the VIPs up above, can we just create a map of vip network IDs while looping there? That way we don't have to do this double-loop here with the continue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because this vip loop is part of a quadratic check for which networks have vips already allocated and which do not. We can probably optimize this by combining this logic with the allocation logic, but I think that optimization would be out of scope for this revision.

Copy link
Contributor

@cyli cyli Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I'm not entirely sure I understand, apologies. What I mean is that this inner loop is going through every VIP to find a network ID that matches the network ID in the outer loop.

However, the loop above is also looping through every VIP, to see which ones to keep and deallocate. In the upper loop, if we build up a mapping of network IDs to struct{}{}, we'll have a set of which network IDs are specified by the VIPS that we can use to look up here, instead of having an inner loop, if that makes sense.

e.g:

        nwidsInVIPS := make(map[string]struct{}{})
	for _, vip := range endpoint.VirtualIPs {
		if _, ok := networkIDs[vip.NetworkID]; ok {
			keep = append(keep, vip)
		} else {
			deallocate = append(deallocate, vip)
		}
		nwidsInVIPS[vip.NetworkID] = struct{}{}
	}

	for nwid := range networkIDs {
		if _, ok := nwidsInVIPS[vip.NetworkID]; !ok {
  		    allocate = append(allocate, nwid)
  		}
	}

I think this is equivalent to the code that is there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh... i see. let me put this in and run it through the wash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done this.

// network, and each network in turn has non-overlapping subnets, there is
// no chance of IPs we're releasing to be reused in the allocation of new
// VIPs. So, instead, we deallocate last, so that if any allocation fails,
// we only have to roll back incomplete allocation, not re-allocation a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "re-allocation a release" -> "re-allocate a release"

defer func() {
if rerr != nil {
for _, addr := range attach.Addresses {
poolID := local.endpoints[addr]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete it from local.endpoints[addr] as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has been addressed.

if err != nil {
requestIP = net.ParseIP(address)
if requestIP == nil {
return nil, errors.ErrInvalidSpec("address %v is not valid", address)
Copy link
Contributor

@cyli cyli Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this correctly trigger the defer function that deallocates/rolls back? This does not set rerr and I'm not sure the deferred function will pick up on that. https://play.golang.org/p/rcNPmYSE4yr

Similar with the other error returns below.

Update: Nm, I'm dumb - passing the function a value is different than referring to that value in the function, which should get it when it executes.

// Allocator is the interface for the port allocator, which chooses and keeps
// track of assigned and available port resources.
type Allocator interface {
Restore([]*api.Endpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the descriptions be on the interface, because thats what the caller writes their code to. But its not a big deal.

// user's spec wants a new (dynamically allocated) port, just like if we
// had dynamically assigned the port.
//
// Luckily for use, we keep a copy of the spec on the Endpoint object,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use => us


// So, basically, here's what we're going to do: we're going to create a
// new list of PortConfigs. This will be the final list that gets put into
// the object endpoint object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if "object endpoint object" is what you were trying to say

@dperny
Copy link
Collaborator Author

dperny commented Jun 7, 2018

Updated some comments, noticed a bug in us not tracking node-local networks properly, fixed it.

@dperny
Copy link
Collaborator Author

dperny commented Jun 7, 2018

Addressed comment errors.

@dperny dperny force-pushed the rewrite-allocator branch 2 times, most recently from 0da78f0 to 6b5fbc3 Compare June 11, 2018 18:12
Copy link
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just had a couple of minor nits about comments, but other than that I think this PR LGTM. I'm sure I've missed some stuff, but I spot nothing else in the logic on another re-review.

@dperny dperny changed the title [WIP] Rewrite allocator Rewrite allocator Jun 18, 2018
@dperny dperny force-pushed the rewrite-allocator branch 2 times, most recently from 3ab272c to f0ed299 Compare June 25, 2018 22:27
Adds the ginkgo BDD testing framework.

Signed-off-by: Drew Erny <drew.erny@docker.com>
Rewrites the manager/allocator component to be more readable,
documented, tested, and correct. The new allocator is tested with the
Ginkgo BDD testing framework

Allocator shares one set of errors across the whole implementation,
defined in the manager/allocator/network/errors package

Adds a new, more robust implementation of assigning ports. Tested with
the ginkgo testing framework.

Adds a subcomponent to the network allocator with the purpose of
interfacing with the IPAM drivers and allocating IP addresses for
swarmkit objects.

Adds a subcomponent for the network allocator that has the
responsibility of allocating network resources and driver state from
network drivers.

Adds the highest-level component charged with allocating network
resources from from the underlying libraries. The network allocator
provides an interface between the higher level swarmkit objects
(services, tasks and nodes), and the lower level swarmkit objects
(endpoints, attachments) that the lower level-allocator components
operate on.

Adds mocks, and the makefile rules to generate them. Readds a vendoring
of the mocking library.

Adds the top-level allocator component, (called NewAllocator while this
PR is still WIP), which is the component that interacts with the store
and event stream. Copies the tests from the old allocator, and starts
adapting them to the new one. Changes some tests from the old allocator
that fail because of assumptions that no longer hold.

Adds prometheus metrics collection to the new allocator.

Signed-off-by: Drew Erny <drew.erny@docker.com>
This field has, in various forms, been deprecated since at least 2016,
and we should be well clear to remove it.

Signed-off-by: Drew Erny <drew.erny@docker.com>
Continues improving the new allocator, after having removed the Networks
field from the Service object

Signed-off-by: Drew Erny <drew.erny@docker.com>
Copy link

@mefyl mefyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid, probably a good improvement over the old allocator. A solid base to build upon from now on.

  • I did not review *_test.go. LMK if you want me to look at it in a separate review.
  • As I am no expert in networking, this is mostly focused on architecture and the code itself.

Overall LGTM !

// have resources allocated for the object that are not committed to the
// store, and which we cannot commit to the store because the object is
// gone. Therefore, we deallocate the object when we see that it has
// failed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused by the phrasing there in the first few reads as the meaning of "fail" is not clear. I feel like something along the lines of "Therefore, we deallocate the ressources we created directly when we see an object was deleted in the meantime" would be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, altered.

// allocator's restore function will add all allocated objects to the local
// state so we can proceed with new allocations.
//
// It thens adds all objects in the store to the working set, so that any
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : s/thens/then

networks, _ = store.FindNetworks(tx, store.All)
services, _ = store.FindServices(tx, store.All)
tasks, _ = store.FindTasks(tx, store.All)
nodes, _ = store.FindNodes(tx, store.All)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded that I don't think we should plain ignore these errors. If the underlying code changes to sometime return an actual errors and an empty list, we surely want to know. If the error is "not possible", I'd prefer the allocator not to start - better safe than sorry.

case <-batchTimer.C:
a.pendingMu.Lock()
total := len(a.pendingNetworks) + len(a.pendingNodes) + len(a.pendingTasks)
a.pendingMu.Unlock()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we lock pendingMu to check if we need to enter processPendingAllocations, but as you state just below processPendingAllocations will jump straight through anyway if it's zero. Couldn't we move this check and simply exit from processPendingAllocations just after locking if the total is zero ? It would be simpler, and we are reducing concurrency by locking twice (if someone acquires the mutex in this small time gap we will be put to sleep).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, basically, as a future improvement, we will want to make processPendingAllocations more intelligent. for example, right now, if allocation fails because resources are exhausted, we will start spinning and trying to allocate. ideally we would in that case want to check if something has been deallocated before launching into another attempt to allocate. (this was actually the case in the old allocator, which I chose to ignore in the interest of time and simplicity for this PR). the check for this would happen here in this function, and NOT in processPendingAllocations, which should have the sole responsibility of doing the actual work, not deciding if it should do the work.

does that make sense?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to check whether something has been deallocated, we might have to lock here too indeed ; however I'm not sure it's worth locking twice for now, we don't know what we'll actually do in the future and when. Also that check in processPendingAllocations can't hurt I think, and it would get rid of this lock for now. All in all I think the PR should focus on making it sensible as is, not leave a placeholder for later - but I'll let you judge.

select {
case <-ctx.Done():
log.G(ctx).Debug("context done, canceling the event stream")
cancel()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have missed something, but can't we just cancel() directly from this function a bit later after wg.Wait() ? Maybe we can even just defer cancel() right here ? (or a lambda if you want to keep the log)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't remember why i did this. let me stew on it for a bit and get back to you.

// because only allocated networks have the DriverState
// allocated, we will only attach allocated networks. if for
// some reason an unallocated network slips through, we'll get
// ErrDepedencyNotAllocated when we try to allocate the nodes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : s/ErrDepedencyNotAllocated/ErrDependencyNotAllocated

// remove these tasks from the map
log.G(ctx).WithError(err).Error("task cannot be allocated")
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the task allocation code is the exact same for services tasks and service-less tasks, a local lambda factoring this would probably be better.

// successfully freed, and there is no reason to roll back.
//
// Therefore, because the operations can fail can be undone (allocation),
// and the operations cannot be undone cannot fail (deallocation) then we
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit :

  • the operations that can fail can be undone
  • the operations that cannot fail can be undone

I don't mean to nitpick too much but had to reread it a few time to make sense of it :)

// to keep two different distributed states in sync. There is one sole source
// of truth about the network state in the cluster. However, if the local state
// of the allocator is not completely consistent with the distributed state,
// then errors such as double allocations or use-after-frees may result.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the comment. If both libnetwork and the allocator maintain a local state, I'm a bit suprised by the "not needing to keep two different distributed states in sync" and "There is one sole source of truth". My understanding is the opposite : we choose to duplicate libnetwork's state, thus needing to keep it it sync and duplicating the potential source of truth, which is certainly a trade of for something else (I suppose reading libnetwork's distributed state is prohibitively expensive). So I would rather expect a comment on the form "while this requires careful syncing since we duplicate the source of truth, it is preferable because ".

// with this same Allocator, the caller must later call Commit with the
// returned proposal. If the allocation is abandoned, meaning it's not going to
// be committed to the store and otherwise would need to be rolled back, then
// the proposal can just be ignored.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was suprising to me that Allocate does not not alter the allocator state, is if we have two Allocate before a Commit we may pick the same port twice. I understand it works because we don't interleave those operations, but then :

  • Let's have a comment stating it, as otherwise it's screaming BUG to the innocent reader.
  • I feel like it's better to have that Rollback method, even if it's a no-op for now, as it will help if/when we want to make it more parallel, or have more leeway if switching implementation, but I'll leave that to your apprecation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After offline discussion, if we consider this an internal package, I agree we don't need Rollback as one should have decent knowledge of the interactions and make adequate changes if needed. Let's just go with a short comment.

Copy link

@mefyl mefyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid, probably a good improvement over the old allocator. A solid base to build upon from now on.

  • I did not review *_test.go. LMK if you want me to look at it in a separate review.
  • As I am no expert in networking, this is mostly focused on architecture and the code itself.

Overall LGTM !

@dperny dperny mentioned this pull request Aug 21, 2018
@olljanat olljanat mentioned this pull request Oct 9, 2018
@mefyl mefyl removed their assignment May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants