Service discovery hardening #1796

fcrisciani · 2017-06-08T19:54:29Z

This patch addresses several issues found on the Service Discovery feature.

SetMatrix is a simple matrix of sets. Added tests This data structure will be used in following commit to handle transient states where the same key can momentarely be associated to more than a value Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>

fcrisciani · 2017-06-08T21:34:03Z

endpoint.go

@@ -515,6 +515,11 @@ func (ep *endpoint) sbJoin(sb *sandbox, options ...EndpointOption) (err error) {
 		return err
 	}

+	if err = ep.addServiceInfoToCluster(); err != nil {


move it back to EnableService (introduce back the sb lock)

fcrisciani · 2017-06-08T22:13:03Z

sandbox.go

@@ -669,24 +669,14 @@ func (sb *sandbox) SetKey(basePath string) error {

 func (sb *sandbox) EnableService() error {
 	for _, ep := range sb.getConnectedEndpoints() {
-		if ep.enableService(true) {
-			if err := ep.addServiceInfoToCluster(); err != nil {


Restore Enable service logic (introducing again the sb lock)

mavenugo

Just pushing the initial set of review comments. I have a few more areas to cover.

mavenugo · 2017-06-09T22:48:00Z

agent.go

+	// but in any case the deleteServiceInfoToCluster will follow doing the cleanup if needed.
+	// In case the deleteServiceInfoToCluster arrives first, this one is happening after the endpoint is
+	// removed from the list, in this situation the delete will bail out not finding any data to cleanup
+	// and the add will bail out not finding the endpoint on the sandbox.


Thanks for the excellent summary of the problem.

mavenugo · 2017-06-09T23:37:29Z

service.go

@@ -46,19 +48,51 @@ type service struct {
 	ingressPorts portConfigs

 	// Service aliases
-	aliases []string
+	serviceAliases []string


its okay to be called aliases since it belongs to the service object.

ok will restore

mavenugo · 2017-06-10T02:59:38Z

network.go

-		ipInfo.extResolver = true
+	if ok, _ := sr.ipMap.Contains(ipStr, ipInfo{name: name, extResolver: true}); !ok {
+		sr.ipMap.Remove(ipStr, ipInfo{name: name})
+		sr.ipMap.Insert(ipStr, ipInfo{name: name, extResolver: true})


@fcrisciani I dont quite understand this logic. Can you pls explain ?
@sanimej could you PTAL this change ?

If you look at the original logic what the code does is to fetch the element from the map and set the extResolver to true.
The new logic simply checks if there is already an element with extResolver to true, if there is not then removes the entry that has extResolver to false and inserts one to true.

Looking better the logic is slightly different from the original, in the sense that I insert all the time an entry if it is not there, instead the original logic if the entry is not there does not change it. I will add the check to see if the entry is there with extResolver to false

mavenugo · 2017-06-10T03:06:45Z

network.go

+	ipInfo, ok := elemSet[0].(ipInfo)
+	if !ok {
+		setStr, b := sr.ipMap.String(ip)
+		logrus.Warnf("Something is wrong with the ipInfo set for key %s set:%t %s", ip, b, setStr)


a better warning message maybe ?

suggestion? It will be !ok if the element inserted is not of type ipInfo that is strange and major bug

mavenugo · 2017-06-10T03:07:47Z

network.go

+	// network db notifications)
+	// In such cases the resolution will be based on the first element of the set, and can vary
+	// during the system stabilitation
+	ipInfo, ok := elemSet[0].(ipInfo)


Is the check for extResolver intentionally ignored here ?

nope, will add that also

mavenugo · 2017-06-10T03:08:28Z

networkdb/networkdb.go

@@ -285,7 +285,6 @@ func (nDB *NetworkDB) CreateEntry(tname, nid, key string, value []byte) error {
 	nDB.indexes[byNetwork].Insert(fmt.Sprintf("/%s/%s/%s", nid, tname, key), entry)
 	nDB.Unlock()

-	nDB.broadcaster.Write(makeEvent(opCreate, tname, nid, key, value))


Can you pls explain why it is okay to remove the broadcast event ?

This is triggering a notification of the handleEpTableEvent for local endpoints. This is not necessary considering that local endpoints got already configured by logic before. Also that can be risky because now the timing with remote events can arrive out of sync due to the fact that remote events will take more time than local dispatched ones

am not against this change. But I like to understand this a bit more.

broadcaster.Write ultimately calls appropriate Watch routines such as handleEpTableEvent or handleNodeTableEvent or other tables managed by other components (such as overlay driver).

CreateEntry in networkdb is a generic API to create an event of any type. But changing this generic function for a specific case of Endpoint management seems risky without considering how it impacts other events.
What if there is another table (not endpoint), with an event created by one component and consumed not just by remote peers, but also by another local component which networkdb as the mechanism for communication ?

Also, if we remove this call, can you point me to another code path for which handleEpTableEvent is called for local endpoints ?

Based on the review on all the current usage of CreateEntry and Watch, there is no such case that will break and hence am fine with not having to broadcast this event from networkDB.

fcrisciani · 2017-06-10T21:28:38Z

Don't see this message on github, so will reply by email. The reasons are several: 1) Events are happening in parallel, imagine this scenario with 2 workers. Worker A is allocating IPa, now that container exit so IPa is released. Before that the notification from network db arrives to Worker B, a new container on Worker B start using IPa. When the notification from WorkerA arrives to Worker B without the SetMatrix the entry in the ipMap will be removed. This will break the resolution for the container that is still running on Worker B. 2) Events from network DB sometime are replayed for endpoints that got deleted in the past. I saw events delayed up to 6 min. In this case if the IP is reused by any other worker, the notification arriving will wipe out the entry from the ipMap. For all the previous reason, the SetMatrix will guarantee that a specific IP is going to be removed when no other endpoint is associated with it.

…

On Sat, Jun 10, 2017 at 1:14 PM, Madhu Venugopal ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In network.go <#1796 (comment)>: > @@ -97,7 +98,7 @@ type ipInfo struct { type svcInfo struct { svcMap map[string][]net.IP svcIPv6Map map[string][]net.IP - ipMap map[string]*ipInfo + ipMap common.SetMatrix I dont quite understand the need to have this ipMap structure to container more than 1 ipInfo ? Hence I dont see a reason to change it to use SetMatrix. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1796 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvTieULfqSMaiyLTPRzIJX7RirJOZrYks5sCvkxgaJpZM4N0i7C> .

mavenugo · 2017-06-10T20:28:17Z

network.go

+	// In such cases the resolution will be based on the first element of the set, and can vary
+	// during the system stabilitation
+	elem, ok := elemSet[0].(ipInfo)
+	if !ok {


What is the purpose of adding this defensive check ?
ipMap is completely controlled by libnetwork core and what is the purpose of checking for ipInfo type ?

the SetMatrix is a generic data structure that store potentially any kind of type. This, as far as I know, is the only way to cast back to the original type.

ok. This is a defensive check and am okay to keep it in and am hoping to never see the Error message ever :)

mavenugo · 2017-06-10T20:30:32Z

network.go

+	elem, ok := elemSet[0].(ipInfo)
+	if !ok {
+		setStr, b := sr.ipMap.String(ip)
+		logrus.Warnf("BUG expected set of ipInfo type for key %s set:%t %s", ip, b, setStr)


Warning with a BUG seems odd. It certainly will raise an alarm with the users and as mentioned above, when do we expect such a Bug and if this is a concern, shouldnt this be resolved rather than logging it as a BUG ?

it has to raise the alarm, It will mean that in that map got inserted a wrong type.

mavenugo · 2017-06-10T21:14:46Z

networkdb/networkdb.go

@@ -285,7 +285,6 @@ func (nDB *NetworkDB) CreateEntry(tname, nid, key string, value []byte) error {
 	nDB.indexes[byNetwork].Insert(fmt.Sprintf("/%s/%s/%s", nid, tname, key), entry)
 	nDB.Unlock()

-	nDB.broadcaster.Write(makeEvent(opCreate, tname, nid, key, value))


am not against this change. But I like to understand this a bit more.

broadcaster.Write ultimately calls appropriate Watch routines such as handleEpTableEvent or handleNodeTableEvent or other tables managed by other components (such as overlay driver).

CreateEntry in networkdb is a generic API to create an event of any type. But changing this generic function for a specific case of Endpoint management seems risky without considering how it impacts other events.
What if there is another table (not endpoint), with an event created by one component and consumed not just by remote peers, but also by another local component which networkdb as the mechanism for communication ?

Also, if we remove this call, can you point me to another code path for which handleEpTableEvent is called for local endpoints ?

mavenugo · 2017-06-10T21:21:15Z

sandbox.go

 	for _, ep := range sb.getConnectedEndpoints() {
-		if ep.enableService(false) {
-			if err := ep.deleteServiceInfoFromCluster(); err != nil {


Thinking more about this, Isnt it better to have deleteServiceInfoFromCluster called from DisableService, but remove it from sbLeave in order to be consistent with the way EnableService and DisableService are handled. Also it brings in consistency between Join and Leave. WDYT ?

That would be great as a future addition, but we have to investigate some corner cases. For example I validated that if you do the docker kill <container> no DisableService is called. This of course will leave stale entries behind.

mavenugo · 2017-06-10T21:28:06Z

service.go

+type loadBalancerBackend struct {
+	ip            net.IP
+	containerName string
+	taskAliases   []string


It is weird to see taskAliases and containerName carried in loadBalancerBackend, where this structure is used so far for the VIP based IPVS programming. Are we changing this for other purposes now (such as DNS-RR as well) ?

I can see why we have to carry these additional details. This is required especially so that cleanupServiceBindings can call the lower level SD functions that expects to see containerName and taskAliases. But still, this seems out of place.

@fcrisciani Please consider picking up this change mavenugo@92820b9

This is a minor rework of your changes that enables us to avoid the need to change this structure and also helps scope this PR for fixing the race issues and minimize the code-reorg for a later activity if there is a need.

mavenugo · 2017-06-10T22:46:17Z

agent.go

-		n.addSvcRecords(name, ip, nil, true)
-		for _, alias := range taskaliases {
-			n.addSvcRecords(alias, ip, nil, true)
-		}


I can see how you have removed this here and consolidated all the addSvcRecords under addEndpointNameResolution.

But removing this will result in unmanaged containers in attachable networks losing SD.

Hence, we have to call addEndpointNameResolution from here and addEndpointNameResolution must be changed to also handle unmanaged container scenario.

mavenugo · 2017-06-10T22:46:35Z

agent.go

-		n.deleteSvcRecords(name, ip, nil, true)
-		for _, alias := range taskaliases {
-			n.deleteSvcRecords(alias, ip, nil, true)
-		}


mavenugo · 2017-06-10T22:48:12Z

agent.go

@@ -602,8 +622,7 @@ func (ep *endpoint) addServiceInfoToCluster() error {
 		if n.ingress {
 			ingressPorts = ep.ingressPorts
 		}
-
-		if err := c.addServiceBinding(ep.svcName, ep.svcID, n.ID(), ep.ID(), ep.virtualIP, ingressPorts, ep.svcAliases, ep.Iface().Address().IP); err != nil {
+		if err := c.addServiceBinding(ep.svcName, ep.svcID, n.ID(), ep.ID(), ep.Name(), ep.virtualIP, ingressPorts, ep.svcAliases, ep.myAliases, ep.Iface().Address().IP, "addServiceInfoToCluster"); err != nil {
 			return err
 		}
 	}


I think you should also call addEndpointNameResolution from here for the case of unmanaged container and its name resolution.

mavenugo · 2017-06-10T22:48:35Z

agent.go

@@ -655,8 +680,7 @@ func (ep *endpoint) deleteServiceInfoFromCluster() error {
 		if n.ingress {
 			ingressPorts = ep.ingressPorts
 		}
-
-		if err := c.rmServiceBinding(ep.svcName, ep.svcID, n.ID(), ep.ID(), ep.virtualIP, ingressPorts, ep.svcAliases, ep.Iface().Address().IP); err != nil {
+		if err := c.rmServiceBinding(ep.svcName, ep.svcID, n.ID(), ep.ID(), ep.Name(), ep.virtualIP, ingressPorts, ep.svcAliases, ep.myAliases, ep.Iface().Address().IP, "deleteServiceInfoFromCluster"); err != nil {
 			return err
 		}
 	}


I think you should also call deleteEndpointNameResolution from here for the case of unmanaged container and its name resolution.

mavenugo · 2017-06-11T04:26:18Z

service_common.go

+		}
+	}
+
+	if addService && len(vip) != 0 {


Its hard for me to judge if addService variable is introduced here just as an optimization to avoid calling addSvcRecords multiple times for the same svcName <-> vip combination ?
or Is there a real problem if we call addSvcRecords multiple times for the same svcName <-> vip combination ?

no real problem is only an optimization and also more symmetric towards the delSvcRecords that has the rmService (the case of the removal the logic is mandatory of course)

fcrisciani · 2017-06-11T21:48:05Z

agent.go

 	c := n.getController()
 	agent := c.getAgent()

+	name := ep.Name()


@mavenugo I guess to handle properly the case of anonymous container this has to be brought up

fcrisciani · 2017-06-11T21:48:23Z

agent.go

-		if n.ingress {
-			ingressPorts = ep.ingressPorts
-		}
+	name := ep.Name()


@mavenugo same here for the delete this name is needed

changed the ipMap to SetMatrix to allow transient states Compacted the addSvc and deleteSvc into a one single method Updated the datastructure for backends to allow storing all the information needed to cleanup properly during the cleanupServiceBindings Removed the enable/disable Service logic that was racing with sbLeave/sbJoin logic Add some debug logs to track further race conditions Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>

mavenugo

@fcrisciani Thanks for addressing all the comments and fixing the hard to reproduce race test-cases.

LGTM

Contains Service Discovery hardening fixes via moby/libnetwork#1796 Fixes multiple issues such as moby#32830 Signed-off-by: Madhu Venugopal <madhu@docker.com>

This is a cherry-pick of moby/moby#33634 that brings in moby/libnetwork#1796. Signed-off-by: Madhu Venugopal <madhu@docker.com>

Contains Service Discovery hardening fixes via moby/libnetwork#1796 Fixes multiple issues such as #32830 Signed-off-by: Madhu Venugopal <madhu@docker.com> Upstream-commit: 6868b8e Component: engine

This is a cherry-pick of moby/moby#33634 that brings in moby/libnetwork#1796. Signed-off-by: Madhu Venugopal <madhu@docker.com>

Contains Service Discovery hardening fixes via moby/libnetwork#1796 Fixes multiple issues such as #32830 Signed-off-by: Madhu Venugopal <madhu@docker.com> Upstream-commit: 6868b8e Component: engine

This is a cherry-pick of moby/moby#33634 that brings in moby/libnetwork#1796. Signed-off-by: Madhu Venugopal <madhu@docker.com>

fcrisciani force-pushed the name-resolution-race branch from 49f2a16 to bda77e3 Compare June 8, 2017 20:34

fcrisciani changed the title ~~Service discovery rewrite~~ Service discovery hardening Jun 8, 2017

fcrisciani commented Jun 8, 2017

View reviewed changes

fcrisciani force-pushed the name-resolution-race branch 7 times, most recently from 0a8bf0c to 94372a6 Compare June 9, 2017 17:21

mavenugo reviewed Jun 10, 2017

View reviewed changes

fcrisciani force-pushed the name-resolution-race branch from 94372a6 to 41278c3 Compare June 10, 2017 04:49

mavenugo reviewed Jun 10, 2017

View reviewed changes

fcrisciani force-pushed the name-resolution-race branch 2 times, most recently from bbd63a7 to f2da09f Compare June 10, 2017 22:19

mavenugo reviewed Jun 10, 2017

View reviewed changes

fcrisciani force-pushed the name-resolution-race branch 2 times, most recently from d6a81a6 to 47842a4 Compare June 11, 2017 00:46

fcrisciani mentioned this pull request Jun 11, 2017

Docker connectivity issues between the services in the swarm moby/moby#32830

Open

mavenugo reviewed Jun 11, 2017

View reviewed changes

fcrisciani force-pushed the name-resolution-race branch from 47842a4 to 1a67846 Compare June 11, 2017 21:47

fcrisciani commented Jun 11, 2017

View reviewed changes

fcrisciani force-pushed the name-resolution-race branch from 1a67846 to f6e2ee8 Compare June 12, 2017 03:21

fcrisciani force-pushed the name-resolution-race branch from f6e2ee8 to d64e71e Compare June 12, 2017 03:49

mavenugo approved these changes Jun 12, 2017

View reviewed changes

mavenugo merged commit 86ae3cf into moby:master Jun 12, 2017

fcrisciani deleted the name-resolution-race branch June 12, 2017 05:34

This was referenced Jun 12, 2017

Vendoring libnetwork f4a15a0890383619ad797b3bd2481cc6f46a978d moby/moby#33634

Merged

Vendoring libnetwork 4f5310be349d9299f6cab6d5822312f00cfa965c docker-archive/docker-ce#66

Merged

thaJeztah mentioned this pull request Jun 14, 2017

Docker Swarm Mode service discovery is totally unstable moby/moby#33589

Closed

sanimej mentioned this pull request Jun 23, 2017

Docker swarm DNS periodically fails moby/moby#33721

Closed

This was referenced Aug 9, 2017

Why do controller call addServiceBinding() twice? #1588

Closed

internal DNS has wrong ip address mappings #1295

Closed

olljanat mentioned this pull request Nov 10, 2018

Do not add IP to name records for aliases #2299

Merged

Service discovery hardening #1796

Service discovery hardening #1796

Conversation

fcrisciani commented Jun 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mavenugo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcrisciani commented Jun 10, 2017 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mavenugo Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mavenugo left a comment

Choose a reason for hiding this comment

fcrisciani commented Jun 8, 2017 •

edited

Loading

mavenugo Jun 11, 2017 •

edited

Loading