Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble: implement peer service, enable it locally #10969

Merged
merged 6 commits into from Apr 23, 2020
Merged

Conversation

rolinh
Copy link
Member

@rolinh rolinh commented Apr 14, 2020

Please, see individual commits for details.

@rolinh rolinh added kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium. sig/hubble Impacts hubble server or relay labels Apr 14, 2020
@rolinh rolinh requested a review from a team April 14, 2020 13:55
@rolinh rolinh requested a review from a team as a code owner April 14, 2020 13:55
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Apr 14, 2020
@coveralls
Copy link

coveralls commented Apr 14, 2020

Coverage Status

Coverage increased (+0.1%) to 44.74% when pulling 0c5a9e9 on pr/rolinh/hubble-peer into 531e61f on master.

Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change looks good, but it wasn't clear to me how i can test this in my cluster. i guess nobody is actually calling rpc Notify(NotifyRequest) yet?

pkg/hubble/peer/service.go Outdated Show resolved Hide resolved
pkg/hubble/peer/peer.go Show resolved Hide resolved
@rolinh
Copy link
Member Author

rolinh commented Apr 15, 2020

the change looks good, but it wasn't clear to me how i can test this in my cluster.

You can use a gRPC client. For example, with grpcurl, you can invoke the service like this from the root of the git repository:

$ grpcurl -import-path api/v1/peer -proto peer.proto -unix -plaintext /var/run/cilium/hubble.sock peer.Peer.Notify
{
  "name": "runtime1",
  "address": "10.0.2.15",
  "type": "PEER_ADDED"
}

Note that this call "blocks". So one way to test this is to invoke this command then add/remove nodes from your cluster. Of course, the cilium-agent needs the --enable-hubble flag.

i guess nobody is actually calling rpc Notify(NotifyRequest) yet?

Not yet. I have an in-development version of hubble-proxy hubble-relay that is using this service but this will be for a follow-up PR once this one is merged.

pkg/hubble/peer/peer.go Show resolved Hide resolved
pkg/hubble/peer/service.go Outdated Show resolved Hide resolved
// interface so that it can be mocked for the purpose of testing.
type Service struct {
stop chan (struct{})
mgr *manager.Manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, consider using an interface instead of implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is required for testing, like mentioned in the struct comment just above. If it's fine with you, I'll add an interface in pkg/node/manager/manager.go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface could be defined here, not in pkg/node/manager/manager.go, this allows you to not need to import pkg/node/manager in this package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the interface with this commit faae24149afbb9da49688945123c09a79716f903 already and implemented tests of the service.
I think that importing the interface from pkg/node/manager is not so bad because it's otherwise really hard to understand who implements it if you're not familiar enough with cilium's codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rolinh there are tools built for that purpose [0] so there's no point in keeping the interfaces in the same place as 1 of the implementations. What happens if in the future we will have a different node manager that implements the same interface? Where will the interface live?

[0] https://docs.google.com/document/d/1_Y9xCEMj5S-7rv2ooHpZNH15JgRT5iM742gJkw5LtmQ/edit#heading=h.f6hyappclfor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind moving the interface to pkg/hubble/peer if you have strong feelings about it; I don't think it matters much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface could be defined here, not in pkg/node/manager/manager.go, this allows you to not need to import pkg/node/manager in this package.

BTW, by moving the interface to pkg/hubble/peer I could save on importing pkg/node/manager but then I'll have to import pkg/datapath as it's required for the interface definition. I'm not "saving" on import anyway.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I do see an issue with blocking the Node Manager which I think must be addressed, but all my other comments are just comments and don't require any action.

pkg/hubble/server/serveroption/option.go Outdated Show resolved Hide resolved
pkg/hubble/defaults/defaults.go Show resolved Hide resolved
case <-s.stop:
return nil
case cn := <-h.C:
if err := stream.Send(cn); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream.Send() can block (it does I/O). Which means the unbuffered h.C channel causes the NodeManager to be blocked as well in h.Node{Add,Update,Delete}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't a context.Context part of the Send parameters?

Copy link
Member Author

@rolinh rolinh Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the ServerStream interface doesn't use a context.Context. SendMsg() is used by Send(), as can be seen in the generated code in peer.pb.go:

type Peer_NotifyServer interface {
	Send(*ChangeNotification) error
	grpc.ServerStream
}

type peerNotifyServer struct {
	grpc.ServerStream
}

func (x *peerNotifyServer) Send(m *ChangeNotification) error {
	return x.ServerStream.SendMsg(m)
}

The interface defines a context but it's global to ServerStream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro I have added a test that simulates a blocked send (TestService_NotifyWithBlockedSend) and it failed (which is good as it tests what we want).

To fix the problem and have the test pass, I've added small buffer (of size 10 by default but this is configurable through service options) that, when full, triggers the server to stop steaming to the client.
This eventually pushes the problem back to the client which will have to call Notify() again in such a case. I don't think it's a problem overall because upon calling Notify(), the client receives the list of all peers before receiving continuous updates so maintaining a local cache client-side is easy.

Please, have a look again at the Notify() implementation and tell me if you think this way of handling the problem is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, yes, the code itself looks good! 🎉 We have to get a bit of experience with this, because I honestly have no clue how big the buffer size should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to get a bit of experience with this, because I honestly have no clue how big the buffer size should be.

Neither do I. I assumed that since the connection is over a local unix domain socket, a buffer of 10 should be sufficient. I also don't want to make it unnecessarily large because if the caller can't keep up then there's probably a problem elsewhere.

pkg/hubble/peer/service.go Outdated Show resolved Hide resolved
@michi-covalent
Copy link
Contributor

@rolinh ok it passed my rigorous testing of scaling nodepool size from 1 to 3 and then scaling it back down to 1:

# ./grpcurl -import-path api/v1/peer -proto peer.proto -unix -plaintext /var/run/cilium/hubble.sock peer.Peer.Notify
{
  "name": "gke-michi-default-pool-87b89101-ljbs",
  "address": "10.138.0.53",
  "type": "PEER_ADDED"
}
{
  "name": "gke-michi-default-pool-87b89101-4r14",
  "address": "10.138.0.55",
  "type": "PEER_ADDED"
}
{
  "name": "gke-michi-default-pool-87b89101-lwrc",
  "address": "10.138.0.56",
  "type": "PEER_ADDED"
}
{
  "name": "gke-michi-default-pool-87b89101-4r14",
  "address": "10.138.0.55",
  "type": "PEER_DELETED"
}
{
  "name": "gke-michi-default-pool-87b89101-lwrc",
  "address": "10.138.0.56",
  "type": "PEER_DELETED"
}

// interface so that it can be mocked for the purpose of testing.
type Service struct {
stop chan (struct{})
mgr *manager.Manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface could be defined here, not in pkg/node/manager/manager.go, this allows you to not need to import pkg/node/manager in this package.

pkg/hubble/peer/handler.go Outdated Show resolved Hide resolved
case <-s.stop:
return nil
case cn := <-h.C:
if err := stream.Send(cn); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't a context.Context part of the Send parameters?

@rolinh rolinh requested a review from a team as a code owner April 16, 2020 09:17
@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch 2 times, most recently from c301be8 to 01611db Compare April 17, 2020 13:03
@rolinh rolinh requested a review from gandro April 17, 2020 13:20
@rolinh
Copy link
Member Author

rolinh commented Apr 18, 2020

test-me-please

EDIT: test flake

@rolinh
Copy link
Member Author

rolinh commented Apr 20, 2020

restart-ginkgo

EDIT: hit flake again

case <-s.stop:
return nil
case cn := <-h.C:
if err := stream.Send(cn); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, yes, the code itself looks good! 🎉 We have to get a bit of experience with this, because I honestly have no clue how big the buffer size should be.

@rolinh
Copy link
Member Author

rolinh commented Apr 21, 2020

restart-ginkgo

EDIT: yet another flake :(
EDIT2: just rebased against latest master

@rolinh
Copy link
Member Author

rolinh commented Apr 21, 2020

test-me-please

EDIT: update to adapt to changes from e99049f

@rolinh
Copy link
Member Author

rolinh commented Apr 21, 2020

test-me-please

EDIT: rebased against master again to fix conflict introduced by f332fb5

@rolinh
Copy link
Member Author

rolinh commented Apr 21, 2020

test-me-please

@nebril
Copy link
Member

nebril commented Apr 21, 2020

test-gke

@nebril
Copy link
Member

nebril commented Apr 21, 2020

test-gke

@rolinh
Copy link
Member Author

rolinh commented Apr 22, 2020

Rebased against master to fix conflict with fef319a.

@rolinh
Copy link
Member Author

rolinh commented Apr 22, 2020

test-me-please

@nebril
Copy link
Member

nebril commented Apr 22, 2020

test-gke

@rolinh
Copy link
Member Author

rolinh commented Apr 22, 2020

Found a bug in the peer service implementation where the wrong context.Background() was used instead of the existing ctx.

@rolinh
Copy link
Member Author

rolinh commented Apr 22, 2020

test-me-please

This commits removes the global variable DefaultOptions which contained
no defaults but was just initializing the listeners map.
The map initialization is done in the constructor instead in order to
avoid problems with shallow copy of default options (a 2nd call to
NewServer would use modified defaults).

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Make sure to only require the grpc interface definition for service
options. The server is a simple grpc server which serves 1 or more
services and it doesn't need to know about the internal of each service
implementation. This change makes this explicit.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is useful for any package that needs to implement gRPC tests, thus
move it to a more generic location.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This interface will make it easier to implement tests, notably for the
peer service that uses the node manager.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit adds a new peer module for hubble that implements the peer
gRPC server interface.

This new gRPC service notifies of nodes being added, deleted or
modified. This information is obtained by subscribing to the cilium node
manager via a new datapath.NodeHandler interface implementation
(peer.handler).

This new service will be consumed by hubble proxy in order to connect to
hubble peers. It is intended to only be served over a local unix domain
socket.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commits enables the peer service for the hubble gRPC server that
listens on the local unix domain socket. This will allow node discovery
for the peer proxy.

For all other hubble listen addresses, the peer service is not enabled.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
@rolinh
Copy link
Member Author

rolinh commented Apr 22, 2020

rebased against master, fixed test failure uncovered by the context fix.

@rolinh
Copy link
Member Author

rolinh commented Apr 22, 2020

test-me-please

@nebril
Copy link
Member

nebril commented Apr 22, 2020

test-gke

@nebril
Copy link
Member

nebril commented Apr 22, 2020

test-gke

Edit: looks like it timed out: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/546/execution/node/112/log/

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@christarazi
Copy link
Member

test-gke

@rolinh
Copy link
Member Author

rolinh commented Apr 23, 2020

The GKE test was running for 10+ hours and stuck at this step (CI issue):

[Pipeline] Start of Pipeline
[Pipeline] node
Still waiting to schedule task
‘jenkins-node-gke-fixed0’ is offline

@rolinh
Copy link
Member Author

rolinh commented Apr 23, 2020

test-gke

@nebril nebril merged commit 2d0ae5b into master Apr 23, 2020
1.8.0 automation moved this from In progress to Merged Apr 23, 2020
@nebril nebril deleted the pr/rolinh/hubble-peer branch April 23, 2020 12:07
@tgraf tgraf mentioned this pull request May 6, 2020
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium. sig/hubble Impacts hubble server or relay
Projects
No open projects
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

7 participants