Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce "cilium-health", a new tool for investigating cluster connectivity issues. #2052

Merged
merged 16 commits into from
Nov 30, 2017

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Nov 16, 2017

This series extends the Cilium API to provide access to known nodes in the cluster and their IPs via /healthz (The same as #2139), and adds a new program to the tree: cilium-health.

cilium-health can be run as a daemon, which will periodically probe other nodes in the cluster for connectivity, starting with basic ping (ICMP) access and HTTP probes to other nodes. By default, when cilium is run, it will launch cilium-health locally. The results of the probes will be cached and exposed over a REST API. cilium-health can also be run as a commandline client for the daemon, to fetch the connectivity status and format it. Currently, cilium-health exposes:

  • GET /hello
    • Returns success code with empty body, used for probing
  • GET /healthz
    • Daemon uptime
    • Node load (1min/5min/15min)
    • Cilium /healthz
  • GET /status
    • Timestamp of probe
    • Local node name
    • List of nodes
      • Name
      • IP
      • Latency to IPs
      • Errors, if any, attempting to reach IPs
  • PUT /status/probe
    • Similar to GET /status - same result, but it triggers a probe and only returns when the probe completes.

Sample output

$ cilium-health status
Probe time:   2017-11-27T13:08:40-08:00
Nodes:
  cilium-master (localhost):
    Host connectivity to 10.0.2.15:
      ICMP:          OK, RTT=167.551µs
      HTTP via L3:   OK, RTT=780.628µs

Commandline options

$ ./cilium-health
Agent for hosting and querying the Cilium health status API

Usage:
  cilium-health [flags]
  cilium-health [command]

Available Commands:
  get         Display local cilium-health status
  ping        Check whether the cilium-health API is up
  status      Display cilium connectivity to other nodes

Flags:
      --admin string    Expose resources over 'unix' socket, 'any' socket (default "any")
  -c, --cilium string   URI to Cilium server API
  -d, --daemon          Run as a daemon
  -D, --debug           Enable debug messages
  -H, --host string     URI to cilum-health server API
  -i, --interval int    Interval (in seconds) for periodic connectivity probes (default 60)
  -p, --passive         Only respond to HTTP health checks

Use "cilium-health [command] --help" for more information about a command.

Next Tasks:

  • Launch from Cilium, similar to monitor server
  • Serve over TCP
    • Provide separate API path for connectivity checks
  • Prettify commandline output (non-JSON output)
  • Use RFC3339 formatting for timestamp
  • Extend cilium status to print the new fields exposed over the Cilium API
  • Provide access to raw json on commandline via "--json"
  • Provide current node name in output if available
  • Split IPs in API into primary / secondary
  • Add --passive option for only responding to probes
  • Force delete unix socket path on startup, similar to monitor
  • Add --admin option to toggle serving all APIs / unix socket

The scope of this PR is primarily to establish groundwork of determining connectivity to other nodes. Endpoint connectivity and policy connectivity is likely to be addressed in a followup PR.

For more future tasks, see Issue #1947.

@joestringer joestringer added kind/feature This introduces new functionality. wip labels Nov 16, 2017
@joestringer joestringer requested a review from a team November 16, 2017 07:56
@joestringer joestringer requested review from a team as code owners November 16, 2017 07:56
if err := pinger.Run(); err != nil {
log.WithError(err).Info("Failed to run ping")
return nil, err
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary)

}
}

// FetchStatusResponse() updates the cluster with the latest set of nodes,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported method Server.FetchStatusResponse should be of the form "FetchStatusResponse ..."

@joestringer joestringer added the release-note/major This PR introduces major new functionality to Cilium. label Nov 16, 2017
@joestringer joestringer changed the title RFC: Add cilium-health client/daemon RFC: Introduce "cilium-health", a new tool for investigating cluster connectivity issues. Nov 16, 2017
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far!

"github.com/go-openapi/strfmt"
)

type Client struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type Client should have comment or be unexported

@@ -101,3 +103,54 @@ func Hint(err error) error {
}
return fmt.Errorf("%s", e)
}

func FormatStatusResponse(w io.Writer, sr *models.StatusResponse) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function FormatStatusResponse should have comment or be unexported

@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@cilium cilium deleted a comment from houndci-bot Nov 17, 2017
@joestringer joestringer force-pushed the submit/cilium-health branch 4 times, most recently from cfd9d94 to 8858645 Compare November 29, 2017 22:18
"$ref": "../openapi.yaml#/definitions/Error"
"/status/probe":
put:
summary: Get connectivity status of the Cilium cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary should already indicate that it is synchronous

Run synchronous connectivity probe to determine status of Cilium cluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will fix.

properties:
cilium:
description: Status of Cilium daemon
"$ref": "../openapi.yaml#/definitions/StatusResponse"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have a StatusResponse in both yaml files? Can we rename one of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, renamed the health status response.

Copy link
Member

@ianvernon ianvernon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall - the size made it hard to fully grok some parts, but that is the nature of bootstrapping new components :)

Some small nits. You don't have to request another review from me.

}
}

// Run sends a single probes out to all of the other cilium nodes to gather
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... a single probe...

cilium --> Cilium

launcher.setStdout(stdout)
}

// Restart stops the daemon whilauncher will trigger a rerun.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix typo "whilauncher"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking it over. I tried to split it up into reasonable separate patches for better review, but it is still quite large. I'll fix this up.

@joestringer joestringer force-pushed the submit/cilium-health branch 2 times, most recently from 302d36b to a7b512c Compare November 30, 2017 01:36
// HealthStatusResponse Connectivity status to other daemons
// swagger:model HealthStatusResponse

type HealthStatusResponse struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type HealthStatusResponse should have comment or be unexported

This should live in pkg/node.

Signed-off-by: Joe Stringer <joe@covalent.io>
This makes it reusable from cilium-health client.

Signed-off-by: Joe Stringer <joe@covalent.io>
Export a list of known cluster nodes, including names, IPs and CIDR
allocations in the health status requested on a local node.

Signed-off-by: Joe Stringer <joe@covalent.io>
This is only used by some commands right now, so leave it hidden (since
it's a bit inconsistent). When each command implements json-formatting
the responses, we can make this not hidden.

Signed-off-by: Joe Stringer <joe@covalent.io>
Previously, the client package created the client by passing the entire
"unix://foo" or "http://foo" host string into the hostname field of the
client, however when creating a client using an "http://" host string
this would result in HTTP requests attempting to reach
"http://http://hostname".

Signed-off-by: Joe Stringer <joe@covalent.io>
goprocinfo provides a wrapper around /proc on linux systems, which will
be used to collect system load in an upcoming commit.

Signed-off-by: Joe Stringer <joe@covalent.io>
GET /status:
  Returns the connectivity status to all other cilium-health instances
  using interval-based probing.
GET /status/probe:
  Runs a synchronous probe to all other cilium-health instances and
  returns the connectivity status.
GET /healthz:
  Returns current health:
  * uptime
  * system load
  * local agent status

See also Issue cilium#1947

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Joe Stringer <joe@covalent.io>
cilium-health is a new client and daemon which allows users to probe the
status of the local node, and also of the connectivity between the local
node and remote nodes in the cluster.

Signed-off-by: Joe Stringer <joe@covalent.io>
Also add the transitive dependencies on golang.org ICMP/IPv4/IPv6.

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Joe Stringer <joe@covalent.io>
Add support to the cilium-health client and server to host a new /hello
API call over TCP socket 4240, and be able to make requests to it.

Pinging this port on all known cluster nodes is done periodically, just
like how regular ICMP ping polling is done.

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Joe Stringer <joe@covalent.io>
A generic launcher will be useful for an upcoming commit where the
Cilium daemon also launches the cilium-health daemon.

Signed-off-by: Joe Stringer <joe@covalent.io>
Signed-off-by: Joe Stringer <joe@covalent.io>
The new '--admin' option restricts which protocols that the /healthz and
/status resources are served over. If specified as 'unix', it will only
serve these resources over the unix socket (if not otherwise disabled by
--passive); if specified as 'any' (default), these resources will also
be served over the HTTP sockets.

Signed-off-by: Joe Stringer <joe@covalent.io>
@tgraf tgraf merged commit 62b9d61 into cilium:master Nov 30, 2017
@joestringer joestringer deleted the submit/cilium-health branch December 1, 2017 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants