Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agent notifications & access log records to monitor #2667

Merged
merged 3 commits into from
Jan 30, 2018
Merged

Conversation

tgraf
Copy link
Member

@tgraf tgraf commented Jan 28, 2018

This heavily extends the notifications sent via the monitor protocol to include:

  • Agent notifications for events such as agent start, endpoint regeneration, and so on.
  • L7 log records for API call access notifications

Examples:

Endpoint to endpoint:

<- endpoint 56326 flow 0x98e163f identity 264->0 state new ifindex 0: 10.11.129.91:36188 -> 10.0.2.3:53 udp
-> 0: 10.11.129.91:36188 -> 10.0.2.3:53 udp
<- host flow 0x2dd24244 identity 1->0 state new ifindex enp0s3: 10.0.2.3:53 -> 10.11.129.91:36188 udp
-> endpoint 56326 flow 0x2dd24244 identity 1->264 state reply ifindex lxca3056: 10.0.2.3:53 -> 10.11.129.91:36188 udp
<- endpoint 56326 flow 0xe0f8d00c identity 268->0 state new ifindex 0: 10.11.129.91 -> 213.144.137.178 EchoRequest
-> 0: 10.11.129.91 -> 213.144.137.178 EchoRequest
<- host flow 0xe0f8d00c identity 1->0 state new ifindex enp0s3: 213.144.137.178 -> 10.11.129.91 EchoReply
-> endpoint 56326 flow 0xe0f8d00c identity 1->268 state reply ifindex lxca3056: 213.144.137.178 -> 10.11.129.91 EchoReply

Proxy redirection

<- endpoint 56326 flow 0xd4316c25 identity 268->0 state new ifindex 0: 10.11.129.91:55334 -> 172.217.19.14:80 tcp SYN
-> proxy flow 0xd4316c25 identity 268->0 state new ifindex cilium_net: 10.11.129.91:55334 -> 172.217.19.14:80 tcp SYN
-> proxy port 10000: 10.11.129.91:55334 -> 10.11.28.238:10000 tcp SYN
-> cilium_net: 10.11.129.91:55334 -> 10.11.28.238:10000 tcp SYN
<- host flow 0x68f94f8f identity 1->0 state new ifindex 0: 10.11.28.238:10000 -> 10.11.129.91:55334 tcp SYN, ACK
-> proxy port 0: 172.217.19.14:80 -> 10.11.129.91:55334 tcp SYN, ACK
-> endpoint 56326 flow 0x68f94f8f identity 1->268 state reply ifindex lxca3056: 172.217.19.14:80 -> 10.11.129.91:55334 tcp SYN, ACK

Endpoint regeneration messages:

-> endpoint 29381 flow 0x2a545f24 identity 1->322 state reply ifindex lxcc8154: 10.0.2.3:53 -> 10.11.193.100:56446 udp
>> Endpoint regenerated: 29381 (container:id.foo)
<- endpoint 29381 flow 0xb0cb1cf6 identity 322->0 state new ifindex 0: 10.11.193.100:48454 -> 10.0.2.3:53 udp

L7 access log:

-> endpoint 29381 flow 0x37a5c90b identity 1->322 state reply ifindex lxcc8154: 172.217.19.14:80 -> 10.11.193.100:41252 tcp ACK
-> http from 29381 ([container:id.foo]) to 0 ([reserved:world]), identity 322->2, verdict Forwarded GET http://google.com/ => 0
<- proxy flow 0x8922ff53 identity 322->0 state new ifindex 0: 10.11.28.238:43064 -> 172.217.19.14:80 tcp SYN

Fixes: #2379

Add agent notifications & access log records to monitor

@tgraf tgraf added kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Jan 28, 2018
@tgraf tgraf requested a review from a team January 28, 2018 15:42
@tgraf tgraf requested a review from a team as a code owner January 28, 2018 15:42
@tgraf
Copy link
Member Author

tgraf commented Jan 28, 2018

test-me-please

@tgraf
Copy link
Member Author

tgraf commented Jan 28, 2018

test-me-please

@tgraf tgraf requested a review from a team January 29, 2018 13:29
@tgraf tgraf requested review from a team as code owners January 29, 2018 13:29

// Must be synchronized with <bpf/lib/common.h>
const (
// 0-128 are reserved for BPF datapath events

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported const MessageTypeUnspec should be of the form "MessageTypeUnspec ..."

type AgentNotification uint32

const (
AgentNotifyUnspec AgentNotification = iota

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported const AgentNotifyUnspec should have comment (or a comment on this block) or be unexported

@@ -37,9 +47,25 @@ func (m *NodeMonitor) GetPid() int {
}

// Run starts the node monitor.
func (nm *NodeMonitor) Run() {
func (nm *NodeMonitor) Run(sockPath string) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

receiver name nm should be consistent with previous receiver name m for NodeMonitor

@@ -1345,3 +1349,62 @@ func (d *Daemon) GetServiceList() []*models.Service {
}
return list
}

func (d *Daemon) SendNotification(typ monitor.AgentNotification, text string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method Daemon.SendNotification should have comment or be unexported

@tgraf
Copy link
Member Author

tgraf commented Jan 29, 2018

test-me-please

@tgraf tgraf added release-note/major This PR introduces major new functionality to Cilium. and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Jan 29, 2018
@tgraf tgraf changed the title Improved trace messages for the monitor Add agent notifications & access log records to monitor Jan 29, 2018
@tgraf
Copy link
Member Author

tgraf commented Jan 29, 2018

test-me-please

@tgraf
Copy link
Member Author

tgraf commented Jan 29, 2018

test-me-please

@tgraf
Copy link
Member Author

tgraf commented Jan 29, 2018

test-me-please

mutex.Unlock()

sleep:
time.Sleep(time.Duration(15) * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be easier to have a function to add/remove interfaces whenever a new endpoint is added or removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows resolving of all interface names including cilium_host, cilium_net, cilium_vxlan, and native interfaces.

n.traceSummary(), n.Hash, n.SrcLabel, n.DstLabel,
connState(n.Reason), GetConnectionSummary(data[TraceNotifyLen:]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's going to be a lock/unlock for every single packet now. Won't this make the monitor slower and the connections slower as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bottleneck will be the gob decode/encode. The lock is absolutely fine here.

labels = e.SecLabel.Labels.String()
}

return fmt.Sprintf("%d (%s)", e.ID, labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the labels are not important. I would replace it with the security ID + container ID (or perhaps name) to be less verbose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the labels are absolutely ideal as they typically contain the application name. Happy to add the container ID but I think it will matter less and less in a cluster enviroment where such IDs are meaningless as they have a scope of a node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, we should try to hide runtime specific identification such as the container ID as we are very likely to handle non containerised endpoints in the future.

* TRACE_TO_HOST was only emitted if policy enforcement was disabled
* Add TRACE_FROM_* traces to enable monitor to show any packet being processed
  and to provide the initial processing context. This guarantees that ever
  packet being processed is being shown in the monitor.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
The monitor will now notify on endpoint regenerations:
>> Endpoint regenerated: 29898 (reserved:health)
>> Endpoint regenerated: 29381 (container:id.foo)

and accesslog entries:
-> http from 29381 ([container:id.foo]) to 0 ([reserved:world]), identity 322->2, verdict Forwarded GET http://google.com/ => 0

Signed-off-by: Thomas Graf <thomas@cilium.io>
Copy link
Member

@ianvernon ianvernon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI / CI / API changes look fine. Have a few questions, nothing blocking though.

@@ -766,9 +767,12 @@ func (e *Endpoint) Regenerate(owner Owner, reason string) <-chan bool {
buildSuccess = false
scopedLog.WithError(err).Warn("Regeneration of endpoint program failed")
e.LogStatus(BPF, Failure, "Error regenerating endpoint: "+err.Error())
owner.SendNotification(monitor.AgentNotifyEndpointRegenerateFail,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of this function is being ignored - should we log at debug level if it fails? Potential downside of clogging the logs, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it and initially decided against it. This is still low volume so I have nothing against logging as debug. Will add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

} else {
buildSuccess = true
e.LogStatusOK(BPF, "Successfully regenerated endpoint program due to "+reason)
owner.SendNotification(monitor.AgentNotifyEndpointRegenerateSuccess, e.getIDandLabels())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above:
The return value of this function is being ignored - should we log at debug level if it fails? Potential downside of clogging the logs, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@manalibhutiyani
Copy link

@tgraf I believe this will also fix #2379 ?

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor IPv4/IPv6 tracing discrepancy introduced by rearranging POLICY_EGRESS.

send_trace_notify(skb, TRACE_TO_HOST, SECLABEL, HOST_ID, 0, HOST_IFINDEX,
forwarding_reason);

#ifndef POLICY_EGRESS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previously these POLICY_EGRESS ifdefs avoided calling send_trace_notify() because it was expected that the cilium_reserved_policy tail call would do the appropriate tracing.

If we want the tracing like this patch has changed it, we should apply the same to IPv4; otherwise we should probably drop these changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer I assume you are talking about the following:

        if (policy_can_access(&POLICY_MAP, skb, src_label, 0, 0, 0, NULL) != TC_ACT_OK) {
                return send_drop_notify(skb, src_label, SECLABEL, 0, ifindex,
                                        DROP_POLICY, TC_ACT_SHOT);
        } else {
                cilium_dbg_capture(skb, DBG_CAPTURE_DELIVERY, ifindex);

The main issue I have with the current code is that the TO_HOST trace is only emitted if policy is disabled. The policy code will later emit a debug capture but it is subject to debugging being enabled whereas tracing is not.

I will add a debug capture just before the redirect() in the policy disabled case so we have symmetry and also emit a debug capture for the policy disabled case.

Regarding IPv6, good catch!

bpf/bpf_netdev.c Outdated
@@ -427,6 +448,7 @@ int from_netdev(struct __sk_buff *skb)

#ifdef ENABLE_IPV4
case bpf_htons(ETH_P_IP):
skb->cb[0] = proxy_identity;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skb->cb[CB_SRC_LABEL], perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

bpf/bpf_netdev.c Outdated
@@ -395,7 +398,8 @@ static inline int handle_ipv4(struct __sk_buff *skb)

__section_tail(CILIUM_MAP_CALLS, CILIUM_CALL_IPV4) int tail_handle_ipv4(struct __sk_buff *skb)
{
int ret = handle_ipv4(skb);
__u32 proxy_identity = skb->cb[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skb->cb[CB_SRC_LABEL], perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@tgraf
Copy link
Member Author

tgraf commented Jan 29, 2018

@tgraf I believe this will also fix #2379 ?

Correct. Will add Fixes tag.

@manalibhutiyani
Copy link

Adding release notes as well

@tgraf
Copy link
Member Author

tgraf commented Jan 29, 2018

test-me-please

@tgraf
Copy link
Member Author

tgraf commented Jan 29, 2018

@manalibhutiyani Label is present, I'm using the title of the PR. Do you think it needs to be more verbose?

@joestringer Feedback addressed

@aanm Let me know if you are OK as-is.

@manalibhutiyani
Copy link

@tgraf :I actually meant "I am adding the label". It wasn't there before, hence added it. Looks good 👍

@@ -393,9 +396,12 @@ static inline int handle_ipv4(struct __sk_buff *skb)
return TC_ACT_OK;
}

#define CB_SRC_IDENTITY 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to organise this next to the bpf/lib/common.h CB* declarations just so it's easier to trace them through the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually want to move away from that as the single declaration doesn't make it clear what it is used where. I'd rather list it above the tail call to make it super clear which cb fields are being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep, cool. We can move towards that.

(Not sure how we can try to make this verifiable that we don't accidentally reuse some field for an unintended purpose.. something to think about)

@tgraf tgraf merged commit a1b5404 into master Jan 30, 2018
@tgraf tgraf deleted the fix-monitor branch January 30, 2018 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. release-note/major This PR introduces major new functionality to Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants