-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add agent notifications & access log records to monitor #2667
Conversation
test-me-please |
test-me-please |
|
||
// Must be synchronized with <bpf/lib/common.h> | ||
const ( | ||
// 0-128 are reserved for BPF datapath events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported const MessageTypeUnspec should be of the form "MessageTypeUnspec ..."
type AgentNotification uint32 | ||
|
||
const ( | ||
AgentNotifyUnspec AgentNotification = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported const AgentNotifyUnspec should have comment (or a comment on this block) or be unexported
@@ -37,9 +47,25 @@ func (m *NodeMonitor) GetPid() int { | |||
} | |||
|
|||
// Run starts the node monitor. | |||
func (nm *NodeMonitor) Run() { | |||
func (nm *NodeMonitor) Run(sockPath string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
receiver name nm should be consistent with previous receiver name m for NodeMonitor
@@ -1345,3 +1349,62 @@ func (d *Daemon) GetServiceList() []*models.Service { | |||
} | |||
return list | |||
} | |||
|
|||
func (d *Daemon) SendNotification(typ monitor.AgentNotification, text string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported method Daemon.SendNotification should have comment or be unexported
test-me-please |
test-me-please |
test-me-please |
test-me-please |
pkg/bpfdebug/ifindex.go
Outdated
mutex.Unlock() | ||
|
||
sleep: | ||
time.Sleep(time.Duration(15) * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be easier to have a function to add/remove interfaces whenever a new endpoint is added or removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows resolving of all interface names including cilium_host, cilium_net, cilium_vxlan, and native interfaces.
pkg/bpfdebug/trace.go
Outdated
n.traceSummary(), n.Hash, n.SrcLabel, n.DstLabel, | ||
connState(n.Reason), GetConnectionSummary(data[TraceNotifyLen:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's going to be a lock/unlock for every single packet now. Won't this make the monitor slower and the connections slower as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bottleneck will be the gob decode/encode. The lock is absolutely fine here.
labels = e.SecLabel.Labels.String() | ||
} | ||
|
||
return fmt.Sprintf("%d (%s)", e.ID, labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the labels are not important. I would replace it with the security ID + container ID (or perhaps name) to be less verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the labels are absolutely ideal as they typically contain the application name. Happy to add the container ID but I think it will matter less and less in a cluster enviroment where such IDs are meaningless as they have a scope of a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, we should try to hide runtime specific identification such as the container ID as we are very likely to handle non containerised endpoints in the future.
* TRACE_TO_HOST was only emitted if policy enforcement was disabled * Add TRACE_FROM_* traces to enable monitor to show any packet being processed and to provide the initial processing context. This guarantees that ever packet being processed is being shown in the monitor. Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
The monitor will now notify on endpoint regenerations: >> Endpoint regenerated: 29898 (reserved:health) >> Endpoint regenerated: 29381 (container:id.foo) and accesslog entries: -> http from 29381 ([container:id.foo]) to 0 ([reserved:world]), identity 322->2, verdict Forwarded GET http://google.com/ => 0 Signed-off-by: Thomas Graf <thomas@cilium.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLI / CI / API changes look fine. Have a few questions, nothing blocking though.
@@ -766,9 +767,12 @@ func (e *Endpoint) Regenerate(owner Owner, reason string) <-chan bool { | |||
buildSuccess = false | |||
scopedLog.WithError(err).Warn("Regeneration of endpoint program failed") | |||
e.LogStatus(BPF, Failure, "Error regenerating endpoint: "+err.Error()) | |||
owner.SendNotification(monitor.AgentNotifyEndpointRegenerateFail, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of this function is being ignored - should we log at debug level if it fails? Potential downside of clogging the logs, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it and initially decided against it. This is still low volume so I have nothing against logging as debug. Will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} else { | ||
buildSuccess = true | ||
e.LogStatusOK(BPF, "Successfully regenerated endpoint program due to "+reason) | ||
owner.SendNotification(monitor.AgentNotifyEndpointRegenerateSuccess, e.getIDandLabels()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above:
The return value of this function is being ignored - should we log at debug level if it fails? Potential downside of clogging the logs, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor IPv4/IPv6 tracing discrepancy introduced by rearranging POLICY_EGRESS.
send_trace_notify(skb, TRACE_TO_HOST, SECLABEL, HOST_ID, 0, HOST_IFINDEX, | ||
forwarding_reason); | ||
|
||
#ifndef POLICY_EGRESS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think previously these POLICY_EGRESS
ifdefs avoided calling send_trace_notify()
because it was expected that the cilium_reserved_policy
tail call would do the appropriate tracing.
If we want the tracing like this patch has changed it, we should apply the same to IPv4; otherwise we should probably drop these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joestringer I assume you are talking about the following:
if (policy_can_access(&POLICY_MAP, skb, src_label, 0, 0, 0, NULL) != TC_ACT_OK) {
return send_drop_notify(skb, src_label, SECLABEL, 0, ifindex,
DROP_POLICY, TC_ACT_SHOT);
} else {
cilium_dbg_capture(skb, DBG_CAPTURE_DELIVERY, ifindex);
The main issue I have with the current code is that the TO_HOST
trace is only emitted if policy is disabled. The policy code will later emit a debug capture but it is subject to debugging being enabled whereas tracing is not.
I will add a debug capture just before the redirect()
in the policy disabled case so we have symmetry and also emit a debug capture for the policy disabled case.
Regarding IPv6, good catch!
bpf/bpf_netdev.c
Outdated
@@ -427,6 +448,7 @@ int from_netdev(struct __sk_buff *skb) | |||
|
|||
#ifdef ENABLE_IPV4 | |||
case bpf_htons(ETH_P_IP): | |||
skb->cb[0] = proxy_identity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skb->cb[CB_SRC_LABEL]
, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
bpf/bpf_netdev.c
Outdated
@@ -395,7 +398,8 @@ static inline int handle_ipv4(struct __sk_buff *skb) | |||
|
|||
__section_tail(CILIUM_MAP_CALLS, CILIUM_CALL_IPV4) int tail_handle_ipv4(struct __sk_buff *skb) | |||
{ | |||
int ret = handle_ipv4(skb); | |||
__u32 proxy_identity = skb->cb[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skb->cb[CB_SRC_LABEL]
, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Adding release notes as well |
test-me-please |
@manalibhutiyani Label is present, I'm using the title of the PR. Do you think it needs to be more verbose? @joestringer Feedback addressed @aanm Let me know if you are OK as-is. |
@tgraf :I actually meant "I am adding the label". It wasn't there before, hence added it. Looks good 👍 |
@@ -393,9 +396,12 @@ static inline int handle_ipv4(struct __sk_buff *skb) | |||
return TC_ACT_OK; | |||
} | |||
|
|||
#define CB_SRC_IDENTITY 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to organise this next to the bpf/lib/common.h
CB* declarations just so it's easier to trace them through the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually want to move away from that as the single declaration doesn't make it clear what it is used where. I'd rather list it above the tail call to make it super clear which cb fields are being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep, cool. We can move towards that.
(Not sure how we can try to make this verifiable that we don't accidentally reuse some field for an unintended purpose.. something to think about)
This heavily extends the notifications sent via the monitor protocol to include:
Examples:
Endpoint to endpoint:
Proxy redirection
Endpoint regeneration messages:
L7 access log:
Fixes: #2379