Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health: Update Cilium agent to listen on nodeip #26845

Merged
merged 1 commit into from
Aug 8, 2023

Conversation

tamilmani1989
Copy link
Contributor

@tamilmani1989 tamilmani1989 commented Jul 14, 2023

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

This PR updates agent health server to listen on nodeip instead of all interfaces.
Fixes:
#23353

<!-- Enter the release note text here if needed or remove this section! -->

@tamilmani1989 tamilmani1989 requested a review from a team as a code owner July 14, 2023 23:05
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 14, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jul 14, 2023
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @tamilmani1989 . Changes look good on the agent side, but I think we should leave the cilium-health-responder behavior as is.

cilium-health/responder/main.go Outdated Show resolved Hide resolved
@pippolo84 pippolo84 added area/daemon Impacts operation of the Cilium daemon. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Jul 17, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jul 17, 2023
@pippolo84 pippolo84 added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related. and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Jul 17, 2023
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, thanks!

One last thing: could you please change the commit message to adhere to the guide here? Maybe a brief description of the solved issue and a reference to it in a Fixes: ... line. For the title, I suggest something like health: Update daemon to listen on node ip.
Thanks! 🙏

@tamilmani1989 tamilmani1989 changed the title [Fix] Update agenthealth to listen on node IP health: Update Cilium agent to listen on nodeip Jul 18, 2023
@pippolo84
Copy link
Member

Hi @tamilmani1989 , what I meant was to change the commit message like this:

health: Update daemon to listen on node ip

Update agenthealth to listen on internal node ip instead of listening on
all interfaces.

Fixes: #23353
    
Signed-off-by: Tamilmani <tamanoha@microsoft.com>

I was not referring to the PR title, but to the first line of the commit message. Sorry for the confusion.
Can you please update the commit message one last time so I can approve the PR? Thanks in advance! 🙏

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🚀

@pippolo84
Copy link
Member

@tamilmani1989 CI failures are unrelated to the PR. It has already been fixed in main, could you please rebase?

@pippolo84
Copy link
Member

/test

@tamilmani1989
Copy link
Contributor Author

@pippolo84 anything else required for this PR?

@pippolo84
Copy link
Member

CI failures seem legit, are you up to investigate them? I think it should be the same root cause for all of those.

@maintainer-s-little-helper
Copy link

Commit 31aaf55006f1772c0e2af222a824fcb78c4e0cbd does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jul 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jul 26, 2023
@tamilmani1989
Copy link
Contributor Author

/test

@tamilmani1989
Copy link
Contributor Author

/test

1 similar comment
@tamilmani1989
Copy link
Contributor Author

/test

@tamilmani1989
Copy link
Contributor Author

tamilmani1989 commented Jul 31, 2023

@pippolo84 not sure why external workloads CI failing. It doesn't look like regular cilium agent running on a node. Is there anyway I can repro it locally? My guess is that cilium is running as non-hostnetworking pod in that node. If that's the case, is there anyway to detect in cilium agent code? Rest of failures look flaky.

This is the failure:

cilium-health validation failed: "cilium-agent 'kube-system/cilium-w7lzk': connectivity to path 'cilium-cilium-5707680440-1-vm/cilium-cilium-5707680440-1-vm-6.host.primary-address.http.status' is unhealthy: 'Get \"[http://10.202.0.30:4240/hello\](http://10.202.0.30:4240/hello/)": dial tcp 10.202.0.30:4240: connect: connection refused'", retrying...
NAME                              CILIUM ID   IP
cilium-cilium-5707680440-1-vm-6   21875       10.202.0.30

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this next iteration! 🙏
I've left some comments, mainly around the style.

To setup a local cluster with external workloads you can refer to the doc here.
I think you can check option.Config.JoinCluster to understand if Cilium is running in a VM, since the feature is enabled based on that config flag.

pkg/health/probe/responder/responder.go Outdated Show resolved Hide resolved
pkg/health/probe/responder/responder.go Outdated Show resolved Hide resolved
pkg/health/probe/responder/responder.go Outdated Show resolved Hide resolved
pkg/health/probe/responder/responder_test.go Outdated Show resolved Hide resolved
pkg/health/server/server.go Outdated Show resolved Hide resolved
pkg/node/types/node.go Outdated Show resolved Hide resolved
@tamilmani1989
Copy link
Contributor Author

/test

@tamilmani1989
Copy link
Contributor Author

tamilmani1989 commented Aug 1, 2023

Thanks for this next iteration! 🙏 I've left some comments, mainly around the style.

To setup a local cluster with external workloads you can refer to the doc here. I think you can check option.Config.JoinCluster to understand if Cilium is running in a VM, since the feature is enabled based on that config flag.

I spent some time in creating repro but couldn't. When I tried creating external workload following that doc, I'm losing connection to VM everytime I ran this script ./install-external-workload.sh . I couldn't access VM after that. I see cew resource created for that external worload though

$ k get cew
NAME     CILIUM ID   IP
extvm3   29819       10.10.0.7

I skipped this change for external node for now and if needed can open separate issue for that.

@tamilmani1989
Copy link
Contributor Author

These CI failures look flaky as it succeeded in previous run

@pippolo84
Copy link
Member

These CI failures look flaky as it succeeded in previous run

Agree. Let's try a re-run and in case flakyness is confirmed I'm gonna open an issue.

@pippolo84
Copy link
Member

CI is green now, awesome! 💯
Apart from this last feedback to address, one minor thing left: could you please add a comment where we check for option.Config.JoinCluster to explain why that is needed? 🙏 Thanks!

@tamilmani1989
Copy link
Contributor Author

/test

@pippolo84
Copy link
Member

Hey @tamilmani1989 , I took a closer look at the last changes and I think we can avoid the introduction of the helpers GetInternalIPv4 and GetInternalIPv6, relying on the already existing node.GetIPv4 and node.GetIPv6 instead.
Regarding the external workloads, I think that what you observed is due to the fact that the agent is running on a VM. In that case, we just have to leave the old logic since there is no such thing as an InternalIP. To do that, I suggest to introduce a helper to get the right addresses based on the global options. I did it here and it is working fine (don't mind the failures, they are just flakes, the health related tests are passing).
So, I suggest to:

  • remove the helpers GetInternalIPv4 and GetInternalIPv6
  • refactor the code to introduce a helper for the server addresses as done here
  • Change the comment removing the reference to the issue and stating that in the external workload case we listen on all interfaces and all address families
  • squash the commits and fix all their messages to be compliant to the guidelines here (please take a careful look at point 5)

Thanks in advance!

@tamilmani1989
Copy link
Contributor Author

tamilmani1989 commented Aug 4, 2023

Hey @tamilmani1989 , I took a closer look at the last changes and I think we can avoid the introduction of the helpers GetInternalIPv4 and GetInternalIPv6, relying on the already existing node.GetIPv4 and node.GetIPv6 instead. Regarding the external workloads, I think that what you observed is due to the fact that the agent is running on a VM. In that case, we just have to leave the old logic since there is no such thing as an InternalIP. To do that, I suggest to introduce a helper to get the right addresses based on the global options. I did it here and it is working fine (don't mind the failures, they are just flakes, the health related tests are passing). So, I suggest to:

  • remove the helpers GetInternalIPv4 and GetInternalIPv6
  • refactor the code to introduce a helper for the server addresses as done here
  • Change the comment removing the reference to the issue and stating that in the external workload case we listen on all interfaces and all address families
  • squash the commits and fix all their messages to be compliant to the guidelines here (please take a careful look at point 5)

Thanks in advance!

The reason i added new api GetInternalIPv4 is that existing node.GetIPv4 may return node External IP if internal ip doesn't exist and cilium would be listening on <external_ip>:4240. is that desired behavior?

The comment for that function reads:

// GetIPv4 returns one of the IPv4 node address available with the following
// priority:
// - NodeInternalIP
// - NodeExternalIP
// - other IP address type.
// It must be reachable on the network.

@pippolo84
Copy link
Member

The reason i added new api GetInternalIPv4 is that existing node.GetIPv4 may return node External IP if internal ip doesn't exist and cilium would be listening on <external_ip>:4240. is that desired behavior?

The comment for that function reads:

// GetIPv4 returns one of the IPv4 node address available with the following // priority: // - NodeInternalIP // - NodeExternalIP // - other IP address type. // It must be reachable on the network.

The only case I can think of where the InternalIP won't be available if the external workload, but we already take care of that separately. Anyway, there's no harm in having extra care, so let's leave your helpers as you suggested 👍

@tamilmani1989
Copy link
Contributor Author

If GetIPv4 or GetIPV6 fails for some reason, then cilium agent would not listen on that ip family as its not appended to address list. so we may need this piece of code as well to address it (it was removed in your version)

	if len(addresses) != 0 {
		if option.Config.EnableIPv4 && ipv4 == nil {
			addresses = append(addresses, "0.0.0.0")
		}
		if option.Config.EnableIPv6 && ipv6 == nil {
			addresses = append(addresses, "::")
		}
	}

@tamilmani1989
Copy link
Contributor Author

tamilmani1989 commented Aug 4, 2023

@pippolo84 I updated PR based on above comments and what I felt right. I'm open to hear if there any alternate suggestions and can update accordingly. Thanks for taking time to review this. Appreciate it. 👍

@tamilmani1989
Copy link
Contributor Author

/test

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the further iteration 💯
I agree with your analysis regarding the GetInternalIPv4 and GetInternalIPv4 returnng nil. I've left a suggestion to improve the readability of the code when handling those cases.

pkg/health/server/server.go Show resolved Hide resolved
pkg/health/server/server.go Outdated Show resolved Hide resolved
@tamilmani1989 tamilmani1989 force-pushed the tamilmani/fixOpenPort branch 2 times, most recently from a095428 to 1242c13 Compare August 7, 2023 18:56
Update agenthealth to listen on internal node ip instead of listening on
all interfaces except for external workload scenario where there is no
internal ip concept. Listen on ipv4 or ipv6 or both based on cilium
config EnableIPv4 and EnableIPv6. If for some reason, if it fails to get
node internal IP it will listen on all addresses.

Fixes: cilium#23353

Signed-off-by: Tamilmani <tamanoha@microsoft.com>
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This turned out to be trickier than expected, great job! 🚀

@pippolo84
Copy link
Member

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 8, 2023
@ldelossa
Copy link
Contributor

ldelossa commented Aug 8, 2023

All required tests passing. Merging

@ldelossa ldelossa merged commit ef50f3d into cilium:main Aug 8, 2023
57 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants