Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh: silence misleading logs about service resolution #26614

Merged
merged 2 commits into from Jul 7, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jul 4, 2023

9f5a82a ("clustermesh: use custom dialer for service resolution") configured a custom dialer to perform service resolution when the etcd address refers to a local service (mainly in the kvstoremesh case).

Yet, the custom dialer function outputs quite verbose log messages, which can be misleading when the address does not point to a local service (as it is correct that the address cannot be parsed). One example being:

level=error msg="Unable to parse etcd service URL" error="parse \"172.19.0.2:2379\": first path segment in URL cannot contain colon"

Hence, let's silence them, as they do more harm than good in this specific situation. We still print (at debug level) the outcome of the translation, so that we know to whom we are actually connecting.

Silence misleading log messages about service resolution in clustermesh 

@giorio94 giorio94 added area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 4, 2023
@giorio94 giorio94 requested review from a team as code owners July 4, 2023 09:19
@giorio94
Copy link
Member Author

giorio94 commented Jul 4, 2023

/test

@squeed
Copy link
Contributor

squeed commented Jul 4, 2023

Is there any chance that this will hide useful error messages as well? Should we re-enable logging if debug mode is on?

@giorio94
Copy link
Member Author

giorio94 commented Jul 4, 2023

Is there any chance that this will hide useful error messages as well? Should we re-enable logging if debug mode is on?

I'm a bit doubtful as well. That function currently emits four log messages depending on the conditions:

  • Unable to parse etcd service URL (error level): this is definitely misleading and should never be emitted in this context, given that for instance IP addresses are valid target addresses.
  • Unable to parse etcd service URL into a service ID (debug level): could not convert the address into a service name (it should never happen unless the address has no dots).
  • Service not found in the service IP getter (debug level): could not find the given service (this can again happen in many legitimate cases if the address looked like a service name but it was not);
  • custom dialer based on k8s service backend is dialing to %q (debug level): shows the actual address which is being dialed.

An alternative could be to add a flag to silence the first three types of messages and keep only the last, to know the actual target and spot possible issues. WDYT?

@squeed
Copy link
Contributor

squeed commented Jul 4, 2023

It would be ideal to log - somewhere - the actual IP to which we're connecting, especially since there is some amount of magic going on. But I wouldn't consider that a blocker -- your call.

9f5a82a ("clustermesh: use custom dialer for service resolution")
configured a custom dialer to perform service resolution when the etcd
address refers to a local service (mainly in the kvstoremesh case).

Yet, the custom dialer function outputs quite verbose log messages, which
can be misleading when the address does not point to a local service (as
it is correct that the address cannot be parsed). One example being:

level=error msg="Unable to parse etcd service URL"
  error="parse \"172.19.0.2:2379\": first path segment in URL cannot contain colon"

Hence, let's silence them, as they do more harm than good in this specific
situation. We still print (at debug level) the outcome of the translation,
so that we know to whom we are actually connecting.

Fixes: 9f5a82a ("clustermesh: use custom dialer for service resolution")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the custom dialer ignores the context it is passed to. Let's
switch to using DialContext, so that we properly propagate it. This
mimics also the default behavior when no custom dialer is configured [1].

[1] vendor/google.golang.org/grpc/internal/transport/http2_client.go:179

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/clustermesh-dialer-silence-log branch from 0c56289 to cf33a87 Compare July 4, 2023 16:41
@giorio94 giorio94 requested review from a team as code owners July 4, 2023 16:41
@giorio94 giorio94 changed the title clustermesh: hide misleading logs about service resolution clustermesh: silence misleading logs about service resolution Jul 4, 2023
@giorio94
Copy link
Member Author

giorio94 commented Jul 4, 2023

It would be ideal to log - somewhere - the actual IP to which we're connecting, especially since there is some amount of magic going on. But I wouldn't consider that a blocker -- your call.

I updated that function to allow silencing the error messages, while preserving the one which shows the target address (at debug level). While being there, I also updated the custom dialer to propagate the given context. @squeed PTAL.

@giorio94 giorio94 requested a review from squeed July 4, 2023 16:44
@giorio94
Copy link
Member Author

giorio94 commented Jul 4, 2023

/test

@asauber
Copy link
Member

asauber commented Jul 5, 2023

I'd like to propose an alternative solution. We should be using valid URLs in any cases where this error message is popping up. Adding a scheme (http://, https://) to the string will allow these to parse as expected, and this is the type of value which is used in the etcd docs.

Go Playground example: https://go.dev/play/p/OJB_oM5PQGO

@giorio94
Copy link
Member Author

giorio94 commented Jul 6, 2023

I'd like to propose an alternative solution. We should be using valid URLs in any cases where this error message is popping up.

Yeah, the dialer is passed the address without the scheme, although it was originally present. We could probably prepend a fake scheme while doing the translation to silence it (although it wouldn't change anything in terms of the final result). I'm still personally in favor of this change, as also the other log messages pointing out failures (although at debug level) might be misleading (as also possibly repeated multiple times). WDYT?

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 6, 2023
@asauber
Copy link
Member

asauber commented Jul 6, 2023

@giorio94 considering the service IP lookup, it makes sense to me

@borkmann borkmann merged commit ad76862 into cilium:main Jul 7, 2023
65 checks passed
@jibi jibi mentioned this pull request Jul 10, 2023
19 tasks
@jibi jibi added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 10, 2023
@julianwiedmann julianwiedmann added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants