util: log slow GRPC calls #4847

gman0 · 2024-09-17T15:16:49Z

Describe what this PR does

This PR adds logs for slow GRPC calls. It does so by implementing a grpc middleware where it checks if the call's handler finishes after its context deadline, in which case a "slow call" message is logged.

The time duration is passed in through a new MiddlewareServerOptionConfig struct -- a couple of functions needed to be modified to accept a new arg.

User-facing changes:

New cmdline arg: --logslowopinterval
New Helm chart value logSlowOperationInterval

Logs with this PR included:

I0917 15:11:29.233484       1 utils.go:266] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 GRPC call: /csi.v1.Controller/CreateVolume
I0917 15:11:29.235487       1 utils.go:267] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0917 15:11:29.236381       1 rbd_util.go:1339] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 setting disableInUseChecks: false image features: [layering] mounter: rbd
I0917 15:11:32.236436       1 utils.go:298] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 Still processing GRPC call /csi.v1.Controller/CreateVolume (3s)
I0917 15:11:32.237029       1 utils.go:300] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}

Is there anything that requires special attention

Do you have any questions?

Is the change backward compatible?

Are there concerns around backward compatibility?

Provide any external context for the change, if any.

For example:

Kubernetes links that explain why the change is required
CSI spec related changes/catch-up that necessitates this patch
golang related practices that necessitates this change

Related issues

Mention any github issues relevant to this PR. Adding below line
will help to auto close the issue once the PR is merged.

Fixes: #issue_number

Future concerns

List items that are not part of the PR and do not impact it's
functionality, but are work items that can be taken up subsequently.

Checklist:

Commit Message Formatting: Commit titles and messages follow
guidelines in the developer
guide.
Reviewed the developer guide on Submitting a Pull
Request
Pending release
notes
updated with breaking and/or notable changes for the next major release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

/retest ci/centos/<job-name>: retest the <job-name> after unrelated
failure (please report the failure too!)

Madhu-1

@gman0 Thanks for the PR, some nits.

Madhu-1 · 2024-09-18T07:46:16Z

charts/ceph-csi-cephfs/README.md

@@ -118,6 +118,7 @@ charts and their default values.
 | `commonLabels`                                 | Labels to apply to all resources                                                               | `{}`                                                  |
 | `logLevel`                                     | Set logging level for csi containers. Supported values from 0 to 5. 0 for general useful logs, 5 for trace level verbosity.                          | `5`                                                |
 | `sidecarLogLevel`                              | Set logging level for csi sidecar containers. Supported values from 0 to 5. 0 for general useful logs, 5 for trace level verbosity.                  | `1`                                                |
+| `logSlowOperationsAfter`                       | Operations running longer than the specified time duration will be logged as slow. Setting the value to zero disables the feature.                   | `15s`                                              |


can we keep this value as same as GRPC timeout which we have at 60s seconds?

Having the same timeout probably defeats the purpose? It would be useful to know if an operation is still running before the timeout hits.

Madhu-1 · 2024-09-18T07:47:48Z

cmd/cephcsi.go

+	flag.DurationVar(
+		&conf.LogSlowOpsAfter,
+		"logslowopsafter",
+		time.Second*15,


Do we need to enable it by default? how about disable it by default and if the user requires they can enable it?

I'm fine with enabling by default, it can help with troubleshooting issues.

Madhu-1 · 2024-09-18T07:51:03Z

internal/csi-common/utils.go

+	info *grpc.UnaryServerInfo,
+	handler grpc.UnaryHandler,
+) (interface{}, error) {
+	ticker := time.NewTicker(timeout)


let's have a defer for the ticker.Stop() to avoid resource leak

Madhu-1 · 2024-09-18T07:55:04Z

internal/csi-common/utils.go

+		for {
+			select {
+			case <-callFinished:
+				break waitForCallFinished


if we return from here we dont need to use break right?

Madhu-1 · 2024-09-18T07:59:11Z

charts/ceph-csi-cephfs/values.yaml

@@ -40,6 +40,9 @@ commonLabels: {}
 logLevel: 5
 # sidecarLogLevel is the variable for Kubernetes sidecar container's log level
 sidecarLogLevel: 1
+# Operations running longer than the specified time duration will be logged
+# as slow. Setting the value to zero disables the feature.
+logSlowOperationsAfter: 15s


can we comment it out and not enable it by default?

gman0 · 2024-09-18T16:15:37Z

@nixpanic, actually I like @Madhu-1's idea about waiting for context timeout. It's what's happening after, when a call is retried while the older call is still in flight -- this is what I wanted to log and to have a better visibility over.

I0918 15:33:48.756232       1 utils.go:298] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (45s)
I0918 15:33:48.756658       1 utils.go:300] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:03.756752       1 utils.go:298] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (1m0s)
I0918 15:34:03.758209       1 utils.go:300] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:04.270362       1 utils.go:266] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 GRPC call: /csi.v1.Controller/CreateVolume
I0918 15:34:04.270850       1 utils.go:267] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:04.274015       1 rbd_util.go:1339] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 setting disableInUseChecks: false image features: [layering] mounter: rbd
I0918 15:34:11.608149       1 utils.go:266] ID: 28 GRPC call: /csi.v1.Identity/Probe
I0918 15:34:11.608281       1 utils.go:267] ID: 28 GRPC request: {}
I0918 15:34:11.608368       1 utils.go:273] ID: 28 GRPC response: {}
I0918 15:34:18.756588       1 utils.go:298] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (1m15s)
I0918 15:34:18.757242       1 utils.go:300] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:19.271399       1 utils.go:298] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (15s)
I0918 15:34:19.272034       1 utils.go:300] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}

Here in the log you can see this exact situation:

CreateVolume ID 25 is still in progress,
at 1-minute mark CreateVolume is retried with ID 27,
25 continues on

Normally 27 would be blocked by a try-lock but in any case it's great that we can immediately see when a call is stuck. I propose we change this PR so that the "slow call" logs start only after a call exceeds its context deadline, and we keep this enabled by default. This way we can keep an eye on runaway GRPCs.

gman0 · 2024-09-19T11:14:30Z

I've made the changes described in the post above. I wonder if it makes sense to make the log interval configurable or just have it hardcoded to some sane value (30s might be ok)?

gman0 · 2024-09-19T11:17:56Z

internal/util/util.go

+	MetricsPort       int           // TCP port for liveness/grpc metrics requests
+	PollTime          time.Duration // time interval in seconds between each poll
+	PoolTimeout       time.Duration // probe timeout in seconds
+	LogSlowOpInterval time.Duration // GRPC calls running longer than this will be logged


The comment here needs an update.

nixpanic

LGTM, thanks!

nixpanic · 2024-09-19T11:24:35Z

You probably should consider adding a note to PendingReleaseNotes.md so that the new feature will be included in the release notes for the next version.

Pull request has been modified.

nixpanic · 2024-09-19T13:23:37Z

Two minor issues:

The last commit misses your Signed-off-by line.
yamllint complains about the line length

./PendingReleaseNotes.md:15: MD013 Line length

Further documentation is available for these failures:
 - MD013: https://github.com/markdownlint/markdownlint/blob/main/docs/RULES.md#md013---line-length

gman0 · 2024-09-19T15:11:16Z

CI is passing now, thanks! PTAL

Madhu-1 · 2024-09-20T06:48:37Z

@Mergifyio queue

mergify · 2024-09-20T06:48:59Z

queue

🛑 The pull request has been removed from the queue `default`

The pull request can't be updated.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

Madhu-1 · 2024-09-20T06:55:20Z

@Mergifyio rebase

This commit adds a gRPC middleware that logs calls that keep running after their deadline. Adds --logslowopinterval cmdline argument to pass the log rate. Signed-off-by: Robert Vasek <robert.vasek@clyso.com>

Signed-off-by: Robert Vasek <robert.vasek@clyso.com>

mergify · 2024-09-20T06:55:32Z

rebase

✅ Branch has been successfully rebased

Madhu-1 · 2024-09-20T06:55:46Z

@Mergifyio refresh

mergify · 2024-09-20T06:55:51Z

refresh

✅ Pull request refreshed

Madhu-1 · 2024-09-20T06:56:58Z

@Mergifyio queue

mergify · 2024-09-20T06:57:01Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at c76338c

ceph-csi-bot · 2024-09-20T06:57:15Z

/test ci/centos/upgrade-tests-cephfs

ceph-csi-bot · 2024-09-20T06:57:16Z

/test ci/centos/upgrade-tests-rbd

ceph-csi-bot · 2024-09-20T06:57:16Z

/test ci/centos/k8s-e2e-external-storage/1.29

ceph-csi-bot · 2024-09-20T06:57:17Z

/test ci/centos/k8s-e2e-external-storage/1.30

ceph-csi-bot · 2024-09-20T06:57:17Z

/test ci/centos/mini-e2e-helm/k8s-1.29

ceph-csi-bot · 2024-09-20T06:57:17Z

/test ci/centos/mini-e2e-helm/k8s-1.30

ceph-csi-bot · 2024-09-20T06:57:17Z

/test ci/centos/mini-e2e/k8s-1.29

ceph-csi-bot · 2024-09-20T06:57:18Z

/test ci/centos/mini-e2e/k8s-1.30

ceph-csi-bot · 2024-09-20T06:57:20Z

/test ci/centos/k8s-e2e-external-storage/1.31

ceph-csi-bot · 2024-09-20T06:57:20Z

/test ci/centos/mini-e2e-helm/k8s-1.31

ceph-csi-bot · 2024-09-20T06:57:21Z

/test ci/centos/mini-e2e/k8s-1.31

gman0 force-pushed the logslowops branch from cb5dd27 to d6059e8 Compare September 17, 2024 15:49

Madhu-1 reviewed Sep 18, 2024

View reviewed changes

Madhu-1 requested a review from a team September 18, 2024 07:59

gman0 force-pushed the logslowops branch from d6059e8 to eaf4b24 Compare September 19, 2024 11:08

gman0 commented Sep 19, 2024

View reviewed changes

gman0 force-pushed the logslowops branch from eaf4b24 to 373398a Compare September 19, 2024 11:20

nixpanic previously approved these changes Sep 19, 2024

View reviewed changes

gman0 force-pushed the logslowops branch from 699b558 to 7b8568d Compare September 19, 2024 14:13

nixpanic approved these changes Sep 19, 2024

View reviewed changes

gman0 requested a review from Madhu-1 September 20, 2024 05:42

Madhu-1 approved these changes Sep 20, 2024

View reviewed changes

Robert Vasek added 4 commits September 20, 2024 06:55

util: added logs for slow gRPC calls

521fcef

This commit adds a gRPC middleware that logs calls that keep running after their deadline. Adds --logslowopinterval cmdline argument to pass the log rate. Signed-off-by: Robert Vasek <robert.vasek@clyso.com>

helm: added logSlowOperationInterval value to cephfs and rbd charts

1990807

Signed-off-by: Robert Vasek <robert.vasek@clyso.com>

doc: added notes about --logslowopinterval cmd arg

dda0889

Signed-off-by: Robert Vasek <robert.vasek@clyso.com>

doc: add a release note about "Slow GRPC" logs

318c143

Signed-off-by: Robert Vasek <robert.vasek@clyso.com>

Madhu-1 force-pushed the logslowops branch from 7b8568d to 318c143 Compare September 20, 2024 06:55

mergify bot added the ok-to-test Label to trigger E2E tests label Sep 20, 2024

ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Sep 20, 2024

mergify bot merged commit c76338c into ceph:devel Sep 20, 2024
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util: log slow GRPC calls #4847

util: log slow GRPC calls #4847

gman0 commented Sep 17, 2024 •

edited

Loading

Madhu-1 left a comment

Madhu-1 Sep 18, 2024

nixpanic Sep 18, 2024

Madhu-1 Sep 18, 2024

nixpanic Sep 18, 2024

Madhu-1 Sep 18, 2024

Madhu-1 Sep 18, 2024

Madhu-1 Sep 18, 2024

gman0 commented Sep 18, 2024

gman0 commented Sep 19, 2024

gman0 Sep 19, 2024

nixpanic left a comment

nixpanic commented Sep 19, 2024

nixpanic commented Sep 19, 2024

gman0 commented Sep 19, 2024

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024 •

edited

Loading

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024 •

edited

Loading

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

util: log slow GRPC calls #4847

util: log slow GRPC calls #4847

Conversation

gman0 commented Sep 17, 2024 • edited Loading

Describe what this PR does

Is there anything that requires special attention

Related issues

Future concerns

Madhu-1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gman0 commented Sep 18, 2024

gman0 commented Sep 19, 2024

Choose a reason for hiding this comment

nixpanic left a comment

Choose a reason for hiding this comment

nixpanic commented Sep 19, 2024

nixpanic commented Sep 19, 2024

gman0 commented Sep 19, 2024

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024

✅ Branch has been successfully rebased

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024

✅ Pull request refreshed

Madhu-1 commented Sep 20, 2024

mergify bot commented Sep 20, 2024 • edited Loading

✅ The pull request has been merged automatically

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

ceph-csi-bot commented Sep 20, 2024

gman0 commented Sep 17, 2024 •

edited

Loading

mergify bot commented Sep 20, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

mergify bot commented Sep 20, 2024 •

edited

Loading