Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: fix rbd-nbd io-timeout to never abort #2394

Merged
merged 4 commits into from
Aug 24, 2021
Merged

Conversation

pkalever
Copy link

@pkalever pkalever commented Aug 10, 2021

Describe what this PR does

  • Fix rbd-nbd io-timeout to match reattach-timeout

With the tests at CI, it kind of looks like that the IO is timing out after 30 seconds (default with rbd-nbd). Since we have tweaked reattach-timeout to 300 seconds at ceph-csi, we need to explicitly set io-timeout on the device too, as it doesn't make any sense to keep io-timeout < reattach-timeout

Hence we set io-timeout for rbd nbd to 0. Specifying io-timeout 0 tells the nbd driver to not abort the request and instead see if it can be restarted on another socket.


  • At our e2e, use io-timeout conditionally based on kernel version

We need https://www.mail-archive.com/linux-block@vger.kernel.org/msg38060.html inorder to use --io-timeout=0. This patch is part of kernel v5.4. Since minikube doesn't have a v5.4 kernel yet, let's use io-timeout value conditionally based on the kernel version at our e2e.


Thanks @idryomov for the timeout suggestion.

Updates: #2204

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Suggested-by: Ilya Dryomov idryomov@redhat.com


Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)
  • /retest all: run this in case the CentOS CI failed to start/report any test
    progress or results

@mergify mergify bot added component/rbd Issues related to RBD bug Something isn't working labels Aug 10, 2021
@pkalever
Copy link
Author

Thanks, @trociny for highlighting it.

@pkalever
Copy link
Author

/retest all

@pkalever
Copy link
Author

pkalever commented Aug 11, 2021

Most of the tests stuck after 'return bare-metal machine' state.

@@ -239,6 +241,9 @@ func appendDeviceTypeAndOptions(cmdArgs []string, isNbd, isThick bool, userOptio
if !strings.Contains(userOptions, setNbdReattach) {
cmdArgs = append(cmdArgs, "--options", fmt.Sprintf("%s=%d", setNbdReattach, defaultNbdReAttachTimeout))
}
if !strings.Contains(userOptions, setNbdIOTimeout) {
cmdArgs = append(cmdArgs, "--options", fmt.Sprintf("%s=%d", setNbdIOTimeout, defaultNbdIOTimeout))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
@pkalever will there be any impact on pod graceful termination and force termination because of this one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 Nope this shouldn't hurt the termination sequence of the pod lifecycle. This timeout is supplied to the nbd driver via rbd-nbd process, It means how long the cmds should wait before timing out (like: when the rbd-nbd process is down, wait for it to serve IO)

But it's good to get some confirmation from @idryomov about the values we are using i.e reattach-timeout=300 and io-timeout=$reattach-timeout + 30, has any further objections.

cc: @lxbsz

Copy link
Contributor

@idryomov idryomov Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong to me. I think this would cause in-flight requests to not be requeued when rbd-nbd is respawned, see torvalds/linux@2c27254. By setting io-timeout to the value of reattach-timeout or higher we would effectively disable that nbd_requeue_cmd() call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the different settings of io-timeout values like greater than, less than, equal to reattach timeout value keeping reattach timeout at 300 fixed, and honestly with the latest nbd driver available at the master I had not seen any difference in the io timeout/fail behavior.

Here are the observations captured: https://hackmd.io/@FFELd2RJT76Uqp-3M-e4zw/Sys8bLfeY

I'm planning to test the behaviour on the minikube kernel version to understand the behaviour between the kernels.


I hope now you might sense the need for ceph/ceph#42609

Thanks!

CC: @trociny @lxbsz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary, so far I don't see the need for ceph/ceph#42609 at all.

I don't understand what you were testing though. If you kill rbd-nbd with reattach-timeout set, eventually in-flight I/O would be failed no matter what the value of io-timeout is. I think the test should have included rbd device attach at e.g. 3 minute mark with the expectation that nothing gets timed out, the filesystem stays read-write and dd continues to execute after reattach. I would expect io-timeout=30 to be disqualified immediately and with enough test rounds you should be able to see the difference between io-timeout=300/io-timeout=330 and io-timeout=0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would not need fsfreeze in case the block-layer does not report errors but just queues the I/O. This should work for both filesystem- and block-mode volumes.

Quiescing takes time (flushing outstanding I/O) and delays restarting the node-plugin Pod. I am not convinced there is a good reason to flushing I/O, as we do expect the pod to restart. Assuming there are issues with restarting, or re-attaching doesn't give me confidence in the solution...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the scenario where the process is killed(TERM) and brought back to life while the device is held, is a bit special, It will not be a big surprise to arm rbd-nbd with additional safety.

IMHO, at any cost, we shouldn't be in a place that will lead us to corruption. Data loss is secondary to me, which we cannot avoid at the worst.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday there was a discussion around this in the CSI standup meeting, and here is the summary of the discussion:

  1. If we do not quiesce on detach, then there is a chance of data corruption, hence we should provide quiesce.
  2. For setups where the writes are slow and hence heavily cached on fs (like a few MBs, GBs), the flushing involved as part of fsfreeze might take time and might lead us to hit reattach-timeout (note: this can also be reconfigured from current 300sec to higher). Hence we should have the option to disable quiesce on detach at Ceph-CSI.

The pod termination sequence should be improved with the preHooks, the hook should ideally wait for rbd-nbd processes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If we do not quiesce on detach, then there is a chance of data corruption, hence we should provide quiesce.

Note that for applications there probably is little difference with or without quiescing. The application does not get informed about the need to flush writes, it can still be interrupted (blocked/paused) while it is writing data.

When updating the DaemonSet for the rbd node-plugin, all RBD volumes will start to flush their data when quiescing is done. This will cause a load spike on the Ceph cluster and network.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed @nixpanic. This should be almost the same load that takes place if we have to snapshot all the PVs in one go!

nixpanic
nixpanic previously approved these changes Aug 11, 2021
Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me, thanks!

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM will wait for response for https://github.com/ceph/ceph-csi/pull/2394/files#r686495332 and merge it

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested yet but I think we should always set io-timeout to 0. This should keep the requeueing logic in action and work around the timeout handling bugs present in all kernels (where the old value just hangs around and the first mapping is different from subsequent mappings).

@pkalever
Copy link
Author

/retest ci/centos/upgrade-tests-rbd


X Exiting due to GUEST_PROVISION: Failed to start host: creating host: create: Error creating machine: Error in driver during machine creation: IP not available after waiting: machine minikube didn't return IP after 1 minute

nixpanic
nixpanic previously approved these changes Aug 19, 2021
@nixpanic
Copy link
Member

The lint-extras GitHub action fails (just like make containerized-test), which will be fixed once #2421 is merged and this PR is rebased.

@nixpanic
Copy link
Member

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Aug 19, 2021

Command rebase: success

Branch has been successfully rebased

@pkalever
Copy link
Author

�[1mSTEP�[0m: checking provisioner deployment is running
�[1mSTEP�[0m: checking nodeplugin deamonset pods are running
Aug 19 11:56:10.916: INFO: Waiting up to 10m0s for all daemonsets in namespace 'cephcsi-e2e-6856a6b083be' to start
Aug 19 11:56:10.919: INFO: 1 / 1 pods ready in namespace 'cephcsi-e2e-6856a6b083be' in daemonset 'csi-cephfsplugin' (0 seconds elapsed)
�[1mSTEP�[0m: verify PVC and app binding on helm installation
Aug 19 11:56:10.929: INFO: Waiting up to &PersistentVolumeClaim{ObjectMeta:{csi-cephfs-pvc  cephfs-3274    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Spec:PersistentVolumeClaimSpec{AccessModes:[ReadWriteMany],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},},VolumeName:,Selector:nil,StorageClassName:*csi-cephfs-sc,VolumeMode:nil,DataSource:nil,DataSourceRef:nil,},Status:PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},},} to be in Bound state
Aug 19 11:56:10.929: INFO: waiting for PVC csi-cephfs-pvc (0 seconds elapsed)
Aug 19 11:56:12.944: INFO: waiting for PVC csi-cephfs-pvc (2 seconds elapsed)
Aug 19 11:56:12.951: INFO: Waiting for PV pvc-8f823be1-4b73-4575-bb13-cf4eb66fad91 to bind to PVC csi-cephfs-pvc
Aug 19 11:56:12.951: INFO: Waiting up to timeout=10m0s for PersistentVolumeClaims [csi-cephfs-pvc] to have phase Bound
Aug 19 11:56:12.954: INFO: PersistentVolumeClaim csi-cephfs-pvc found and phase=Bound (2.92652ms)
Aug 19 11:56:12.954: INFO: Waiting up to 10m0s for PersistentVolume pvc-8f823be1-4b73-4575-bb13-cf4eb66fad91 to have phase Bound
Aug 19 11:56:12.957: INFO: PersistentVolume pvc-8f823be1-4b73-4575-bb13-cf4eb66fad91 found and phase=Bound (3.080234ms)
Aug 19 11:56:12.971: INFO: Waiting up to csi-cephfs-demo-pod to be in Running state
Aug 19 12:01:56.989: INFO: Waiting for pod csi-cephfs-demo-pod to be deleted
Aug 19 12:11:57.001: FAIL: failed to validate CephFS pvc and application binding with error timed out waiting for the condition

https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e-helm_k8s-1.22/runs/94/nodes/101/log/?start=0

�[1mSTEP�[0m: Resize rbd-nbd PVC and check application directory size
E0819 10:57:22.628963   83435 util.go:245] kernel 4.18.0-305.12.1.el8_4.x86_64 does not support required features
�[1mSTEP�[0m: perform IO on rbd-nbd volume after nodeplugin restart
Aug 19 10:57:22.629: INFO: waiting for kubectl (delete -f [] args %!s(MISSING)) to finish
Aug 19 10:57:22.629: INFO: Running '/usr/bin/kubectl --server=https://192.168.39.112:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-7b8169a8 delete -f -'
Aug 19 10:57:22.762: INFO: stderr: "warning: deleting cluster-scoped resources, not scoped to the provided namespace\n"
Aug 19 10:57:22.762: INFO: stdout: "storageclass.storage.k8s.io \"csi-rbd-sc\" deleted\n"
Aug 19 10:57:22.769: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fsid] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-kp6qz ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 19 10:57:22.769: INFO: >>> kubeConfig: /root/.kube/config
Aug 19 10:57:24.919: INFO: Waiting up to &PersistentVolumeClaim{ObjectMeta:{rbd-pvc  rbd-1318    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Spec:PersistentVolumeClaimSpec{AccessModes:[ReadWriteOnce],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},},VolumeName:,Selector:nil,StorageClassName:*csi-rbd-sc,VolumeMode:nil,DataSource:nil,DataSourceRef:nil,},Status:PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},},} to be in Bound state
Aug 19 10:57:24.919: INFO: waiting for PVC rbd-pvc (0 seconds elapsed)
Aug 19 10:57:26.926: INFO: waiting for PVC rbd-pvc (2 seconds elapsed)
Aug 19 10:57:26.933: INFO: Waiting for PV pvc-091a6bbf-7b9e-41c2-86a2-ea7d74b4e95e to bind to PVC rbd-pvc
Aug 19 10:57:26.933: INFO: Waiting up to timeout=10m0s for PersistentVolumeClaims [rbd-pvc] to have phase Bound
Aug 19 10:57:26.935: INFO: PersistentVolumeClaim rbd-pvc found and phase=Bound (2.387417ms)
Aug 19 10:57:26.935: INFO: Waiting up to 10m0s for PersistentVolume pvc-091a6bbf-7b9e-41c2-86a2-ea7d74b4e95e to have phase Bound
Aug 19 10:57:26.937: INFO: PersistentVolume pvc-091a6bbf-7b9e-41c2-86a2-ea7d74b4e95e found and phase=Bound (2.394599ms)
Aug 19 10:57:26.949: INFO: Waiting up to csi-rbd-demo-pod to be in Running state
Aug 19 10:57:46.991: INFO: ExecWithOptions {Command:[/bin/sh -c sync /var/lib/www/html] Namespace:rbd-1318 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 19 10:57:46.991: INFO: >>> kubeConfig: /root/.kube/config
Aug 19 10:57:47.184: INFO: stdErr occurred: sync: error syncing '/var/lib/www/html': Input/output error

Aug 19 10:57:47.184: FAIL: failed to sync, err: command terminated with exit code 1, stdErr: sync: error syncing '/var/lib/www/html': Input/output error

https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e_k8s-1.20/runs/2161/nodes/87/log/?start=0

@pkalever
Copy link
Author

/retest ci/centos/mini-e2e-helm/k8s-1.22

@pkalever
Copy link
Author

/retest ci/centos/mini-e2e/k8s-1.20

@pkalever
Copy link
Author

/retest ci/centos/mini-e2e/k8s-1.20



X Exiting due to GUEST_PROVISION: Failed to start host: creating host: create: Error creating machine: Error in driver during machine creation: IP not available after waiting: machine minikube didn't return IP after 1 minute

Madhu-1
Madhu-1 previously approved these changes Aug 23, 2021
internal/rbd/rbd_attach.go Outdated Show resolved Hide resolved
e2e/rbd.go Show resolved Hide resolved
e2e/rbd_helper.go Outdated Show resolved Hide resolved
Madhu-1
Madhu-1 previously approved these changes Aug 24, 2021
e2e/rbd_helper.go Outdated Show resolved Hide resolved
@pkalever pkalever requested a review from Madhu-1 August 24, 2021 07:06
@mergify mergify bot dismissed Madhu-1’s stale review August 24, 2021 07:06

Pull request has been modified.

e2e/pod.go Outdated Show resolved Hide resolved
e2e/rbd.go Outdated Show resolved Hide resolved
e2e/rbd.go Outdated Show resolved Hide resolved
e2e/rbd_helper.go Show resolved Hide resolved
@pkalever
Copy link
Author

@Madhu-1 @nixpanic hope I addressed everying. PTAL. Thanks!

Prasanna Kumar Kalever added 4 commits August 24, 2021 11:39
Adding some code comments to make them readable and easy to understand.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
With the tests at CI, it kind of looks like that the IO is timing out after
30 seconds (default with rbd-nbd). Since we have tweaked reattach-timeout
to 300 seconds at ceph-csi, we need to explicitly set io-timeout on the
device too, as it doesn't make any sense to keep
io-timeout < reattach-timeout

Hence we set io-timeout for rbd nbd to 0. Specifying io-timeout 0 tells
the nbd driver to not abort the request and instead see if it can be
restarted on another socket.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Suggested-by: Ilya Dryomov <idryomov@redhat.com>
Currently, we get the kernel version where the e2e (client) executable runs,
not the kernel version that is used by the csi-rbdplugin pod.

Add a function that run `uname -r` command from the specified container and
returns the kernel version.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Suggested-by: Niels de Vos <ndevos@redhat.com>
We need
https://www.mail-archive.com/linux-block@vger.kernel.org/msg38060.html
inorder to use `--io-timeout=0`. This patch is part of kernel 5.4

Since minikube doesn't have a v5.4 kernel yet, lets use io-timeout value
conditionally based on kernel version at our e2e.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
@idryomov
Copy link
Contributor

Nit: change the title of the PR to "rbd: fix rbd-nbd io-timeout to never abort".

@pkalever pkalever changed the title rbd: fix rbd-nbd io-timeout to match reattach-timeout rbd: fix rbd-nbd io-timeout to never abort Aug 24, 2021
@pkalever
Copy link
Author

Nit: change the title of the PR to "rbd: fix rbd-nbd io-timeout to never abort".

Thanks @idryomov done now :-)

@idryomov
Copy link
Contributor

I want to solve a case where the attach doesn't happen in reattach-timeout, so that even if the application pod has to restart, it will have a corruption-free backend image.

I assume by backend image you mean filesystem image? The filesystem image should always be corruption-free because filesystems are built to withstand sudden power loss or kernel panics. Unless you foresee reattach not completing in time continuously and on a regular basis, there is nothing wrong with occasionally losing in-flight I/O and going through journal recovery on the next mount.

The reasons for not reattaching in the specified timeout could be anything beyond our control.

* Upgrade failures (cannot download image)

Does the download of the new image start after detach (i.e. "hanging" in-flight I/O)? That seems wrong even if everything in between (container registry, network, etc) were 100% reliable.

* Container crash could be a bug in ceph-csi or ...

How is it different from a crash with no reattach in sight?

* Network issue downloading the image

Already mentioned above.

* Resource highly used on the node

If the node is so oversubscribed that rbd-nbd process fails to start, the user workload is already impacted.

* docker/podman bug?

Again, how is it different from a "regular" crash?

We can go on with practical reasons.

Do we have any better solution or suggestions to handle this case other than ceph/ceph#42609 (quiesce on detach) ?

I guess I just don't see why this case needs handling. I'm working off of the assumption that reattach failures would be (or at least are expected to be) as rare as rbd-nbd crashes or kernel panics. Even if the download of the new image happens after detach (instead of before as it should), how often do you expect the upgrade to fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants