-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd: fix rbd-nbd io-timeout to never abort #2394
Conversation
Thanks, @trociny for highlighting it. |
/retest all |
Most of the tests stuck after 'return bare-metal machine' state. |
@@ -239,6 +241,9 @@ func appendDeviceTypeAndOptions(cmdArgs []string, isNbd, isThick bool, userOptio | |||
if !strings.Contains(userOptions, setNbdReattach) { | |||
cmdArgs = append(cmdArgs, "--options", fmt.Sprintf("%s=%d", setNbdReattach, defaultNbdReAttachTimeout)) | |||
} | |||
if !strings.Contains(userOptions, setNbdIOTimeout) { | |||
cmdArgs = append(cmdArgs, "--options", fmt.Sprintf("%s=%d", setNbdIOTimeout, defaultNbdIOTimeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question:
@pkalever will there be any impact on pod graceful termination and force termination because of this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 Nope this shouldn't hurt the termination sequence of the pod lifecycle. This timeout is supplied to the nbd driver via rbd-nbd process, It means how long the cmds should wait before timing out (like: when the rbd-nbd process is down, wait for it to serve IO)
But it's good to get some confirmation from @idryomov about the values we are using i.e reattach-timeout=300 and io-timeout=$reattach-timeout + 30, has any further objections.
cc: @lxbsz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong to me. I think this would cause in-flight requests to not be requeued when rbd-nbd is respawned, see torvalds/linux@2c27254. By setting io-timeout
to the value of reattach-timeout
or higher we would effectively disable that nbd_requeue_cmd()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the different settings of io-timeout values like greater than, less than, equal to reattach timeout value keeping reattach timeout at 300 fixed, and honestly with the latest nbd driver available at the master I had not seen any difference in the io timeout/fail behavior.
Here are the observations captured: https://hackmd.io/@FFELd2RJT76Uqp-3M-e4zw/Sys8bLfeY
I'm planning to test the behaviour on the minikube kernel version to understand the behaviour between the kernels.
I hope now you might sense the need for ceph/ceph#42609
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the contrary, so far I don't see the need for ceph/ceph#42609 at all.
I don't understand what you were testing though. If you kill rbd-nbd with reattach-timeout
set, eventually in-flight I/O would be failed no matter what the value of io-timeout
is. I think the test should have included rbd device attach
at e.g. 3 minute mark with the expectation that nothing gets timed out, the filesystem stays read-write and dd continues to execute after reattach. I would expect io-timeout=30
to be disqualified immediately and with enough test rounds you should be able to see the difference between io-timeout=300
/io-timeout=330
and io-timeout=0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would not need fsfreeze
in case the block-layer does not report errors but just queues the I/O. This should work for both filesystem- and block-mode volumes.
Quiescing takes time (flushing outstanding I/O) and delays restarting the node-plugin Pod. I am not convinced there is a good reason to flushing I/O, as we do expect the pod to restart. Assuming there are issues with restarting, or re-attaching doesn't give me confidence in the solution...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the scenario where the process is killed(TERM) and brought back to life while the device is held, is a bit special, It will not be a big surprise to arm rbd-nbd with additional safety.
IMHO, at any cost, we shouldn't be in a place that will lead us to corruption. Data loss is secondary to me, which we cannot avoid at the worst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday there was a discussion around this in the CSI standup meeting, and here is the summary of the discussion:
- If we do not quiesce on detach, then there is a chance of data corruption, hence we should provide quiesce.
- For setups where the writes are slow and hence heavily cached on fs (like a few MBs, GBs), the flushing involved as part of fsfreeze might take time and might lead us to hit
reattach-timeout
(note: this can also be reconfigured from current 300sec to higher). Hence we should have the option to disable quiesce on detach at Ceph-CSI.
The pod termination sequence should be improved with the preHooks
, the hook should ideally wait for rbd-nbd
processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If we do not quiesce on detach, then there is a chance of data corruption, hence we should provide quiesce.
Note that for applications there probably is little difference with or without quiescing. The application does not get informed about the need to flush writes, it can still be interrupted (blocked/paused) while it is writing data.
When updating the DaemonSet for the rbd node-plugin, all RBD volumes will start to flush their data when quiescing is done. This will cause a load spike on the Ceph cluster and network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed @nixpanic. This should be almost the same load that takes place if we have to snapshot all the PVs in one go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM will wait for response for https://github.com/ceph/ceph-csi/pull/2394/files#r686495332 and merge it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested yet but I think we should always set io-timeout
to 0. This should keep the requeueing logic in action and work around the timeout handling bugs present in all kernels (where the old value just hangs around and the first mapping is different from subsequent mappings).
071c16f
to
88bf25e
Compare
88bf25e
to
d7bd795
Compare
/retest ci/centos/upgrade-tests-rbd
|
The lint-extras GitHub action fails (just like |
@Mergifyio rebase |
Command
|
|
/retest ci/centos/mini-e2e-helm/k8s-1.22 |
/retest ci/centos/mini-e2e/k8s-1.20 |
/retest ci/centos/mini-e2e/k8s-1.20
|
287f749
to
444a1cf
Compare
444a1cf
to
0b9f956
Compare
0b9f956
to
8fb73a2
Compare
8fb73a2
to
820852b
Compare
Adding some code comments to make them readable and easy to understand. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
With the tests at CI, it kind of looks like that the IO is timing out after 30 seconds (default with rbd-nbd). Since we have tweaked reattach-timeout to 300 seconds at ceph-csi, we need to explicitly set io-timeout on the device too, as it doesn't make any sense to keep io-timeout < reattach-timeout Hence we set io-timeout for rbd nbd to 0. Specifying io-timeout 0 tells the nbd driver to not abort the request and instead see if it can be restarted on another socket. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Suggested-by: Ilya Dryomov <idryomov@redhat.com>
Currently, we get the kernel version where the e2e (client) executable runs, not the kernel version that is used by the csi-rbdplugin pod. Add a function that run `uname -r` command from the specified container and returns the kernel version. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Suggested-by: Niels de Vos <ndevos@redhat.com>
We need https://www.mail-archive.com/linux-block@vger.kernel.org/msg38060.html inorder to use `--io-timeout=0`. This patch is part of kernel 5.4 Since minikube doesn't have a v5.4 kernel yet, lets use io-timeout value conditionally based on kernel version at our e2e. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Nit: change the title of the PR to "rbd: fix rbd-nbd io-timeout to never abort". |
Thanks @idryomov done now :-) |
I assume by backend image you mean filesystem image? The filesystem image should always be corruption-free because filesystems are built to withstand sudden power loss or kernel panics. Unless you foresee reattach not completing in time continuously and on a regular basis, there is nothing wrong with occasionally losing in-flight I/O and going through journal recovery on the next mount.
Does the download of the new image start after detach (i.e. "hanging" in-flight I/O)? That seems wrong even if everything in between (container registry, network, etc) were 100% reliable.
How is it different from a crash with no reattach in sight?
Already mentioned above.
If the node is so oversubscribed that rbd-nbd process fails to start, the user workload is already impacted.
Again, how is it different from a "regular" crash?
I guess I just don't see why this case needs handling. I'm working off of the assumption that reattach failures would be (or at least are expected to be) as rare as rbd-nbd crashes or kernel panics. Even if the download of the new image happens after detach (instead of before as it should), how often do you expect the upgrade to fail? |
Describe what this PR does
With the tests at CI, it kind of looks like that the IO is timing out after 30 seconds (default with rbd-nbd). Since we have tweaked reattach-timeout to 300 seconds at ceph-csi, we need to explicitly set io-timeout on the device too, as it doesn't make any sense to keep
io-timeout < reattach-timeout
Hence we set io-timeout for rbd nbd to 0. Specifying io-timeout 0 tells the nbd driver to not abort the request and instead see if it can be restarted on another socket.
We need https://www.mail-archive.com/linux-block@vger.kernel.org/msg38060.html inorder to use
--io-timeout=0
. This patch is part of kernel v5.4. Since minikube doesn't have a v5.4 kernel yet, let's use io-timeout value conditionally based on the kernel version at our e2e.Thanks @idryomov for the timeout suggestion.
Updates: #2204
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Suggested-by: Ilya Dryomov idryomov@redhat.com
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>
: retest the<job-name>
after unrelatedfailure (please report the failure too!)
/retest all
: run this in case the CentOS CI failed to start/report any testprogress or results