Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and remove stale connections #126

Merged
merged 4 commits into from
Jun 7, 2023

Conversation

petrutlucian94
Copy link
Member

Storport resets the lun after hitting request timeouts. However,
it never actually removes the disk. Having a stale disk can be
troublesome, leading to an unresponsive host in certain situations
(e.g. cache deadlocks, hanging persistent reservation
requests, etc).

For this reason, we'll detect stale connections and disconnect
the disk. This feature along with the timeouts are configurable.
By default, we'll consider a connection to be stale if at least one
request older than 15s got aborted and if no IO reply was received
in the last minute.

At the same time, we'll include the following timestamps in the
wnbd-client.exe stats output:

  • last received request
  • last submitted request
  • last received reply

Signed-off-by: Lucian Petrut lpetrut@cloudbasesolutions.com

In order to detect stale daemons, we'll record the timestamp of
certain IO events:

* last received request
* last submitted request
* last received reply
* queue timestamp for each async IO request

Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
We'll include the following timestamps in the wnbd-client stats
output:

* last received request
* last submitted request
* last received reply
Storport resets the lun after hitting request timeouts. However,
it never actually removes the disk. Having a stale disk can be
troublesome, leading to an unresponsive host in certain situations
(e.g. cache deadlocks, hanging persistent reservation
requests, etc).

For this reason, we'll detect stale connections and disconnect
the disk. This feature along with the timeouts are configurable.
By default, we'll consider a connection to be stale if at least one
request older than 15s got aborted and if no IO reply was received
in the last minute.

Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
The PR test job installs qemu in order to run NBD tests.

The issue is that we're hitting checksum errors with the latest
Qemu choco build (released 1 week ago), for which reason we'll
switch to the previous version.

Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
@petrutlucian94 petrutlucian94 merged commit 8f8663e into cloudbase:main Jun 7, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant