Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore BIO_FLUSH command if there is no volatile cache to flush #710

Closed

Conversation

santhoshkumar-mani
Copy link
Contributor

When a storage device reports that it does not support cache flush, the GEOM
disk layer by default returns ENOTSUPP in response to a BIO_FLUSH command.

On AWS, local volumes do not advertise themselves as having write-cache enabled.
When they are selected for L3 on all HDD nodes, the L3 subsystem may inadvertently
kick these L3 devices if a BIO_FLUSH command fails with an ENOTSUPP return code.
The fix is to make GEOM disk return success (0) when this condition occurs and
add a sysctl to make this error handling config-driven

…he GEOM

disk layer by default returns ENOTSUPP in response to a BIO_FLUSH command.

On AWS, local volumes do not advertise themselves as having write-cache enabled.
When they are selected for L3 on all HDD nodes, the L3 subsystem may inadvertently
kick these L3 devices if a BIO_FLUSH command fails with an ENOTSUPP return code.
The fix is to make GEOM disk return success (0) when this condition occurs and
add a sysctl to make this error handling config-driven
@bsdimp
Copy link
Member

bsdimp commented Apr 5, 2023

What is the L3 subsystem? It kinda sounds like a bug there at first blush... there are several block devices without caches to flush that cope with this situation correctly.... I'm not sure I understand here, so please help me fill in the gaps...

@danielwryan
Copy link

This was in the context of a filesystem above geom built on FreeBSD, L3 being one path that was submitting the cache flush requests. I think the general idea would be that some consumers might prefer to treat a request to flush a drive cache to a drive without said cache as success, rather than track the device capability themself and use that to handle ENOTSUPP (or avoid issuing flushes). In my recollection a number of drives would let you issue cache flush commands if they had no such cache in the hardware world, but in this case geom has collected that information from the nvme capability information for the drive in question and rejects cache flush bios if the drive reports no volatile cache - again, accurate and not wrong as said flush would be a no-op, but for some consumers treating that as success may simplify their own implementation.

If this isn't something upstream would take, we would probably head down the path you suggested of modifying the consumer behavior :)

SYSCTL_ADD_BOOL(&sc->sysctl_ctx,
SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, "ignore_flush",
CTLFLAG_RWTUN, &sc->ignore_flush, sizeof(sc->ignore_flush),
"Do not return EOPNOTSUPP if there is no cache to flush");
Copy link
Member

@bsdimp bsdimp Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind... ENOTSUP == EOPNOTSUPP

@bsdimp
Copy link
Member

bsdimp commented Apr 5, 2023

Some producers of BIO_FLUSH use ENOTSUP as a hint to never send BIO_FLUSH again.
Others completely ignore the error. Others just pass the error through to the next layer. UFS ignores all errors (it just uses it to enforce ordering in one place that's hard to do with I/O scheduling). ZFS ignores it and also never sends another one, but reacts to other errors. UFS schedules it with BIO_ORDERED, which is more important that's honored.

So converting NOTSUP to 0 feels wrong to me, but the current code in the tree would at most have an optimization bypassed. So it's not terrible to cope with other upper layers that assign more malice to this failure than is really there (eg L3 dropping it).

It's a terrible name, though. flush_notsup_succeed is better, but still not great.

It also needs to be documented.... but there's no good man page to document this weird quirk.

@santhoshkumar-mani
Copy link
Contributor Author

santhoshkumar-mani commented Apr 6, 2023

Some producers of BIO_FLUSH use ENOTSUP as a hint to never send BIO_FLUSH again. Others completely ignore the error. Others just pass the error through to the next layer. UFS ignores all errors (it just uses it to enforce ordering in one place that's hard to do with I/O scheduling). ZFS ignores it and also never sends another one, but reacts to other errors. UFS schedules it with BIO_ORDERED, which is more important that's honored.

So converting NOTSUP to 0 feels wrong to me, but the current code in the tree would at most have an optimization bypassed. So it's not terrible to cope with other upper layers that assign more malice to this failure than is really there (eg L3 dropping it).

It's a terrible name, though. flush_notsup_succeed is better, but still not great.

It also needs to be documented.... but there's no good man page to document this weird quirk.

Thanks for your comments, it appears to me that suggestion is to not return 0, as there other upper layers/consumers using this error code. And for the consumer of us, probably drop in L3 for such error code

Is my understanding right ? if yes, may be close/drop this PR ?

@bsdimp
Copy link
Member

bsdimp commented Apr 7, 2023

No. I think that we should do this PR. While normally this functionality would be counterproductive, the sysctl that you have to enable it would allow people that need to cope with an upper layer / filesystem not honoring the protocol of ENOTSUP correctly would be a reasonable compromise. My only real objection is the name.

@santhoshkumar-mani
Copy link
Contributor Author

No. I think that we should do this PR. While normally this functionality would be counterproductive, the sysctl that you have to enable it would allow people that need to cope with an upper layer / filesystem not honoring the protocol of ENOTSUP correctly would be a reasonable compromise. My only real objection is the name.

Got it. flush_notsup_succeed looks good to me. Updated the diff with the same

@santhoshkumar-mani
Copy link
Contributor Author

Please let me know if there is anything else to be done from my end to close this one

@bsdimp
Copy link
Member

bsdimp commented Jul 1, 2023

Landed... Sorry for the delay. d3eb9d3

@bsdimp bsdimp closed this Jul 1, 2023
@bsdimp bsdimp added the merged label Jul 1, 2023
freebsd-git pushed a commit that referenced this pull request Jul 1, 2023
When a storage device reports that it does not support cache flush, the
GEOM disk layer by default returns ENOTSUPP in response to a BIO_FLUSH
command.

On AWS, local volumes do not advertise themselves as having write-cache
enabled.  When they are selected for L3 on all HDD nodes, the L3
subsystem may inadvertently kick these L3 devices if a BIO_FLUSH command
fails with an ENOTSUPP return code.  The fix is to make GEOM disk return
success (0) when this condition occurs and add a sysctl to make this
error handling config-driven

Reviewed by: imp
Pull Request: #710
bsdjhb pushed a commit to bsdjhb/cheribsd that referenced this pull request Sep 2, 2023
When a storage device reports that it does not support cache flush, the
GEOM disk layer by default returns ENOTSUPP in response to a BIO_FLUSH
command.

On AWS, local volumes do not advertise themselves as having write-cache
enabled.  When they are selected for L3 on all HDD nodes, the L3
subsystem may inadvertently kick these L3 devices if a BIO_FLUSH command
fails with an ENOTSUPP return code.  The fix is to make GEOM disk return
success (0) when this condition occurs and add a sysctl to make this
error handling config-driven

Reviewed by: imp
Pull Request: freebsd/freebsd-src#710
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants