Ignore BIO_FLUSH command if there is no volatile cache to flush #710

santhoshkumar-mani · 2023-04-05T14:35:38Z

When a storage device reports that it does not support cache flush, the GEOM
disk layer by default returns ENOTSUPP in response to a BIO_FLUSH command.

On AWS, local volumes do not advertise themselves as having write-cache enabled.
When they are selected for L3 on all HDD nodes, the L3 subsystem may inadvertently
kick these L3 devices if a BIO_FLUSH command fails with an ENOTSUPP return code.
The fix is to make GEOM disk return success (0) when this condition occurs and
add a sysctl to make this error handling config-driven

…he GEOM disk layer by default returns ENOTSUPP in response to a BIO_FLUSH command. On AWS, local volumes do not advertise themselves as having write-cache enabled. When they are selected for L3 on all HDD nodes, the L3 subsystem may inadvertently kick these L3 devices if a BIO_FLUSH command fails with an ENOTSUPP return code. The fix is to make GEOM disk return success (0) when this condition occurs and add a sysctl to make this error handling config-driven

bsdimp · 2023-04-05T14:43:07Z

What is the L3 subsystem? It kinda sounds like a bug there at first blush... there are several block devices without caches to flush that cope with this situation correctly.... I'm not sure I understand here, so please help me fill in the gaps...

danielwryan · 2023-04-05T18:29:30Z

This was in the context of a filesystem above geom built on FreeBSD, L3 being one path that was submitting the cache flush requests. I think the general idea would be that some consumers might prefer to treat a request to flush a drive cache to a drive without said cache as success, rather than track the device capability themself and use that to handle ENOTSUPP (or avoid issuing flushes). In my recollection a number of drives would let you issue cache flush commands if they had no such cache in the hardware world, but in this case geom has collected that information from the nvme capability information for the drive in question and rejects cache flush bios if the drive reports no volatile cache - again, accurate and not wrong as said flush would be a no-op, but for some consumers treating that as success may simplify their own implementation.

If this isn't something upstream would take, we would probably head down the path you suggested of modifying the consumer behavior :)

bsdimp · 2023-04-05T20:49:57Z

sys/geom/geom_disk.c

+		SYSCTL_ADD_BOOL(&sc->sysctl_ctx,
+		    SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, "ignore_flush",
+		    CTLFLAG_RWTUN, &sc->ignore_flush, sizeof(sc->ignore_flush),
+		    "Do not return EOPNOTSUPP if there is no cache to flush");


Oh, never mind... ENOTSUP == EOPNOTSUPP

bsdimp · 2023-04-05T21:09:02Z

Some producers of BIO_FLUSH use ENOTSUP as a hint to never send BIO_FLUSH again.
Others completely ignore the error. Others just pass the error through to the next layer. UFS ignores all errors (it just uses it to enforce ordering in one place that's hard to do with I/O scheduling). ZFS ignores it and also never sends another one, but reacts to other errors. UFS schedules it with BIO_ORDERED, which is more important that's honored.

So converting NOTSUP to 0 feels wrong to me, but the current code in the tree would at most have an optimization bypassed. So it's not terrible to cope with other upper layers that assign more malice to this failure than is really there (eg L3 dropping it).

It's a terrible name, though. flush_notsup_succeed is better, but still not great.

It also needs to be documented.... but there's no good man page to document this weird quirk.

santhoshkumar-mani · 2023-04-06T17:06:25Z

Some producers of BIO_FLUSH use ENOTSUP as a hint to never send BIO_FLUSH again. Others completely ignore the error. Others just pass the error through to the next layer. UFS ignores all errors (it just uses it to enforce ordering in one place that's hard to do with I/O scheduling). ZFS ignores it and also never sends another one, but reacts to other errors. UFS schedules it with BIO_ORDERED, which is more important that's honored.

So converting NOTSUP to 0 feels wrong to me, but the current code in the tree would at most have an optimization bypassed. So it's not terrible to cope with other upper layers that assign more malice to this failure than is really there (eg L3 dropping it).

It's a terrible name, though. flush_notsup_succeed is better, but still not great.

It also needs to be documented.... but there's no good man page to document this weird quirk.

Thanks for your comments, it appears to me that suggestion is to not return 0, as there other upper layers/consumers using this error code. And for the consumer of us, probably drop in L3 for such error code

Is my understanding right ? if yes, may be close/drop this PR ?

bsdimp · 2023-04-07T15:47:23Z

No. I think that we should do this PR. While normally this functionality would be counterproductive, the sysctl that you have to enable it would allow people that need to cope with an upper layer / filesystem not honoring the protocol of ENOTSUP correctly would be a reasonable compromise. My only real objection is the name.

santhoshkumar-mani · 2023-04-07T17:10:18Z

No. I think that we should do this PR. While normally this functionality would be counterproductive, the sysctl that you have to enable it would allow people that need to cope with an upper layer / filesystem not honoring the protocol of ENOTSUP correctly would be a reasonable compromise. My only real objection is the name.

Got it. flush_notsup_succeed looks good to me. Updated the diff with the same

santhoshkumar-mani · 2023-04-13T19:05:37Z

Please let me know if there is anything else to be done from my end to close this one

bsdimp · 2023-07-01T17:15:46Z

Landed... Sorry for the delay. d3eb9d3

When a storage device reports that it does not support cache flush, the GEOM disk layer by default returns ENOTSUPP in response to a BIO_FLUSH command. On AWS, local volumes do not advertise themselves as having write-cache enabled. When they are selected for L3 on all HDD nodes, the L3 subsystem may inadvertently kick these L3 devices if a BIO_FLUSH command fails with an ENOTSUPP return code. The fix is to make GEOM disk return success (0) when this condition occurs and add a sysctl to make this error handling config-driven Reviewed by: imp Pull Request: #710

When a storage device reports that it does not support cache flush, the GEOM disk layer by default returns ENOTSUPP in response to a BIO_FLUSH command. On AWS, local volumes do not advertise themselves as having write-cache enabled. When they are selected for L3 on all HDD nodes, the L3 subsystem may inadvertently kick these L3 devices if a BIO_FLUSH command fails with an ENOTSUPP return code. The fix is to make GEOM disk return success (0) when this condition occurs and add a sysctl to make this error handling config-driven Reviewed by: imp Pull Request: freebsd/freebsd-src#710

bsdimp reviewed Apr 5, 2023

View reviewed changes

sysctl name modified from ignore_flush to flush_notsup_succeed

2badeb7

bsdimp closed this Jul 1, 2023

bsdimp added the merged label Jul 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore BIO_FLUSH command if there is no volatile cache to flush #710

Ignore BIO_FLUSH command if there is no volatile cache to flush #710

santhoshkumar-mani commented Apr 5, 2023

bsdimp commented Apr 5, 2023

danielwryan commented Apr 5, 2023

bsdimp Apr 5, 2023 •

edited

Loading

bsdimp commented Apr 5, 2023

santhoshkumar-mani commented Apr 6, 2023 •

edited

Loading

bsdimp commented Apr 7, 2023

santhoshkumar-mani commented Apr 7, 2023

santhoshkumar-mani commented Apr 13, 2023

bsdimp commented Jul 1, 2023

Ignore BIO_FLUSH command if there is no volatile cache to flush #710

Ignore BIO_FLUSH command if there is no volatile cache to flush #710

Conversation

santhoshkumar-mani commented Apr 5, 2023

bsdimp commented Apr 5, 2023

danielwryan commented Apr 5, 2023

bsdimp Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

bsdimp commented Apr 5, 2023

santhoshkumar-mani commented Apr 6, 2023 • edited Loading

bsdimp commented Apr 7, 2023

santhoshkumar-mani commented Apr 7, 2023

santhoshkumar-mani commented Apr 13, 2023

bsdimp commented Jul 1, 2023

bsdimp Apr 5, 2023 •

edited

Loading

santhoshkumar-mani commented Apr 6, 2023 •

edited

Loading