Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inclusion into upstream kernel plans? #1

Closed
KlavsKlavsen opened this issue May 13, 2020 · 50 comments
Closed

Inclusion into upstream kernel plans? #1

KlavsKlavsen opened this issue May 13, 2020 · 50 comments

Comments

@KlavsKlavsen
Copy link

Hi guys,

This is a fantastic optimization on a very important part of the system.. Who doesn't use LUKS today :)
I haven't been able to find any mentioning of this patch on any of the kernel mailinglists :(

Whats the plans, for including this in the upstream kernel, so everyone can benefit?

@ignatk
Copy link
Contributor

ignatk commented May 13, 2020

Hi,

As you can see the current approach is "all-or-nothing". That is: you either get complete inline behaviour or not. While it greatly improved our IO cache workload (with smaller IO block size), other synthetic benchmarks with larger IO block size (~>2MB) actually sometimes show worse results.

Secondly, the current approach is rather x86-specific, but upstream kernel needs something more generic, which would at least run on different architectures.

The plan forward is to try to profile specific parts of dev-mapper/dm-crypt stack in more detail and have better understanding at what introduces the most overhead to provide more meaningful commit messages for kernel maintainers with some backing data.

I would not be surprised, if the proposed here force_inline flag would turn into several flags, which tune different parts of the stack independently.

That said - if you found the patches useful we would appreciate if you could share some details about your IO workload.

@KlavsKlavsen
Copy link
Author

I'll have to get back to you on that once I actually try it out.. But my most important workload would by ceph disks encrypted with luks/dmcrypt, so each pod in k8s can choose if they need encryption at rest or not for their data (given the huge performance impact it has).

@ignatk
Copy link
Contributor

ignatk commented Jun 20, 2020

I have initiated the discussion with the upstream. You may follow https://lkml.org/lkml/2020/6/19/1934

@ignatk ignatk closed this as completed Jun 20, 2020
@MrM40
Copy link

MrM40 commented Jun 20, 2020

Thanks you for then efforts to improve the performance in the upstream, it's very valuable for all of us :-) :-) :-)

@EnderShadow
Copy link

I don't mean to ressurect this issue, but this is for anyone who stumbles across this in the future. As of kernel version 5.9, the changes have been upstreamed. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.9-rc1&id=39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877

@KlavsKlavsen
Copy link
Author

Its sad its not run-time options (via /proc knob) - but one has to recompile the kernel.. becomes annoying with all the kernel upgrades that happens all the time :(

@ignatk
Copy link
Contributor

ignatk commented Aug 21, 2020

Well, even with proc support you have to recompile at least once to get the code in the first place. Once you have the kernel with above flags, you can just enable/disable flags in runtime by manipulating dmsetup table ... | dmsetup reload and doing dmsetup suspend && dmsetup resume. See https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ for examples.

@neitsab
Copy link

neitsab commented Nov 18, 2020

Hi, thank you very much for all your work. Is there a step-by-step howto available to make use of those changes? As an end-user I'm having a bit of a hard time understanding what needs to be done to benefit from them. Like how to specify the new flags while recompiling the kernel? And should we also compile the xtsproxy module from the repo since the dmsetup table examples rely on it?
Cheers

@ignatk
Copy link
Contributor

ignatk commented Nov 18, 2020

What distribution are you using? Maybe it makes sense just to wait for the upstream changes to get propagated?

@neitsab
Copy link

neitsab commented Nov 18, 2020

Thanks for your quick answer! I'm using Archlinux with kernel 5.9.2, and I ask because I'm just provisioning a new machine with NVMe storage where the encryption speedup would be most appreciated, especially since I'll be transfering a lot of user data for the initial setup.

Maybe it makes sense just to wait for the upstream changes to get propagated?

So you mean that once we have kernel 5.9 we don't need to recompile as indicated in a previous comment? Nothing else is needed but the dmsetup table ... | dmsetup reload operation? The examples in the article used capi:xts-aes-xtsproxy-plain64 which wasn't part of the merged changes if I understand correctly.

What I am mainly trying to grasp is: how to make use of the added flags (DM_CRYPT_NO_READ_WORKQUEUE and DM_CRYPT_NO_WRITE_WORKQUEUE if I got that right). Is it during kernel compilation, module loading/dmsetup change or something else? But then what is the equivalent of the 1 force_inline used in the article for dmsetup table? Are they to be used in conjunction with xtsproxy module or can they be used independently?

Thanks

@ignatk
Copy link
Contributor

ignatk commented Nov 18, 2020

So you mean that once we have kernel 5.9 we don't need to recompile as indicated in a previous comment?

I think so. xts-aes-xtsproxy-plain64 was not merged, but we don't have enough evidence to say it is actually useful. Many other people tested the change without it and it seems the main performance contributor are the new flags, not the xts module. In the end, if you do want to try it, I think there is no need to recompile the whole kernel, just compile the xts module against your current 5.9 kernel.

Are they to be used in conjunction with xtsproxy module or can they be used independently?

Yes, you can use the flags independently.

It seems you don't need to mess with dmsetup table ... anymore. Latest release of cryptsetup included support for these flags:

...
* Added support for new no_read/write_wrokqueue dm-crypt options (kernel 5.9).

  These performance options, introduced in kernel 5.9, configure dm-crypt
  to bypass read or write workqueues and run encryption synchronously.

  Use --perf-no_read_workqueue or --perf-no_write_workqueue cryptsetup arguments
  to use these dm-crypt flags.

  These options are available only for low-level dm-crypt performance tuning,
  use only if you need a change to default dm-crypt behavior.

  For LUKS2, these flags can be persistently stored in metadata with
  the --persistent option.
...

If you're on Archlinux there is a good chance you already have v2.3.4 available (yes, it is), so just use standard cryptsetup commands with --perf-no_read_workqueue and/or --perf-no_read_workqueue flags, when you LUKS-format your new NVMEs.

@neitsab
Copy link

neitsab commented Nov 18, 2020

Thank you very much, this is exactly the information I was looking for! I'll try that and report the results. Much appreciated.

@neitsab
Copy link

neitsab commented Nov 19, 2020

Preliminary results with ramfs confirm the speed gains:

modprobe brd rd_nr=1 rd_size=4194304
echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | dmsetup create eram0
echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 no_read_workqueue no_write_workqueue' | dmsetup create inline-eram0
dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 8.67908 s, 495 MB/s
875eee50d50ba51a3a31f7e275bdc04c1615844785c374cfe7ae6fc0c8014287  -
dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 9.65908 s, 445 MB/s
875eee50d50ba51a3a31f7e275bdc04c1615844785c374cfe7ae6fc0c8014287  -
dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 8.70909 s, 493 MB/s
875eee50d50ba51a3a31f7e275bdc04c1615844785c374cfe7ae6fc0c8014287  -

Now I need to find a good IO benchmark example involving NVMe storage to see what the effect is.

Note: at first I tried to do the testing it with:

cryptsetup luksFormat /dev/ram0
<dd test ram0>
cryptsetup open /dev/ram0 eram0
<dd test eram0>
cryptsetup refresh --perf-no_read_workqueue --perf-no_write_workqueue eram0
<dd test refreshed eram0>

The speed gains were exactly the same as in the previous examples, however the two sha256sums of eram0 did not match the one from ram0 (but were consistent together before/after refreshing with the perf parameters).

Edit: maybe this results from not using a detached header and overwriting it? From the article:

Those who are familiar with LUKS/dm-crypt might have noticed we used a LUKS detached header here. Normally, LUKS stores the password-encrypted disk encryption key on the same disk as the data, but since we want to compare read/write performance between encrypted and unencrypted devices, we might accidentally overwrite the encrypted key during our benchmarking later. Keeping the encrypted key in a separate file avoids this problem for the purposes of this post.

@ignatk
Copy link
Contributor

ignatk commented Nov 19, 2020

Edit: maybe this results from not using a detached header and overwriting it? From the article:

This and the fact that with cryptsetup you're probably not using capi:ecb(cipher_null) anymore, but some proper encryption, most likely xts-aes-plain64.

So by hashing ram0 you're hashing raw (encrypted) data, but by hashing eram0 you're hashing plaintext (unencrypted) data.

@neitsab
Copy link

neitsab commented Nov 19, 2020

Ha of course, that makes sense. Thanks

@darena67
Copy link

darena67 commented Dec 1, 2020

Hi, I have tried doing sudo cryptsetup refresh --perf-no_read_workqueue --perf-no_write_workqueue /dev/mapper/root with a 5.9.x kernel and my system locks up with attached messages. Am I doing this wrong?

IMG_20201201_080947

IMG_20201201_080854

@ignatk
Copy link
Contributor

ignatk commented Dec 1, 2020

No, you're doing it right. Which exact kernel version are you using? Is it a distro or a self-compiled kernel?

@linuxrrze
Copy link

Hi, same error here:

I'm running Ubuntu 20.04 with Kernel 5.9.8 and 5.9.10 (from here https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9.8/amd64/).

One system crashed tonight with those scheduling errors. Cause seems to be high i/o load by a btrfs scrub job (scheduled monthly). Reproduced this multiple times since then by forcing some extra i/o-load on the block devices. System was running fine (with this kernel) for some days now.
Second system (similar setup) just crashed during a backup job - same errors.

I never observed this behavior when using the default Ubuntu kernel with the two updated modules from cloudflare added.
I'm currently building them against Ubuntu's 5.4.0-56 kernel and will stay with that on production systems.

Further testing only on non-production machines / VMs.

Some details about my setup (maybe that's relevant): I'm running LVM + dm-crypt + btrfs

@ignatk
Copy link
Contributor

ignatk commented Dec 1, 2020

It would be very helpful if you could also rebuild the dm-crypt module for your kernel with this patch applied torvalds/linux@cd74693 and let me know if the issue persists.

@evilbulgarian
Copy link

Happens in 5.9.9 and in 5.10.0-rc6, custom build kernel on gentoo here is the config..
kconfig-5.10.txt

@evilbulgarian
Copy link

[Tue Dec 1 10:22:59 2020] BTRFS info (device dm-0): devid 2 device path /dev/mapper/root_1 changed to /dev/dm-1 scanned by udevd (22281)
[Tue Dec 1 10:22:59 2020] BTRFS info (device dm-0): devid 2 device path /dev/dm-1 changed to /dev/mapper/root_1 scanned by udevd (22281)
[Tue Dec 1 10:23:06 2020] BTRFS info (device dm-0): devid 1 device path /dev/mapper/root changed to /dev/dm-0 scanned by udevd (22311)
[Tue Dec 1 10:23:06 2020] BTRFS info (device dm-0): devid 1 device path /dev/dm-0 changed to /dev/mapper/root scanned by udevd (22311)
[Tue Dec 1 10:23:36 2020] BUG: scheduling while atomic: fio/22422/0x00000102
[Tue Dec 1 10:23:36 2020] Modules linked in: md4 md5 sha512_ssse3 sha512_generic cmac hmac nls_utf8 cifs libdes libarc4 cfg80211 autofs4 ipv6 tcp_bic binfmt_misc btusb btrtl btbcm btintel bluetooth snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio rfkill snd_hda_intel ecdh_generic snd_intel_dspcfg ecc snd_usb_audio snd_hda_codec snd_hwdep snd_hda_core snd_usbmidi_lib amdgpu cdc_acm snd_pcm mousedev input_leds led_class snd_seq_midi iommu_v2 snd_seq_midi_event gpu_sched evdev ttm snd_rawmidi i2c_algo_bit drm_kms_helper snd_seq syscopyarea sysfillrect sysimgblt snd_seq_device fb_sys_fops snd_timer k10temp ccp drm snd sha1_generic soundcore button efivarfs dm_crypt encrypted_keys dm_mod r8169 realtek mdio_devres libphy usbhid
[Tue Dec 1 10:23:36 2020] CPU: 5 PID: 22422 Comm: fio Tainted: G W 5.10.0-rc6 #51
[Tue Dec 1 10:23:36 2020] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR (MS-7B89), BIOS 1.D0 01/06/2020
[Tue Dec 1 10:23:36 2020] Call Trace:
[Tue Dec 1 10:23:36 2020]
[Tue Dec 1 10:23:36 2020] dump_stack+0x57/0x6a
[Tue Dec 1 10:23:36 2020] __schedule_bug.cold+0x47/0x53
[Tue Dec 1 10:23:36 2020] __schedule+0x5ca/0x690
[Tue Dec 1 10:23:36 2020] ? try_to_wake_up+0x1ff/0x390
[Tue Dec 1 10:23:36 2020] schedule+0x56/0xc0
[Tue Dec 1 10:23:36 2020] schedule_timeout+0xc3/0x100
[Tue Dec 1 10:23:36 2020] wait_for_completion+0x7f/0xe0
[Tue Dec 1 10:23:36 2020] crypt_convert.constprop.0+0x779/0xf40 [dm_crypt]
[Tue Dec 1 10:23:36 2020] ? blk_update_request+0x1ff/0x350
[Tue Dec 1 10:23:36 2020] ? __iommu_dma_unmap+0xa3/0x110
[Tue Dec 1 10:23:36 2020] ? sbitmap_queue_clear+0x3b/0x60
[Tue Dec 1 10:23:36 2020] ? __blk_mq_free_request+0x9b/0xe0
[Tue Dec 1 10:23:36 2020] kcryptd_crypt_read_convert+0xb9/0xf0 [dm_crypt]
[Tue Dec 1 10:23:36 2020] tasklet_action_common.constprop.0+0x9b/0xa0
[Tue Dec 1 10:23:36 2020] __do_softirq+0xb8/0x1f2
[Tue Dec 1 10:23:36 2020] asm_call_irq_on_stack+0x12/0x20
[Tue Dec 1 10:23:36 2020]
[Tue Dec 1 10:23:36 2020] do_softirq_own_stack+0x31/0x40
[Tue Dec 1 10:23:36 2020] irq_exit_rcu+0xbe/0x100
[Tue Dec 1 10:23:36 2020] common_interrupt+0x8f/0x160
[Tue Dec 1 10:23:36 2020] asm_common_interrupt+0x1e/0x40
[Tue Dec 1 10:23:36 2020] RIP: 0010:_aesni_enc4+0xfb/0x190
[Tue Dec 1 10:23:36 2020] Code: dc ea 66 0f 38 dc f2 41 0f 28 52 10 66 0f 38 dc c2 66 0f 38 dc e2 66 0f 38 dc ea 66 0f 38 dc f2 41 0f 28 52 20 66 0f 38 dc c2 <66> 0f 38 dc e2 66 0f 38 dc ea 66 0f 38 dc f2 41 0f 28 52 30 66 0f
[Tue Dec 1 10:23:36 2020] RSP: 0018:ffffae768358b3d8 EFLAGS: 00000282
[Tue Dec 1 10:23:36 2020] RAX: ffffffffa80652a0 RBX: ffffffffa8e0ab98 RCX: 0000000000000000
[Tue Dec 1 10:23:36 2020] RDX: ffff946f4dc15b80 RSI: ffff9471bf52bb80 RDI: ffff946fa7c5a210
[Tue Dec 1 10:23:36 2020] RBP: 0000000000000080 R08: ffff9471bfd793c0 R09: 0000000000000020
[Tue Dec 1 10:23:36 2020] R10: ffff946fa7c5a280 R11: ffffffffa8065040 R12: 0000000000000080
[Tue Dec 1 10:23:36 2020] R13: ffff9471bf52bc00 R14: ffffae768358b458 R15: ffff946f4dc15c00
[Tue Dec 1 10:23:36 2020] ? _aesni_enc1+0xb0/0xb0
[Tue Dec 1 10:23:36 2020] ? _aesni_dec1+0xb0/0xb0
[Tue Dec 1 10:23:36 2020] ? aesni_xts_crypt8+0x1fa/0x240
[Tue Dec 1 10:23:36 2020] ? __glue_xts_req_128bit+0x8a/0xb0
[Tue Dec 1 10:23:36 2020] ? glue_xts_req_128bit+0xb8/0x420
[Tue Dec 1 10:23:36 2020] ? aesni_set_key+0x1e0/0x1e0
[Tue Dec 1 10:23:36 2020] ? prep_new_page+0x72/0x80
[Tue Dec 1 10:23:36 2020] ? get_page_from_freelist+0xbc7/0xee0
[Tue Dec 1 10:23:36 2020] ? get_page_from_freelist+0xd08/0xee0
[Tue Dec 1 10:23:36 2020] ? crypt_convert.constprop.0+0xa03/0xf40 [dm_crypt]
[Tue Dec 1 10:23:36 2020] ? crypt_page_alloc+0x5a/0x80 [dm_crypt]
[Tue Dec 1 10:23:36 2020] ? mempool_alloc+0x6f/0x190
[Tue Dec 1 10:23:36 2020] ? kcryptd_crypt_write_convert+0x203/0x330 [dm_crypt]
[Tue Dec 1 10:23:36 2020] ? crypt_map+0x136/0x270 [dm_crypt]
[Tue Dec 1 10:23:36 2020] ? __map_bio.isra.0+0x3d/0xe0 [dm_mod]
[Tue Dec 1 10:23:36 2020] ? __split_and_process_non_flush+0x259/0x2e0 [dm_mod]
[Tue Dec 1 10:23:36 2020] ? dm_submit_bio+0x16d/0x420 [dm_mod]
[Tue Dec 1 10:23:36 2020] ? ktime_get+0x33/0x90
[Tue Dec 1 10:23:36 2020] ? submit_bio_noacct+0x111/0x4e0
[Tue Dec 1 10:23:36 2020] ? btrfs_map_bio+0x18d/0x320
[Tue Dec 1 10:23:36 2020] ? btrfs_submit_direct+0x24d/0x4e0
[Tue Dec 1 10:23:36 2020] ? iov_iter_get_pages+0xe3/0x3b0
[Tue Dec 1 10:23:36 2020] ? bio_iov_iter_get_pages+0x301/0x420
[Tue Dec 1 10:23:36 2020] ? iomap_dio_submit_bio+0x6c/0x80
[Tue Dec 1 10:23:36 2020] ? iomap_dio_bio_actor+0x189/0x3c0
[Tue Dec 1 10:23:36 2020] ? iomap_apply+0x104/0x200
[Tue Dec 1 10:23:36 2020] ? iomap_dio_rw+0x30/0x30
[Tue Dec 1 10:23:36 2020] ? __iomap_dio_rw+0x1f4/0x480
[Tue Dec 1 10:23:36 2020] ? iomap_dio_rw+0x30/0x30
[Tue Dec 1 10:23:36 2020] ? iomap_dio_rw+0x9/0x30
[Tue Dec 1 10:23:36 2020] ? btrfs_direct_IO+0x18b/0x1b0
[Tue Dec 1 10:23:36 2020] ? btrfs_file_write_iter+0x2d3/0x6d0
[Tue Dec 1 10:23:36 2020] ? aio_write+0x11a/0x260
[Tue Dec 1 10:23:36 2020] ? kmem_cache_free+0x38f/0x3e0
[Tue Dec 1 10:23:36 2020] ? _copy_to_user+0x1c/0x30
[Tue Dec 1 10:23:36 2020] ? aio_read_events_ring+0x1b8/0x200
[Tue Dec 1 10:23:36 2020] ? io_submit_one+0x472/0x7c0
[Tue Dec 1 10:23:36 2020] ? blk_queue_exit+0x9/0x40
[Tue Dec 1 10:23:36 2020] ? __x64_sys_io_submit+0x7b/0x130
[Tue Dec 1 10:23:36 2020] ? __x64_sys_io_getevents+0x61/0xc0
[Tue Dec 1 10:23:36 2020] ? do_syscall_64+0x33/0x40
[Tue Dec 1 10:23:36 2020] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Tue Dec 1 10:23:36 2020] BUG: scheduling while atomic: kworker/5:1/148/0x00000102

@neitsab
Copy link

neitsab commented Dec 1, 2020

For the record, no issue so far here on Arch Linux with standard linux kernel 5.9.10-arch1-1 and straight XFS over LUKS config on Intel 660p NVMe drive. However I haven't had any IO intensive load so far.

@ignatk
Copy link
Contributor

ignatk commented Dec 1, 2020

@evilbulgarian can you post your fio command/config?

@evilbulgarian
Copy link

fio --bs=4096 --rw=randrw --name=generic_job_name --filename=./test.fio --numjobs=1 --size=50g --thread --direct=1 --runtime=300 --rwmixread=50 --iodepth=256 --ioengine=libaio

@ignatk
Copy link
Contributor

ignatk commented Dec 8, 2020

If I understand the problem right, the patch below should probably fix it. We will test it on our end, but would appreciate folks here, who reported the problem, to try it out as well, if possible/feasible.

---
 drivers/md/dm-crypt.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 392337f16ecf..1ea98b6b816f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1455,7 +1455,7 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 			       int error);

 static void crypt_alloc_req_skcipher(struct crypt_config *cc,
-				     struct convert_context *ctx)
+				     struct convert_context *ctx, bool may_backlog)
 {
 	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);

@@ -1469,12 +1469,12 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	skcipher_request_set_callback(ctx->r.req,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    may_backlog ? CRYPTO_TFM_REQ_MAY_BACKLOG : 0,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
 }

 static void crypt_alloc_req_aead(struct crypt_config *cc,
-				 struct convert_context *ctx)
+				 struct convert_context *ctx, bool may_backlog)
 {
 	if (!ctx->r.req_aead)
 		ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
@@ -1486,17 +1486,17 @@ static void crypt_alloc_req_aead(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	aead_request_set_callback(ctx->r.req_aead,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    may_backlog ? CRYPTO_TFM_REQ_MAY_BACKLOG : 0,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead));
 }

 static void crypt_alloc_req(struct crypt_config *cc,
-			    struct convert_context *ctx)
+			    struct convert_context *ctx, bool may_backlog)
 {
 	if (crypt_integrity_aead(cc))
-		crypt_alloc_req_aead(cc, ctx);
+		crypt_alloc_req_aead(cc, ctx, may_backlog);
 	else
-		crypt_alloc_req_skcipher(cc, ctx);
+		crypt_alloc_req_skcipher(cc, ctx, may_backlog);
 }

 static void crypt_free_req_skcipher(struct crypt_config *cc,
@@ -1539,7 +1539,7 @@ static blk_status_t crypt_convert(struct crypt_config *cc,

 	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {

-		crypt_alloc_req(cc, ctx);
+		crypt_alloc_req(cc, ctx, !atomic);
 		atomic_inc(&ctx->cc_pending);

 		if (crypt_integrity_aead(cc))
--
2.20.1

@evilbulgarian
Copy link

Hi @ignatk,

Thanks for patch seems like now there is no lockup but lots of these btrfs errors in output..

[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x34dd1610 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] btrfs_dev_stat_print_on_error: 54 callbacks suppressed
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 331, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x340d57a0 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 332, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x3746c020 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 333, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x3557f418 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] repair_io_failure: 54 callbacks suppressed
[Tue Dec 8 08:44:51 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 28309405696 (dev /dev/mapper/root sector 873289632)
[Tue Dec 8 08:44:51 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 23296749568 (dev /dev/mapper/root sector 894956568)
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 266, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 267, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x14cd3a98 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x156b5fa8 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 334, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x344ae6c8 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 268, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x20b56320 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 979709952 (dev /dev/mapper/root_1 sector 348994200)
[Tue Dec 8 08:44:51 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 2039894016 (dev /dev/mapper/root_1 sector 359358376)
[Tue Dec 8 08:44:51 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 30374989824 (dev /dev/mapper/root sector 877323976)
[Tue Dec 8 08:44:51 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 11546734592 (dev /dev/mapper/root_1 sector 548758304)
[Tue Dec 8 08:44:51 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 335, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:51 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x37cf3f50 len 4096 err no 10
[Tue Dec 8 08:44:51 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 50921906176 (dev /dev/mapper/root sector 936329040)
[Tue Dec 8 08:44:53 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 336, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x344a4770 len 4096 err no 10
[Tue Dec 8 08:44:53 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 337, flush 0, corrupt 0, gen 0
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x359dc480 len 4096 err no 10
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x379031f8 len 4096 err no 10
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x35ed2c38 len 4096 err no 10
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x353fb700 len 4096 err no 10
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x14c48258 len 4096 err no 10
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x3296aaa0 len 4096 err no 10
[Tue Dec 8 08:44:53 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 687124480 (dev /dev/mapper/root_1 sector 348422744)
[Tue Dec 8 08:44:53 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 34746167296 (dev /dev/mapper/root sector 904735800)
[Tue Dec 8 08:44:53 2020] BTRFS info (device dm-0): read error corrected: ino 3185621 off 17709400064 (dev /dev/mapper/root sector 848734880)
[Tue Dec 8 08:44:53 2020] BTRFS warning (device dm-0): direct IO failed ino 3185621 rw 0,0 sector 0x1b15a958 len 4096 err no 10

@evilbulgarian
Copy link

evilbulgarian commented Dec 8, 2020

Also the performance is worse but seems issue is isolated to the fio job, I do not see errors if not running fio so far ( tried recreating the fio file as well ):

before flags:
❯ fio --bs=4096 --rw=randrw --name=generic_job_name --filename=./test.fio --numjobs=1 --size=50g --thread --direct=1 --runtime=300 --rwmixread=50 --iodepth=256 --ioengine=libaio
generic_job_name: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
fio-3.19
Starting 1 thread
Jobs: 1 (f=1): [m(1)][100.0%][r=163MiB/s,w=162MiB/s][r=41.6k,w=41.5k IOPS][eta 00m:00s]
generic_job_name: (groupid=0, jobs=1): err= 0: pid=5033: Tue Dec 8 08:43:09 2020
read: IOPS=32.7k, BW=128MiB/s (134MB/s)(25.0GiB/200340msec)
slat (usec): min=2, max=14568, avg= 9.25, stdev=20.46
clat (usec): min=55, max=3105.0k, avg=3836.76, stdev=15063.12
lat (usec): min=60, max=3105.1k, avg=3846.08, stdev=15065.24
clat percentiles (usec):
| 1.00th=[ 1074], 5.00th=[ 2442], 10.00th=[ 2573], 20.00th=[ 2704],
| 30.00th=[ 2802], 40.00th=[ 2868], 50.00th=[ 2966], 60.00th=[ 3032],
| 70.00th=[ 3163], 80.00th=[ 3425], 90.00th=[ 4359], 95.00th=[ 5276],
| 99.00th=[ 23725], 99.50th=[ 33162], 99.90th=[ 61604], 99.95th=[ 82314],
| 99.99th=[354419]
bw ( KiB/s): min= 72, max=189816, per=100.00%, avg=132364.74, stdev=53441.25, samples=395
iops : min= 18, max=47454, avg=33091.19, stdev=13360.32, samples=395
write: IOPS=32.7k, BW=128MiB/s (134MB/s)(24.0GiB/200340msec); 0 zone resets
slat (usec): min=5, max=320451, avg=15.64, stdev=246.87
clat (usec): min=97, max=3740.0k, avg=3960.91, stdev=23383.46
lat (usec): min=137, max=3741.0k, avg=3976.64, stdev=23386.20
clat percentiles (usec):
| 1.00th=[ 1893], 5.00th=[ 2245], 10.00th=[ 2409],
| 20.00th=[ 2573], 30.00th=[ 2671], 40.00th=[ 2737],
| 50.00th=[ 2802], 60.00th=[ 2868], 70.00th=[ 2966],
| 80.00th=[ 3064], 90.00th=[ 3851], 95.00th=[ 7504],
| 99.00th=[ 25297], 99.50th=[ 35914], 99.90th=[ 80217],
| 99.95th=[ 147850], 99.99th=[1010828]
bw ( KiB/s): min= 16, max=190240, per=100.00%, avg=132966.51, stdev=52680.84, samples=393
iops : min= 4, max=47560, avg=33241.62, stdev=13170.21, samples=393
lat (usec) : 100=0.03%, 250=0.10%, 500=0.13%, 750=0.11%, 1000=0.11%
lat (msec) : 2=1.46%, 4=86.93%, 10=7.87%, 20=1.95%, 50=1.10%
lat (msec) : 100=0.16%, 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%, >=2000=0.01%
cpu : usr=5.51%, sys=65.66%, ctx=280393, majf=0, minf=1
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=6555326,6551874,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=25.0GiB (26.8GB), run=200340-200340msec
WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=24.0GiB (26.8GB), run=200340-200340msec

after flags:
❯ fio --bs=4096 --rw=randrw --name=generic_job_name --filename=./test.fio --numjobs=1 --size=50g --thread --direct=1 --runtime=300 --rwmixread=50 --iodepth=256 --ioengine=libaio
generic_job_name: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
fio-3.19
Starting 1 thread
Jobs: 1 (f=1): [m(1)][100.0%][r=99.6MiB/s,w=98.6MiB/s][r=25.5k,w=25.2k IOPS][eta 00m:00s]
generic_job_name: (groupid=0, jobs=1): err= 0: pid=5586: Tue Dec 8 08:48:47 2020
read: IOPS=25.0k, BW=97.8MiB/s (103MB/s)(25.0GiB/261725msec)
slat (usec): min=2, max=2365, avg= 8.00, stdev= 8.42
clat (usec): min=51, max=1132.0k, avg=5155.91, stdev=6401.53
lat (usec): min=60, max=1132.0k, avg=5164.01, stdev=6401.70
clat percentiles (usec):
| 1.00th=[ 1270], 5.00th=[ 3261], 10.00th=[ 3490], 20.00th=[ 3752],
| 30.00th=[ 3982], 40.00th=[ 4178], 50.00th=[ 4359], 60.00th=[ 4555],
| 70.00th=[ 4883], 80.00th=[ 5211], 90.00th=[ 6194], 95.00th=[ 7898],
| 99.00th=[ 27132], 99.50th=[ 35390], 99.90th=[ 71828], 99.95th=[ 90702],
| 99.99th=[147850]
bw ( KiB/s): min= 640, max=146432, per=100.00%, avg=100336.84, stdev=22338.55, samples=522
iops : min= 160, max=36608, avg=25084.21, stdev=5584.64, samples=522
write: IOPS=25.0k, BW=97.8MiB/s (103MB/s)(24.0GiB/261725msec); 0 zone resets
slat (usec): min=11, max=163111, avg=26.74, stdev=303.18
clat (usec): min=50, max=971238, avg=5031.52, stdev=6116.94
lat (usec): min=74, max=971260, avg=5058.35, stdev=6128.24
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
| 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 5], 60.00th=[ 5],
| 70.00th=[ 5], 80.00th=[ 5], 90.00th=[ 6], 95.00th=[ 9],
| 99.00th=[ 29], 99.50th=[ 39], 99.90th=[ 72], 99.95th=[ 100],
| 99.99th=[ 148]
bw ( KiB/s): min= 600, max=146416, per=100.00%, avg=100283.94, stdev=22281.35, samples=522
iops : min= 150, max=36604, avg=25070.99, stdev=5570.34, samples=522
lat (usec) : 100=0.02%, 250=0.07%, 500=0.10%, 750=0.10%, 1000=0.10%
lat (msec) : 2=0.49%, 4=35.27%, 10=60.19%, 20=2.14%, 50=1.28%
lat (msec) : 100=0.18%, 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%
cpu : usr=3.46%, sys=65.95%, ctx=4852934, majf=0, minf=2
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=6555326,6551874,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=97.8MiB/s (103MB/s), 97.8MiB/s-97.8MiB/s (103MB/s-103MB/s), io=25.0GiB (26.8GB), run=261725-261725msec
WRITE: bw=97.8MiB/s (103MB/s), 97.8MiB/s-97.8MiB/s (103MB/s-103MB/s), io=24.0GiB (26.8GB), run=261725-261725msec

@ignatk
Copy link
Contributor

ignatk commented Dec 8, 2020

What about this one? Should help with errors, not sure if it helps with performance

---
 drivers/md/dm-crypt.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 392337f16ecf..1f4e948090b9 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1455,7 +1455,7 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 			       int error);

 static void crypt_alloc_req_skcipher(struct crypt_config *cc,
-				     struct convert_context *ctx)
+				     struct convert_context *ctx, bool may_backlog)
 {
 	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);

@@ -1469,12 +1469,12 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	skcipher_request_set_callback(ctx->r.req,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    may_backlog ? CRYPTO_TFM_REQ_MAY_BACKLOG : 0,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
 }

 static void crypt_alloc_req_aead(struct crypt_config *cc,
-				 struct convert_context *ctx)
+				 struct convert_context *ctx, bool may_backlog)
 {
 	if (!ctx->r.req_aead)
 		ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
@@ -1486,17 +1486,17 @@ static void crypt_alloc_req_aead(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	aead_request_set_callback(ctx->r.req_aead,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    may_backlog ? CRYPTO_TFM_REQ_MAY_BACKLOG : 0,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead));
 }

 static void crypt_alloc_req(struct crypt_config *cc,
-			    struct convert_context *ctx)
+			    struct convert_context *ctx, bool may_backlog)
 {
 	if (crypt_integrity_aead(cc))
-		crypt_alloc_req_aead(cc, ctx);
+		crypt_alloc_req_aead(cc, ctx, may_backlog);
 	else
-		crypt_alloc_req_skcipher(cc, ctx);
+		crypt_alloc_req_skcipher(cc, ctx, may_backlog);
 }

 static void crypt_free_req_skcipher(struct crypt_config *cc,
@@ -1539,7 +1539,7 @@ static blk_status_t crypt_convert(struct crypt_config *cc,

 	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {

-		crypt_alloc_req(cc, ctx);
+		crypt_alloc_req(cc, ctx, !atomic);
 		atomic_inc(&ctx->cc_pending);

 		if (crypt_integrity_aead(cc))
@@ -1548,6 +1548,13 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 			r = crypt_convert_block_skcipher(cc, ctx, ctx->r.req, tag_offset);

 		switch (r) {
+		/*
+		 * The driver request queue is full and the request was
+		 * not queued, ask device mapper to resubmit the IO.
+		 */
+		case -ENOSPC:
+			atomic_dec(&ctx->cc_pending);
+			return BLK_STS_DM_REQUEUE;
 		/*
 		 * The request was queued by a crypto driver
 		 * but the driver request queue is full, let's wait.
--
2.20.1

@evilbulgarian
Copy link

hmm,

❯ patch -p1 < ~/ignat2.patch
patching file drivers/md/dm-crypt.c
Hunk #1 FAILED at 1455.
Hunk #2 FAILED at 1469.
patch: **** malformed patch at line 30: ,

@ignatk
Copy link
Contributor

ignatk commented Dec 8, 2020

Can you try again (I updated the above comment with the patch - probably a copy-paste error).

@evilbulgarian
Copy link

evilbulgarian commented Dec 8, 2020

@ignatk, ok seems like same errors..

[Tue Dec 8 13:04:24 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x3660bb80 len 4096 err no 10
[Tue Dec 8 13:04:24 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0xa5f17e0 len 4096 err no 10
[Tue Dec 8 13:04:24 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0xa0191f8 len 4096 err no 10
[Tue Dec 8 13:04:24 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x374db110 len 4096 err no 10
[Tue Dec 8 13:04:24 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x38816d28 len 4096 err no 10
[Tue Dec 8 13:04:24 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x399818a0 len 4096 err no 10
[Tue Dec 8 13:04:24 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x375285a8 len 4096 err no 10
[Tue Dec 8 13:04:25 2020] BTRFS info (device dm-0): the free space cache file (615276085248) is invalid, skip it
[Tue Dec 8 13:04:28 2020] btrfs_dev_stat_print_on_error: 774 callbacks suppressed
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3535, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3536, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3537, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] btrfs_printk: 684 callbacks suppressed
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3538, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x39750880 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x1a529ca0 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x336a7f88 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3539, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x33df7bd0 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3540, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x366b2240 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3541, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x37f78bb8 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] repair_io_failure: 774 callbacks suppressed
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 40320430080 (dev /dev/mapper/root_1 sector 945990776)
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 15886913536 (dev /dev/mapper/root_1 sector 862617480)
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 19812294656 (dev /dev/mapper/root_1 sector 870284240)
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 23424434176 (dev /dev/mapper/root_1 sector 912990784)
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 49526407168 (dev /dev/mapper/root_1 sector 963971200)
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 8676524032 (dev /dev/mapper/root_1 sector 441621664)
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 36725813248 (dev /dev/mapper/root_1 sector 938970040)
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3542, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3543, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x384e66c0 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x38978808 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS error (device dm-0): bdev /dev/mapper/root_1 errs: wr 0, rd 3544, flush 0, corrupt 0, gen 0
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x33c36360 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x369f9658 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x38eb1c98 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x36862740 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 39640203264 (dev /dev/mapper/root_1 sector 944662208)
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 42094039040 (dev /dev/mapper/root_1 sector 949454856)
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x39115208 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x37aa5970 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS info (device dm-0): read error corrected: ino 4077816 off 18869567488 (dev /dev/mapper/root_1 sector 868442976)
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x366173b8 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x371c98c0 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x2fe71ee8 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x1a5698c0 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x10616550 len 4096 err no 10
[Tue Dec 8 13:04:28 2020] BTRFS warning (device dm-0): direct IO failed ino 4077816 rw 0,0 sector 0x333d8ba0 len 4096 err no 10

@ignatk
Copy link
Contributor

ignatk commented Dec 8, 2020

Thanks for the follow up - will try to reproduce on my side then...

@ignatk
Copy link
Contributor

ignatk commented Dec 17, 2020

Some update: I've been trying hard to reproduce something like above in QEMU setup using ramdisk. I've modified the Linux ramdisk driver to complete read requests in softirq context emulating some of the storage driver behaviour. I also tried to both run fio load directly on the dm-crypt device as well as creating btrfs on top. Somehow I still can't see the issues you're reporting.

For me to try to dig further is it possible for you to provide:

  • your kernel configuration file
  • number of cpus/cores (nproc command)
  • storage type (and driver)
  • the output of # dmsetup table command
  • the output of $ grep -A 10 xts /proc/crypto

Thanks.

@evilbulgarian
Copy link

evilbulgarian commented Dec 18, 2020

Thanks for troubleshooting this @ignatk,

  • I attached my config in above post as a txt file
  • nproc => 16 but also happens on my laptop which has 4
  • nvme / driver not sure
    `
    sudo dmsetup table /dev/mapper/root
    0 1023406080 crypt aes-xts-plain64 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0 259:8 4096 1 allow_discards
    sudo dmsetup table /dev/mapper/root_1
    0 974843904 crypt aes-xts-plain64 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0 259:6 4096 1 allow_discards

grep -A 10 xts /proc/crypto
name : __xts(aes)
driver : cryptd(__xts-aes-aesni)
module : kernel
priority : 451
refcnt : 4
selftest : passed
internal : yes
type : skcipher
async : yes
blocksize : 16
min keysize : 32
max keysize : 64

name : xts(aes)
driver : xts-aes-aesni
module : kernel
priority : 401
refcnt : 4
selftest : passed
internal : no
type : skcipher
async : yes
blocksize : 16
min keysize : 32
max keysize : 64

name : __xts(aes)
driver : __xts-aes-aesni
module : kernel
priority : 401
refcnt : 4
selftest : passed
internal : yes
type : skcipher
async : no
blocksize : 16
min keysize : 32
max keysize : 64
`

@ignatk
Copy link
Contributor

ignatk commented Dec 23, 2020

I was able to reliably reproduce this in QEMU with only 1 cpu and scaling down cryptd cryptd_max_cpu_qlen parameter to 1 (default is 1000). But I don't see any simple fix so far as returning any kind of error (even asking device mapper to just requeue the bio) produces some weird side effects (probably the error handling path in this case is not very well worked out).

While we figure this out, as a workaround, you may try to actually increase cryptd_max_cpu_qlen or compile and use the xtsproxy crypto module from this repository: since the module is fully synchronous, it should never trigger the async processing path in dm-crypt.

@ignatk
Copy link
Contributor

ignatk commented Dec 24, 2020

Hi all,

I think I've come up with a patch which at least fixes the above on my environment. If you could test it on your side and let me know if it fixes the issues, I would be very grateful

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 5f9f9b3a226d..12ffab3d3100 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1368,8 +1368,11 @@ static int crypt_convert_block_aead(struct crypt_config *cc,
 	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
 		r = cc->iv_gen_ops->post(cc, org_iv, dmreq);

-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	/* do not advance the bio, when Crypto API did nothing in case we want to retry */
+	if (r != -ENOSPC) {
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	}

 	return r;
 }
@@ -1445,8 +1448,11 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc,
 	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
 		r = cc->iv_gen_ops->post(cc, org_iv, dmreq);

-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	/* do not advance the bio, when Crypto API did nothing in case we want to retry */
+	if (r != ENOSPC) {
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	}

 	return r;
 }
@@ -1457,10 +1463,11 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 				     struct convert_context *ctx)
 {
+	bool atomic = in_interrupt();
 	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);

 	if (!ctx->r.req)
-		ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
+		ctx->r.req = mempool_alloc(&cc->req_pool, atomic ? GFP_ATOMIC : GFP_NOIO);

 	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);

@@ -1469,15 +1476,17 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	skcipher_request_set_callback(ctx->r.req,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    atomic ? 0 : CRYPTO_TFM_REQ_MAY_BACKLOG,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
 }

 static void crypt_alloc_req_aead(struct crypt_config *cc,
 				 struct convert_context *ctx)
 {
+	bool atomic = in_interrupt();
+
 	if (!ctx->r.req_aead)
-		ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
+		ctx->r.req_aead = mempool_alloc(&cc->req_pool, atomic ? GFP_ATOMIC : GFP_NOIO);

 	aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]);

@@ -1486,7 +1495,7 @@ static void crypt_alloc_req_aead(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	aead_request_set_callback(ctx->r.req_aead,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    atomic ? 0 : CRYPTO_TFM_REQ_MAY_BACKLOG,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead));
 }

@@ -1548,6 +1557,13 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 			r = crypt_convert_block_skcipher(cc, ctx, ctx->r.req, tag_offset);

 		switch (r) {
+		/*
+		 * The driver request queue is full and the request was
+		 * not queued.
+		 */
+		case -ENOSPC:
+			atomic_dec(&ctx->cc_pending);
+			return BLK_STS_DEV_RESOURCE;
 		/*
 		 * The request was queued by a crypto driver
 		 * but the driver request queue is full, let's wait.
@@ -1945,6 +1961,34 @@ static bool kcryptd_crypt_write_inline(struct crypt_config *cc,
 	}
 }

+static void kcryptd_crypt_write_continue(struct work_struct *work)
+{
+	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+	struct crypt_config *cc = io->cc;
+	struct convert_context *ctx = &io->ctx;
+	int crypt_finished;
+	sector_t sector = io->sector;
+	blk_status_t r;
+
+	r = crypt_convert(cc, &io->ctx, false);
+	if (r)
+		io->error = r;
+	crypt_finished = atomic_dec_and_test(&ctx->cc_pending);
+	if (!crypt_finished && kcryptd_crypt_write_inline(cc, ctx)) {
+		/* Wait for completion signaled by kcryptd_async_done() */
+		wait_for_completion(&ctx->restart);
+		crypt_finished = 1;
+	}
+
+	/* Encryption was already finished, submit io now */
+	if (crypt_finished) {
+		kcryptd_crypt_write_io_submit(io, 0);
+		io->sector = sector;
+	}
+
+	crypt_dec_pending(io);
+}
+
 static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
@@ -1974,6 +2018,16 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 	crypt_inc_pending(io);
 	r = crypt_convert(cc, ctx,
 			  test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags));
+	/*
+	 * Crypto API could not process the request, because its queue was full
+	 * and we're in softirq context, so retry from a workqueue
+	 * (TODO: is it actually possible to be in softirq in the write path?)
+	 */
+	if (r == BLK_STS_DEV_RESOURCE) {
+		INIT_WORK(&io->work, kcryptd_crypt_write_continue);
+		queue_work(cc->crypt_queue, &io->work);
+		return;
+	}
 	if (r)
 		io->error = r;
 	crypt_finished = atomic_dec_and_test(&ctx->cc_pending);
@@ -1998,6 +2052,22 @@ static void kcryptd_crypt_read_done(struct dm_crypt_io *io)
 	crypt_dec_pending(io);
 }

+static void kcryptd_crypt_read_continue(struct work_struct *work)
+{
+	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+	struct crypt_config *cc = io->cc;
+	blk_status_t r;
+
+	r = crypt_convert(cc, &io->ctx, false);
+	if (r)
+		io->error = r;
+
+	if (atomic_dec_and_test(&io->ctx.cc_pending))
+		kcryptd_crypt_read_done(io);
+
+	crypt_dec_pending(io);
+}
+
 static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
@@ -2010,6 +2080,15 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)

 	r = crypt_convert(cc, &io->ctx,
 			  test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags));
+	/*
+	 * Crypto API could not process the request, because its queue was full
+	 * and we're in softirq context, so retry from a workqueue
+	 */
+	if (r == BLK_STS_DEV_RESOURCE) {
+		INIT_WORK(&io->work, kcryptd_crypt_read_continue);
+		queue_work(cc->crypt_queue, &io->work);
+		return;
+	}
 	if (r)
 		io->error = r;

@evilbulgarian
Copy link

@ignatk thanks for keeping at this! here is result ( looked good at first then got the io error and fio stopped ):

[Wed Dec 23 16:55:41 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[Wed Dec 23 16:55:41 2020] BTRFS warning (device dm-0): csum failed root 260 ino 17349418 off 46156853248 csum 0xb79d51c7 expected csum 0xef38e286 mirror 1
[Wed Dec 23 16:55:41 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[Wed Dec 23 16:55:41 2020] BTRFS warning (device dm-0): csum failed root 260 ino 17349418 off 53374042112 csum 0x5c95246c expected csum 0x64e6fe4e mirror 1
[Wed Dec 23 16:55:41 2020] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 0, flush 0, corrupt 3, gen 0

fio --bs=4096 --rw=randrw --name=generic_job_name --filename=./test.fio --numjobs=1 --size=50g --thread --direct=1 --runtime=300 --rwmixread=50 --iodepth=256 --ioengine=libaio
generic_job_name: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
fio-3.19
Starting 1 thread
fio: io_u error on file ./test.fio: Input/output error: read offset=3122044928, buflen=4096
fio: pid=3169, err=5/file:io_u.c:1787, func=io_u error, error=Input/output error

generic_job_name: (groupid=0, jobs=1): err= 5 (file:io_u.c:1787, func=io_u error, error=Input/output error): pid=3169: Wed Dec 23 16:55:42 2020
read: IOPS=3633, BW=14.2MiB/s (14.9MB/s)(450MiB/31751msec)
slat (usec): min=6, max=4913, avg=108.30, stdev=163.97
clat (msec): min=6, max=1749, avg=35.27, stdev=67.32
lat (msec): min=6, max=1749, avg=35.38, stdev=67.37
clat percentiles (msec):
| 1.00th=[ 9], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 11],
| 30.00th=[ 12], 40.00th=[ 13], 50.00th=[ 19], 60.00th=[ 32],
| 70.00th=[ 43], 80.00th=[ 54], 90.00th=[ 75], 95.00th=[ 99],
| 99.00th=[ 123], 99.50th=[ 129], 99.90th=[ 1737], 99.95th=[ 1737],
| 99.99th=[ 1754]
bw ( KiB/s): min= 2881, max=54184, per=100.00%, avg=15227.48, stdev=14200.90, samples=60
iops : min= 720, max=13546, avg=3806.90, stdev=3550.26, samples=60
write: IOPS=3626, BW=14.1MiB/s (14.8MB/s)(449MiB/31751msec); 0 zone resets
slat (usec): min=19, max=1677.6k, avg=150.81, stdev=5074.32
clat (msec): min=5, max=1748, avg=34.84, stdev=63.64
lat (msec): min=5, max=1749, avg=35.00, stdev=63.90
clat percentiles (msec):
| 1.00th=[ 9], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 11],
| 30.00th=[ 11], 40.00th=[ 13], 50.00th=[ 19], 60.00th=[ 32],
| 70.00th=[ 43], 80.00th=[ 54], 90.00th=[ 75], 95.00th=[ 99],
| 99.00th=[ 124], 99.50th=[ 129], 99.90th=[ 1737], 99.95th=[ 1737],
| 99.99th=[ 1754]
bw ( KiB/s): min= 3081, max=54072, per=100.00%, avg=15190.82, stdev=14195.43, samples=60
iops : min= 770, max=13518, avg=3797.70, stdev=3548.86, samples=60
lat (msec) : 10=18.72%, 20=32.29%, 50=27.65%, 100=16.49%, 250=4.52%
lat (msec) : 500=0.11%, 2000=0.11%
cpu : usr=4.01%, sys=38.11%, ctx=212940, majf=0, minf=1
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=115364,115132,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=14.2MiB/s (14.9MB/s), 14.2MiB/s-14.2MiB/s (14.9MB/s-14.9MB/s), io=450MiB (472MB), run=31751-31751msec
WRITE: bw=14.1MiB/s (14.8MB/s), 14.1MiB/s-14.1MiB/s (14.8MB/s-14.8MB/s), io=449MiB (471MB), run=31751-31751msec

@ignatk
Copy link
Contributor

ignatk commented Dec 24, 2020

Not sure if it is related - but I found a typo in my patch: a missing - around ENOSPC in crypt_convert_block_skcipher. Here is the updated version:

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 5f9f9b3a226d..12ffab3d3100 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1368,8 +1368,11 @@ static int crypt_convert_block_aead(struct crypt_config *cc,
 	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
 		r = cc->iv_gen_ops->post(cc, org_iv, dmreq);

-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	/* do not advance the bio, when Crypto API did nothing in case we want to retry */
+	if (r != -ENOSPC) {
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	}

 	return r;
 }
@@ -1445,8 +1448,11 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc,
 	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
 		r = cc->iv_gen_ops->post(cc, org_iv, dmreq);

-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	/* do not advance the bio, when Crypto API did nothing in case we want to retry */
+	if (r != -ENOSPC) {
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, cc->sector_size);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, cc->sector_size);
+	}

 	return r;
 }
@@ -1457,10 +1463,11 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 				     struct convert_context *ctx)
 {
+	bool atomic = in_interrupt();
 	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);

 	if (!ctx->r.req)
-		ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
+		ctx->r.req = mempool_alloc(&cc->req_pool, atomic ? GFP_ATOMIC : GFP_NOIO);

 	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);

@@ -1469,15 +1476,17 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	skcipher_request_set_callback(ctx->r.req,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    atomic ? 0 : CRYPTO_TFM_REQ_MAY_BACKLOG,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
 }

 static void crypt_alloc_req_aead(struct crypt_config *cc,
 				 struct convert_context *ctx)
 {
+	bool atomic = in_interrupt();
+
 	if (!ctx->r.req_aead)
-		ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
+		ctx->r.req_aead = mempool_alloc(&cc->req_pool, atomic ? GFP_ATOMIC : GFP_NOIO);

 	aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]);

@@ -1486,7 +1495,7 @@ static void crypt_alloc_req_aead(struct crypt_config *cc,
 	 * requests if driver request queue is full.
 	 */
 	aead_request_set_callback(ctx->r.req_aead,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
+	    atomic ? 0 : CRYPTO_TFM_REQ_MAY_BACKLOG,
 	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead));
 }

@@ -1548,6 +1557,13 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 			r = crypt_convert_block_skcipher(cc, ctx, ctx->r.req, tag_offset);

 		switch (r) {
+		/*
+		 * The driver request queue is full and the request was
+		 * not queued.
+		 */
+		case -ENOSPC:
+			atomic_dec(&ctx->cc_pending);
+			return BLK_STS_DEV_RESOURCE;
 		/*
 		 * The request was queued by a crypto driver
 		 * but the driver request queue is full, let's wait.
@@ -1945,6 +1961,34 @@ static bool kcryptd_crypt_write_inline(struct crypt_config *cc,
 	}
 }

+static void kcryptd_crypt_write_continue(struct work_struct *work)
+{
+	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+	struct crypt_config *cc = io->cc;
+	struct convert_context *ctx = &io->ctx;
+	int crypt_finished;
+	sector_t sector = io->sector;
+	blk_status_t r;
+
+	r = crypt_convert(cc, &io->ctx, false);
+	if (r)
+		io->error = r;
+	crypt_finished = atomic_dec_and_test(&ctx->cc_pending);
+	if (!crypt_finished && kcryptd_crypt_write_inline(cc, ctx)) {
+		/* Wait for completion signaled by kcryptd_async_done() */
+		wait_for_completion(&ctx->restart);
+		crypt_finished = 1;
+	}
+
+	/* Encryption was already finished, submit io now */
+	if (crypt_finished) {
+		kcryptd_crypt_write_io_submit(io, 0);
+		io->sector = sector;
+	}
+
+	crypt_dec_pending(io);
+}
+
 static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
@@ -1974,6 +2018,16 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 	crypt_inc_pending(io);
 	r = crypt_convert(cc, ctx,
 			  test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags));
+	/*
+	 * Crypto API could not process the request, because its queue was full
+	 * and we're in softirq context, so retry from a workqueue
+	 * (TODO: is it actually possible to be in softirq in the write path?)
+	 */
+	if (r == BLK_STS_DEV_RESOURCE) {
+		INIT_WORK(&io->work, kcryptd_crypt_write_continue);
+		queue_work(cc->crypt_queue, &io->work);
+		return;
+	}
 	if (r)
 		io->error = r;
 	crypt_finished = atomic_dec_and_test(&ctx->cc_pending);
@@ -1998,6 +2052,22 @@ static void kcryptd_crypt_read_done(struct dm_crypt_io *io)
 	crypt_dec_pending(io);
 }

+static void kcryptd_crypt_read_continue(struct work_struct *work)
+{
+	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+	struct crypt_config *cc = io->cc;
+	blk_status_t r;
+
+	r = crypt_convert(cc, &io->ctx, false);
+	if (r)
+		io->error = r;
+
+	if (atomic_dec_and_test(&io->ctx.cc_pending))
+		kcryptd_crypt_read_done(io);
+
+	crypt_dec_pending(io);
+}
+
 static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
@@ -2010,6 +2080,15 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)

 	r = crypt_convert(cc, &io->ctx,
 			  test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags));
+	/*
+	 * Crypto API could not process the request, because its queue was full
+	 * and we're in softirq context, so retry from a workqueue
+	 */
+	if (r == BLK_STS_DEV_RESOURCE) {
+		INIT_WORK(&io->work, kcryptd_crypt_read_continue);
+		queue_work(cc->crypt_queue, &io->work);
+		return;
+	}
 	if (r)
 		io->error = r;

@evilbulgarian
Copy link

@ignatk re-patched and same error and fio command stops.

@ignatk
Copy link
Contributor

ignatk commented Dec 29, 2020

Changed the idea a bit and came up with this

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 392337f16ecf..d6ea5a9eaee4 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1460,7 +1460,7 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);

 	if (!ctx->r.req)
-		ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
+		ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO);

 	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);

@@ -1477,7 +1477,7 @@ static void crypt_alloc_req_aead(struct crypt_config *cc,
 				 struct convert_context *ctx)
 {
 	if (!ctx->r.req_aead)
-		ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
+		ctx->r.req_aead = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO);

 	aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]);

@@ -1529,13 +1529,19 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
 static blk_status_t crypt_convert(struct crypt_config *cc,
-			 struct convert_context *ctx, bool atomic)
+			 struct convert_context *ctx, bool atomic, bool reset_pending)
 {
 	unsigned int tag_offset = 0;
 	unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
 	int r;

-	atomic_set(&ctx->cc_pending, 1);
+	/*
+	 * if reset_pending is set we are dealing with the bio for the first time,
+	 * else we're continuing to work on the previous bio, so don't mess with
+	 * the cc_pending counter
+	 */
+	if (reset_pending)
+		atomic_set(&ctx->cc_pending, 1);

 	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {

@@ -1553,7 +1559,24 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 		 * but the driver request queue is full, let's wait.
 		 */
 		case -EBUSY:
-			wait_for_completion(&ctx->restart);
+			if (in_interrupt()) {
+				if (try_wait_for_completion(&ctx->restart)) {
+					/*
+					 * we don't have to block to wait for completion,
+					 * so proceed
+					 */
+				} else {
+					/*
+					 * we can't wait for completion without blocking
+					 * exit and continue processing in a workqueue
+					 */
+					ctx->r.req = NULL;
+					ctx->cc_sector += sector_step;
+					tag_offset++;
+					return BLK_STS_DEV_RESOURCE;
+				}
+			} else
+				wait_for_completion(&ctx->restart);
 			reinit_completion(&ctx->restart);
 			fallthrough;
 		/*
@@ -1945,6 +1968,37 @@ static bool kcryptd_crypt_write_inline(struct crypt_config *cc,
 	}
 }

+static void kcryptd_crypt_write_continue(struct work_struct *work)
+{
+	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+	struct crypt_config *cc = io->cc;
+	struct convert_context *ctx = &io->ctx;
+	int crypt_finished;
+	sector_t sector = io->sector;
+	blk_status_t r;
+
+	wait_for_completion(&ctx->restart);
+	reinit_completion(&ctx->restart);
+
+	r = crypt_convert(cc, &io->ctx, true, false);
+	if (r)
+		io->error = r;
+	crypt_finished = atomic_dec_and_test(&ctx->cc_pending);
+	if (!crypt_finished && kcryptd_crypt_write_inline(cc, ctx)) {
+		/* Wait for completion signaled by kcryptd_async_done() */
+		wait_for_completion(&ctx->restart);
+		crypt_finished = 1;
+	}
+
+	/* Encryption was already finished, submit io now */
+	if (crypt_finished) {
+		kcryptd_crypt_write_io_submit(io, 0);
+		io->sector = sector;
+	}
+
+	crypt_dec_pending(io);
+}
+
 static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
@@ -1973,7 +2027,17 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)

 	crypt_inc_pending(io);
 	r = crypt_convert(cc, ctx,
-			  test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags));
+			  test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags), true);
+	/*
+	 * Crypto API backlogged the request, because its queue was full
+	 * and we're in softirq context, so continue from a workqueue
+	 * (TODO: is it actually possible to be in softirq in the write path?)
+	 */
+	if (r == BLK_STS_DEV_RESOURCE) {
+		INIT_WORK(&io->work, kcryptd_crypt_write_continue);
+		queue_work(cc->crypt_queue, &io->work);
+		return;
+	}
 	if (r)
 		io->error = r;
 	crypt_finished = atomic_dec_and_test(&ctx->cc_pending);
@@ -1998,6 +2062,25 @@ static void kcryptd_crypt_read_done(struct dm_crypt_io *io)
 	crypt_dec_pending(io);
 }

+static void kcryptd_crypt_read_continue(struct work_struct *work)
+{
+	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+	struct crypt_config *cc = io->cc;
+	blk_status_t r;
+
+	wait_for_completion(&io->ctx.restart);
+	reinit_completion(&io->ctx.restart);
+
+	r = crypt_convert(cc, &io->ctx, true, false);
+	if (r)
+		io->error = r;
+
+	if (atomic_dec_and_test(&io->ctx.cc_pending))
+		kcryptd_crypt_read_done(io);
+
+	crypt_dec_pending(io);
+}
+
 static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
@@ -2009,7 +2092,16 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 			   io->sector);

 	r = crypt_convert(cc, &io->ctx,
-			  test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags));
+			  test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags), true);
+	/*
+	 * Crypto API backlogged the request, because its queue was full
+	 * and we're in softirq context, so continue from a workqueue
+	 */
+	if (r == BLK_STS_DEV_RESOURCE) {
+		INIT_WORK(&io->work, kcryptd_crypt_read_continue);
+		queue_work(cc->crypt_queue, &io->work);
+		return;
+	}
 	if (r)
 		io->error = r;

I've been running this for a while in my repro environment and so far no warnings in dmesg or fio errors

@evilbulgarian
Copy link

@ignatk nice job! no more errors in dmesg ( but slower fio prob due to something with btrfs not sure ):

`❯ fio --bs=4096 --rw=randrw --name=generic_job_name --filename=./test.fio --numjobs=1 --size=50g --thread --direct=1 --runtime=300 --rwmixread=50 --iodepth=256 --ioengine=libaio
generic_job_name: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
fio-3.19
Starting 1 thread
Jobs: 1 (f=1): [m(1)][100.0%][r=67.3MiB/s,w=67.6MiB/s][r=17.2k,w=17.3k IOPS][eta 00m:00s]
generic_job_name: (groupid=0, jobs=1): err= 0: pid=2999: Wed Dec 30 09:23:17 2020
read: IOPS=18.5k, BW=72.4MiB/s (75.9MB/s)(21.2GiB/300007msec)
slat (usec): min=3, max=7271, avg= 9.50, stdev=12.37
clat (usec): min=310, max=148240, avg=7850.28, stdev=3723.28
lat (usec): min=317, max=148268, avg=7859.83, stdev=3724.42
clat percentiles (usec):
| 1.00th=[ 2900], 5.00th=[ 3195], 10.00th=[ 3687], 20.00th=[ 4555],
| 30.00th=[ 5604], 40.00th=[ 6915], 50.00th=[ 7570], 60.00th=[ 8094],
| 70.00th=[ 8979], 80.00th=[10290], 90.00th=[12256], 95.00th=[13960],
| 99.00th=[20579], 99.50th=[23462], 99.90th=[32113], 99.95th=[36439],
| 99.99th=[47449]
bw ( KiB/s): min=13993, max=179776, per=100.00%, avg=74212.82, stdev=23094.32, samples=599
iops : min= 3498, max=44944, avg=18553.21, stdev=5773.58, samples=599
write: IOPS=18.5k, BW=72.3MiB/s (75.8MB/s)(21.2GiB/300007msec); 0 zone resets
slat (usec): min=5, max=46567, avg=30.31, stdev=43.59
clat (usec): min=112, max=171485, avg=5929.64, stdev=2932.40
lat (usec): min=127, max=171496, avg=5960.06, stdev=2941.59
clat percentiles (usec):
| 1.00th=[ 2671], 5.00th=[ 2999], 10.00th=[ 3195], 20.00th=[ 3851],
| 30.00th=[ 4424], 40.00th=[ 5080], 50.00th=[ 5669], 60.00th=[ 6194],
| 70.00th=[ 6849], 80.00th=[ 7439], 90.00th=[ 8291], 95.00th=[ 9765],
| 99.00th=[14746], 99.50th=[18220], 99.90th=[28705], 99.95th=[34866],
| 99.99th=[92799]
bw ( KiB/s): min=14721, max=178200, per=100.00%, avg=74152.19, stdev=22988.08, samples=599
iops : min= 3680, max=44550, avg=18538.05, stdev=5747.02, samples=599
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.11%, 4=16.90%, 10=69.86%, 20=12.37%, 50=0.73%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=7.11%, sys=69.51%, ctx=64261, majf=0, minf=1
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=5557820,5553288,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=72.4MiB/s (75.9MB/s), 72.4MiB/s-72.4MiB/s (75.9MB/s-75.9MB/s), io=21.2GiB (22.8GB), run=300007-300007msec
WRITE: bw=72.3MiB/s (75.8MB/s), 72.3MiB/s-72.3MiB/s (75.8MB/s-75.8MB/s), io=21.2GiB (22.7GB), run=300007-300007msec

~ 5m 2s
❯ sudo cryptsetup refresh --perf-no_read_workqueue --perf-no_write_workqueue /dev/mapper/root
Password:
Enter passphrase for /dev/nvme0n1p3:

~ 14s
❯ fio --bs=4096 --rw=randrw --name=generic_job_name --filename=./test.fio --numjobs=1 --size=50g --thread --direct=1 --runtime=300 --rwmixread=50 --iodepth=256 --ioengine=libaio
generic_job_name: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
fio-3.19
Starting 1 thread
Jobs: 1 (f=1): [m(1)][100.0%][r=50.6MiB/s,w=50.7MiB/s][r=12.0k,w=12.0k IOPS][eta 00m:00s]
generic_job_name: (groupid=0, jobs=1): err= 0: pid=3670: Wed Dec 30 09:29:12 2020
read: IOPS=16.3k, BW=63.7MiB/s (66.8MB/s)(18.7GiB/300001msec)
slat (usec): min=3, max=745, avg=10.39, stdev= 5.39
clat (usec): min=128, max=150590, avg=7896.38, stdev=3628.75
lat (usec): min=140, max=150596, avg=7906.85, stdev=3628.58
clat percentiles (msec):
| 1.00th=[ 5], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
| 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 9], 60.00th=[ 9],
| 70.00th=[ 10], 80.00th=[ 10], 90.00th=[ 11], 95.00th=[ 11],
| 99.00th=[ 12], 99.50th=[ 21], 99.90th=[ 46], 99.95th=[ 67],
| 99.99th=[ 133]
bw ( KiB/s): min=37352, max=119280, per=100.00%, avg=65303.46, stdev=19785.02, samples=599
iops : min= 9338, max=29820, avg=16325.86, stdev=4946.25, samples=599
write: IOPS=16.3k, BW=63.7MiB/s (66.8MB/s)(18.7GiB/300001msec); 0 zone resets
slat (usec): min=10, max=136498, avg=47.56, stdev=235.68
clat (usec): min=141, max=147414, avg=7747.14, stdev=3617.46
lat (usec): min=200, max=147488, avg=7794.79, stdev=3636.70
clat percentiles (msec):
| 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
| 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 9], 60.00th=[ 9],
| 70.00th=[ 10], 80.00th=[ 10], 90.00th=[ 11], 95.00th=[ 11],
| 99.00th=[ 11], 99.50th=[ 15], 99.90th=[ 44], 99.95th=[ 66],
| 99.99th=[ 133]
bw ( KiB/s): min=36896, max=119664, per=100.00%, avg=65267.25, stdev=19783.49, samples=599
iops : min= 9224, max=29916, avg=16316.81, stdev=4945.88, samples=599
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.97%, 10=88.58%, 20=9.96%, 50=0.42%
lat (msec) : 100=0.05%, 250=0.03%
cpu : usr=4.45%, sys=78.75%, ctx=5012467, majf=0, minf=1
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=4891978,4889279,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=63.7MiB/s (66.8MB/s), 63.7MiB/s-63.7MiB/s (66.8MB/s-66.8MB/s), io=18.7GiB (20.0GB), run=300001-300001msec
WRITE: bw=63.7MiB/s (66.8MB/s), 63.7MiB/s-63.7MiB/s (66.8MB/s-66.8MB/s), io=18.7GiB (20.0GB), run=300001-300001msec`

@ignatk
Copy link
Contributor

ignatk commented Dec 30, 2020

Thanks for testing! I've submitted the patches to the dm-devel mailing list: https://www.redhat.com/archives/dm-devel/2020-December/msg00452.html

@jonathanunderwood
Copy link

I see a second updated set of packages has been submitted here: https://www.redhat.com/archives/dm-devel/2020-December/msg00468.html

@ignatk would be great to get these backported to 5.10 if possible, given that's a stable release.

@ignatk
Copy link
Contributor

ignatk commented Jan 1, 2021

The patches have CC: stable@vger.kernel.org, which should get them automatically backported to stable releases, once they are merged to the mainline branch.

@jonathanunderwood
Copy link

The patches have CC: stable@vger.kernel.org, which should get them automatically backported to stable releases, once they are merged to the mainline branch.

Brilliant, thank you :)

@ignatk
Copy link
Contributor

ignatk commented Jan 19, 2021

FYI these patches and even some more were merged into upstream and backported to 5.10 in latest 5.10.9. 5.9 is EOL now, so not there.

@Arcitec
Copy link

Arcitec commented Feb 26, 2021

@ignatk Hi, I just want to throw in a quick THANK YOU for everything you're doing to speed up Linux disk encryption! You're AWESOME! <3

@geovex
Copy link

geovex commented Mar 14, 2021

Interesting. When this flags enabled guest OSes in virtualbox became extremely unstable. May be we still missing something with btrfs bug.

@maciejsszmigiero
Copy link

Might be since I've recently got a hard lockup when machine with these flags enabled started heavily swapping.
Don't know if that was related to swap in general or zswap specifically.

I am waiting whether this will occur again.

@geovex
Copy link

geovex commented Mar 14, 2021

A long time ago i noticed strange VBox behavior when it tries to write to memory buffer while it in process of writing. But i may be mistaken. I got several direct-io errors about data changed "in-flight" during write. May be this is connected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests