Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fio hangs when doing randwrite with io_uring #1195

Closed
1 task done
karlatec opened this issue Mar 2, 2021 · 13 comments
Closed
1 task done

fio hangs when doing randwrite with io_uring #1195

karlatec opened this issue Mar 2, 2021 · 13 comments
Labels
needreporterinfo Waiting on information from the issue reporter

Comments

@karlatec
Copy link

karlatec commented Mar 2, 2021

Please acknowledge the following before creating a ticket

Description of the bug:
Fio hangs / fails to finish randwrite job when runnin with io_uring engine.

After starting fio "ETA" line is displayed (as it should be)
Jobs: 1 (f=1): [w(1)][27.3%][w=524MiB/s][w=134k IOPS][eta 00m:08s]
At some point bandwidth and throughput stats are lost:
Jobs: 1 (f=1): [w(1)][72.7%][eta 00m:05s]
After reaching 100% the counter resets back to 0%, "eta" timer shows abnormally high value and fio fails to end. SIGINT and SIGTERM are ignored, and SIGKILL must be used.
Not reproducible 100% of times.

I've tried reproducing this with --debug=io,file, but no luck. There's no related messages in dmesg.

Environment:
Fedora 33 5.10.19-200.fc33.x86_64
gcc (GCC) 10.2.1 20201125 (Red Hat 10.2.1-9)
fio 3.25
liburing from DNF repo - 0.7-3.fc33 (also tested with https://github.com/axboe/liburing/releases/tag/liburing-0.7, reproduces as well)

fio version: 3.25

Reproduction steps
Run fio with following config in a loop, as the reproducibility is not 100%. Something like for i in $(seq 1 25); do sudo fio config.fio; done should do the trick.

[global]
direct=1
thread=1
norandommap=1
group_reporting=1
time_based=1
ioengine=io_uring

rw=randwrite
bs=4096
runtime=20
numjobs=1
fixedbufs=1
hipri=1
registerfiles=1
sqthread_poll=1

[filename0]
iodepth=1
cpus_allowed=20
filename=/dev/nvme18n1
@axboe
Copy link
Owner

axboe commented Mar 2, 2021

I'm pretty sure the sqpoll handling is broken in fio. Does it work if you remove sqhread_poll=1?

@karlatec
Copy link
Author

karlatec commented Mar 2, 2021

Yes. 100/100 pass rate if without this option.

@axboe
Copy link
Owner

axboe commented Mar 2, 2021

Can't reproduce this, but I'm running a newer kernel, so who knows...

When it's stuck, can you try and do:

cat /proc/<for each fio pid>/stack

where is any PID of fio on the system.

If you check top, is fio spinning 100%?

If you check top, do you see any io_uring-sq threads?

@karlatec
Copy link
Author

karlatec commented Mar 3, 2021

sudo cat /proc/1492442/stack
[<0>] hrtimer_nanosleep+0x8b/0x100
[<0>] common_nsleep+0x40/0x50
[<0>] __x64_sys_clock_nanosleep+0xb0/0x110
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

If you check top, is fio spinning 100%?

Yes

If you check top, do you see any io_uring-sq threads?

Not visible in top or htop, but I can see it in ps
root 1492526 0.2 0.0 0 0 ? S 08:26 0:00 [io_uring-sq]

@karlatec
Copy link
Author

karlatec commented Mar 9, 2021

@axboe any suggestions on how to fix or work around that? I could really use some help with that and I'd rather not give up on sqthread_poll option in config file, because it had quite an impact on performance if I recall correctly.

@karlatec
Copy link
Author

karlatec commented Mar 12, 2021

@axboe saw your fix fc2dc21 and I've retested the issue with current fio master 014ab48a (HEAD -> master, origin/master, origin/HEAD) Merge branch 'dev_luye_github' of https://github.com/louisluSCU/fio, but unfortunately the issue still persists.

@axboe
Copy link
Owner

axboe commented Mar 12, 2021

Those were specific to t/io_uring, it's not used by fio at all. So definitely won't make a difference :-)

It might be a kernel issue, I haven't been able to reproduce but I'm also one a drastically newer kernel than you are...

@karlatec
Copy link
Author

Hey @axboe how about reproducing it like this? I see this reproduces as well in a virtual machine which kernel is very similar to mine: fedora32.localdomain 5.10.20-100.fc32.x86_64

Assuming you're using Vagrant for work. If not, let me know, I'll prepare steps with plain Qemu.

Put Vagrantfile somewhere in your FS (doesn't have to be on NVMe, I'm using Crucial MX500 for this repro)

Vagrant.configure(2) do |config|
  config.vm.box = "generic/fedora32"
  config.vm.box_check_update = false

  if Vagrant.has_plugin?("vagrant-proxyconf")
    config.proxy.http     = ENV['http_proxy']
    config.proxy.https    = ENV['https_proxy']
    config.proxy.no_proxy = "localhost,127.0.0.1"
  end

  config.vm.provider "libvirt" do |libvirt, override|
    libvirt.random_hostname = "1"
    libvirt.driver = "kvm"
    libvirt.graphics_type = "vnc"
    libvirt.memory = 8192
    libvirt.cpus = 12
    libvirt.video_type = "cirrus"
    libvirt.qemuargs :value => "-drive"
    libvirt.qemuargs :value => "format=raw,file=/tmp/fio-repro/nvme.img,if=none,id=nvme01"
    libvirt.qemuargs :value => "-device"
    libvirt.qemuargs :value => "nvme,drive=nvme01,serial=1234"
  end
end

Create backing file for emulated nvme drive:

dd if=/dev/zero of=nvme.img bs=1M count=3000
chmod 777 nvme.img

Run the VM

vagrant up

SSH into the VM with vagrant ssh. Then:

sudo dnf check-update
git clone https://github.com/axboe/fio.git
sudo dnf install -y builddep fio
cd fio
./configure
make -j100

Create the job config file:

[global]
direct=1
thread=1
norandommap=1
group_reporting=1
time_based=1
ioengine=io_uring

rw=randwrite
bs=4096
runtime=20
numjobs=1
fixedbufs=1
hipri=1
registerfiles=1
sqthread_poll=1

[filename0]
iodepth=1
cpus_allowed=0
filename=/dev/nvme0n1

And run the test:

sudo ./fio fio.conf --ioengine=io_uring

In my case it hanged on second attempt, same /proc stack as in previous commit.

@sitsofe
Copy link
Collaborator

sitsofe commented Apr 10, 2021

@karlatec I don't think Fedora 32 is supported any more... can you reproduce this with Fedora 34's kernel?

@sitsofe sitsofe added the needreporterinfo Waiting on information from the issue reporter label Apr 10, 2021
@karlatec
Copy link
Author

@sitsofe Fedora34 is available in Beta version only at the moment, and Fedora 32 is still supported. I will try later with F33.

@sitsofe
Copy link
Collaborator

sitsofe commented Apr 12, 2021

It's probably no big deal as 32/33/34 all have a 5.11 kernel - https://bodhi.fedoraproject.org/updates/?packages=kernel . Could you try with that?

@karlatec
Copy link
Author

@sitsofe I tested with VM's running:

Linux fedora32.localdomain 5.11.11-100.fc32.x86_64 #1 SMP Tue Mar 30 16:53:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Linux fedora33.localdomain 5.11.11-200.fc33.x86_64 #1 SMP Tue Mar 30 16:53:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

and wasn't able to reproduce the issue with previously provided reproduction steps. Must have been something in the Kernel. I guess we can close this issue. Thanks!

@axboe axboe closed this as completed Apr 12, 2021
@sitsofe
Copy link
Collaborator

sitsofe commented Apr 13, 2021

Thanks for following up @karlatec .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needreporterinfo Waiting on information from the issue reporter
Projects
None yet
Development

No branches or pull requests

3 participants