Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to optimize FIO testing? #579

Closed
jjpcat opened this issue Apr 11, 2018 · 13 comments
Closed

How to optimize FIO testing? #579

jjpcat opened this issue Apr 11, 2018 · 13 comments

Comments

@jjpcat
Copy link

jjpcat commented Apr 11, 2018

First time to post here. Please excuse me if I post it at the wrong place.

I have 4 very fast SSDs (Intel P4600, Micron 9200 MAX) in my i7-7700 PC. I am using FIO to try to measure the aggregated 4kB random read performance. So I start a FIO for each SSD, e.g.,

sudo fio --name=dummy --size=50G --runtime=600 --filename=/dev/nvme0n1 --ioengine=libaio --direct=1 --rw=randread -bs=4k** --iodepth=64 --numjobs=8 --group_reporting

The CPU load shots to 100%. It's still 90%+ when I cut down to testing 3 SSDs. Reducing numjobs helps in reducing CPU load. But I notice that it also brings down the IOPS. So I am wondering if I can do any optimizations to reduce FIO's CPU requirement.

Thanks.

@axboe
Copy link
Owner

axboe commented Apr 12, 2018 via email

@szaydel
Copy link
Contributor

szaydel commented Apr 12, 2018

@jjpcat: Latency should be examined closely, this sounds very much like io wait. A single fio process can drive hundreds of drives with rather low CPU load. I suspect the problem is elsewhere. For sake of sanity, are you using a fairly recent version of fio?

@jjpcat
Copy link
Author

jjpcat commented Apr 12, 2018

Thanks for the responses.

@szaydel Could you show me how to use a single fio process to drive multiple SSDs?

I am attaching a screenshot when I was running 3 FIOs at the same time. The 3 SSDs used in this case: 1 Intel P4600 (the one in the lower left window) and 2 other somewhat slow NVMe SSDs. The average latency is large (3.9ms - 5.8ms). But that's expected when iodepth is set to 512. If I set iodepth to 1, then average latency for these 3 SSDs is 93us - 126us, which is typical.

So for this test, because of 2 rather slow SSDs, the CPU usage is only 84%.

I am using fio 2.16 running on Ubuntu with kernel 4.13.

My observation with these SSDs is that, to hit max IOPS, I need to set numjobs to 8. IOPS will be about 2% lower if I set numjobs to 4 and significantly lower if numjobs is set to 1.

fio screenshot 3 ssd

@jjpcat
Copy link
Author

jjpcat commented Apr 12, 2018

@szaydel Is this the right way to drive multiple SSDs with 1 fio? --filename=/dev/nvme0n1:/dev/nvme1n1

I tried this. The IOPS is higher than each individual IOPS. But it's 10-30% lower than the sum of each individual IOPS. Increasing numjobs or iodepth doesn't help. There is an article on the internet saying that it's limited by the slowest SSD in the group.

With a single fio process driving multiple SSDs, the CPU utilization also shot up. It doesn't improve CPU usage per aggregated IOPS, which is the problem I am trying to solve.

Thanks.

@szaydel
Copy link
Contributor

szaydel commented Apr 13, 2018

@jjpcat : I was fairly generic with my statement, which assumed a filesystem over the drives as opposed to just the individual drives. With regard to IO depth, I am not sure 512 is really sane, but you may have a very specific reason for that number. You might be hitting throughput limits of the bus also. Have you done just very basic sequential IO tests to see what throughput you top out at?

@szaydel
Copy link
Contributor

szaydel commented Apr 13, 2018

@jjpcat: Install later version of fio., because 2.16 is quite dated at this point.

@jjpcat
Copy link
Author

jjpcat commented Apr 16, 2018

@szaydel Updated to FIO 3.5. Still the same.

@sitsofe
Copy link
Collaborator

sitsofe commented Apr 18, 2018

@jjpcat As this isn't so much an issue in fio and more of a "How do I?" question it is better aimed at the fio mailing list. I'll note there has been the occasional "What are go faster options for fio?" questions on the mailing list in the past (e.g. https://www.spinics.net/lists/fio/msg05451.html ) and there are examples of the jobs people used to reach high IOPS in various places (e.g. https://marc.info/?l=linux-kernel&m=140313968523237&w=2 ).
Some of those options increase IOPS at the expense of CPU though but some will reduce overhead while increasing latency (e.g. the batching options in http://fio.readthedocs.io/en/latest/fio_doc.html#i-o-depth ).

Is this the right way to drive multiple SSDs with 1 fio? --filename=/dev/nvme0n1:/dev/nvme1n1

Sort of. That's more for doing round robin between multiple disks but this is has already veered into a discussion topic rather than an issue.

Here are some hints:

  1. Max IOPS is a tricky business because it rarely models real world workloads. In real life you don't actually want to send the tiniest block sizes all the time because those have the biggest overhead...
  2. Run only one fio with numjob=1 against one SSD for now. Doing anything else is just going to confuse matters and create potential confounding factors. Once you've maxed that THEN we can move on.
  3. Since your disks are NVMe do you know what the maximum queue depth they support is? You may be able to get a hint by looking in /sys/block/<dev>/device/queue_depth. You may also find it in spec sheets but be aware your controller may limit things too etc.
  4. When setting options on the command line you need to use -- not just - (e.g. see what you did with -bs=4k**).
  5. Why do you have two ** after 4k?
  6. In the screenshot you attached a huge amount of time was spent in the kernel (50.2 sys). You might want to investigate why.
  7. It's easier for us if you copy/paste your terminal's text into a text box rather than sending a screenshot ;-) .

As I said I'd strongly recommend taking this to the mailing list...

@sitsofe
Copy link
Collaborator

sitsofe commented Apr 24, 2018

@jjpcat Any follow up on this?

@jjpcat
Copy link
Author

jjpcat commented Apr 24, 2018

@sitsofe Thanks for providing those info. I think you have a good point regarding the time spent in the kernel. That's kind of beyond my control. I am using the standard Linux nvme driver. I may try some user space driver and test it again.

Sorry that I can not do numjob=1. None of our competitors are doing this. If I use only 1 job, our IOPS numbers will look so much worse than our competitors.

@jjpcat jjpcat closed this as completed Apr 24, 2018
@szaydel
Copy link
Contributor

szaydel commented Apr 25, 2018

@jjpcat, are you trying to represent real world, or are you really just trying to chase numbers? If numbers, I think queue depth is quite important. Did you do any straight sequential IO where you try to get as much pushed through as possible? At least that should tell you how far you can push hardware.

Kernel time is likely due to small IO, and the resulting large number of syscalls to get the IO done. And, as far as we can tell seems like system is spending a lot of time getting this IO done, which means we are waiting in the kernel.

Just a few thoughts about how I would approach this. First, do straight sequential IO, just reads with large block. Next, same test, do both reads and writes, keep watch on CPU and kernel time. Start out with effectively QD=1, and increase from there, trying to figure out at which point QD stops to matter. CPU utilization should keep going up with QD. Once you hit a bottleneck, toss another CPU into the mix. If no difference, your problem is something else in the system. I am quite certain fio won't be root cause, and something else is going to be your bottleneck.

@sitsofe
Copy link
Collaborator

sitsofe commented Apr 25, 2018

@jjpcat Just for the record I wasn't saying only do numjob=1 and never go any further but rather try and tune the speed when numjob=1 and only move on to numjob=2 etc once you (and everyone else) is sure that is maxed out. When you're able to submit I/O asynchronously one job is able to keep most single disks totally busy and the less threads/processes you have the less overhead you waste doing things like context switching etc.

Don't forget to look over https://github.com/axboe/fio/blob/master/MORAL-LICENSE if you're going to publish statistics using fio.

@axboe
Copy link
Owner

axboe commented Apr 25, 2018

I agree with both of you. As a rule of thumb, you need enough threads to get the max perf, and no more. For NVMe, on modern boxes, a round figure or ~450K IOPS per core is feasible. So for NVMe, you'll usually find your best performance in the 2-4 threads case. Make QD as low as possible to reach the peak, no more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants