-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numjobs >1, groupreporting, json output metrics inconsistency #519
Comments
This stems from where the values are calculated. If you look at summary line you have a write: row whose IOPS and BW values are the totals divided by time (which is currently assuming all jobs start and finish at the same time). Further down you have bw ( KiB/ s) and iops which are calculated by periodic sampling then averaging the samples (each job collects it's own samples). I'm guessing we are seeing the sample based data for only the first job. The question is does it make sense to aggregate sample data across jobs due to it not being per I/O information? |
In the json output for FIO with group reporting enabled As per our understanding the values we get in json output as bw_min,bw_max,bw_mean and bw_agg are per thread/job results and bw, iops etc. are the aggregated result of all threads/jobs. But how can we achieve the aggregated result of all threads/jobs for max,min,mean values of bw and iops? aggregated mean_bw=( bw_mean100)/bw_agg aggregated mean_iops=( iops_mean100)/iops_agg Is this the correct way to achieve those values? And the latency output (lat_ns) is for per thread or for whole threads? |
@jo-ale35 you'll have to check but I think the first set are the totals divided by time and the second set are due to periodic sampling on just one of the jobs. fio would need various bits of code rewritten if it we want it to be able to do periodic sampling of all the jobs in a group and it raises the question of what do you do if jobs have unequal numbers of samples in them / their sample periods don't overlap exactly. I'm half tempted to say periodic sampling should just be turned off when group reporting is turned on. |
@sitsofe: I 100% agree with your assessment about periodic sampling. I also think with group reporting enabled, periodic sampling should be off. |
@sitsofe @szaydel Thanks for your inputs. I am developing a graphical representation for fio results,I want three values minimum, maximum, and mean as the total of all the jobs run. The first part give me only one value that is mean bandwidth. So am using the above maths to get the three values, approx. values will be ok for me. Should I do same for latency values. Is latency values for periodic sampling on just one of the jobs or total of all the jobs. |
@jo-ale35, I think you can look at this from two different angles, one being statistical correctness, and the other being granularity. I feel like that statistically, it is hard to derive meaningful numbers by taking interval data from more than one job and treating each job equally. There is enough variability there to where I believe data will not necessarily be as meaningful. In my mind each job is an individual distribution and so say you have three jobs, each giving you a mean value, a max value and a min value for some variable I think you should consider working with groups, and assuming you have multiple jobs all grouped into a single group, you should have no trouble using only summary data, not interval data as your source for visualization. This answer also depends on whether you care to report about individual jobs for some reason, in other words are all jobs going to be identical and are all of them going to be running at same time, or is there something more complex going on, like some partial overlap in jobs, resulting in some finishing sooner than others, or amount of IO done varying, or what the jobs themselves do, like different IO patterns, or different amount of IO, etc. My knee-jerk reaction is to say that you are better off in most cases working with only the final data, which is processed for you and to some degree smoothed over each runtime, and if you need multiple samples, run the same job definitions over and over. From those results you can probably pull out averages or medians perhaps for each datapoint meaningful to you, like mean/min/max bandwidth, iops, etc. |
The points from @szaydel really echo my thoughts too. I'm not a statistician but you have to be careful with how you start mixing sampled data together because if they are drawn from different distributions it's hard to use them together meaningfully. I see two approaches:
Sam's earlier comment has encouraged me to look at whipping up a patch to just turn sampling off when using group reporting. When I've done it we'll see what Jens thinks. |
@sitsofe : thank you! :) I would also add that the shorter the jobs the fewer samples collected, the more likely there are to be really statistically impacting outliers in the interval data, and they could even further muddy waters. Conversely the longer the jobs are, the more the noise subsides. Length was not really mentioned here, so I am just raising it as another potential point to consider. |
I got the same problem - see below IOPS and iops, having group_reporting specified, sampling numbers are really confusing - any plans to fix it? rand-read-J4Q64: (groupid=26, jobs=4): err= 0: pid=22926: Thu Nov 29 19:50:03 2018 |
@universalbodge As just another member of the community I've no plans to tackle this any time soon and as fio is a community project it doesn't really have a roadmap... If I were to "fi" this I would likely just remove the sample data when group reporting were turned on but if you've created a good fix for this please submit a patch for review! |
I can see how that would be confusing, it's a quick fix - which I've now committed. |
Issue: It's not clear which metrics are per job and which are for a group of jobs.
Using
numjobs
>1,groupreporting
set, withoutput-format=json
I've noticed that some metrics report a value for a job group, and other values are divided by the number of jobs.For example, for a purely write workload with
numjobs=4
, Mywrite["iops"]
is25345
but my"iops_mean"
is6334
. For this specific workload, these numbers should be the same since it is a write-only workload.25345/6334
is approximately 4, which is the number of numjobs, so I can see where the discrepancy lies. This issue also applies toiops_stddev
.cli-args
fio reporting-issue.txt --output-format=json+ --output=out.out
reporting-issues.txt jobfile
out.out
The text was updated successfully, but these errors were encountered: