Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the uniform distribution by default #329

Closed
theTibi opened this issue Nov 10, 2019 · 5 comments
Closed

Use the uniform distribution by default #329

theTibi opened this issue Nov 10, 2019 · 5 comments

Comments

@theTibi
Copy link

theTibi commented Nov 10, 2019

Hi,

I was running a series of tests and I have noticed the get_id() function does not really random:

local function get_id()
   return sysbench.rand.default(1, sysbench.opt.table_size)
end

It should be generating numbers between 1 and the table size, in my test I was using 1000 as table size, so it should get random numbers between 1 and 1000.

To make it simple I was only using one function called execute_index_updates and I just printed the the ids: print (id)

I logged the output in a file:

cat get_id_default_int.log| wc -l
142777
cat  get_id_default_int.log | sort | uniq -c | sort -rn | head -n 20
  13106 500
  12901 505
  12879 501
  12726 504
  12636 502
  12632 499
  12604 503
   9714 498
   6419 506
   1971
    184 494
    184 493
    180 536
    176 534
    176 517
    175 485
    174 551
    174 488
    173 532
    173 496

13106+12901+12879+12726+12636+12632+12604=89484

There is 142777 line in the file and only 8 numbers responsible 89484 of that which is more than 60% off all the lines. So basically when I am running MySQL benchmarks sysbench creates hotspots in the workload.

By digging the code a bit:

/*
  Return random number in the specified range with distribution specified
  with the --rand-type command line option
*/

uint32_t sb_rand_default(uint32_t a, uint32_t b)
{
  return rand_func(a,b);
}

I have retested by using the --rand-type=uniform
I was able to generate real random numbers:

cat get_id_default_uniform.log | sort | uniq -c | sort -rn  | head -n 20
   2399
    185 833
    184 314
    179 959
    179 555
    179 437
    177 15
    176 815
    175 896
    174 894
    174 224
    174 215
    173 901
    173 727
    173 361
    172 428
    172 319
    171 721
    171 394
    170 78

I also noticed sometimes the get_id() function does not create any numbers and sometimes it creates numbers bigger than 1000 which is very wired.

In the log file I could see lines like this:

503
499500

503
503503

505
502500

503
506

So it look like there is something wrong going under the hood.

If this is a feature to be able to test hotspots in that case this should be clearly documented but I would recommend to change get_id from default to uniform because I think most of the ppl does not realise default will generate hotspots in their tests and this could make many tests give misleading results.

@akopytov
Copy link
Owner

Hi,

The behavior of sysbench.rand.default is controlled by the --rand-type command line option. Which defaults to special. The special distribution is a little unscientific, but it was supposed to be an approximation to real-life workloads. It will likely be replaced by the Zipfian distribution in the next version of sysbench. I may also change the default of --rand-type to uniform as was requested by a comment in that post.

It is also easy to visualize different distributions using the histograms API.

I used the following Lua script:

function thread_init()
   h = sysbench.histogram.new(1000, 1, 10)
end

function event()
   h:update(sysbench.rand.default(1, 10))
end

function thread_done()
   h:print()
end

These are the results that I got:

$ sysbench /tmp/random.lua --time=1 --rand-type=zipfian --verbosity=0 run
       value  ------------- distribution ------------- count
       1.000 |**************************************** 690507
       2.001 |***********************                  396307
       3.002 |*****************                        286235
       3.996 |*************                            228276
       4.997 |***********                              190612
       5.995 |**********                               164206
       6.996 |********                                 145487
       7.997 |********                                 130939
       8.994 |*******                                  118629
      10.000 |******                                   109564

$ sysbench /tmp/random.lua --time=1 --rand-type=pareto --verbosity=0 run
       value  ------------- distribution ------------- count
       1.000 |**************************************** 2027752
       2.001 |****                                     204580
       3.002 |***                                      128894
       3.996 |**                                       95608
       4.997 |**                                       76857
       5.995 |*                                        65191
       6.996 |*                                        55967
       7.997 |*                                        49721
       8.994 |*                                        44620
      10.000 |*                                        40663
$ sysbench /tmp/random.lua --time=1 --rand-type=special --verbosity=0 run
       value  ------------- distribution ------------- count
       1.000 |                                         1
       2.001 |                                         186
       3.002 |                                         16526
       3.996 |****                                     173084
       4.997 |***************************              1106586
       5.995 |**************************************** 1658067
       6.996 |                                         16528
       7.997 |                                         165
$ sysbench /tmp/random.lua --time=1 --rand-type=zipfian --rand-zipfian-exp=0 --verbosity=0 run
       value  ------------- distribution ------------- count
       1.000 |**************************************** 239561
       2.001 |**************************************** 239582
       3.002 |**************************************** 239817
       3.996 |**************************************** 239263
       4.997 |**************************************** 239868
       5.995 |**************************************** 239665
       6.996 |**************************************** 239625
       7.997 |**************************************** 239775
       8.994 |**************************************** 239745
      10.000 |**************************************** 239116

With all that in mind, I'm a little confused: what exactly is being requested in this issue?

@theTibi
Copy link
Author

theTibi commented Nov 11, 2019

Hi,

Thanks for the quick answer.
Regarding to you question what is requested here. First of all I just wanted to confirm that with default settings random is not truly random. If that's true I would recommend/request to change the default to be truly random and/or document clearly in the manual. Now it is listing different random types options but does not say/show what are the differences.

@akopytov
Copy link
Owner

akopytov commented Nov 11, 2019

@theTibi well, saying that a non-uniform random number is not "truly random" would be wrong and misleading in my opinion. The "probability distribution" term is scientifically correct and that's precisely what I use in sysbench docs.

Which distribution people actually expect by default is another question. I asked that question explicitly in the previously mentioned blog post and got only one response, saying that uniform is preferable.

I'm fine with leaving this issue as a feature request to make uniform distribution the new default in the next major release. But I'm going to change the title to make the request more explicit and less confusing in the changelog.

@akopytov akopytov changed the title MySQL - get_id() is not random, it creates hotspots Use the uniform distribution by default Nov 11, 2019
@Tusamarco
Copy link

Tusamarco commented Feb 26, 2020

A simple fix to the default will suffice to switch from SPECIAL to UNIFORM. And provide by default a more uniform random distribution

index 0539148..a9a43f4 100644
--- a/src/sb_rand.c
+++ b/src/sb_rand.c
@@ -67,7 +67,7 @@ static sb_arg_t rand_args[] =
 {
   SB_OPT("rand-type",
          "random numbers distribution {uniform, gaussian, special, pareto, "
-         "zipfian} to use by default", "special", STRING),
+         "zipfian} to use by default", "uniform", STRING),
   SB_OPT("rand-seed",
          "seed for random number generator. When 0, the current time is "
          "used as an RNG seed.", "0", INT),

@Tusamarco
Copy link

@akopytov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants