Normal distribution standard deviation does not seem correct #374

Open
joelhoro opened this Issue Apr 3, 2016 · 13 comments

Projects

None yet

4 participants

@joelhoro
joelhoro commented Apr 3, 2016

When running 'Normal distribution' with an average of 5000 and a standard deviation of 1000 I get a distribution of data that is far off. I measure a sample stdev of ~ 260 on a sample size of 100. Happy to fix the analytics - couldn't find where it is in the codebase...

@conradhagemans
Contributor

Hi Joel,
Any help is apreciated.
You can find the code:
/ROOT FOLDER/plugins/dataTypes/NormalDistribution/

@TMSiefers

conradhagemans,

I believe the problem is in the script for the file NormalDistribution.class.php.

joelhoro's stdev is approximately 1/4th the requested value. I used your online generator to create 10 sets with a 100 points each and the stddev is approximately 1/4th the requested values for each of those as well (requested mean of 1.5 with stddev of 0.3). Looking in the script for the file listed above, there is a line at the very end that commands:

return $this->gauss() * ($stddev / 4) + $mean;

There are also preceding notes that state

// Adjust our gaussian random to fit the mean and standard deviation
// The division by 4 is an arbitrary value to help fit the distribution
// within our required range, and gives a best fit for $stddev=1.0

I'm not familiar with this particular language nor am I using the appropriate software to read the script, but I'm pretty sure this "arbitrary" reduction by 1/4th is the problem.

Admittedly, it took an embarassingly long time to consider calculating the standard deviation of the exported data to find out why it wasn't matching the limiting distribution. None-the-less, thanks for creating the data generator. Hopefully you get the bugs worked out quickly.

Kind regards,
Tim Siefers

@benkeen
Owner
benkeen commented Apr 16, 2016

Hey @TMSiefers, thanks for the info, and sorry for the bug in the code. Not being a mathematician, when I wrote this plugin I relied on someone else's code for the generation part - it would appear that that wasn't correct (or very possibly, I misinterpreted it somehow).

I've just been wading through some wikipedia articles on this. Yikes, my math is rusty.

I'd love to fix this, but I think I need a layman-friendly version of how the generation code should actually work. I tried just removing the / 4 part you mentioned but the generated data with a mean of 0 and a S.D. of 1 is no longer between the range of -1 to 1 (which is incorrect, no?).

@joelhoro

Well if stdev is 1 that means that roughly 68% of the population lies
between mean-stdev and mean+stdev. Stdev is by no means a measure of the
maximum by which samples can deviate from the mean, its a measure of how
much they deviate from the mean 'on average'.

Clearly the 1/4 should go away.
On Apr 16, 2016 14:27, "Benjamin Keen" notifications@github.com wrote:

Hey @TMSiefers https://github.com/TMSiefers, thanks for the info, and
sorry for the bug in the code. Not being a mathematician, when I wrote this
plugin I relied on someone else's code for the generation part - it would
appear that that wasn't correct (or very possibly, I misinterpreted it
somehow).

I've just been wading through some wikipedia articles on this. Yikes, my
math is rusty.

I'd love to fix this, but I think I need a layman-friendly version of how
the generation code should actually work. I tried just removing the / 4
part you mentioned but the generated data with a mean of 0 and a S.D. of 1
is no longer between the range of -1 to 1 (which is incorrect, no?).


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#374 (comment)

@joelhoro

Btw you can try for yourself in excel, just write the formula
=normsinv(Rand()), and copy it 100 times. If you then ask for stdev on the
entire range it will be roughly 1 (and you'll notice many values are
outside of [-1,1]
On Apr 16, 2016 15:21, "Joel Horowitz" joel@horowitz.com wrote:

Well if stdev is 1 that means that roughly 68% of the population lies
between mean-stdev and mean+stdev. Stdev is by no means a measure of the
maximum by which samples can deviate from the mean, its a measure of how
much they deviate from the mean 'on average'.

Clearly the 1/4 should go away.
On Apr 16, 2016 14:27, "Benjamin Keen" notifications@github.com wrote:

Hey @TMSiefers https://github.com/TMSiefers, thanks for the info, and
sorry for the bug in the code. Not being a mathematician, when I wrote this
plugin I relied on someone else's code for the generation part - it would
appear that that wasn't correct (or very possibly, I misinterpreted it
somehow).

I've just been wading through some wikipedia articles on this. Yikes, my
math is rusty.

I'd love to fix this, but I think I need a layman-friendly version of how
the generation code should actually work. I tried just removing the / 4
part you mentioned but the generated data with a mean of 0 and a S.D. of 1
is no longer between the range of -1 to 1 (which is incorrect, no?).


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#374 (comment)

@TMSiefers

benkeen,

Could you show how gauss() is generated (if data) or the script (if a function)? If I knew more about the source perhaps I could offer a solution.

@benkeen
Owner
benkeen commented Apr 20, 2016

Sorry for the wait - that'd be great! I'll post it here tonight.

@benkeen
Owner
benkeen commented Apr 21, 2016 edited

Hey @TMSiefers, here's how the logic goes. Let me know if I'm not clear on anything - it's kind of dense to even describe.

Step 1: compute “gauss” (the note in the code says it’s the “Polar form of the Box-Muller transformation”)

  • set A to equal 2;
  • while A is greater 1.0 or A is equal to 0.0, repeat the following:
    • set B to a random number between -1 and 1
    • set C to a random number between -1 and 1
    • now update A to equal B^2 + C^2

At this point, A will contain a random value <1.0 or 0.0.

  • now set D to equal the square root of ((-2.0 * log(A)) / A);
  • lastly, E (“gauss”) is set to B * A.

Step 2: the rest of the math

  • set F = E * (STANDARD DEVIATION / 4) + MEAN

^^ "F" is the final value that gets outputted.

@TMSiefers

@benkeen Your second bullet implies a while loop for which the result upon exit should be both that of A<1.0 AND A not-equal-to 0.0. However, my interpretation seems to be at at odds with your statement starting with "At this point A will..." Am I missing something?

Please advise,
Tim S

@joelhoro

I don't know much about php but there shouldn't be a need to implement your
own things. This is all very standard and according to this page you can
just use this function

stats_rand_gen_normal
http://php.net/manual/en/function.stats-rand-gen-normal.php

2016-04-28 16:58 GMT-04:00 TMSiefers notifications@github.com:

@benkeen https://github.com/benkeen Your second bullet implies a while
loop for which the result upon exit should be both that of A<1.0 AND A
not-equal-to 0.0. However, my interpretation seems to be at at odds with
your statement starting with "At this point A will..." Am I missing
something?

Please advise,
Tim S


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#374 (comment)

@benkeen
Owner
benkeen commented Apr 28, 2016 edited

Hey guys -

Tim, sorry about that. You're quite right, I'll update the "At this point" bit.

there shouldn't be a need to implement your own things.

:) That looks very promising, but it's a PECL function. To my knowledge, PECL doesn't come bundled with standard PHP distributions, so I wouldn't want to rely on it. Basically I don't want to add any additional steps for users to have to configure their server environment just to run this script - a little too prohibitive for some folk!

@TMSiefers

@benkeen So according to the Wikipedia article on the Box-Muller transformation, all you need to do is drop the /4 in your code. The notes indicate the one-quarter reduction was necessary to "fit the distribution within our required range." I don't know why that was necessary, but you won't have data that matches the users' requirements until you get rid of that. I have not been able to connect the mathematical dots between the Gaussian limiting distribution and the transformation you're using, but I can roughly see the matching relationship between the two; so I'll assume the Wikipedia information is correct. Wikipedia article: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

Regards,
Tim S

@TMSiefers

@benkeen Also, you wrote that you take log(A) in your code, that should be for a base of e, so it must be the natural-log. Hopefully that is how the algorithm is written. (I usually assume log(A) to use a base of 10 by default).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment