Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROPE sensitivity to scales? #144

Closed
mattansb opened this issue May 21, 2019 · 48 comments
Closed

ROPE sensitivity to scales? #144

mattansb opened this issue May 21, 2019 · 48 comments

Comments

@mattansb
Copy link
Member

From ROPE article:

However, keep in mind that unlike indices of effect existence (such as the pd), indices of effect size, such as the ROPE percentage, depend on the unit of its parameter, and can thus be easily changed via the scaling of the predictors.

  1. I'm far from a ROPE expert, but shouldn't %ROPE be in theory not sensitive to scale? Shouldn't the doc suggest that the ROPE range be adjusted according the the IV's scale?

  2. Also, I'm not sure how %ROPE is a measure of effect size - it only indicated the degree by which the effect is not small, but give no indication of the size the effect (if it is medium, or large).

(I've only now started reading Kruschke's book, after reading some of his papers, so sorry if these are dumb Qs)

@mattansb
Copy link
Member Author

  1. I guess my question is: if the ROPE range is adjusted for the scale of the DV, why not for the IV? (Maybe this can also be implemented with something like what is used for parameters::standardize_parameters(x, method = "full")?

@strengejacke
Copy link
Member

@1 The rope range is not adjusted for the range of the DV, it is based on that range (at least for linear models). If you standardize your predictors, you'll change the results from the rope, as the rope range does not change (since it's based on the unchanged DV), but the posterior from the scaled IV changed. So "everything working as intended". For p_direction, it doesn't matter if an IV is scaled or not.

@strengejacke
Copy link
Member

Also, I'm not sure how %ROPE is a measure of effect size

True, maybe we should revise the wording here?

@DominiqueMakowski
Copy link
Member

Also, I'm not sure how %ROPE is a measure of effect size

I might have been unclear, let me try to rephrase:

the rope percentage is an index of significance (in the sense of "practical importance", rather than some esoteric and conceptual definition), thus it tries to discriminate between significant and non-significant effects. It is not an index of effect size per se. However it uses effect size to define what significance is. It defines a region assimilable to null (or negligible) effect, based solely on the effect's magnitude (I.e., size). Hence it is not directly an index of effect size, but it uses effect size to define significance. Due to this close relationship, both the arguments for and against the focus on effect size apply.

what do you think?

but yes that is worth reformulating and detailing in the docs.

@mattansb
Copy link
Member Author

If I understand the both of you, the %ROPE is a measure of significance, whose criterion is based on effect size.

It still seems to me like the ROPE range should be based on the scale of both the DV and the IV. Else we can have a situation like this, where the posterior falls completely in the ROPE, but the effect is anything but practically equivalent to 0:

library(bayestestR)
library(rstanarm)

X <- rnorm(200)
Y <- X + rnorm(200, sd = .1)

df <- data.frame(X = X*100,Y)

junk <- capture.output(
  stan_model <- stan_glm(Y ~ X, df, family = gaussian())
)

equivalence_test(stan_model)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.10 0.10]
#> 
#>    Parameter       H0 inside ROPE      89% HDI
#>  (Intercept) accepted    100.00 % [-0.01 0.01]
#>            X accepted    100.00 % [ 0.01 0.01]
performance::r2(stan_model)
#> # Bayesian R2 with Standard Error
#> 
#>   Conditional R2: 0.991 [0.000]

Created on 2019-05-22 by the reprex package (v0.2.1)

@mattansb
Copy link
Member Author

At the very least, I suggest adding a "warning" in the test of the vignette and the .rds along the lines of:

However, keep in mind that unlike other indices of effect existence (such as the pd), ROPE percentage depend on the unit of its parameter, thus correct interpretation of the ROPE as representing a region of practical equivalence to zero is dependent on the scaling of the predictors.

In the vignette, the current example of not-sig to sig could be accompanied by an explanation that even though after the scaling the result is sig, the effect still is practically 0 (or something). Maybe also add the example I provided above of the opposite happening?

(I'm thinking of the layperson who might use rope or equivalence_test without giving too much thought...)

@strengejacke
Copy link
Member

It still seems to me like the ROPE range should be based on the scale of both the DV and the IV.

The ROPE is (relativley) clearly defined, so we can't change this. The idea behind the ROPE is similar to the "clinical" (not statistical) difference of effects: if effects differ by at least 1/2 SD, they're probably also practically relevant: https://www.ncbi.nlm.nih.gov/pubmed/19807551

In our case, we don't have .5 SD, but .1 SD, but we're not comparing point estimates thar are .5 SD away from zero, but the (almost) whole posterior distribution that should not cover the ROPE (+/-1 .1 SD around zero). So indeed, it seems like this "test" is sensible to scaling - here, frequentist methods may have an advantage of being more stable.

We could probably check the SD of the response, and if it's approx. 1, we might indicate a warning that scaled DVs (and scaled IVs) may bias the results. At least, we should also add the text from the vignette to the docs.

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented May 22, 2019

Yes, I am not convinced by throwing too many warnings either. After all, it is not this package's role to judge if what the user does makes sense or not. I would prefer to properly document things and leave warnings to a minimum :)

@strengejacke
Copy link
Member

agree.

@strengejacke
Copy link
Member

else, one would end up with warning for every single function (lm() - don't trust p-values!, confint() - not precise if you assume normal-dist... etc.)

@mattansb
Copy link
Member Author

Right, I also don't think the functions themselves should give a warning (because as @strengejacke says, where would this end???). But I do think that this sensitivity to scale (and how to account for it) should be documented in the vignette and the .rds.

@DominiqueMakowski
Copy link
Member

But I do think that this sensitivity to scale (and how to account for it) should be documented in the vignette and the .rds.

Yes, as a matter of fact it would be good to have the documentation, the docstrings (the .rd) and the README up to date with one another :) But it's a thing we will revise and improve over and over.

@DominiqueMakowski
Copy link
Member

I think we can close this?

@mattansb
Copy link
Member Author

Has any of the suggested clarification been made to the doc/vignette?

@DominiqueMakowski
Copy link
Member

@mattansb
Copy link
Member Author

mattansb commented May 23, 2019 via email

@DominiqueMakowski
Copy link
Member

Yes yes please go ahead 😅

mattansb added a commit that referenced this issue May 23, 2019
@mattansb
Copy link
Member Author

@DominiqueMakowski take a look.
I was trying to channel (from the rope doc):

Kruschke (2018) suggests that such null value could be set, by default, to the -0.1 to 0.1 range of a standardized parameter

In the sense that a standardized parameter is standardized based both on the DV and the predictors.

@strengejacke
Copy link
Member

Standardizing DV makes no sense, because a standardized variables has a SD of 1, and then you don't need the formula 0.1 * SD.

@strengejacke
Copy link
Member

For parameters that have the same scale as the data,
it is relatively straightforward to think about a ROPE.
For example, in the case of IQ scores with a normal
distribution, the mean, μ, is on the IQ scale, and its
ROPE limits are in IQ points. Other models may have
parameters that are less directly related to the scales of
the data, and therefore ROPE limits may need to be
derived more indirectly. Consider linear regression. We
might want to say that a regression coefficient, βx, is
practically equivalent to zero if a change across the
“main range of x” produces only a negligible change
in the predicted value, yˆ. Suppose we specify a negligible
change in yˆ as ±0.1Sy, where Sy is the standard
deviation of y (a range that may be motivated by the
convention that 0.1S is half of a “small” effect), and we
specify the “main range of x” as Mx ± 2Sx (because if x
were normally distributed, this range would cover just
over 95% of the distribution). Given these specifications,
a regression coefficient is practically equivalent
to zero when a change of x from Mx – 2Sx to Mx + 2Sx
yields a change of yˆ only from My – 0.1Sy to My + 0.1Sy,
which implies ROPE limits of βx = ±0.05 for standardized
variables.

Kruschke 2018, p.277

@strengejacke strengejacke mentioned this issue May 23, 2019
Merged
@mattansb
Copy link
Member Author

This quote from Kruschke is exactly what I was trying to say - to ROPE range needs to also account for the scale of the predictor... or first standardize the coefficients, or something - any way you put it is just two sides of the same coin - you can either standardize the DV and predictor and have rope [-0.1 0.1], or don't, and have rope (Sy/Sx)*[-0.1 0.1].

@strengejacke
Copy link
Member

So, why haven't you said this clearly before? :-D

@strengejacke
Copy link
Member

Kruschke mentions .05 for standardized variables, so we can probably check different models that have unstandardized variables, compare the same models with standardized variables and ROPE-limit +/-.05 and see if the percentage coverage is similar.

If so, we just need to check the model for standardized variables... a quick check would indeed be checking the SD for being ~1.

@strengejacke
Copy link
Member

library(rstanarm)
library(bayestestR)

data(mtcars)
mtcars_z <- parameters::standardize(mtcars)

m1a <- stan_glm(disp ~ mpg + hp + qsec + drat, data = mtcars, seed = 123)
m1b <- stan_glm(disp ~ mpg + hp + qsec + drat, data = mtcars_z, seed = 123)

m2a <- stan_glm(mpg ~ hp + wt + drat, data = mtcars, seed = 123)
m2b <- stan_glm(mpg ~ hp + wt + drat, data = mtcars_z, seed = 123)


equivalence_test(m1a)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-12.39 12.39]
#> 
#>    Parameter        H0 inside ROPE          89% HDI
#>  (Intercept)  rejected      0.00 % [  60.87 848.51]
#>          mpg  accepted    100.00 % [ -12.13  -1.07]
#>           hp  accepted    100.00 % [   0.20   1.26]
#>         qsec undecided     91.27 % [ -13.57  15.38]
#>         drat  rejected      0.00 % [-113.89 -22.76]

equivalence_test(m1b)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.10 0.10]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     85.71 % [-0.14  0.13]
#>          mpg undecided      5.34 % [-0.59 -0.03]
#>           hp  rejected      0.00 % [ 0.14  0.72]
#>         qsec undecided     59.73 % [-0.19  0.23]
#>         drat undecided      0.98 % [-0.50 -0.09]

equivalence_test(m1b, range = c(-.05, .05))
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.05 0.05]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     49.76 % [-0.14  0.13]
#>          mpg undecided      1.74 % [-0.59 -0.03]
#>           hp  rejected      0.00 % [ 0.14  0.72]
#>         qsec undecided     33.98 % [-0.19  0.23]
#>         drat  rejected      0.00 % [-0.50 -0.09]



equivalence_test(m2a)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.60 0.60]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept)  rejected      0.00 % [19.32 39.68]
#>           hp  accepted    100.00 % [-0.05 -0.02]
#>           wt  rejected      0.00 % [-4.46 -1.86]
#>         drat undecided     17.24 % [-0.48  3.60]

equivalence_test(m2b)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.10 0.10]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     91.15 % [-0.13  0.11]
#>           hp  rejected      0.00 % [-0.53 -0.19]
#>           wt  rejected      0.00 % [-0.73 -0.32]
#>         drat undecided     33.39 % [-0.03  0.33]

equivalence_test(m2b, range = c(-.05, .05))
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.05 0.05]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     53.97 % [-0.13  0.11]
#>           hp  rejected      0.00 % [-0.53 -0.19]
#>           wt  rejected      0.00 % [-0.73 -0.32]
#>         drat undecided     15.87 % [-0.03  0.33]

Created on 2019-05-23 by the reprex package (v0.3.0)

@strengejacke
Copy link
Member

Somehow, this sheds doubts on the usefulnes of the equivalence-testing...

@strengejacke
Copy link
Member

On the other hand, these results may indicate that multicollinearity or disregarding "joint" distributions are a problem, and probably the "standardized" model gives more reliable results?

@mattansb
Copy link
Member Author

I guess I really should finish reading Kruschke's book ^_^

If a small (standardized) effect size is 0.2, and half of that is 0.1, why isn't the ROPE range defined as (Sy/Sx) * [-0.1 0.1]?? How then is the standardized ROPE range [-0.05 0.05]? This section seems off to me:

...“main range of x” as Mx ± 2Sx (because if x were normally distributed, this range would cover just over 95% of the distribution).

Is he defining the posterior range of the parameter based on Sx? That seems wrong - those two things are unrelated...

@DominiqueMakowski
Copy link
Member

On the other hand, these results may indicate that multicollinearity or disregarding "joint" distributions are a problem, and probably the "standardized" model gives more reliable results?

I am not sure in your example that multicollinearity is the problem involved (and I am not sure that standardization helps address multicollinearity).

After reading the previous comments and quotes, let us try to reformulate here. The ROPE is a region of negligible effect. That means, that adding 1 to the (linear) predictor (which is what regression coefs represent) will result in a negligible effect, ie*, a negligible change of the response variable. Thus, the negligible region must refer to the response and its scale, as it is the case currently.

But does the predictor needs to be scaled? Well, I am a big fan of standardization and standardized parameters. Acknowledging their limits, I still find them very useful and informative and that's why I recommend computing them by default in parameters. As a general rule, yes, of course, to me it makes more sense when the "adding 1 to the (linear) predictor" mentioned above means "moving by 1 SD". This way, you're comparing a statistically meaningful change (change in variance of the predictor) to a meaningful effect (change in variance of the response). Otherwise, people might just feed in raw reaction times in milliseconds as predictors and be surprised that their equivalence test is never significant... Well indeed, moving by 1 ms is hardly related to non-negligible effects. So I definitely think that standardizing variables should be indeed considered whenever using ROPE.

With that being said, I do not think we should enforce this rule and makes it automatic or default. First of all, standardization should be avoided "behind the scenes", as it is not always appropriate nor meaningful (e.g., in the case of non-normal distributions etc). Moreover, many people do not need standardization as the raw unit of their parameters make sense to them and are appropriate. I don't know, if you're studying the pizza eating behaviour and you have number of pizzas / week as a predictor, then the raw parameter, corresponding to one pizza by week more, probably makes sense (to the researcher of that topic). So automatically standardizing this under the carpet would be 1) not easy to implement (as correct standardization is not that straightforward to do, which is why it is one of the main features of parameters ) and 2) would feel like too imposing for the user.

Thus, the course of action I would suggest is not to change the current behaviour of ROPE and, at the same time, emphasizing its close relationship with the scale of the parameter in the docs (maybe expanding or detailing the current paragraph). This way, the user can understand what the values mean, and he can make an informed decision about what is appropriate in his particular case.

@DominiqueMakowski
Copy link
Member

Nevertheless, I could envision in parameters::model_parameters() an argument like rope_standardized = TRUE that would return the ROPE / equi test results based on the standardized parameters, which the function computes anyway. But I don't think doing this in bayestestR is a good idea, which scope is more to provide tools and information rather than implementing and forcing "good practices", as it becomes the case a bit more in other, more directive, packages.

This would need to clean up the code for other types of standardization (metho="full") as it is currently quite messy and doesn't allow to standardize whole posteriors, only point-estimates. But that could be done.

@DominiqueMakowski
Copy link
Member

Eventually, although I would probably vote against it, we could just check if the SD of the predictor is far from 1. If it is indeed (let's say more than 10 / less than 0.1 or something), we could throw a warning like "The scale of your predictor is somethingsomething. The ROPE results might be unreliable. Consider standardizing your data."

@mattansb
Copy link
Member Author

But does the predictor needs to be scaled?

I think something needs to be done - we don't have to scale the predictor, we can (reverse-)scale the rope range (so rope_range would return a different range for each parameter).

...emphasizing its close relationship with the scale of the parameter in the docs ...

Absolutely!

I do not think we should enforce this rule and makes it automatic or default.

I disagree here - if we don't do this, we set up our users for a wrong inference, as the default rope range is only appropriate if Sx=1, which is unlikely (improbable? Ha!) to be the case...

@strengejacke
Copy link
Member

I think we could mail John kruschke. I can do this. What do you think?

@mattansb
Copy link
Member Author

I think this is an excellent idea! Ask him what he suggest as a default behavior for the rope range / scaling etc...

@DominiqueMakowski
Copy link
Member

I think we could mail John kruschke. I can do this. What do you think?

Sounds fair :)

@strengejacke
Copy link
Member

Thus, the course of action I would suggest is not to change the current behaviour of ROPE and, at the same time, emphasizing its close relationship with the scale of the parameter in the docs (maybe expanding or detailing the current paragraph).

I agree, I think, no matter what reply we will get, keep the current behaviour and explain in the docs.

@strengejacke
Copy link
Member

"The scale of your predictor is somethingsomething. The ROPE results might be unreliable. Consider standardizing your data."

My impression is that the opposite is true, i.e. after standardizing, results might be somewhat strange?

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented May 24, 2019

I agree, I think, no matter what reply we will get, keep the current behaviour and explain in the docs.

Yup

As they say in Singapore, "see how" ☺️ (we'll see what then)

@mattansb
Copy link
Member Author

How about "The ROPE range might not be appropriate for the scale of your predictor. The ROPE results might be unreliable. Consider standardizing your data. See ?rope for more details."

@strengejacke

This comment has been minimized.

@strengejacke
Copy link
Member

This is probably also a good read:
http://www.indiana.edu/~kruschke/KruschkeFreqAndBayesAppTutorial.html

@mattansb
Copy link
Member Author

I'm sorry - I must be missing something... Unless all your predictors are on the same scale, how does it make sense to you the same rope range for all of them? A change in a unit of X has totally different meaning depending on the scale of X...

Also is this:

95% of the distribution

Referring to the HDI? Why is this defined as point + 2S?

@strengejacke
Copy link
Member

strengejacke commented May 24, 2019

Unless all your predictors are on the same scale, how does it make sense to you the same rope range for all of them?

Big confusion - I centered variables, not standardized them...

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented May 24, 2019

I'm sorry - I must be missing something... Unless all your predictors are on the same scale, how does it make sense to you the same rope range for all of them? A change in a unit of X has totally different meaning depending on the scale of X...

I think that @mattansb was asking about what the scale of X, since the posterior represent a given change here in QoL related to shift of 1 of the (continuous) predictor. For example, assuming that length of stay is a continuous predictor expressed is in days, your posterior is related to the effect of 1 day of length of stay more. However, if you express your length of stay in let's say years, then your percentage in ROPE will be highly shrunk, and it might completely shift outside of the ROPE. Again, if you convert the unit to seconds, then the posterior will be inflated because 1 second of length of stay more is not the same as 1 year of length of stay more.

@mattansb
Copy link
Member Author

The main point of contention is this: if the recommended smallest (standardized) effect size (/slope) is |0.1|, it seems like an un-standardized rope rage should be (Sy/Sx) * [-0.1 0.1].
This is equivalent to how - in frequentist / point estimates - determining if an effect is big is always dependent on the scale of Y and X. In other words, we cannot determine if a parameter is clinically significant without accounting X and Y's scales (as we've seen in all the examples we've shown here + in the vignette).
One final example: Say we have a parameter of 2.5, HDI [1.0 4.0], and rope range, Sy * [-0.1 0.1] = [-0.4 0.4], then it would seem that the HDI is fully outside the rope. I.e., a change of 1 unit in X produced a non-negligible change in Y.
But if we learn that Sx is 0.01, then we would never observe a change in 1 unit of X (1 unit = Sx*100).
Thus an appropriate rope range should consider not only the scale of Y but also that of X (does a clinically significant change X correspond to clinically significant change in Y?).

Okay, I think that's all I have to say about this topic.

Sorry for all the trouble - I've been sick at home these last couple of days and have had probably too much time to think...

@DominiqueMakowski
Copy link
Member

I agree that the unit/scale of the parameter is super important and should be taken into consideration.

But I think it should be taken into consideration by the user (or eventually in wrappers that also implement some intelligent standardization) and not at bayestestR's level.

The reason for that is conceptual, the ROPE is basically a threshold on the response's scale under which an effect is considered as negligible and above which it is considered as relevant. Thus it makes conceptually sense that the ROPE is defined solely based on its scale, which is the scale of the response. Then it should the responsibility of the user to make sure that the parameters that are in his model make sense. Because this concern (that parameters are scaled in a way that their results are not optimally interpretable) can be extended to many analyses and we should add tons of warnings to every model (again we fall back on the "where do we stop" argument)...

@mattansb
Copy link
Member Author

But I think it should be taken into consideration by the user

I definitely agree! But also, I think that we should give a warning simply because we have internal defaults, which in most cases will be inappropriate (i.e., in all cases where the predictors don't have an Sx=1).

As to "where do we stop" - I would say we "owe" our users a warning if our defaults don't give appropriate results (like when in the BF functions when a prior is not provided - the default will nearly never give an appropriate result, and so we say "hey, this is what we did. You probably don't want this, you should do X".)

This was referenced May 25, 2019
@DominiqueMakowski
Copy link
Member

I think we can close this for now, as we have discussed this issue in docs and vignettes.

@strengejacke
Copy link
Member

I still have the email draft to Kruschke in Outlook, but I won't forget it. And the next day probably sent him an email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants