-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROPE sensitivity to scales? #144
Comments
|
@1 The rope range is not adjusted for the range of the DV, it is based on that range (at least for linear models). If you standardize your predictors, you'll change the results from the rope, as the rope range does not change (since it's based on the unchanged DV), but the posterior from the scaled IV changed. So "everything working as intended". For p_direction, it doesn't matter if an IV is scaled or not. |
True, maybe we should revise the wording here? |
I might have been unclear, let me try to rephrase: the rope percentage is an index of significance (in the sense of "practical importance", rather than some esoteric and conceptual definition), thus it tries to discriminate between significant and non-significant effects. It is not an index of effect size per se. However it uses effect size to define what significance is. It defines a region assimilable to null (or negligible) effect, based solely on the effect's magnitude (I.e., size). Hence it is not directly an index of effect size, but it uses effect size to define significance. Due to this close relationship, both the arguments for and against the focus on effect size apply. what do you think? but yes that is worth reformulating and detailing in the docs. |
If I understand the both of you, the %ROPE is a measure of significance, whose criterion is based on effect size. It still seems to me like the ROPE range should be based on the scale of both the DV and the IV. Else we can have a situation like this, where the posterior falls completely in the ROPE, but the effect is anything but practically equivalent to 0: library(bayestestR)
library(rstanarm)
X <- rnorm(200)
Y <- X + rnorm(200, sd = .1)
df <- data.frame(X = X*100,Y)
junk <- capture.output(
stan_model <- stan_glm(Y ~ X, df, family = gaussian())
)
equivalence_test(stan_model)
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.10 0.10]
#>
#> Parameter H0 inside ROPE 89% HDI
#> (Intercept) accepted 100.00 % [-0.01 0.01]
#> X accepted 100.00 % [ 0.01 0.01]
performance::r2(stan_model)
#> # Bayesian R2 with Standard Error
#>
#> Conditional R2: 0.991 [0.000] Created on 2019-05-22 by the reprex package (v0.2.1) |
At the very least, I suggest adding a "warning" in the test of the vignette and the
In the vignette, the current example of not-sig to sig could be accompanied by an explanation that even though after the scaling the result is sig, the effect still is practically 0 (or something). Maybe also add the example I provided above of the opposite happening? (I'm thinking of the layperson who might use |
The ROPE is (relativley) clearly defined, so we can't change this. The idea behind the ROPE is similar to the "clinical" (not statistical) difference of effects: if effects differ by at least 1/2 SD, they're probably also practically relevant: https://www.ncbi.nlm.nih.gov/pubmed/19807551 In our case, we don't have .5 SD, but .1 SD, but we're not comparing point estimates thar are .5 SD away from zero, but the (almost) whole posterior distribution that should not cover the ROPE (+/-1 .1 SD around zero). So indeed, it seems like this "test" is sensible to scaling - here, frequentist methods may have an advantage of being more stable. We could probably check the SD of the response, and if it's approx. 1, we might indicate a warning that scaled DVs (and scaled IVs) may bias the results. At least, we should also add the text from the vignette to the docs. |
Yes, I am not convinced by throwing too many warnings either. After all, it is not this package's role to judge if what the user does makes sense or not. I would prefer to properly document things and leave warnings to a minimum :) |
agree. |
else, one would end up with warning for every single function ( |
Right, I also don't think the functions themselves should give a warning (because as @strengejacke says, where would this end???). But I do think that this sensitivity to scale (and how to account for it) should be documented in the vignette and the |
Yes, as a matter of fact it would be good to have the documentation, the docstrings (the .rd) and the README up to date with one another :) But it's a thing we will revise and improve over and over. |
I think we can close this? |
Has any of the suggested clarification been made to the doc/vignette? |
It still says rope is a measure of effect size.
Do you mind if I make some minor tweaks and additions? You can then remove
them if you don't this they're appropriate.
…--
Mattan S. Ben-Shachar, PhD student
Department of Psychology & Zlotowski Center for Neuroscience
Ben-Gurion University of the Negev
The Developmental ERP Lab
On Thu, May 23, 2019, 17:34 Dominique Makowski ***@***.***> wrote:
I added a bit here
https://easystats.github.io/bayestestR/articles/region_of_practical_equivalence.html
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#144?email_source=notifications&email_token=AINRP6G334W5SDSFJUM7WL3PW2TOPA5CNFSM4HONOSHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWCNOCA#issuecomment-495245064>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AINRP6CNV7D5XQAF74NZJ33PW2TOPANCNFSM4HONOSHA>
.
|
Yes yes please go ahead 😅 |
@DominiqueMakowski take a look.
In the sense that a standardized parameter is standardized based both on the DV and the predictors. |
Standardizing DV makes no sense, because a standardized variables has a SD of 1, and then you don't need the formula 0.1 * SD. |
Kruschke 2018, p.277 |
This quote from Kruschke is exactly what I was trying to say - to ROPE range needs to also account for the scale of the predictor... or first standardize the coefficients, or something - any way you put it is just two sides of the same coin - you can either standardize the DV and predictor and have rope [-0.1 0.1], or don't, and have rope (Sy/Sx)*[-0.1 0.1]. |
So, why haven't you said this clearly before? :-D |
Kruschke mentions .05 for standardized variables, so we can probably check different models that have unstandardized variables, compare the same models with standardized variables and ROPE-limit +/-.05 and see if the percentage coverage is similar. If so, we just need to check the model for standardized variables... a quick check would indeed be checking the SD for being ~1. |
library(rstanarm)
library(bayestestR)
data(mtcars)
mtcars_z <- parameters::standardize(mtcars)
m1a <- stan_glm(disp ~ mpg + hp + qsec + drat, data = mtcars, seed = 123)
m1b <- stan_glm(disp ~ mpg + hp + qsec + drat, data = mtcars_z, seed = 123)
m2a <- stan_glm(mpg ~ hp + wt + drat, data = mtcars, seed = 123)
m2b <- stan_glm(mpg ~ hp + wt + drat, data = mtcars_z, seed = 123)
equivalence_test(m1a)
#> # Test for Practical Equivalence
#>
#> ROPE: [-12.39 12.39]
#>
#> Parameter H0 inside ROPE 89% HDI
#> (Intercept) rejected 0.00 % [ 60.87 848.51]
#> mpg accepted 100.00 % [ -12.13 -1.07]
#> hp accepted 100.00 % [ 0.20 1.26]
#> qsec undecided 91.27 % [ -13.57 15.38]
#> drat rejected 0.00 % [-113.89 -22.76]
equivalence_test(m1b)
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.10 0.10]
#>
#> Parameter H0 inside ROPE 89% HDI
#> (Intercept) undecided 85.71 % [-0.14 0.13]
#> mpg undecided 5.34 % [-0.59 -0.03]
#> hp rejected 0.00 % [ 0.14 0.72]
#> qsec undecided 59.73 % [-0.19 0.23]
#> drat undecided 0.98 % [-0.50 -0.09]
equivalence_test(m1b, range = c(-.05, .05))
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.05 0.05]
#>
#> Parameter H0 inside ROPE 89% HDI
#> (Intercept) undecided 49.76 % [-0.14 0.13]
#> mpg undecided 1.74 % [-0.59 -0.03]
#> hp rejected 0.00 % [ 0.14 0.72]
#> qsec undecided 33.98 % [-0.19 0.23]
#> drat rejected 0.00 % [-0.50 -0.09]
equivalence_test(m2a)
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.60 0.60]
#>
#> Parameter H0 inside ROPE 89% HDI
#> (Intercept) rejected 0.00 % [19.32 39.68]
#> hp accepted 100.00 % [-0.05 -0.02]
#> wt rejected 0.00 % [-4.46 -1.86]
#> drat undecided 17.24 % [-0.48 3.60]
equivalence_test(m2b)
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.10 0.10]
#>
#> Parameter H0 inside ROPE 89% HDI
#> (Intercept) undecided 91.15 % [-0.13 0.11]
#> hp rejected 0.00 % [-0.53 -0.19]
#> wt rejected 0.00 % [-0.73 -0.32]
#> drat undecided 33.39 % [-0.03 0.33]
equivalence_test(m2b, range = c(-.05, .05))
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.05 0.05]
#>
#> Parameter H0 inside ROPE 89% HDI
#> (Intercept) undecided 53.97 % [-0.13 0.11]
#> hp rejected 0.00 % [-0.53 -0.19]
#> wt rejected 0.00 % [-0.73 -0.32]
#> drat undecided 15.87 % [-0.03 0.33] Created on 2019-05-23 by the reprex package (v0.3.0) |
Somehow, this sheds doubts on the usefulnes of the equivalence-testing... |
On the other hand, these results may indicate that multicollinearity or disregarding "joint" distributions are a problem, and probably the "standardized" model gives more reliable results? |
I guess I really should finish reading Kruschke's book ^_^ If a small (standardized) effect size is 0.2, and half of that is 0.1, why isn't the ROPE range defined as (Sy/Sx) * [-0.1 0.1]?? How then is the standardized ROPE range [-0.05 0.05]? This section seems off to me:
Is he defining the posterior range of the parameter based on Sx? That seems wrong - those two things are unrelated... |
I am not sure in your example that multicollinearity is the problem involved (and I am not sure that standardization helps address multicollinearity). After reading the previous comments and quotes, let us try to reformulate here. The ROPE is a region of negligible effect. That means, that adding But does the predictor needs to be scaled? Well, I am a big fan of standardization and standardized parameters. Acknowledging their limits, I still find them very useful and informative and that's why I recommend computing them by default in With that being said, I do not think we should enforce this rule and makes it automatic or default. First of all, standardization should be avoided "behind the scenes", as it is not always appropriate nor meaningful (e.g., in the case of non-normal distributions etc). Moreover, many people do not need standardization as the raw unit of their parameters make sense to them and are appropriate. I don't know, if you're studying the pizza eating behaviour and you have Thus, the course of action I would suggest is not to change the current behaviour of ROPE and, at the same time, emphasizing its close relationship with the scale of the parameter in the docs (maybe expanding or detailing the current paragraph). This way, the user can understand what the values mean, and he can make an informed decision about what is appropriate in his particular case. |
Nevertheless, I could envision in This would need to clean up the code for other types of standardization ( |
Eventually, although I would probably vote against it, we could just check if the SD of the predictor is far from 1. If it is indeed (let's say more than 10 / less than 0.1 or something), we could throw a warning like "The scale of your predictor is somethingsomething. The ROPE results might be unreliable. Consider standardizing your data." |
I think something needs to be done - we don't have to scale the predictor, we can (reverse-)scale the rope range (so
Absolutely!
I disagree here - if we don't do this, we set up our users for a wrong inference, as the default rope range is only appropriate if Sx=1, which is unlikely (improbable? Ha!) to be the case... |
I think we could mail John kruschke. I can do this. What do you think? |
I think this is an excellent idea! Ask him what he suggest as a default behavior for the rope range / scaling etc... |
Sounds fair :) |
I agree, I think, no matter what reply we will get, keep the current behaviour and explain in the docs. |
My impression is that the opposite is true, i.e. after standardizing, results might be somewhat strange? |
Yup As they say in Singapore, "see how" |
How about "The ROPE range might not be appropriate for the scale of your predictor. The ROPE results might be unreliable. Consider standardizing your data. See ?rope for more details." |
This comment has been minimized.
This comment has been minimized.
This is probably also a good read: |
I'm sorry - I must be missing something... Unless all your predictors are on the same scale, how does it make sense to you the same rope range for all of them? A change in a unit of X has totally different meaning depending on the scale of X... Also is this:
Referring to the HDI? Why is this defined as point + 2S? |
Big confusion - I centered variables, not standardized them... |
I think that @mattansb was asking about what the scale of X, since the posterior represent a given change here in QoL related to shift of |
The main point of contention is this: if the recommended smallest (standardized) effect size (/slope) is |0.1|, it seems like an un-standardized rope rage should be (Sy/Sx) * [-0.1 0.1]. Okay, I think that's all I have to say about this topic. Sorry for all the trouble - I've been sick at home these last couple of days and have had probably too much time to think... |
I agree that the unit/scale of the parameter is super important and should be taken into consideration. But I think it should be taken into consideration by the user (or eventually in wrappers that also implement some intelligent standardization) and not at bayestestR's level. The reason for that is conceptual, the ROPE is basically a threshold on the response's scale under which an effect is considered as negligible and above which it is considered as relevant. Thus it makes conceptually sense that the ROPE is defined solely based on its scale, which is the scale of the response. Then it should the responsibility of the user to make sure that the parameters that are in his model make sense. Because this concern (that parameters are scaled in a way that their results are not optimally interpretable) can be extended to many analyses and we should add tons of warnings to every model (again we fall back on the "where do we stop" argument)... |
I definitely agree! But also, I think that we should give a warning simply because we have internal defaults, which in most cases will be inappropriate (i.e., in all cases where the predictors don't have an Sx=1). As to "where do we stop" - I would say we "owe" our users a warning if our defaults don't give appropriate results (like when in the BF functions when a prior is not provided - the default will nearly never give an appropriate result, and so we say "hey, this is what we did. You probably don't want this, you should do X".) |
I think we can close this for now, as we have discussed this issue in docs and vignettes. |
I still have the email draft to Kruschke in Outlook, but I won't forget it. And the next day probably sent him an email. |
From ROPE article:
I'm far from a ROPE expert, but shouldn't %ROPE be in theory not sensitive to scale? Shouldn't the doc suggest that the ROPE range be adjusted according the the IV's scale?
Also, I'm not sure how %ROPE is a measure of effect size - it only indicated the degree by which the effect is not small, but give no indication of the size the effect (if it is medium, or large).
(I've only now started reading Kruschke's book, after reading some of his papers, so sorry if these are dumb Qs)
The text was updated successfully, but these errors were encountered: