Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with difference smooths and identifiability constraint #108

Closed
dinga92 opened this issue Jun 18, 2021 · 4 comments
Closed

Problem with difference smooths and identifiability constraint #108

dinga92 opened this issue Jun 18, 2021 · 4 comments
Labels
enhancement New feature or request not a bug This doesn't seem right

Comments

@dinga92
Copy link

dinga92 commented Jun 18, 2021

I think I found a problem/bug with the difference smooths computation, although maybe I am just doing something wrong.

I was trying to create difference smooths, and I noticed that the difference smooths might be quite biased due to the identifiability constraint that forces smooths to be centered at 0. In your blogpost, you wrote that this could be mitigated by including additional intercept terms into the model; however, this does not seem to solve the problem. Using the factor smooth basis, also didn't work.

I think that including the parametric term into the model doe indeed takes the different group averages into account, but this is then not taken into account when calculating the difference.

Is this a known issue, or a bug, or am I modeling it wrong?

See the code below, I hope it makes sense.

### example code

library(psych) # for the logistict function
library(mgcv)
library(gratia)
library(patchwork)

x <- seq(-30, 30, length.out = 1000)
df <- data.frame(x=c(x, x),
                 y_false_positive=c(logistic(x), logistic(x+15)),
                 y_false_negative=c(sin(pi*x/30), sin(pi*x/30)+5),
                 group = as.factor(c(rep('A', length(x)), rep('B', length(x)))))

# showing the ground-truth
# 1: the difference betwee smooths should be 0 at the edges
p1 <- ggplot(df, aes(x, y_false_positive, color=group)) +
  geom_line() +
  ggtitle("case 1")

# 2: the difference should not be 0 everywhere
p2 <- ggplot(df, aes(x, y_false_negative, color=group)) +
  geom_line() +
  ggtitle("case 2")

p1 + p2 + plot_annotation(title = "simulated smooths")

gam_fit_false_positive <- gam(y_false_positive ~ s(x, by=group) + group, data=df)
gam_fit_false_negative <- gam(y_false_negative ~ s(x, by=group) + group, data=df)

draw(gam_fit_false_positive, scales='fixed', select=c(1,2)) + 
  plot_annotation(title = "estimated smooths case 1")
draw(difference_smooths(gam_fit_false_positive, 's(x)')) +
  labs(title = "Estimated difference between smooths",
       subtitle = "The difference should be 0 at the edges")

draw(gam_fit_false_negative, scales='fixed', select=c(1,2)) +
  plot_annotation(title = "estimated smooths case 2")
draw(difference_smooths(gam_fit_false_negative, 's(x)')) +
  labs(title = "Estimated difference between smooths",
       subtitle = "Difference should not be 0")

Rplot99
Rplot95
Rplot96
Rplot97
Rplot98

@gavinsimpson
Copy link
Owner

@dinga92 I need to look at case 1 more closely, but case 2 looks like a misunderstanding of what we mean by a difference of smooth functions. In Case 2 your functions have exactly the same shape and only differ in terms of the group means. As those group means are not involved in the smooth, we don't consider them differences. Indeed, for more complex models there is no other way to think of a difference of smooth functions. For complex models with multiple terms and parametric terms, the best we could do then is compare differences of fitted values from the model holding other covariates at fixed values, which would condition the differences on the combination of factor covariates in such a model.

So, we just focus on pairs of factor-by smooths and ignore the differences in means. If you want to test that you need to focus on parametric group means, not the smooths.

@dinga92
Copy link
Author

dinga92 commented Jun 20, 2021

Thanks for your reply. Case 1 is also just a direct consequence of the mean being removed from the smooth function. Since the function is a bit shifted, the average is also different.

I think what I expected is a difference between smooth-by-factor that also includes the factor intercept, which is of course different than just a difference between mean centered smooths alone.

I know I can test the main effect but that is a bit different question. E.g.,

  • testing main effects would be something like asking if a global temperature is higher
  • testing the smooths+factors would be asking where in the world is the temperature higher
  • testing difference smooths would be asking where in the world is the temperature higher, corrected for the average global temperature. Or where is the deviation from the average global temperature different

@gavinsimpson
Copy link
Owner

I'm not sure I follow your bullet 3.

Consider the model gam(y ~ f + s(x, by = f)), were f is a factor and s(x, by = f) is a smooth of x for each level of the factor f. At this point we can test three things:

  1. tests on f, the parametric term. We don't have much functionality to do this in {mgcv} or {gratia} but {emmeans} has good capabilities for doing things with parametric terms. If you don't mind doing a bit of coding one could also do any testing you want.
  2. tests on functions of fitted values of the model, or
  3. we can focus on the smooths specified by s(x, by = f). There is nothing in these smooths that encodes the means of the response for the levels of f, so we can only focus on these smooth functions and as such we are talking about differences of fitted smooth functions and not differences that include the group means

I don't follow the point about correcting for the average temperature? What is correcting for this? By default and in the way you are fitting these models there is no such thing as the average temperature. The model intercept is coded as the the mean of Y for the reference level of the factor with deviations between this reference level and each other level coded by the other parametric terms related to the factor. You can change the meaning of the coefficients by changing the contrasts used, but I don't know of any that would correct for the average of the response.

Anyway, I'm not saying that what you want to do is not possible or desirable, it's just not what was implemented as the function was designed to focus on s(x, by = f) and nothing else. Several people have asked for the ability to include the group means and I have said that I will add this, but I continue to get hung up on what users would want for models such as

gam(y ~ f1 + f2 + s(x, by = f1))

?

Now there is nothing in this model's model matrix that codes for the levels of f1 only; the intercept is the estimated mean of the samples in the group formed by the reference levels of f1 and f2. In this model there's no simple way to add on the mean for each level for f1 only, without going back to original data and averaging y over the levels of f2 for each level of f1 (and propagating the uncertainty correctly). But if I'm going to have to implement code that can do the required things generally, then I'd be reimplementing a chunk of things that {emmeans} already does and I'm not confident that I should be doing that. I should look at what {itsadug} does as that will do difference smooths that include the group means for the factor by terms so I could be missing something obvious?

@gavinsimpson gavinsimpson added enhancement New feature or request not a bug This doesn't seem right labels Jun 25, 2021
@gavinsimpson
Copy link
Owner

Turns out I was massively over-thinking this, at least for the case of including group means (the complexity where there are multiple factors just cancels for any reasonable comparison because the level of the second factor you condition on is the same for both groups you are differencing.)

From 0.7.3.12 with difference_smooths() you can include the group means in the comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request not a bug This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants