Problem with difference smooths and identifiability constraint #108

dinga92 · 2021-06-18T14:17:34Z

I think I found a problem/bug with the difference smooths computation, although maybe I am just doing something wrong.

I was trying to create difference smooths, and I noticed that the difference smooths might be quite biased due to the identifiability constraint that forces smooths to be centered at 0. In your blogpost, you wrote that this could be mitigated by including additional intercept terms into the model; however, this does not seem to solve the problem. Using the factor smooth basis, also didn't work.

I think that including the parametric term into the model doe indeed takes the different group averages into account, but this is then not taken into account when calculating the difference.

Is this a known issue, or a bug, or am I modeling it wrong?

See the code below, I hope it makes sense.

### example code

library(psych) # for the logistict function
library(mgcv)
library(gratia)
library(patchwork)

x <- seq(-30, 30, length.out = 1000)
df <- data.frame(x=c(x, x),
                 y_false_positive=c(logistic(x), logistic(x+15)),
                 y_false_negative=c(sin(pi*x/30), sin(pi*x/30)+5),
                 group = as.factor(c(rep('A', length(x)), rep('B', length(x)))))

# showing the ground-truth
# 1: the difference betwee smooths should be 0 at the edges
p1 <- ggplot(df, aes(x, y_false_positive, color=group)) +
  geom_line() +
  ggtitle("case 1")

# 2: the difference should not be 0 everywhere
p2 <- ggplot(df, aes(x, y_false_negative, color=group)) +
  geom_line() +
  ggtitle("case 2")

p1 + p2 + plot_annotation(title = "simulated smooths")

gam_fit_false_positive <- gam(y_false_positive ~ s(x, by=group) + group, data=df)
gam_fit_false_negative <- gam(y_false_negative ~ s(x, by=group) + group, data=df)

draw(gam_fit_false_positive, scales='fixed', select=c(1,2)) + 
  plot_annotation(title = "estimated smooths case 1")
draw(difference_smooths(gam_fit_false_positive, 's(x)')) +
  labs(title = "Estimated difference between smooths",
       subtitle = "The difference should be 0 at the edges")

draw(gam_fit_false_negative, scales='fixed', select=c(1,2)) +
  plot_annotation(title = "estimated smooths case 2")
draw(difference_smooths(gam_fit_false_negative, 's(x)')) +
  labs(title = "Estimated difference between smooths",
       subtitle = "Difference should not be 0")

The text was updated successfully, but these errors were encountered:

gavinsimpson · 2021-06-20T17:41:24Z

@dinga92 I need to look at case 1 more closely, but case 2 looks like a misunderstanding of what we mean by a difference of smooth functions. In Case 2 your functions have exactly the same shape and only differ in terms of the group means. As those group means are not involved in the smooth, we don't consider them differences. Indeed, for more complex models there is no other way to think of a difference of smooth functions. For complex models with multiple terms and parametric terms, the best we could do then is compare differences of fitted values from the model holding other covariates at fixed values, which would condition the differences on the combination of factor covariates in such a model.

So, we just focus on pairs of factor-by smooths and ignore the differences in means. If you want to test that you need to focus on parametric group means, not the smooths.

dinga92 · 2021-06-20T19:42:28Z

Thanks for your reply. Case 1 is also just a direct consequence of the mean being removed from the smooth function. Since the function is a bit shifted, the average is also different.

I think what I expected is a difference between smooth-by-factor that also includes the factor intercept, which is of course different than just a difference between mean centered smooths alone.

I know I can test the main effect but that is a bit different question. E.g.,

testing main effects would be something like asking if a global temperature is higher
testing the smooths+factors would be asking where in the world is the temperature higher
testing difference smooths would be asking where in the world is the temperature higher, corrected for the average global temperature. Or where is the deviation from the average global temperature different

gavinsimpson · 2021-06-23T20:12:36Z

I'm not sure I follow your bullet 3.

Consider the model gam(y ~ f + s(x, by = f)), were f is a factor and s(x, by = f) is a smooth of x for each level of the factor f. At this point we can test three things:

tests on f, the parametric term. We don't have much functionality to do this in {mgcv} or {gratia} but {emmeans} has good capabilities for doing things with parametric terms. If you don't mind doing a bit of coding one could also do any testing you want.
tests on functions of fitted values of the model, or
we can focus on the smooths specified by s(x, by = f). There is nothing in these smooths that encodes the means of the response for the levels of f, so we can only focus on these smooth functions and as such we are talking about differences of fitted smooth functions and not differences that include the group means

I don't follow the point about correcting for the average temperature? What is correcting for this? By default and in the way you are fitting these models there is no such thing as the average temperature. The model intercept is coded as the the mean of Y for the reference level of the factor with deviations between this reference level and each other level coded by the other parametric terms related to the factor. You can change the meaning of the coefficients by changing the contrasts used, but I don't know of any that would correct for the average of the response.

Anyway, I'm not saying that what you want to do is not possible or desirable, it's just not what was implemented as the function was designed to focus on s(x, by = f) and nothing else. Several people have asked for the ability to include the group means and I have said that I will add this, but I continue to get hung up on what users would want for models such as

gam(y ~ f1 + f2 + s(x, by = f1))

?

Now there is nothing in this model's model matrix that codes for the levels of f1 only; the intercept is the estimated mean of the samples in the group formed by the reference levels of f1 and f2. In this model there's no simple way to add on the mean for each level for f1 only, without going back to original data and averaging y over the levels of f2 for each level of f1 (and propagating the uncertainty correctly). But if I'm going to have to implement code that can do the required things generally, then I'd be reimplementing a chunk of things that {emmeans} already does and I'm not confident that I should be doing that. I should look at what {itsadug} does as that will do difference smooths that include the group means for the factor by terms so I could be missing something obvious?

gavinsimpson · 2022-08-29T07:06:05Z

Turns out I was massively over-thinking this, at least for the case of including group means (the complexity where there are multiple factors just cancels for any reasonable comparison because the level of the second factor you condition on is the same for both groups you are differencing.)

From 0.7.3.12 with difference_smooths() you can include the group means in the comparison.

gavinsimpson added enhancement New feature or request not a bug This doesn't seem right labels Jun 25, 2021

gavinsimpson added a commit that referenced this issue Aug 29, 2022

allow group means in the difference smooths; discussed in #108 and #143

46c1735

gavinsimpson closed this as completed Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with difference smooths and identifiability constraint #108

Problem with difference smooths and identifiability constraint #108

dinga92 commented Jun 18, 2021 •

edited

gavinsimpson commented Jun 20, 2021

dinga92 commented Jun 20, 2021

gavinsimpson commented Jun 23, 2021

gavinsimpson commented Aug 29, 2022

Problem with difference smooths and identifiability constraint #108

Problem with difference smooths and identifiability constraint #108

Comments

dinga92 commented Jun 18, 2021 • edited

gavinsimpson commented Jun 20, 2021

dinga92 commented Jun 20, 2021

gavinsimpson commented Jun 23, 2021

gavinsimpson commented Aug 29, 2022

dinga92 commented Jun 18, 2021 •

edited