Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return standard error of coefficient estimates #12

Closed
const-ae opened this issue Nov 24, 2020 · 12 comments
Closed

Return standard error of coefficient estimates #12

const-ae opened this issue Nov 24, 2020 · 12 comments

Comments

@const-ae
Copy link
Owner

In thelovelab/DESeq2#29 @mschubert raised the point that glmGamPoi, unlike DESeq2 does not return the standard error for the LFC estimates.

His use-case for them is change point detection, where he plugs-in the LFC and the standard error into the ecp package to detect structural variances (e.g. stretches of the genome that are duplicated).

I have not returned the standard error, because they are mainly important for the Wald test, where-as I however, do a likelihood ratio test.

If I want to return the covariance matrix of the coefficients, I would either

  1. need to rewrite the inner fitting routines (fisher_scoring_qr_step()) to make sure they return the covariance matrix or
  2. Calculate the covariance matrix on demand. The formula would probably be something like: t(design_matrix) %*% design_matrix / sqrt(n)

The second option seems the easier way, but I would need to think through the math and properly test the estimator.

@const-ae
Copy link
Owner Author

Last week, I implemented a predict() function for glmGamPoi. It is mainly for predicting the mu and the corresponding standard error, but you can also use it to get the standard error of the coefficients:

library(glmGamPoi)
data <- data.frame(fav_food = sample(c("apple", "banana", "cherry"), size = 50, replace = TRUE),
                   city = sample(c("heidelberg", "paris", "new york"), size = 50, replace = TRUE),
                   age = rnorm(n = 50, mean = 40, sd = 15))
y <- rnbinom(n = 50, mu = 3, size = 1/3.1)
fit <- glm_gp(y, design = ~ fav_food + city + age, col_data = data)
# The custom design matrix makes sure we get the standard errorr of the coefficients
identity_design_matrix <-  diag(nrow = ncol(fit$Beta))
pred <- predict(fit, se.fit = TRUE, newdata = identity_design_matrix)
pred
#> $fit
#>            [,1]      [,2]       [,3]      [,4]     [,5]       [,6]
#> [1,] -0.2709198 -1.747597 -0.9085034 0.8920867 1.191104 0.02410875
#> 
#> $se.fit
#>          [,1]      [,2]     [,3]      [,4]      [,5]       [,6]
#> [1,] 0.765845 0.5959845 0.540548 0.6145076 0.5942666 0.01663141
#> 
#> $residual.scale
#> [1] 1

Created on 2021-01-11 by the reprex package (v0.3.0)

This does not yet provide the full covariance matrix, but only its diagonal, but I hope this can still be useful.

@mschubert
Copy link

This looks great, thank you so much!

I'll try it out and see how well it works for my use case 👍

@willtownes
Copy link

I am also interested in this! I would like to be able to do Wald tests with glmGamPoi. Currently I figured out a hack which others may find useful. Note that in below "i" indicates a particular gene.

  1. Fit glmGamPoi to the data.
  2. Extract overdispersion parameters and store the vector of thetas as 1/overdispersion.
  3. fam<-MASS::negative.binomial(theta=thetas[i])
  4. Assuming genes in rows: fit<-glm(Y[i,]~X+0, family=fam, offset=log(size_factors))
  5. summary(fit,dispersion=1)
    Note this is NOT calling glm.nb from MASS. It is the base glm where we force the dispersion parameter to not change during fitting. Also, it's essential to set dispersion=1 in calling summary since otherwise the base R glm thinks this is a linear regression fit (don't ask me why!) and tries to estimate a value for sigma2 on top of the theta that's already in there. It will also weirdly return t-statistics instead of the correct z-statistic unless dispersion is explicitly set in the call to summary.

Needless to say being able to just compute Wald statistics directly from glmGamPoi would be much better.

@const-ae
Copy link
Owner Author

const-ae commented Feb 8, 2021

Currently I figured out a hack which others may find useful.

That's quite clever :)

Can you say a bit more, why you want to do a Wald test instead of a likelihood ratio? Is it speed concerns?

@willtownes
Copy link

@const-ae thanks! So first of all I am actually doing this on neuroscience data not genes but I noticed many times on my data the F-statistic from the test_de function was negative and the p-value was just truncated to 1.000. Also, I like the convenience of pulling lots of Wald test results out of a single model fit rather than having to re-evaluate test_de with a bunch of separate reduced model matrices. But these are all pretty minor considerations, the main benefit I am getting from glmGamPoi is the stable estimation of the dispersion parameters which is awesome! Oh and speed is not a limitation for me at the moment, I'm not working with a very big dataset.

@const-ae
Copy link
Owner Author

const-ae commented Feb 8, 2021

So first of all I am actually doing this on neuroscience data not genes but I noticed many times on my data the F-statistic from the test_de function was negative and the p-value was just truncated to 1.000.

Oh, that doesn't sound good. The last time, I noticed negative F values, there were some convergence problems for the full model, which meant that the reduced model got a larger likelihood. Do you think you could provide an example that produces the negative F-statistic? (If yes, please open a new issue, so that I can look into this).

Also, I like the convenience of pulling lots of Wald test results out of a single model fit rather than having to re-evaluate test_de with a bunch of separate reduced model matrices.

Yes, that sounds reasonable. I'll try to implement a Wald test, but to be honest, I am not sure, when I will get around to do this.

@willtownes
Copy link

Sorry I can't provide the example right now, it's not public data (yet) and my coauthors aren't comfortable with me sharing it until after the publication unfortunately. But I can say, when I re-ran glm_gp on one of the individual rows that had a negative F value in the original "fit to whole matrix" run, it no longer had a negative F value (although the df2 was Inf). This makes me wonder if the problem is something to do with shrinkage across the different rows/ genes?

@ajaynadig
Copy link

Hello,
Thank you for your great improvements to the DESeq2 package over the years. I just wanted to add to this thread that another rationale for a standard error for glmGamPoi beyond Wald tests would be to run shrinkage using ashr or apeglm, as implemented in lfcShrink. These shrinkage functions requires point estimates and standard errors, and currently return an error when run on output from DESeq2 using glmGamPoi. Seeing as ashr and apeglm are now recommended over normal shrinkage within DESeq2, it would be great to be able to use these functions on the output from glmGamPoi.

@const-ae
Copy link
Owner Author

const-ae commented Feb 6, 2023

Hi @ajaynadig,
thanks for reaching out. Taking the output from glmGamPoi and running it through apeglm/ashr is an interesting use case that I haven't thought about so far. Would it be possible to call predict with se.fit = TRUE as described in the comment above (#12 (comment)) to produce the necessary input? (I guess one would have to replace the identity_design_matrix with the contrast vector).

@jackkamm
Copy link
Contributor

I wanted to point out a tricky footgun that arises when using the predict() method to return the standard errors.

Namely, you actually need to divide by log(2) to get the standard errors for the lfc returned by test_de().

The reason is that test_de() returns log2-fold-changes, however the GLM link function is on natural log scale.

You can verify this by checking that the values returned by predict(offset=0)$fit are off by a factor of log(2) from the lfc column returned by test_de().

This is a pretty easy error for the user to make; so it would be nice if test_de() could also return the SE on the appropriate scale.

const-ae added a commit that referenced this issue May 27, 2024
@const-ae
Copy link
Owner Author

Thanks. You are right that it a subtle bug that is easy to miss. Thank you for commenting here and highlighting it!

I amended the documentation of predict.glmGamPoi to warn others.

This is a pretty easy error for the user to make; so it would be nice if test_de() could also return the SE on the appropriate scale.

Would you be interested in drafting a PR that adds the functionality to test_de? Maybe we could make it optional for now (i.e., add an argument to test_de that is called compute_lfc_se and set it to FALSE by default).

@const-ae
Copy link
Owner Author

Closed by #63.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants