Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the use of "X" in divnet() function #53

Closed
gregpoore opened this issue Mar 19, 2020 · 4 comments
Closed

Understanding the use of "X" in divnet() function #53

gregpoore opened this issue Mar 19, 2020 · 4 comments

Comments

@gregpoore
Copy link

gregpoore commented Mar 19, 2020

Hi Amy,

I'm in Rob Knight's lab and excited to see (and use) your DivNet package — congrats on the in press acceptance! I have a question regarding the use of the covariate matrix ("X") in the divnet() function. The example given in your tutorial uses the Lee dataset, picking "char" (basalts of different characteristics) as the covariate. I understand that this is useful when sampling from different environments, but I'm trying to figure out if it would include samples from the same environment that are affected by external factors. A few examples:

  • Plots of soil sampled in winter versus plots of soil sampled in summer in the same region (X = season)
  • A series of gut microbiome samples from patients before and after the influence of antibiotics (X = antibiotic status)
  • Liver tissue samples of tumors that originate in the liver versus those that metastasized to the liver among several patients (X = tumor site of origin) (cf. Bullman et al., 2017. Science.)

Is it permissible to use these differences as covariates ("X") for the divnet() function (e.g. antibiotic vs. non-antibiotic gut samples) even though they're technically sampled from the same environment?

Thanks a ton!

@adw96
Copy link
Owner

adw96 commented Mar 20, 2020

Hi Greg,

Thanks for bringing up this question. Choosing what variables to include in your model is not trivial and depends on your scientific question as well as your experimental design. I generally recommend taking at least a couple of statistics classes (up to multiple regression) to give some context for model fitting.

In short, the model that you would fit for DivNet is what you would include if you were fitting a simple or multiple regression on a univariate outcome. While I don't know the design of your experiments that you describe, here are some things that I would think about including in your models.

  • Plots of soil sampled in winter versus plots of soil sampled in summer in the same region: X = season + plot; plot is called a blocking variable here
    • When you go to use betta for hypothesis testing, you probably will care about the hypothesis test for season (we don't often care about testing for blocking variables)
  • A series of gut microbiome samples from patients before and after the influence of antibiotics: X = antibiotic_status + patient_ID, since observations within patients will be correlated
    • When you go to use betta for hypothesis testing, you probably will care about the hypothesis test for antibiotic status
  • Liver tissue samples of tumors that originate in the liver versus those that metastasized to the liver among several patients: X = tumor site of origin + patient
    • When you go to use betta for hypothesis testing, you probably will care about the hypothesis test for tumor site

(Note that here I'm using R's + notation for adding variables to a model -- I'm not saying sum up the variables!)

I'm sorry that there isn't a more concrete answer for you, and that it depends so much on the setting that you're in. I'm happy to try to answer any more detailed questions you have, though. Statistical consultants at your institution can often advise on what variables to include in the model, too.

Amy

@gregpoore
Copy link
Author

gregpoore commented Mar 21, 2020

Hi Amy,

Thank you so much for the helpful and thorough reply! The only issue is that it doesn't seem possible to add a blocking variable (or any multivariate model) in the divnet() function. Looking at the code underlying the divnet() function (see below; emphasis added to parts with ** and arrows) seems to suggest that it will only accept a NULL or single character variable input (i.e. not a model.matrix() input required to add the blocking variable).

Code:

> divnet
function (W, X = NULL, fitted_model = NULL, tuning = NULL, perturbation = NULL, 
    network = NULL, base = NULL, ncores = NULL, variance = "parametric", 
    B = 5, nsub = NULL, ...) 
{
    if ("phyloseq" %in% class(W)) {
        input_data <- W
        W <- input_data %>% otu_table %>% as.matrix
        suppressWarnings({
            class(W) <- "matrix"
        })
        if (phyloseq::taxa_are_rows(input_data)) 
            W <- W %>% t
        samples_names <- input_data %>% sample_names

        ------> **if (is.character(X))** { <------

            X <- breakaway::make_design_matrix(input_data, X)
        }

       ------> **else if (is.null(X))** { <------

            xx <- input_data %>% sample_data %>% rownames
            X <- model.matrix(~xx)
        }
    }

Also, it appears that the breakaway::make_design_matrix(input_data, X) function will also only accept character variables for X. Do you know if there is any way to modify it so that a model.matrix can be used? I know that it already can be used in the betta() function.

Regarding the regression model: In my case, I have about ~20 metastases in the same tissue that originated from primary tumors of two different cancer types (half for each cancer type) across different patients. The data are shotgun sequencing for microbes with ~7000 features for the relatively small number of samples. I'm inclined, per your suggestion, to use X = tumor site of origin + patient but am curious if patient should be a blocking factor if all the patients are different. Could you please confirm this?

Thank you so much again! (Apologies if I misunderstood anything)

Sincerely,
Greg

@adw96
Copy link
Owner

adw96 commented Mar 23, 2020

Hi Greg! I had completely forgotten that I hadn't implemented formula specification for Divnet. I'm going to ask my co-author @bryandmartin to implement this ASAP.

@bryandmartin -- can you please make sure that DivNet maintains backwards compatibility but also allows for a formula? I think the corncob interface is great so perhaps we could just copy that. If you can write some testthat tests to make sure that it maintains backwards compatibility that would be great!

Greg -- if every observation comes from a different patient then definitely don't include patient as a variable -- otherwise your model will be way overparameterised! However, if you want to add more than one variable to the model, you can adapt the following code. It employs model.matrix to build a covariate matrix, and then passes it to DivNet. Feel free to ping me with questions about how it works! We will have the formula notation ready for you soon, too!

# Load the data set
data(Lee)
lee_phylum <- speedyseq::tax_glom(Lee, taxrank="Phylum")

## Get only the samples for which temperature is available.
## Unfortunately temperature is a factor, which is silly,
## so we have to do it like this.
lee_complete_ps <- lee_phylum %>%
  subset_samples(temp %in% c("12.7", "13.5", "13.7", "2.0", "7.3", "8.6"))

# build the covariate matrix
temps <- lee_complete_ps %>% sample_data %>% pull(temp) %>% as.character %>% as.numeric
types <- lee_complete_ps %>% sample_data %>% pull(type)
my_multivariate_x <- model.matrix(~ temp + type, data = data.frame(temp = temps, type = types))

# fit model and estimate diversity
lee_complete_dv <- lee_complete_ps %>%
  divnet(X=my_multivariate_x, ncores = 4)

# visualise
lee_complete_dv$shannon %>% summary %>%
  bind_cols("temp" = temps, "type" = types) %>%
  ggplot(aes(x = temp, y = estimate, col = type)) +
  geom_point() +
  geom_linerange(aes(ymin = lower, ymax = upper)) +
  ylab("Estimated Shannon diversity \n& confidence intervals") +
  xlab("Temperature (C)") +
  theme_bw()

@paulinetrinh
Copy link
Collaborator

We've updated the documentation for Getting Started to reflect the various ways to specify covariates in the divnet() function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants