Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in pgx.computePGX when working on small datasets #46

Closed
phisanti opened this issue Jul 24, 2023 · 6 comments
Closed

Error in pgx.computePGX when working on small datasets #46

phisanti opened this issue Jul 24, 2023 · 6 comments
Assignees

Comments

@phisanti
Copy link
Contributor

Using a reduced version of the example dataset, the base pgx pipeline fails. See code below:

get_mini_example_data <- function() {
  counts <- playbase::COUNTS
  samples <- playbase::SAMPLES
  contrast <- playbase::CONTRASTS

  n_genes <- round(seq(1, nrow(counts), length.out = 100))
  
  # Subset each data frame to facilitate testing
  mini_counts <- counts[n_genes, c(1:3, 8:9, 11:12)]
  mini_samples <- samples[colnames(mini_counts),]
  mini_contrast <- contrast[1:3, 1:3]

  mini_data <- list(counts = mini_counts, samples = mini_samples, contrast = mini_contrast)

  return(mini_data)
}

d <- get_mini_example_data()

pgx2 <- playbase::pgx.createPGX(
 samples = d$samples,
 counts = d$counts,
 contrasts = d$contrast
)

x <- playbase::pgx.computePGX(pgx2)

The error is:

x <- playbase::pgx.computePGX(pgx2)
[pgx.computePGX] testing genes...
>>> computing gene tests for SINGLE-OMICS
[compute_testGenesSingleOmics] detecting stat groups...
[compute_testGenesSingleOmics] contrasts on groups (use design)
replacing contrast matrix...
[compute_testGenesSingleOmics] pruning unused contrasts
[compute_testGenesSingleOmics] normalizing contrasts
[compute_testGenesSingleOmics] 6 : creating model design matrix
[compute_testGenesSingleOmics] WARNING:: low total counts =  11795.79 
[compute_testGenesSingleOmics] applying global mean scaling to 1e6...
filtering for low-expressed genes: > 1 CPM in >= 2 samples
filtering out 6 low-expressed genes
keeping 85 expressed genes
>>> Testing differential expressed genes (DEG) with methods: ttest.welch trend.limma edger.qlf
[compute_testGenesSingleOmics] 12 : start fitting...
[ngs.fitContrastsWithAllMethods] using input log-expression matrix X...
[ngs.fitContrastsWithAllMethods] fitting using Welch t-test
[ngs.fitContrastsWithAllMethods] fitting using LIMMA trend
[ngs.fitContrastsWithLIMMA] fitting LIMMA contrasts using design matrix
[ngs.fitContrastsWithAllMethods] fitting edgeR using QL F-test 
[ngs.fitContrastsWithEDGER] fitting EDGER contrasts using design matrix
[ngs.fitContrastsWithAllMethods] correcting AveExpr values...
[ngs.fitContrastsWithAllMethods] calculating statistics...
[ngs.fitContrastsWithAllMethods] reshape matrices...
[ngs.fitContrastsWithAllMethods] aggregating p-values...
[compute_testGenesSingleOmics] 13 : fitting done!
            user.self sys.self elapsed user.child sys.child
ttest.welch      0.00     0.00    0.01         NA        NA
trend.limma      0.01     0.02    0.03         NA        NA
edger.qlf        0.34     0.01    0.38         NA        NA
[compute_testGenesSingleOmics] done!
[pgx.computePGX] testing genesets...
Filtering gene sets on size...
Matching gene set matrix...
Reducing gene set matrix...
Error in order(colnames(G)[jj]) : argument 1 is not a vector
In addition: Warning message:
2 very small variances detected, have been offset away from zero

This, therefore, means that x does not exist.

> x
Error: object 'x' not found
@phisanti
Copy link
Contributor Author

The error seems to come from the function compute_testGenesets in compute2-geneset.R. Apparently, there is a hard limit to gene set size. If a gene set has lower than 15 genes, then it won't select any column for further analysis. Here is the code:

  ## filter gene sets on size
  cat("Filtering gene sets on size...\n")
  gmt.size <- Matrix::colSums(G != 0)
  size.ok <- (gmt.size >= 15 & gmt.size <= 400)
  G <- G[, which(size.ok)]

Here we have two options:

  1. We can reduce the limit or making it relative to the input data.
  2. Try some kind of try-catch method so that the script can continue but the geneset will be skipped.

Any suggestion @ivokwee @ncullen93?

@ncullen93
Copy link
Contributor

Yeah a try-catch method would be best. I'm not sure if this error occurs in the omicsplayground platform or if this issue is caught before this code gets run. This is a good example of why all data catching code should be in playbase and not in omicsplayground. Looping in @ivokwee on this.

@ivokwee
Copy link
Member

ivokwee commented Jul 26, 2023

I would debug the error. Sometimes people have just 500-1000 genes. So it is a good edge case.

@phisanti
Copy link
Contributor Author

phisanti commented Jul 29, 2023

Ok, taking on account that last comment, I came up with the following solution:

  # If dataset is too small that size.ok == 0, then select top 100
  if (sum(size.ok) == 0) {
    top_100gs <- head(sort(gmt.size,decreasing = TRUE), 100)
    size.ok <- names(gmt.size) %in% names(top_100gs)
  }
  

Injecting this code in compute_testGenesets after calculating size.ok should solve the issue. This way, there will always be at least 100 genesets. Let me know what you think, @ncullen93 @ivokwee.

@ivokwee
Copy link
Member

ivokwee commented Jul 29, 2023

Yes that's a good solution. But please test.

@phisanti
Copy link
Contributor Author

I can confirm the edit works in the Playbase tests as well as in the omicsplayground

@phisanti phisanti closed this as completed Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants