Error in pgx.computePGX when working on small datasets #46

phisanti · 2023-07-24T09:25:52Z

Using a reduced version of the example dataset, the base pgx pipeline fails. See code below:

get_mini_example_data <- function() {
  counts <- playbase::COUNTS
  samples <- playbase::SAMPLES
  contrast <- playbase::CONTRASTS

  n_genes <- round(seq(1, nrow(counts), length.out = 100))
  
  # Subset each data frame to facilitate testing
  mini_counts <- counts[n_genes, c(1:3, 8:9, 11:12)]
  mini_samples <- samples[colnames(mini_counts),]
  mini_contrast <- contrast[1:3, 1:3]

  mini_data <- list(counts = mini_counts, samples = mini_samples, contrast = mini_contrast)

  return(mini_data)
}

d <- get_mini_example_data()

pgx2 <- playbase::pgx.createPGX(
 samples = d$samples,
 counts = d$counts,
 contrasts = d$contrast
)

x <- playbase::pgx.computePGX(pgx2)

The error is:

x <- playbase::pgx.computePGX(pgx2)
[pgx.computePGX] testing genes...
>>> computing gene tests for SINGLE-OMICS
[compute_testGenesSingleOmics] detecting stat groups...
[compute_testGenesSingleOmics] contrasts on groups (use design)
replacing contrast matrix...
[compute_testGenesSingleOmics] pruning unused contrasts
[compute_testGenesSingleOmics] normalizing contrasts
[compute_testGenesSingleOmics] 6 : creating model design matrix
[compute_testGenesSingleOmics] WARNING:: low total counts =  11795.79 
[compute_testGenesSingleOmics] applying global mean scaling to 1e6...
filtering for low-expressed genes: > 1 CPM in >= 2 samples
filtering out 6 low-expressed genes
keeping 85 expressed genes
>>> Testing differential expressed genes (DEG) with methods: ttest.welch trend.limma edger.qlf
[compute_testGenesSingleOmics] 12 : start fitting...
[ngs.fitContrastsWithAllMethods] using input log-expression matrix X...
[ngs.fitContrastsWithAllMethods] fitting using Welch t-test
[ngs.fitContrastsWithAllMethods] fitting using LIMMA trend
[ngs.fitContrastsWithLIMMA] fitting LIMMA contrasts using design matrix
[ngs.fitContrastsWithAllMethods] fitting edgeR using QL F-test 
[ngs.fitContrastsWithEDGER] fitting EDGER contrasts using design matrix
[ngs.fitContrastsWithAllMethods] correcting AveExpr values...
[ngs.fitContrastsWithAllMethods] calculating statistics...
[ngs.fitContrastsWithAllMethods] reshape matrices...
[ngs.fitContrastsWithAllMethods] aggregating p-values...
[compute_testGenesSingleOmics] 13 : fitting done!
            user.self sys.self elapsed user.child sys.child
ttest.welch      0.00     0.00    0.01         NA        NA
trend.limma      0.01     0.02    0.03         NA        NA
edger.qlf        0.34     0.01    0.38         NA        NA
[compute_testGenesSingleOmics] done!
[pgx.computePGX] testing genesets...
Filtering gene sets on size...
Matching gene set matrix...
Reducing gene set matrix...
Error in order(colnames(G)[jj]) : argument 1 is not a vector
In addition: Warning message:
2 very small variances detected, have been offset away from zero

This, therefore, means that x does not exist.

> x
Error: object 'x' not found

The text was updated successfully, but these errors were encountered:

phisanti · 2023-07-24T12:39:31Z

The error seems to come from the function compute_testGenesets in compute2-geneset.R. Apparently, there is a hard limit to gene set size. If a gene set has lower than 15 genes, then it won't select any column for further analysis. Here is the code:

  ## filter gene sets on size
  cat("Filtering gene sets on size...\n")
  gmt.size <- Matrix::colSums(G != 0)
  size.ok <- (gmt.size >= 15 & gmt.size <= 400)
  G <- G[, which(size.ok)]

Here we have two options:

We can reduce the limit or making it relative to the input data.
Try some kind of try-catch method so that the script can continue but the geneset will be skipped.

Any suggestion @ivokwee @ncullen93?

ncullen93 · 2023-07-24T13:13:11Z

Yeah a try-catch method would be best. I'm not sure if this error occurs in the omicsplayground platform or if this issue is caught before this code gets run. This is a good example of why all data catching code should be in playbase and not in omicsplayground. Looping in @ivokwee on this.

ivokwee · 2023-07-26T16:08:08Z

I would debug the error. Sometimes people have just 500-1000 genes. So it is a good edge case.

phisanti · 2023-07-29T14:41:24Z

Ok, taking on account that last comment, I came up with the following solution:

  # If dataset is too small that size.ok == 0, then select top 100
  if (sum(size.ok) == 0) {
    top_100gs <- head(sort(gmt.size,decreasing = TRUE), 100)
    size.ok <- names(gmt.size) %in% names(top_100gs)
  }

Injecting this code in compute_testGenesets after calculating size.ok should solve the issue. This way, there will always be at least 100 genesets. Let me know what you think, @ncullen93 @ivokwee.

ivokwee · 2023-07-29T15:56:26Z

Yes that's a good solution. But please test.

phisanti · 2023-07-30T08:52:33Z

I can confirm the edit works in the Playbase tests as well as in the omicsplayground

ncullen93 assigned ivokwee Jul 24, 2023

phisanti closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in pgx.computePGX when working on small datasets #46

Error in pgx.computePGX when working on small datasets #46

phisanti commented Jul 24, 2023

phisanti commented Jul 24, 2023

ncullen93 commented Jul 24, 2023

ivokwee commented Jul 26, 2023

phisanti commented Jul 29, 2023 •

edited

Loading

ivokwee commented Jul 29, 2023 •

edited

Loading

phisanti commented Jul 30, 2023

Error in pgx.computePGX when working on small datasets #46

Error in pgx.computePGX when working on small datasets #46

Comments

phisanti commented Jul 24, 2023

phisanti commented Jul 24, 2023

ncullen93 commented Jul 24, 2023

ivokwee commented Jul 26, 2023

phisanti commented Jul 29, 2023 • edited Loading

ivokwee commented Jul 29, 2023 • edited Loading

phisanti commented Jul 30, 2023

phisanti commented Jul 29, 2023 •

edited

Loading

ivokwee commented Jul 29, 2023 •

edited

Loading