Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fgsea hangs forever #151

Open
guidohooiveld opened this issue Apr 4, 2024 · 2 comments
Open

fgsea hangs forever #151

guidohooiveld opened this issue Apr 4, 2024 · 2 comments

Comments

@guidohooiveld
Copy link

guidohooiveld commented Apr 4, 2024

Hi Alex,

A (reproducible) issue ("GSEA hangs") was posted on the clusterProfiler GitHub.
See: YuLab-SMU/clusterProfiler#659 (comment), and posts below that one.

Since clusterProfiler uses under the hood fgsea for gene set enrichment analysis, I checked whether the reported issue originates from the way input/output data is being processed by clusterProfiler, or from fgsea. It turns that I could reproduce the issue when directly using fgsea, hence this post.

Please note that the OP reported this issue when using R-4.2.2, but I could reproduce it also with the current versions of R (R-4.3.0 resp. R-4.3.3) and fgsea on both my Windows resp. Linux machines.

Also note that the issue occurs when minSize is set to 10; when minSize=11 is ued fgsea runs as expected...

For your convenience I have attached the 2 input files to this post as RData file (which I compressed into an ZIP archive in order to be able to upload it). See below how these objects were generated, also in case you would like to generate them yourselves.

I would appreciate if you could have a look at this to see whether this can be fixed.
G

> ## load libraries
> library(clusterProfiler)
> library(fgsea)
> library(org.Hs.eg.db)
> 
> ## import input genes (human ENSEMBL) and GO-BP gene sets
> load("fgsea.input.Rdata")
> 
> ######
> ## if preferred, code to generate input
> 
> ## copy/paste list of input genes ('hgene_list') from:
> ## https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
> 
> 
> ## create GO-based gene sets; limit to BP
> ## 'ont' should either be "BP", "CC", "MF" or all
> library(GO.db)
> ont <- "BP" 
> 
> goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
> if (ont != "ALL") {goterms <- goterms[goterms == ont]}
> 
> term2gene.go <- AnnotationDbi::mapIds(org.Hs.eg.db,
+                                       keys=names(goterms),
+                                       column="ENTREZID",
+                                       keytype="GOALL",
+                                       multiVals='list')
'select()' returned 1:many mapping between keys and columns
> 
> ## end code to generate input.
> ######
> 
> ## manually convert ENSEMBL into ENTREZID using function bitr from clusterProfiler.
> ## when using the function gseGO from clusterProfiler, this is being done on the fly;
> ## see for gseGO function call: https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
> 
> ensembl.2.eg <- bitr( names(hgene_list),
+                       fromType="ENSEMBL",
+                       toType="ENTREZID",
+                       OrgDb="org.Hs.eg.db",
+                       drop = TRUE)
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(names(hgene_list), fromType = "ENSEMBL", toType = "ENTREZID",  :
  0.05% of input gene IDs are fail to map...
> 
> 
> input.genes <- hgene_list[ensembl.2.eg$ENSEMBL]
> names(input.genes) <- ensembl.2.eg$ENTREZID
> ## perform GSEA
> ## with minSize = 11; works fine!
> 
> system.time({
+ 
+ res <- fgseaMultilevel(
+   pathways = term2gene.go,
+   stats = input.genes,
+   minSize = 11,
+   maxSize = 500,
+   eps = 0,
+   scoreType = c("std") )
+ 
+   })
   user  system elapsed 
   3.47    0.87   20.19 
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  There were 8 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
3: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  For some of the pathways the P-values were likely overestimated. For such pathways log2err is set to NA.
> 

> ## perform GSEA
> ## now with minSize = 10; run was aborted after 5 mins since it wasn't finished by then...
> 
> system.time({
+ 
+ res <- fgseaMultilevel(
+   pathways = term2gene.go,
+   stats = input.genes,
+   minSize = 10,
+   maxSize = 500,
+   eps = 0,
+   scoreType = c("std") )
+ 
+   })

Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  There were 4 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)

Timing stopped at: 3.07 0.91 592.6
> 
>

sessionInfo() Windows machine:

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.18.0    AnnotationDbi_1.64.1   IRanges_2.36.0        
[4] S4Vectors_0.40.2       Biobase_2.62.0         BiocGenerics_0.48.1   
[7] fgsea_1.28.0           clusterProfiler_4.10.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.2               bitops_1.0-7            shadowtext_0.1.3       
 [4] gson_0.1.0              gridExtra_2.3           rlang_1.1.3            
 [7] magrittr_2.0.3          DOSE_3.28.2             compiler_4.3.0         
[10] RSQLite_2.3.6           png_0.1-8               vctrs_0.6.5            
[13] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
[16] crayon_1.5.2            fastmap_1.1.1           XVector_0.42.0         
[19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
[22] enrichplot_1.23.1.992   purrr_1.0.2             bit_4.0.5              
[25] zlibbioc_1.48.2         cachem_1.0.8            aplot_0.2.2            
[28] GenomeInfoDb_1.38.8     jsonlite_1.8.8          blob_1.2.4             
[31] BiocParallel_1.36.0     tweenr_2.0.3            parallel_4.3.0         
[34] R6_2.5.1                stringi_1.8.3           RColorBrewer_1.1-3     
[37] GOSemSim_2.29.1.001     Rcpp_1.0.12             snow_0.4-4             
[40] Matrix_1.6-5            splines_4.3.0           igraph_2.0.3           
[43] tidyselect_1.2.1        qvalue_2.34.0           viridis_0.6.5          
[46] codetools_0.2-20        lattice_0.22-6          tibble_3.2.1           
[49] plyr_1.8.9              treeio_1.26.0           withr_3.0.0            
[52] KEGGREST_1.42.0         gridGraphics_0.5-1      scatterpie_0.2.1       
[55] polyclip_1.10-6         Biostrings_2.70.3       pillar_1.9.0           
[58] ggtree_3.10.1           ggfun_0.1.4             generics_0.1.3         
[61] RCurl_1.98-1.14         ggplot2_3.5.0           munsell_0.5.1          
[64] scales_1.3.0            tidytree_0.4.6          glue_1.7.0             
[67] lazyeval_0.2.2          tools_4.3.0             data.table_1.15.4      
[70] fs_1.6.3                graphlayouts_1.1.1      fastmatch_1.1-4        
[73] tidygraph_1.3.1         cowplot_1.1.3           grid_4.3.0             
[76] tidyr_1.3.1             ape_5.7-1               colorspace_2.1-0       
[79] nlme_3.1-164            GenomeInfoDbData_1.2.11 patchwork_1.2.0        
[82] ggforce_0.4.2           cli_3.6.2               fansi_1.0.6            
[85] viridisLite_0.4.2       dplyr_1.1.4             gtable_0.3.4           
[88] yulab.utils_0.1.4       digest_0.6.35           ggrepel_0.9.5          
[91] ggplotify_0.1.2         farver_2.1.1            memoise_2.0.1          
[94] lifecycle_1.0.4         httr_1.4.7              GO.db_3.18.0           
[97] bit64_4.0.5             MASS_7.3-60.0.1        
> 

sessionInfo() Linux machine:

> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 39 (Thirty Nine)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.18.0    AnnotationDbi_1.64.1   IRanges_2.36.0        
[4] S4Vectors_0.40.2       Biobase_2.62.0         BiocGenerics_0.48.1   
[7] fgsea_1.28.0           clusterProfiler_4.10.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.2               bitops_1.0-7            shadowtext_0.1.3       
 [4] gson_0.1.0              gridExtra_2.3           rlang_1.1.3            
 [7] magrittr_2.0.3          DOSE_3.28.2             compiler_4.3.3         
[10] RSQLite_2.3.6           png_0.1-8               vctrs_0.6.5            
[13] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
[16] crayon_1.5.2            fastmap_1.1.1           XVector_0.42.0         
[19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
[22] enrichplot_1.22.0       purrr_1.0.2             bit_4.0.5              
[25] zlibbioc_1.48.2         cachem_1.0.8            aplot_0.2.2            
[28] GenomeInfoDb_1.38.8     jsonlite_1.8.8          blob_1.2.4             
[31] BiocParallel_1.36.0     tweenr_2.0.3            parallel_4.3.3         
[34] R6_2.5.1                stringi_1.8.3           RColorBrewer_1.1-3     
[37] GOSemSim_2.28.1         Rcpp_1.0.12             Matrix_1.6-5           
[40] splines_4.3.3           igraph_2.0.3            tidyselect_1.2.1       
[43] qvalue_2.34.0           viridis_0.6.5           codetools_0.2-20       
[46] lattice_0.22-6          tibble_3.2.1            plyr_1.8.9             
[49] treeio_1.26.0           withr_3.0.0             KEGGREST_1.42.0        
[52] gridGraphics_0.5-1      scatterpie_0.2.2        polyclip_1.10-6        
[55] Biostrings_2.70.3       pillar_1.9.0            ggtree_3.10.1          
[58] ggfun_0.1.4             generics_0.1.3          RCurl_1.98-1.14        
[61] ggplot2_3.5.0           munsell_0.5.1           scales_1.3.0           
[64] tidytree_0.4.6          glue_1.7.0              lazyeval_0.2.2         
[67] tools_4.3.3             data.table_1.15.4       fs_1.6.3               
[70] graphlayouts_1.1.1      fastmatch_1.1-4         tidygraph_1.3.1        
[73] cowplot_1.1.3           grid_4.3.3              tidyr_1.3.1            
[76] ape_5.7-1               colorspace_2.1-0        nlme_3.1-164           
[79] GenomeInfoDbData_1.2.11 patchwork_1.2.0         ggforce_0.4.2          
[82] cli_3.6.2               fansi_1.0.6             viridisLite_0.4.2      
[85] dplyr_1.1.4             gtable_0.3.4            yulab.utils_0.1.4      
[88] digest_0.6.35           ggrepel_0.9.5           ggplotify_0.1.2        
[91] farver_2.1.1            memoise_2.0.1           lifecycle_1.0.4        
[94] httr_1.4.7              GO.db_3.18.0            bit64_4.0.5            
[97] MASS_7.3-60.0.1        
> 

fgsea.input.zip

@assaron
Copy link
Member

assaron commented Apr 4, 2024

@guidohooiveld thanks for the report. I can reproduce the problem. I'll check later what's going on.

@assaron
Copy link
Member

assaron commented May 1, 2024

To keep you updated: this is turned out to be an issue of the algorithm we were generally aware of, although not in this setting. Anyway we recently developed an approach how to properly fix it. Hopefully we'll be able to integrate the proper fix into fgsea in not so distant future, but also it's not trivial, so I can't make any ETA. As a workaround for now one could add random noise to the input scores, and everything should start working fine:

res <- fgseaMultilevel(
    pathways = term2gene.go,
    stats = input.genes+rnorm(length(input.genes), sd=0.001),
    minSize = 10,
    maxSize = 500,
    eps = 0,
    scoreType = c("std") )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants