Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to combine different compareCluster results? #678

Open
Kyeongmi-Lee opened this issue Mar 28, 2024 · 1 comment
Open

How to combine different compareCluster results? #678

Kyeongmi-Lee opened this issue Mar 28, 2024 · 1 comment

Comments

@Kyeongmi-Lee
Copy link

i have compareCluster enrichKEGG and enrichGO results with same input.
and i want to show 2 different results in one dotplot.
how to merge multiple compareCluster results?

@guidohooiveld
Copy link

guidohooiveld commented Mar 29, 2024

Maybe you had found the function merge_result from clusterProfiler, but unfortunately that function doesn't work (yet?) with the results from compareCluster.

One way of achieving you goal is using compareCluster with the generic function enricher on the combined gene sets from KEGG and GO.
Note that under the hood both enrichKEGG and enrichGO also use this generic function enricher, but with these 2 functions the inputs are the conveniently automagically generated on-the-fly, so there is less preparatory 'work' needed from the user.

Please also note that the result of this combined approach is almost identical to the outcomes of the separate enrichKEGG and enrichGO analyses. Since a different number of genes and gene sets are being analyzed, the number of background genes resp. the number of multiple tests performed are different, resulting in slightly different (raw) pvalue and BH-corrected p-values (p.adjust). This is mainly the case for the results of the KEGG-based analysis, since in KEGG itselves only 8672 human genes have been annotated (and are used as background genes), and in the combined TERM2GENE list there are 19189 (background) genes present (for GO-BP the difference in background genes is less; 18870).

Below some code.

> ## load libraries
> library(clusterProfiler)
> library(enrichplot)
> library(org.Hs.eg.db)
> library(GO.db)
> 
> ## load sample data
> data(gcSample)
> 
> 
> #### A) separate analysis
> 
> ## KEGG analysis
> res.kegg <- compareCluster(gcSample,
+                fun = "enrichKEGG",
+                pvalueCutoff=0.05,
+                pAdjustMethod = "BH",
+                minGSSize = 10,
+                maxGSSize = 500)
> res.kegg <- setReadable(res.kegg, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
> 
> res.kegg
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichKEGG 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   76 obs. of  12 variables:
 $ Cluster    : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
 $ category   : chr  "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
 $ subcategory: chr  "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
 $ ID         : chr  "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
 $ Description: chr  "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
 $ GeneRatio  : chr  "23/406" "8/406" "18/406" "9/193" ...
 $ BgRatio    : chr  "202/8672" "38/8672" "157/8672" "89/8672" ...
 $ pvalue     : num  6.71e-05 3.05e-04 3.78e-04 1.52e-04 3.25e-04 ...
 $ p.adjust   : num  0.0211 0.0395 0.0395 0.0357 0.0357 ...
 $ qvalue     : num  0.0196 0.0368 0.0368 0.0331 0.0331 ...
 $ geneID     : chr  "LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1" "ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C" "CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4" "THBS1/HSPG2/COL9A3/ITGB7/CHAD/ITGA7/LAMA4/ITGB8/ITGB5" ...
 $ Count      : int  23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 0 
#..   X2: 3 
#..   X3: 3 
#..   X4: 22 
#..   X5: 10 
#..   X6: 1 
#..   X7: 17 
#..   X8: 20 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> 
> ## Gene Ontology, Biological Process
> 
> res.go <- compareCluster(gcSample,
+                fun = "enrichGO",
+                OrgDb = 'org.Hs.eg.db',
+                ont = "BP",
+                readable = TRUE,
+                pvalueCutoff=0.05,
+                pAdjustMethod = "BH",
+                minGSSize = 10,
+                maxGSSize = 500)
> res.go <- pairwise_termsim(res.go)
> 
> res.go
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichGO 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   1002 obs. of  10 variables:
 $ Cluster    : Factor w/ 8 levels "X1","X2","X3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID         : chr  "GO:0021978" "GO:0086005" "GO:0086091" "GO:1990266" ...
 $ Description: chr  "telencephalon regionalization" "ventricular cardiac muscle cell action potential" "regulation of heart rate by cardiac conduction" "neutrophil migration" ...
 $ GeneRatio  : chr  "4/198" "5/198" "5/198" "8/198" ...
 $ BgRatio    : chr  "13/18870" "35/18870" "38/18870" "129/18870" ...
 $ pvalue     : num  7.81e-06 3.04e-05 4.58e-05 6.58e-05 6.59e-05 ...
 $ p.adjust   : num  0.0207 0.0322 0.0322 0.0322 0.0322 ...
 $ qvalue     : num  0.0191 0.0297 0.0297 0.0297 0.0297 ...
 $ geneID     : chr  "PAX6/LHX2/EMX1/EMX2" "SCN3B/KCNH2/KCND3/KCNE5/CTNNA3" "SCN3B/KCNH2/KCND3/KCNE5/CTNNA3" "CCL20/PLA2G1B/CXCL3/FUT7/CXCL8/GP2/S100A8/CXCL5" ...
 $ Count      : int  4 5 5 8 9 7 4 3 12 9 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 15 
#..   X2: 290 
#..   X3: 84 
#..   X4: 97 
#..   X5: 111 
#..   X6: 97 
#..   X7: 138 
#..   X8: 170 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> 
> 
> 
> 
> #### B) combined analysis
> 
> ## create TERM2GENE and TERM2NAME for KEGG
> kegg.data <- download_KEGG(species="hsa", keggType = "KEGG", keyType = "kegg")
> term2gene.kegg <- kegg.data$KEGGPATHID2EXTID
> term2name.kegg <- kegg.data$KEGGPATHID2NAME
> 
> head(term2gene.kegg)
      from    to
1 hsa00010 10327
2 hsa00010   124
3 hsa00010   125
4 hsa00010   126
5 hsa00010   127
6 hsa00010   128
> head(term2name.kegg)
      from                              to
1 hsa01100              Metabolic pathways
2 hsa01200               Carbon metabolism
3 hsa01210 2-Oxocarboxylic acid metabolism
4 hsa01212           Fatty acid metabolism
5 hsa01230     Biosynthesis of amino acids
6 hsa01232           Nucleotide metabolism
> 
> ## create TERM2GENE and TERM2NAME for GO
> ## code is based on clusterProfiler code
> ## https://github.com/YuLab-SMU/clusterProfiler/blob/b24be8c372f947ed79f8f42c1a7f6624c14fb125/R/enrichGO.R#L156
> 
> 
> ## set GO category to be analyzed,
> ##i.e. 'BP', 'CC', 'MF', or 'ALL'
> 
> ont <- "BP"  ##limit sets to GO-BP
> 
> goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
> if (ont != "ALL") {goterms <- goterms[goterms == ont]}
> 
> term2gene.go <- AnnotationDbi::mapIds(org.Hs.eg.db, keys=names(goterms),
+                                       column="ENTREZID", keytype="GOALL", multiVals='list')
'select()' returned 1:many mapping between keys and columns
> term2gene.go <- stack(term2gene.go)
> term2gene.go <- term2gene.go[!is.na(term2gene.go[,"values"]),] [,c(2,1)]
> colnames(term2gene.go) <- c("from","to")
> 
> term2name.go <- Term(GOTERM)[ names(goterms) ]
> term2name.go <- data.frame("from"=names(term2name.go), "to"=term2name.go)
> 
> head(term2gene.go)
        from   to
2 GO:0000002  142
3 GO:0000002  291
4 GO:0000002 1763
5 GO:0000002 1890
6 GO:0000002 2021
7 GO:0000002 3980
> head(term2name.go)
                 from                               to
GO:0000001 GO:0000001        mitochondrion inheritance
GO:0000002 GO:0000002 mitochondrial genome maintenance
GO:0000003 GO:0000003                     reproduction
GO:0000011 GO:0000011              vacuole inheritance
GO:0000012 GO:0000012       single strand break repair
GO:0000017 GO:0000017        alpha-glucoside transport
> 
> 
> ## combine the KEGG and GO TERM2GENE and TERM2NAME
> TERM2GENE <- rbind(term2gene.kegg, term2gene.go)
> TERM2NAME <- rbind(term2name.kegg, term2name.go)
> 
> 
> 
> ## run compareCluster with generic function enricher, and the combined TERM2GENE and TERM2NAME
> res.combined <- compareCluster(geneClusters = gcSample,
+                      fun = "enricher",
+                      minGSSize = 10,
+                      maxGSSize = 500,
+                      pvalueCutoff = 0.05,
+                      pAdjustMethod = "BH",
+                      TERM2GENE = TERM2GENE,
+                      TERM2NAME = TERM2NAME
+                      )
> res.combined <- setReadable(res.combined, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
> 
> ## check
> res.combined
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enricher 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   1225 obs. of  10 variables:
 $ Cluster    : Factor w/ 8 levels "X1","X2","X3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID         : chr  "GO:0021978" "GO:0086005" "GO:0086091" "GO:0061351" ...
 $ Description: chr  "telencephalon regionalization" "ventricular cardiac muscle cell action potential" "regulation of heart rate by cardiac conduction" "neural precursor cell proliferation" ...
 $ GeneRatio  : chr  "4/198" "5/198" "5/198" "9/198" ...
 $ BgRatio    : chr  "13/19189" "35/19189" "38/19189" "166/19189" ...
 $ pvalue     : num  7.31e-06 2.81e-05 4.23e-05 5.80e-05 5.86e-05 ...
 $ p.adjust   : num  0.0208 0.0312 0.0312 0.0312 0.0312 ...
 $ qvalue     : num  0.0191 0.0286 0.0286 0.0286 0.0286 ...
 $ geneID     : chr  "PAX6/LHX2/EMX1/EMX2" "SCN3B/KCNH2/KCND3/KCNE5/CTNNA3" "SCN3B/KCNH2/KCND3/KCNE5/CTNNA3" "NF2/PAX6/LHX2/FZD9/LHX5/EPHB1/EMX1/LEF1/EMX2" ...
 $ Count      : int  4 5 5 9 8 7 4 3 12 9 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 15 
#..   X2: 317 
#..   X3: 98 
#..   X4: 164 
#..   X5: 145 
#..   X6: 102 
#..   X7: 175 
#..   X8: 209 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> 
> 
> dotplot(res.combined, font.size=8)
> 

image

> ## compare results separate and combined runs
> 
> as.data.frame(res.kegg)[as.data.frame(res.kegg)$ID == "hsa05169",]
  Cluster       category               subcategory       ID
1      X2 Human Diseases Infectious disease: viral hsa05169
                   Description GeneRatio  BgRatio       pvalue  p.adjust
1 Epstein-Barr virus infection    23/406 202/8672 6.707389e-05 0.0210612
      qvalue
1 0.01962794
                                                                                                                         geneID
1 LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
  Count
1    23
> as.data.frame(res.go)[as.data.frame(res.go)$ID == "GO:0021978",]
  Cluster         ID                   Description GeneRatio  BgRatio
1      X1 GO:0021978 telencephalon regionalization     4/198 13/18870
        pvalue   p.adjust     qvalue              geneID Count
1 7.807836e-06 0.02071419 0.01910043 PAX6/LHX2/EMX1/EMX2     4
> 
> as.data.frame(res.combined)[as.data.frame(res.combined)$ID == "hsa05169",]
    Cluster       ID                  Description GeneRatio   BgRatio
33       X2 hsa05169 Epstein-Barr virus infection    23/760 202/19189
735      X5 hsa05169 Epstein-Barr virus infection    20/891 202/19189
952      X7 hsa05169 Epstein-Barr virus infection    16/563 202/19189
          pvalue    p.adjust      qvalue
33  5.530688e-06 0.001653983 0.001372322
735 1.193721e-03 0.046538197 0.039817444
952 3.191288e-04 0.013958290 0.012162907
                                                                                                                           geneID
33  LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
735              LYN/ICAM1/RB1/SKP2/E2F3/PIK3CD/NFKBIE/CASP9/EIF2AK2/CDK2/PLCG2/FAS/IFNAR2/SAP30/STAT1/MAPK13/AKT3/CD3D/MAVS/CD40
952                              PSMD8/PSMD2/NFKBIA/IRAK1/HDAC2/CDK4/SEM1/HES1/CXCL10/ISG15/NCOR2/CCND1/PSMD11/HLA-G/PSMD14/STAT1
    Count
33     23
735    20
952    16
> as.data.frame(res.combined)[as.data.frame(res.combined)$ID == "GO:0021978",]
  Cluster         ID                   Description GeneRatio  BgRatio
1      X1 GO:0021978 telencephalon regionalization     4/198 13/19189
        pvalue   p.adjust     qvalue              geneID Count
1 7.310454e-06 0.02082748 0.01911491 PAX6/LHX2/EMX1/EMX2     4
> 
> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants