Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgeneSim with Resnik measure produces incorrect results #23

Open
deniseduma opened this issue Mar 20, 2019 · 3 comments
Open

mgeneSim with Resnik measure produces incorrect results #23

deniseduma opened this issue Mar 20, 2019 · 3 comments

Comments

@deniseduma
Copy link

Hi,

I'm trying to use GOSemSim to compute pairwise similarity scores for a large set of genes and it seems that the Resnik measure at least is incorrect because the self-similarities on the diagonal which should all be 1 are not 1 actually!

Here is the command I run

d = godata('org.Hs.eg.db', ont="BP", computeIC=TRUE)
gsimmat3 = mgeneSim(my_gene_list, semData=d, drop="NULL", measure="Resnik", combine="BMA")
print(gsimmat3)

and here is the output

[1] "gsimmat3"
150 151 152 1814 1815 1816
150 0.615 0.502 0.503 0.407 0.388 0.364
151 0.502 0.623 0.594 0.360 0.359 0.379
152 0.503 0.594 0.626 0.357 0.353 0.364
1814 0.407 0.360 0.357 0.678 0.509 0.451
1815 0.388 0.359 0.353 0.509 0.669 0.398
1816 0.364 0.379 0.364 0.451 0.398 0.680

As you can see the diagonal is not 1 as it should be!

Could you please fix this?

Thank you,
Denise

@llrs
Copy link

llrs commented Jun 7, 2019

I can reproduce this:

library("GOSemSim")
#> 
#> GOSemSim v2.10.0  For help: https://guangchuangyu.github.io/GOSemSim
#> 
#> If you use GOSemSim in published research, please cite:
#> Guangchuang Yu, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, Shengqi Wang. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics 2010, 26(7):976-978
my_gene_list <- c("150", "151", "152", "1814", "1815", "1816")
d = godata('org.Hs.eg.db', ont="BP", computeIC=TRUE)
#> Loading required package: org.Hs.eg.db
#> Loading required package: AnnotationDbi
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter,
#>     Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
#>     setdiff, sort, table, tapply, union, unique, unsplit, which,
#>     which.max, which.min
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: IRanges
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> 
#> preparing gene to GO mapping data...
#> preparing IC data...
gsimmat3 = mgeneSim(my_gene_list, semData=d, drop="NULL", measure="Resnik", combine="BMA")
#> |=================================================================| 100%
gsimmat3
#>        150   151   152  1814  1815  1816
#> 150  0.638 0.502 0.502 0.376 0.379 0.352
#> 151  0.502 0.641 0.605 0.311 0.322 0.349
#> 152  0.502 0.605 0.645 0.311 0.320 0.335
#> 1814 0.376 0.311 0.311 0.667 0.507 0.424
#> 1815 0.379 0.322 0.320 0.507 0.673 0.408
#> 1816 0.352 0.349 0.335 0.424 0.408 0.701

Created on 2019-06-07 by the reprex package (v0.3.0)

I though it could be an error on mgeneSim but it turns that it happens the same on geneSim:

lr <- geneSim("150", "150", semData = d, measure = "Resnik", combine = "BMA")
lr$geneSim
#> [1] 0.639

The problem seems to come from the Resnik similarity as calculating the similarity between the same GOs with Resnik reports a value different from 0:

mgoSim(lr$GO1[1:5], lr$GO2[1:5], semData = d, measure = "Resnik", combine = NULL)
#>            GO:0001819 GO:0007186 GO:0007189 GO:0007193 GO:0007265
#> GO:0001819      0.415      0.075      0.075      0.075      0.075
#> GO:0007186      0.075      0.367      0.367      0.367      0.181
#> GO:0007189      0.075      0.367      0.594      0.542      0.276
#> GO:0007193      0.075      0.367      0.542      0.627      0.181
#> GO:0007265      0.075      0.181      0.276      0.181      0.495

However I'm not sure if the Resnik similarity between the same element must be 1. I couldn't understand what happens in that case from the original article.

@serbulent
Copy link

I also can reproduce this with Rel also.

hsGO <- godata('org.Hs.eg.db', ont="BP", keytype = "UNIPROT")
mgeneSim(genes=c("A1E959") , semData=hsGO, measure="Rel",verbose=TRUE, drop = "IEA" )

|======================================================================| 100%
              A1E959
A1E959  0.983

@Pmaj7
Copy link

Pmaj7 commented Aug 1, 2023

I'd like to confirm that I'm having a similar issue with Resnik scores not being calculated correctly.

library(GOSemSim)
library(org.Hs.eg.db)

GO.org <- org.Hs.eg.db
GO.anno <- godata(GO.org, ont = "BP")

# Term 1 vs. Term 2
> goSim("GO:0042060", "GO:0061041", semData = GO.anno, measure = "Resnik")
[1] 0.469

# Term 1 vs. Term 1
> goSim("GO:0042060", "GO:0042060", semData = GO.anno, measure = "Resnik")
[1] 0.469

# Term 2 vs. Term 2
> goSim("GO:0061041", "GO:0061041", semData = GO.anno, measure = "Resnik")
[1] 0.569

As you can see, these values make no sense, and should be much greater, especially if one term is being compared to itself. When testing Resnik similarity using NaviGO's web portal, it accurately determines Term 1 (wound healing) and Term 2 (regulation of wound healing) are highly similar with a Resnik score of 2.4.

I'm surprised there aren't more people raising concerns about this, but perhaps Resnik isn't used by many and most just use the default Wang method. This just goes to show, never blindly believe the results you get from one tool/package. Always make sure you verify results using tools from multiple devs to be sure what you get from any one tool is accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants