Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize, : There are duplicate gene names, fgsea may produce unexpected results #189

Closed
sanhe374 opened this issue Mar 4, 2019 · 11 comments

Comments

@sanhe374
Copy link

sanhe374 commented Mar 4, 2019

I have been using the guide for Clusterprofiler https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html and have created a ranked gene list according to the instructions from Wiki.

head(genekegg_6)
400746 114794 90204 104355220 29881 55515
5.700247 4.955668 4.846596 4.550389 4.490711 4.188874

When I run gseKEGG or gsePathway I get the same type of error.

gsea_6_kegg <- gseKEGG(geneList = genekegg_6 ,
organism = 'hsa',
nPerm = 1000,
maxGSSize=500,
minGSSize = 20,
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
verbose = TRUE)

preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize, :
There are duplicate gene names, fgsea may produce unexpected results

The warning is that there are duplicated gene names but I can not find any duplicated gene names in the list.

anyDuplicated(genekegg_6)
[1] 0

I do not understand why I get this error message?

@llrs
Copy link

llrs commented Mar 4, 2019

Try anyDuplicated(names(genekegg_6). The problem is not only in having duplicated values, but also having duplicated names, so if you have a gene with two values it also doesn't work well (Which one should be use for the test?).

@sanhe374
Copy link
Author

sanhe374 commented Mar 4, 2019

If I use the command, I get the following
anyDuplicated(names(genekegg_6))
[1] 9

However if I do:

unique_GSEA<-unique(genekegg_6)

to get all the unique entries I get the same number of genes as in the original genekegg_6

@sanhe374 sanhe374 closed this as completed Mar 4, 2019
@sanhe374 sanhe374 reopened this Mar 4, 2019
@llrs
Copy link

llrs commented Mar 4, 2019

Did you remove those duplicated entries? Or corrected somehow the name of the values? Unless you have unique names it won't work.

@sanhe374
Copy link
Author

sanhe374 commented Mar 4, 2019 via email

@llrs
Copy link

llrs commented Mar 4, 2019

If you use unique(geneKegg_6) you are only removing non-unique values.
I would do something like genekegg_6 <- genekegg_6[!duplicated(names(genekegg_6))], note how I look for names in this approach

@sanhe374
Copy link
Author

sanhe374 commented Mar 5, 2019

Thank you so much! The error message is now gone 👍

@sanhe374 sanhe374 closed this as completed Mar 5, 2019
@rodela71
Copy link

rodela71 commented May 3, 2021

Would you please tell me how to prepare geneSet for fgsea?

@llrs
Copy link

llrs commented May 3, 2021

@rodela71 If you have a question you could try to ask at https://support.bioconductor.org, this is if there are some issues with the package itself. It is not working properly or there are some other problems. To get help I would suggest you to show what you have tried to do and explain on what step you have trouble.

@AStubbusch
Copy link

Hi @llrs,

apologies for commenting in this closed issue, but my question is very related:
Could you explain what the clusterProfiler::gseaKEGG(geneList, ...) function does when it receives a geneList containing duplicated names? Does it simply ignore those duplicated genes, or does it include them in some way that you could describe?
(It produces the warning There are duplicate gene names, fgsea may produce unexpected results. and I would like to know what exactly 'unexpected results' refers to. Are they included but the size of the gene set is not adjusted?)

Ideally, I would like to keep gene duplicates and treat them as individual entries of the same gene set. Is there a mode for this, or could you point me to the code snippet where this step happens?

Many thanks!

@llrs
Copy link

llrs commented Jan 18, 2022

@AStubbusch this warning is from fgsea, so it has nothing to do with clusterProfiler (except that it doesn't check it before providing the data to fgsea). The problem on fgsea is that the bootstrapping and selection of genes might not follow the underlying mathematical assumptions of the test.
If I remember correctly the gene set size on fgsea is not adjusted if duplicate genes are present (it simply checks the length of the vector).

You can't keep genes as separate but consider as one for GSEA to make sense. I would merge them into one single entity. If this are transcripts, you can use gene expression instead of transcript expression and similar for other entities.
An alternative would be to use GSVA to summarize all the entries in a single value and check that gene set as a single entity, but that would answer a slightly different question.


If people have more questions, please post on support.bioconductor.org site so that other people (and maintainers) can help. I will stop helping here as this is not the right place to ask questions and I don't want to encourage them here.

@AStubbusch
Copy link

Thanks a lot for your answer @llrs , and apologies, following questions will go to support.bioconductor.org!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants