Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database update? #99

Open
cmkobel opened this issue Mar 12, 2024 · 2 comments
Open

Database update? #99

cmkobel opened this issue Mar 12, 2024 · 2 comments

Comments

@cmkobel
Copy link

cmkobel commented Mar 12, 2024

How old is the uniref database that CheckM2 is currently using? I see a reference to 3rd june 2018 in the main publication but am not sure if it has been updated since?

Am I correct in assuming that you downloaded uniref100 and the idmappings (https://www.uniprot.org/help/downloads), and then kept only the proteins that have a kegg orthology mapping?

Cheers.

@chklovski
Copy link
Owner

Hi,

Yes, that's correct - we used a 2018 database with KEGG-uniref idmappings during CheckM2 development, but UNIREF has since decided not to include KEGG id mapping in its future updates, meaning that currently CheckM2 is using the last available database from 2018. Given the reliance of CheckM2 on fast diamond-based protein annotation, we haven't switched to KEGG hmm-searches. We are currently exploring using an alternative annotation system using DRAM-based (or other annotation tools, e.g. String/EggNog) annotation of the full GTDB protein database, but that is still at the benchmarking stage for now.

Nevertheless, though the annotation database is a bit old, we'll be using newly added publicly available genomes to update CheckM2 (newest CheckM2 update incorporating GTDB R214 should hopefully be out by the end of the month).

@cmkobel
Copy link
Author

cmkobel commented Mar 12, 2024

Thanks for the quick answer!
Okay that explains why I have such a hard time finding a mapping between Uniref and Kegg orthology. Looking forward for testing the new protein setup. :)

Btw, I know you've been looking into using Kegg pathways as part of the completeness scoring (correct me if I am wrong). Do you think that Gene Ontology (GO) might be a better fit for pathway lookups, generally speaking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants