New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide MGnify sequences available in ESM Metagenomic Atlas #366
Comments
Thanks! Those downloaded in 1 hour and are just what I need. It would be helpful to mention those files in the ESM Metagenomic Atlas github README where the other downloadable files are described https://github.com/facebookresearch/esm/blob/main/scripts/atlas/README.md |
The mgnify90.fasta you provided is useful, but it is not quite what I want. It contains 623796864 sequences, while the ESM Metagenomic Atlas only contains 577944949 sequences according to the stats.parquet file you provide. I want a fasta file of exactly the sequences for which the ESM Metagenomic Atlas has predicted structures. Of course I can filter mgnify90.fasta down to the desired set of sequences, but I think it would help the community if the sequences for atlas predictions were available for download. Thanks! |
Documenting the new files in #367. |
By the way, the stats.parquet file contains duplicate MGnify ids. That file has 577944949 rows but only 577424953 unique MGnify ids in those rows, so about 520,000 duplicates. No idea why. |
We now provide |
Note that the Let me know if you notice anything else missing - and again thank you for bringing all of this to our attention! |
Thanks for quickly solving this problem! The atlas.fasta file is exactly what I needed. |
It would be useful if you could offer a download of the 500 million sequences of the atlas as a fasta file. I want this because it allows me to do fast local searches. The online search capability provided by the atlas takes many minutes to do a single search and we have much faster (but less sensitive) search methods. I'd like to allow users of our ChimeraX visualization software to quickly search the atlas.
I'm aware that I can get the sequences from the EBI MGnify database (2.5 billion sequences), then filter them using the stats.parquet file for the atlas to just the ones used in the atlas. I am pursing that currently, but there appear to be no mirrors of the MGnify database and downloading it from EBI to the United States will take about 10 days to transfer this 250 Gbytes apparently bottlenecked by EBI providing only ~1 Mbit/sec. A download directly from Meta would be 1/5 the size, and if you use better compression than what EBI is using (e.g. bzip2 or xzip instead of gzip) it could be 1/10th the size ~25 Gbytes and on a much faster connection could be downloaded in hours instead of 10 days.
I realize I could scrape the sequences from the 15 Tbytes of structure prediction files that the Atlas provides -- but I'd prefer to not download 500 times more data than I actually need.
Thanks!
Tom Goddard
University of California, San Francisco
ChimeraX molecular visualization software developer
The text was updated successfully, but these errors were encountered: