Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide MGnify sequences available in ESM Metagenomic Atlas #366

Closed
tomgoddard opened this issue Nov 14, 2022 · 8 comments
Closed

Provide MGnify sequences available in ESM Metagenomic Atlas #366

tomgoddard opened this issue Nov 14, 2022 · 8 comments
Assignees

Comments

@tomgoddard
Copy link

It would be useful if you could offer a download of the 500 million sequences of the atlas as a fasta file. I want this because it allows me to do fast local searches. The online search capability provided by the atlas takes many minutes to do a single search and we have much faster (but less sensitive) search methods. I'd like to allow users of our ChimeraX visualization software to quickly search the atlas.

I'm aware that I can get the sequences from the EBI MGnify database (2.5 billion sequences), then filter them using the stats.parquet file for the atlas to just the ones used in the atlas. I am pursing that currently, but there appear to be no mirrors of the MGnify database and downloading it from EBI to the United States will take about 10 days to transfer this 250 Gbytes apparently bottlenecked by EBI providing only ~1 Mbit/sec. A download directly from Meta would be 1/5 the size, and if you use better compression than what EBI is using (e.g. bzip2 or xzip instead of gzip) it could be 1/10th the size ~25 Gbytes and on a much faster connection could be downloaded in hours instead of 10 days.

I realize I could scrape the sequences from the 15 Tbytes of structure prediction files that the Atlas provides -- but I'd prefer to not download 500 times more data than I actually need.

Thanks!

Tom Goddard
University of California, San Francisco
ChimeraX molecular visualization software developer

@tomsercu
Copy link
Contributor

tomsercu commented Nov 15, 2022

In #341 we just provided the fasta file: s3://dl.fbaipublicfiles.com/esmatlas/v0/full/mgnify90.fasta
Can you try downloading from S3 via aria2c or s5cmd? If that's too slow we could look at providing a compressed version.
cc @ebetica we should also document this in the README

@tomgoddard
Copy link
Author

Thanks! Those downloaded in 1 hour and are just what I need. It would be helpful to mention those files in the ESM Metagenomic Atlas github README where the other downloadable files are described

https://github.com/facebookresearch/esm/blob/main/scripts/atlas/README.md

@tomgoddard
Copy link
Author

The mgnify90.fasta you provided is useful, but it is not quite what I want. It contains 623796864 sequences, while the ESM Metagenomic Atlas only contains 577944949 sequences according to the stats.parquet file you provide. I want a fasta file of exactly the sequences for which the ESM Metagenomic Atlas has predicted structures. Of course I can filter mgnify90.fasta down to the desired set of sequences, but I think it would help the community if the sequences for atlas predictions were available for download. Thanks!

@tomsercu
Copy link
Contributor

Documenting the new files in #367.
It looks like the stats.parquet is the old version, we should update and provide the corresponding fasta.

@tomgoddard
Copy link
Author

By the way, the stats.parquet file contains duplicate MGnify ids. That file has 577944949 rows but only 577424953 unique MGnify ids in those rows, so about 520,000 duplicates. No idea why.

@tomsercu
Copy link
Contributor

We now provide s3://dl.fbaipublicfiles.com/esmatlas/v0/full/atlas.fasta with 617051007 records precisely matching stats.parquet.

@tomsercu
Copy link
Contributor

Note that the mgnify90.fasta stays in place and contains the full, raw mgnify90.fasta including where we didn't make predictions.

Let me know if you notice anything else missing - and again thank you for bringing all of this to our attention!

tomsercu added a commit that referenced this issue Nov 22, 2022
Documenting stats.parquet and atlas.fasta, see #366 and #376
@tomgoddard
Copy link
Author

Thanks for quickly solving this problem! The atlas.fasta file is exactly what I needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants