Provide MGnify sequences available in ESM Metagenomic Atlas #366

tomgoddard · 2022-11-14T20:22:02Z

It would be useful if you could offer a download of the 500 million sequences of the atlas as a fasta file. I want this because it allows me to do fast local searches. The online search capability provided by the atlas takes many minutes to do a single search and we have much faster (but less sensitive) search methods. I'd like to allow users of our ChimeraX visualization software to quickly search the atlas.

I'm aware that I can get the sequences from the EBI MGnify database (2.5 billion sequences), then filter them using the stats.parquet file for the atlas to just the ones used in the atlas. I am pursing that currently, but there appear to be no mirrors of the MGnify database and downloading it from EBI to the United States will take about 10 days to transfer this 250 Gbytes apparently bottlenecked by EBI providing only ~1 Mbit/sec. A download directly from Meta would be 1/5 the size, and if you use better compression than what EBI is using (e.g. bzip2 or xzip instead of gzip) it could be 1/10th the size ~25 Gbytes and on a much faster connection could be downloaded in hours instead of 10 days.

I realize I could scrape the sequences from the 15 Tbytes of structure prediction files that the Atlas provides -- but I'd prefer to not download 500 times more data than I actually need.

Thanks!

Tom Goddard
University of California, San Francisco
ChimeraX molecular visualization software developer

tomsercu · 2022-11-15T03:15:58Z

In #341 we just provided the fasta file: s3://dl.fbaipublicfiles.com/esmatlas/v0/full/mgnify90.fasta
Can you try downloading from S3 via aria2c or s5cmd? If that's too slow we could look at providing a compressed version.
cc @ebetica we should also document this in the README

tomgoddard · 2022-11-15T09:38:32Z

Thanks! Those downloaded in 1 hour and are just what I need. It would be helpful to mention those files in the ESM Metagenomic Atlas github README where the other downloadable files are described

https://github.com/facebookresearch/esm/blob/main/scripts/atlas/README.md

tomgoddard · 2022-11-15T19:22:53Z

The mgnify90.fasta you provided is useful, but it is not quite what I want. It contains 623796864 sequences, while the ESM Metagenomic Atlas only contains 577944949 sequences according to the stats.parquet file you provide. I want a fasta file of exactly the sequences for which the ESM Metagenomic Atlas has predicted structures. Of course I can filter mgnify90.fasta down to the desired set of sequences, but I think it would help the community if the sequences for atlas predictions were available for download. Thanks!

tomsercu · 2022-11-16T23:49:27Z

Documenting the new files in #367.
It looks like the stats.parquet is the old version, we should update and provide the corresponding fasta.

tomgoddard · 2022-11-16T23:58:25Z

By the way, the stats.parquet file contains duplicate MGnify ids. That file has 577944949 rows but only 577424953 unique MGnify ids in those rows, so about 520,000 duplicates. No idea why.

tomsercu · 2022-11-22T16:43:16Z

We now provide s3://dl.fbaipublicfiles.com/esmatlas/v0/full/atlas.fasta with 617051007 records precisely matching stats.parquet.

tomsercu · 2022-11-22T16:46:16Z

Note that the mgnify90.fasta stays in place and contains the full, raw mgnify90.fasta including where we didn't make predictions.

Let me know if you notice anything else missing - and again thank you for bringing all of this to our attention!

Documenting stats.parquet and atlas.fasta, see #366 and #376

tomgoddard · 2022-11-22T19:46:46Z

Thanks for quickly solving this problem! The atlas.fasta file is exactly what I needed.

…earch#376

tomsercu assigned ebetica Nov 16, 2022

tomsercu assigned tomsercu and unassigned ebetica Nov 22, 2022

tomsercu mentioned this issue Nov 22, 2022

stats.parquet file for ESM Metagenomic Atlas has duplicate entries #376

Closed

tomsercu added a commit that referenced this issue Nov 22, 2022

documenting stats and fasta, see #366 and #376

5e2d2c5

tomsercu closed this as completed Nov 22, 2022

tomsercu mentioned this issue Nov 22, 2022

How to search sequences in bulk？ #341

Closed

tomsercu added a commit that referenced this issue Nov 22, 2022

Merge pull request #383 from facebookresearch/atlas_stats_fasta

4f126ca

Documenting stats.parquet and atlas.fasta, see #366 and #376

tomgoddard mentioned this issue Nov 23, 2022

Is the current ESM Metagenomic Atlas called version 0 or version 1? #384

Closed

harryhaemin pushed a commit to harryhaemin/esm that referenced this issue Feb 24, 2023

documenting stats and fasta, see facebookresearch#366 and facebookres…

1c33793

…earch#376

andersoncarlosfs pushed a commit to andersoncarlosfs/esm that referenced this issue Mar 17, 2023

documenting stats and fasta, see facebookresearch#366 and facebookres…

aae89c3

…earch#376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide MGnify sequences available in ESM Metagenomic Atlas #366

Provide MGnify sequences available in ESM Metagenomic Atlas #366

tomgoddard commented Nov 14, 2022

tomsercu commented Nov 15, 2022 •

edited

tomgoddard commented Nov 15, 2022

tomgoddard commented Nov 15, 2022

tomsercu commented Nov 16, 2022

tomgoddard commented Nov 16, 2022

tomsercu commented Nov 22, 2022

tomsercu commented Nov 22, 2022

tomgoddard commented Nov 22, 2022

Provide MGnify sequences available in ESM Metagenomic Atlas #366

Provide MGnify sequences available in ESM Metagenomic Atlas #366

Comments

tomgoddard commented Nov 14, 2022

tomsercu commented Nov 15, 2022 • edited

tomgoddard commented Nov 15, 2022

tomgoddard commented Nov 15, 2022

tomsercu commented Nov 16, 2022

tomgoddard commented Nov 16, 2022

tomsercu commented Nov 22, 2022

tomsercu commented Nov 22, 2022

tomgoddard commented Nov 22, 2022

tomsercu commented Nov 15, 2022 •

edited