-
-
Notifications
You must be signed in to change notification settings - Fork 200
DeepClust data files
Files pertaining to
- Buchfink BJ, Barbé É, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Clustering the protein universe of life using DIAMOND DeepClust", Nature Methods 23, 724-727 (2026). doi:10.1038/s41592-026-03030-z
Download links: https://deepclust.objectstore.hpccloud.mpcdf.mpg.de/index.html
Make sure you have enough disk space available (2 TB for downloading, up to 5 TB while decompressing all files) and a stable high-bandwidth Internet connection. The use of a download manager is recommended on desktop systems. In addition, the following options are available:
An S3 client (e.g. Cyberduck, minio-client, rclone) can be used with the URL s3://deepclust.objectstore.hpccloud.mpcdf.mpg.de.
To download via https on the command line, the following wget command can be used:
wget --recursive --level=1 --execute robots=off --no-parent --no-host-directories --cut-dirs=2 https://deepclust.objectstore.hpccloud.mpcdf.mpg.de/index.html
The total compressed size of this download resource is 1.906.886 MB.
File Description joined_with_index_RowGroupFinal.parquet: Parquet file containing all sequences clustered with DIAMOND DeepClust mentioned in the publication.
clust_index_RowGroup.parquet: Index file indicating where in joined_with_index_RowGroupFinal.parquet a cluster can be found.
persistent: DuckDB Database created from clust_index_RowGroup.parquet.
SeqIdMapClustId.parquet: Parquet file containing all sequence IDs from the DeepClust Database which then can be mapped onto the cluster to which they have been assigned.
(These files are contained within the archive DeepClustParquet.tar.zst.) For more information see: https://github.com/drostlab/deepclust_dataretrieval
clust_bigg_2.mmseqs: MMseqs2 formatted Database containing all clusters from the DeepClust Database with more than two members to use in the context of Protein Structure Prediction and ColabFold.
clust_bigg2.fa: FASTA File containg all centroids representing clusters with more than two members.
(These files are contained within the archive clust_bigg2_mmseqs_db.tar.zst.)
For more information see: https://github.com/drostlab/deepclust_colabfold
To decompress the data from the split tar achives use:
cat DeepClustParquet.tar.zst.?? | tar --zstd -xv
DeepClustParquet
DeepClustParquet/persistent
DeepClustParquet/clust_index_RowGroup.parquet
DeepClustParquet/MD5SUM.txt
DeepClustParquet/SeqIdMapClustId.parquet
DeepClustParquet/joined_with_index_RowGroupFinal.parquet
After extracting the archives run
md5sum -c MD5SUM.txt
to check the integrity of the files based on their MD5 hash.