Skip to content

DeepClust data files

Benjamin J. Buchfink edited this page Jun 10, 2026 · 2 revisions

Files pertaining to

  • Buchfink BJ, Barbé É, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Clustering the protein universe of life using DIAMOND DeepClust", Nature Methods 23, 724-727 (2026). doi:10.1038/s41592-026-03030-z

Download links: https://deepclust.objectstore.hpccloud.mpcdf.mpg.de/index.html

Download instructions

Make sure you have enough disk space available (2 TB for downloading, up to 5 TB while decompressing all files) and a stable high-bandwidth Internet connection. The use of a download manager is recommended on desktop systems. In addition, the following options are available:

An S3 client (e.g. Cyberduck, minio-client, rclone) can be used with the URL s3://deepclust.objectstore.hpccloud.mpcdf.mpg.de.

To download via https on the command line, the following wget command can be used:

wget --recursive --level=1 --execute robots=off --no-parent --no-host-directories --cut-dirs=2 https://deepclust.objectstore.hpccloud.mpcdf.mpg.de/index.html

The total compressed size of this download resource is 1.906.886 MB.

File Description joined_with_index_RowGroupFinal.parquet: Parquet file containing all sequences clustered with DIAMOND DeepClust mentioned in the publication.

clust_index_RowGroup.parquet: Index file indicating where in joined_with_index_RowGroupFinal.parquet a cluster can be found.

persistent: DuckDB Database created from clust_index_RowGroup.parquet.

SeqIdMapClustId.parquet: Parquet file containing all sequence IDs from the DeepClust Database which then can be mapped onto the cluster to which they have been assigned.

(These files are contained within the archive DeepClustParquet.tar.zst.) For more information see: https://github.com/drostlab/deepclust_dataretrieval

clust_bigg_2.mmseqs: MMseqs2 formatted Database containing all clusters from the DeepClust Database with more than two members to use in the context of Protein Structure Prediction and ColabFold.

clust_bigg2.fa: FASTA File containg all centroids representing clusters with more than two members.

(These files are contained within the archive clust_bigg2_mmseqs_db.tar.zst.)

For more information see: https://github.com/drostlab/deepclust_colabfold

DeepClustParquet

How-to unpack the archives

To decompress the data from the split tar achives use:

cat DeepClustParquet.tar.zst.?? | tar --zstd -xv

List of files

DeepClustParquet

DeepClustParquet/persistent

DeepClustParquet/clust_index_RowGroup.parquet

DeepClustParquet/MD5SUM.txt

DeepClustParquet/SeqIdMapClustId.parquet

DeepClustParquet/joined_with_index_RowGroupFinal.parquet

Data integrity check

After extracting the archives run

md5sum -c MD5SUM.txt

to check the integrity of the files based on their MD5 hash.

Clone this wiki locally