Chordclust implements similarity clustering using rust-bio.
The algorithm is a greedy search, similar to what is explained in https://www.drive5.com/usearch/manual/uclust_algo.html. It uses similarity instead of identity (for now):
- Sort by sequence length (bigger is first)
- For each sequence, compare it with the database of centroids:
- If identity with best match > T: add to cluster of best match
- Else: form a new cluster
With this kind of heuristic clustering, it is indicated to use a hierarchical approach:
- Given the sequences to cluster
seqs
and a descending array of similarity thresholds[T]
. - For each similarity threshold
T
in[T]
:
- Apply clustering with T to
seqs
- seqs <- current centroids
- The final structure is built by expanding the lower similarity clusters with the members of their corresponding higher clusters.
Licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
README.md is automatically generated on CI using cargo-readme. Please, modify README.tpl or lib.rs instead (check the github worflow for more details).