Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add seqhash v2 #398

Closed
wants to merge 9 commits into from
Closed

Add seqhash v2 #398

wants to merge 9 commits into from

Conversation

Koeng101
Copy link
Contributor

@Koeng101 Koeng101 commented Nov 10, 2023

Changes in this PR

Fixes #397 . This PR makes seqhash v2, which is both compressed (~3x smaller than seqhash v1) and has compatibility for encoding fragments as seqhashes.

This PR has 100% test coverage.

Why are you making these changes?

Seqhash v1 produces hashes that are too large, and cannot be used to encode genetic parts / fragments.

Are any changes breaking? (IMPORTANT)

No

Pre-merge checklist

All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.

  • New packages/exported functions have docstrings.
  • New/changed functionality is thoroughly tested.
  • New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
  • Changes are documented in CHANGELOG.md in the [Unreleased] section.
  • All code is properly formatted and linted.
  • The PR template is filled out.

@Koeng101
Copy link
Contributor Author

Don't merge this yet: I am unsatisfied with the fragment algorithm, and want to fix it up.

@Koeng101
Copy link
Contributor Author

Ready for check @TimothyStiles

Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only general comment is how do you feel about renaming Hash2 to HashV2 @Koeng101?

@Koeng101
Copy link
Contributor Author

@TimothyStiles Feel good about it. Changed.

Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Koeng101 I'm a little on the fence on wanting to support multiple versions of seqhash.

Least Rotation is super applicable and allows people to hash plasmids reliably.

Seqhash V1 and V2 don't feel as generally applicable or stable. At the time seqhash V1 came out I wasn't very familiar with sketching, bwt, and other non-cryptographic, non-alignment based methods of searching and comparing sequences. Now I'm wondering if V2 should have a little more consideration in what goes into the ID besides the metadata itself. Should the non-metadata string component be non-cryptographic and more similar to bwt for speed and better comparison? Should V2 just be the metadata and we let users choose their own hashing method?

Another thought. How many different hash algorithms should we intend to support and how should they be named? What's the difference between blake 1, 2, & 3? Is it just a speed thing? Same could be said for the seqhash family but in this case there are some true semantic difference in both input and output. Should seqhashV2 be called FragHash?

seqhashV1 felt more like a reference implementation and a good short hand than a standard but with the advent of seqhashV2 it starts bringing up the questions, "When should I use V1 rather than V2?", and, "how many versions will there be and how will I know which one I should be using if there's, V3, V4, etc?". V2 gives the assumption that you should always use the most up to date hash version. Is there any case where that's not true and it would be better to use V1 instead of V2?

// First, run checks and get the determistic sequence of the hash
for _, char := range sequence {
if !strings.Contains("ATUGCYRSWKMBDHVNZ", string(char)) {
return result, errors.New("Only letters ATUGCYRSWKMBDHVNZ are allowed for DNA/RNA. Got letter: " + string(char))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sequence hash for amino acids?

Copy link
Contributor Author

@Koeng101 Koeng101 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't fragment proteins because they aren't double stranded. (ie, if you fragment them, proteins just become 2 proteins)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case then maybe this should be called something like FragHash rather than V2?

Copy link
Contributor Author

@Koeng101 Koeng101 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not the main V2 function? It's called HashV2Fragment to complement HashV2, which is an entirely different function

fmt.Println(sequenceSeqhash)
// Output: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1608a615f46350
// Output: C_JPQCj5PgjFwjy7jaoYmwqQ==
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API usage changes between V2 examples and V2 <-> V1? Should be unified and expect similar behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API usage changes between V2 examples and V2 <-> V1? Should be unified and expect similar behavior.

Do you mean the change to [16]byte?

@Koeng101
Copy link
Contributor Author

Koeng101 commented Nov 29, 2023

Least Rotation is super applicable and allows people to hash plasmids reliably.

It doesn't though - you still have to do least comparison in order to get reverse complements.

Should the non-metadata string component be non-cryptographic and more similar to bwt for speed and better comparison?

I don't think they should be used for comparison at least, not many opinions on cryptography but it's hard to fix non cyptographic later if someone maliciously breaks things. Seqhash is used as an identifier - the kmer table or mash or blast index or minimap index or whatever can be on the other side of the index. Let's you change what your comparison algorithm is, which is much more important that your hashing algorithm.

Should V2 just be the metadata and we let users choose their own hashing method?

No, I don't think this is a good idea when trying to make relatively stable identifiers.

How many different hash algorithms should we intend to support and how should they be named?

However many are necessary! And I'm not really sure about naming - but might be good to think about.

Same could be said for the seqhash family but in this case there are some true semantic difference in both input and output. Should seqhashV2 be called FragHash?

I think I explain the use case for seqhashV2 pretty clearly in the package documentation. It shouldn't be called FragHash because it isn't just for fragments, it's for DNA/RNA/Proteins/Fragments. Fragment hashing is just a feature of v2 that v1 did not have.

"When should I use V1 rather than V2?"

This is a good question that I should add to the top of the package docs, but I do think it is answered if you read the v2 documentation. ie, V2 is much smaller but sacrifices length from 256 to 120 bits, increasing collision chance. But more people will probably have that question and not bother to read the full docs.

"how many versions will there be

"How many versions will there be" I think is literally unanswerable because you can't fully predict the future, and if it was predictable, that would be bad, because then it would be inflexible to the changing needs of users. The point of having versions is so that you can add more later, otherwise just do away with versioning at all.

how will I know which one I should be using if there's, V3, V4, etc?".

Well, presumably we can explain that to the user.

V2 gives the assumption that you should always use the most up to date hash version. Is there any case where that's not true and it would be better to use V1 instead of V2?

Rather than use 256 bits for encoding
the hash, we use 120 bits. Since seqhashes are not meant for security, this
is good enough (50% collision with 1.3x10^18 hashes), while making them
conveniently only 16 btyes long

Less collision chance. But generally speaking, I think you should probably always use V2 over V1.

TODO for me:

  • Add doc at top clearly explaining difference between v1 and v2

@Koeng101 Koeng101 closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compressed Seqhash
3 participants