Add seqhash v2 #398

Koeng101 · 2023-11-10T07:06:05Z

Changes in this PR

Fixes #397 . This PR makes seqhash v2, which is both compressed (~3x smaller than seqhash v1) and has compatibility for encoding fragments as seqhashes.

This PR has 100% test coverage.

Why are you making these changes?

Seqhash v1 produces hashes that are too large, and cannot be used to encode genetic parts / fragments.

Are any changes breaking? (IMPORTANT)

No

Pre-merge checklist

All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.

New packages/exported functions have docstrings.
New/changed functionality is thoroughly tested.
New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
Changes are documented in CHANGELOG.md in the [Unreleased] section.
All code is properly formatted and linted.
The PR template is filled out.

Koeng101 · 2023-11-13T16:38:45Z

Don't merge this yet: I am unsatisfied with the fragment algorithm, and want to fix it up.

seqhash/seqhash.go

Koeng101 · 2023-11-17T19:14:00Z

Ready for check @TimothyStiles

TimothyStiles

Only general comment is how do you feel about renaming Hash2 to HashV2 @Koeng101?

Koeng101 · 2023-11-29T17:56:03Z

@TimothyStiles Feel good about it. Changed.

TimothyStiles

@Koeng101 I'm a little on the fence on wanting to support multiple versions of seqhash.

Least Rotation is super applicable and allows people to hash plasmids reliably.

Seqhash V1 and V2 don't feel as generally applicable or stable. At the time seqhash V1 came out I wasn't very familiar with sketching, bwt, and other non-cryptographic, non-alignment based methods of searching and comparing sequences. Now I'm wondering if V2 should have a little more consideration in what goes into the ID besides the metadata itself. Should the non-metadata string component be non-cryptographic and more similar to bwt for speed and better comparison? Should V2 just be the metadata and we let users choose their own hashing method?

Another thought. How many different hash algorithms should we intend to support and how should they be named? What's the difference between blake 1, 2, & 3? Is it just a speed thing? Same could be said for the seqhash family but in this case there are some true semantic difference in both input and output. Should seqhashV2 be called FragHash?

seqhashV1 felt more like a reference implementation and a good short hand than a standard but with the advent of seqhashV2 it starts bringing up the questions, "When should I use V1 rather than V2?", and, "how many versions will there be and how will I know which one I should be using if there's, V3, V4, etc?". V2 gives the assumption that you should always use the most up to date hash version. Is there any case where that's not true and it would be better to use V1 instead of V2?

TimothyStiles · 2023-11-29T18:34:56Z

seqhash/seqhash.go

+	// First, run checks and get the determistic sequence of the hash
+	for _, char := range sequence {
+		if !strings.Contains("ATUGCYRSWKMBDHVNZ", string(char)) {
+			return result, errors.New("Only letters ATUGCYRSWKMBDHVNZ are allowed for DNA/RNA. Got letter: " + string(char))


No sequence hash for amino acids?

You can't fragment proteins because they aren't double stranded. (ie, if you fragment them, proteins just become 2 proteins)

If that's the case then maybe this should be called something like FragHash rather than V2?

But it's not the main V2 function? It's called HashV2Fragment to complement HashV2, which is an entirely different function

TimothyStiles · 2023-11-29T18:41:22Z

seqhash/example_test.go

 	fmt.Println(sequenceSeqhash)
-	// Output: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1608a615f46350
+	// Output: C_JPQCj5PgjFwjy7jaoYmwqQ==


API usage changes between V2 examples and V2 <-> V1? Should be unified and expect similar behavior.

API usage changes between V2 examples and V2 <-> V1? Should be unified and expect similar behavior.

Do you mean the change to [16]byte?

Koeng101 · 2023-11-29T19:30:34Z

Least Rotation is super applicable and allows people to hash plasmids reliably.

It doesn't though - you still have to do least comparison in order to get reverse complements.

Should the non-metadata string component be non-cryptographic and more similar to bwt for speed and better comparison?

I don't think they should be used for comparison at least, not many opinions on cryptography but it's hard to fix non cyptographic later if someone maliciously breaks things. Seqhash is used as an identifier - the kmer table or mash or blast index or minimap index or whatever can be on the other side of the index. Let's you change what your comparison algorithm is, which is much more important that your hashing algorithm.

Should V2 just be the metadata and we let users choose their own hashing method?

No, I don't think this is a good idea when trying to make relatively stable identifiers.

How many different hash algorithms should we intend to support and how should they be named?

However many are necessary! And I'm not really sure about naming - but might be good to think about.

Same could be said for the seqhash family but in this case there are some true semantic difference in both input and output. Should seqhashV2 be called FragHash?

I think I explain the use case for seqhashV2 pretty clearly in the package documentation. It shouldn't be called FragHash because it isn't just for fragments, it's for DNA/RNA/Proteins/Fragments. Fragment hashing is just a feature of v2 that v1 did not have.

"When should I use V1 rather than V2?"

This is a good question that I should add to the top of the package docs, but I do think it is answered if you read the v2 documentation. ie, V2 is much smaller but sacrifices length from 256 to 120 bits, increasing collision chance. But more people will probably have that question and not bother to read the full docs.

"how many versions will there be

"How many versions will there be" I think is literally unanswerable because you can't fully predict the future, and if it was predictable, that would be bad, because then it would be inflexible to the changing needs of users. The point of having versions is so that you can add more later, otherwise just do away with versioning at all.

how will I know which one I should be using if there's, V3, V4, etc?".

Well, presumably we can explain that to the user.

V2 gives the assumption that you should always use the most up to date hash version. Is there any case where that's not true and it would be better to use V1 instead of V2?

Rather than use 256 bits for encoding
the hash, we use 120 bits. Since seqhashes are not meant for security, this
is good enough (50% collision with 1.3x10^18 hashes), while making them
conveniently only 16 btyes long

Less collision chance. But generally speaking, I think you should probably always use V2 over V1.

TODO for me:

Add doc at top clearly explaining difference between v1 and v2

Keoni Gandall added 3 commits November 9, 2023 22:59

Add seqhash v2

99e280d

updated changelog

a5820ef

make linter happy

36de8b8

Koeng101 requested review from TimothyStiles and carreter November 10, 2023 07:11

matiasinsaurralde reviewed Nov 15, 2023

View reviewed changes

seqhash/seqhash.go Show resolved Hide resolved

matiasinsaurralde mentioned this pull request Nov 16, 2023

Improve Hash2Fragment by using a map to validate allowed sequence characters #402

Closed

6 tasks

Keoni Gandall added 2 commits November 17, 2023 11:10

seqhash

f716e5a

add changelog

c720a32

TimothyStiles reviewed Nov 28, 2023

View reviewed changes

Hash2->HashV2

f12dba0

TimothyStiles reviewed Nov 29, 2023

View reviewed changes

TimothyStiles and others added 3 commits November 29, 2023 12:45

renamed fwd and rev

9ae38fa

Add top level comment

44e9cce

Merge branch 'main' into seqhash_compressed

514406f

Koeng101 closed this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add seqhash v2 #398

Add seqhash v2 #398

Koeng101 commented Nov 10, 2023 •

edited

Koeng101 commented Nov 13, 2023

Koeng101 commented Nov 17, 2023

TimothyStiles left a comment

Koeng101 commented Nov 29, 2023

TimothyStiles left a comment •

edited

TimothyStiles Nov 29, 2023

Koeng101 Nov 29, 2023 •

edited

TimothyStiles Nov 29, 2023

Koeng101 Nov 29, 2023 •

edited

TimothyStiles Nov 29, 2023

Koeng101 Nov 29, 2023

Koeng101 commented Nov 29, 2023 •

edited

Add seqhash v2 #398

Add seqhash v2 #398

Conversation

Koeng101 commented Nov 10, 2023 • edited

Changes in this PR

Why are you making these changes?

Are any changes breaking? (IMPORTANT)

Pre-merge checklist

Koeng101 commented Nov 13, 2023

Koeng101 commented Nov 17, 2023

TimothyStiles left a comment

Choose a reason for hiding this comment

Koeng101 commented Nov 29, 2023

TimothyStiles left a comment • edited

Choose a reason for hiding this comment

TimothyStiles Nov 29, 2023

Choose a reason for hiding this comment

Koeng101 Nov 29, 2023 • edited

Choose a reason for hiding this comment

TimothyStiles Nov 29, 2023

Choose a reason for hiding this comment

Koeng101 Nov 29, 2023 • edited

Choose a reason for hiding this comment

TimothyStiles Nov 29, 2023

Choose a reason for hiding this comment

Koeng101 Nov 29, 2023

Choose a reason for hiding this comment

Koeng101 commented Nov 29, 2023 • edited

Koeng101 commented Nov 10, 2023 •

edited

TimothyStiles left a comment •

edited

Koeng101 Nov 29, 2023 •

edited

Koeng101 Nov 29, 2023 •

edited

Koeng101 commented Nov 29, 2023 •

edited