Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed Seqhash #397

Closed
Koeng101 opened this issue Nov 8, 2023 · 0 comments
Closed

Compressed Seqhash #397

Koeng101 opened this issue Nov 8, 2023 · 0 comments
Labels
easy A quick and easy fix! enhancement New feature or request good first issue Good for newcomers low priority Would be nice to fix, but doesn't have to happen right now/there are more important things wontfix This will not be worked on

Comments

@Koeng101
Copy link
Contributor

Koeng101 commented Nov 8, 2023

What I want

I would like a more compressed Seqhash. Here is a current seqhash: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1608a615f46350

Here is the latter portion encoded in base58: HRWk6jLXJ3uvuKBnjyAhinEUsuzKbgpphDkrEcStX4AT - much shorter, 44 letters instead of 64. If we truncate to 16 bytes instead of 32 bytes, we get X8d1qRxANHFkdQM4kqKYWb. Much shorter! (base58 is nicer for encoding into various applications since it doesn't have any special characters)

If there are 8 bits in a byte, we can have a flag take up less space:

3 bit: seqhash version
1 bit: 15 byte version (vs 32 byte, which is default). 15 byte should be good enough for the majority of purposes, while being half the size, and the full seqhash nicely fits in 16 bytes.
1 bit: circularity
1 bit: double-strandedness
2 bit: DNA/RNA/PROTEIN (other left unspecified)

This would result in seqhashes that are 16 bytes, and would take up 22 characters of text rather than the current 71. This is far better than, say, 1000char when it comes to encoding a full protein.

I ask for 15 byte truncation rather than 16 byte truncation (120 bit vs 128 bit), so that the flags can fit, while still being a multiple of 2. (and to have a 50% chance of a collision in a 120-bit hash, you would need to generate approximately 1.357×10^18 hashes)

Why this is important

I am looking at building seqhashes into applications that interact with LLMs. LLMs are very bad at handling sequences, since they usually want to introspect inputs and outputs from APIs / interactive code. This introspection is also very useful for most users, since the AI can automatically improve if it can look at the full input / output of its code. However, the LLM then tries to look at sequences themselves, and then it gets kinda confused. This feature would greatly compress the amount of data needed to refer to genetic sequences.

@Koeng101 Koeng101 added enhancement New feature or request needs-triage An issue that needs to be triaged good first issue Good for newcomers easy A quick and easy fix! low priority Would be nice to fix, but doesn't have to happen right now/there are more important things and removed needs-triage An issue that needs to be triaged labels Nov 8, 2023
@TimothyStiles TimothyStiles added the wontfix This will not be worked on label Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy A quick and easy fix! enhancement New feature or request good first issue Good for newcomers low priority Would be nice to fix, but doesn't have to happen right now/there are more important things wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants