Add svb compression to slow5 #341

Koeng101 · 2023-09-11T16:14:08Z

Replaces #328 , merging this time with ioToBio instead to take advantage of the new parser format.

Koeng101 · 2023-09-11T16:26:01Z

Moving #328 here to merge with ioToBio.

#328 (comment) is relevant - I think this is just looking ahead to blow5 compatibility.

To quote the importance of just the svb compression:

I've seen reduction in total database size (storing fastq and slow5 reads) from 12Gb to 7.1Gb using just this one weird trick (database engineers hate him)

This is without implementing the other tricks from blow5.

I also added in examples for how to use the compression / decompression

carreter · 2023-09-12T17:15:34Z

Continuing the thread from over there (#328 (comment)):

I'm using these functions for taking the rawSignal in and out of an SQL database, so having it be a method doesn't work as well as a raw function.

CompressedRead I think would just be a blow5 read rather than slow5 read, which we SHOULD implement at some point. (there is a whole binary format specifically for doing those compressed reads well, with some real performance improvements)

If this is the case, I think we should entirely hold off on merging this PR in favor of doing a blow5 PR instead. As it stands, I don't really see a use case for other clients of this package. If someone wants to perform this compression, they can just import your svb package themselves.

Koeng101 · 2023-09-12T18:56:06Z

Continuing the thread from over there (#328 (comment)):

I'm using these functions for taking the rawSignal in and out of an SQL database, so having it be a method doesn't work as well as a raw function.
CompressedRead I think would just be a blow5 read rather than slow5 read, which we SHOULD implement at some point. (there is a whole binary format specifically for doing those compressed reads well, with some real performance improvements)

If this is the case, I think we should entirely hold off on merging this PR in favor of doing a blow5 PR instead. As it stands, I don't really see a use case for other clients of this package. If someone wants to perform this compression, they can just import your svb package themselves.

Breaking it down into two parts: 1 blow5 vs svb, and 2 svb triviality

I strongly disagree that blow5 fulfills the same use case:

blow5 uses svb, but the use-case intention is completely different. These slow5 package functions are so that you can use slow5 with different data stores (like an SQL database). blow5 cannot fulfill that use case, because you want to query on the raw data, so it doesn't exist as the binary format. However, the one column of data that takes the majority of space that won't be queried on directly can be compressed with svb to halve the needed storage, which is the limiting factor for these kinds of data files. blow5 is for binary storage, probably putting the Nanopore runs in an object store, while slow5+svb is much better for a document-style database.
I am currently storing ~14 actual sequencing runs in a database with this format, with more coming in every week. I personally have real-world experience that this both works well in actual practice, and that blow5 does not fulfill the niche.

I also disagree that svb is trivial:

svb implementation is non-trivial (before I forked you couldn't actually do it with pre-existing libraries), and here we have it tested in the single context that it is useful. Unlike gzip or zstd, it isn't messing with an io.Reader/io.Writer. Trivial would be piping an io.Reader, not two make([]xint) plus a type conversion loop. What is a mask array? Do you need it? What inputs need to be saved for decompression? Having it there in a function increases readability of an unfamiliar compression/decompression type.
There is messaging with using svb in this way and having functions built-in to do it. The performance increase is pretty darn great compared to other compression methods, and by including it we are saying "THIS is how you should compress raw reads".

As it stands, I don't really see a use case for other clients of this package

I specifically talk about the use-case for other clients in the context comment.

carreter · 2023-09-13T01:09:08Z

Thank you for the clarifications re: slow5+svb vs blow5 and svb triviality. I completely missed the context comment, so sorry!

I'm on board with adding this functionality - reviewing now.

bio/slow5/slow5.go

bio/slow5/slow5_test.go

Keoni Gandall added 2 commits September 11, 2023 09:12

moved svb to ioToBio

fa4d29a

Improve tests

f80b317

Koeng101 requested a review from TimothyStiles September 11, 2023 16:35

carreter requested changes Sep 13, 2023

View reviewed changes

bio/slow5/slow5.go Show resolved Hide resolved

bio/slow5/slow5.go Show resolved Hide resolved

bio/slow5/slow5_test.go Outdated Show resolved Hide resolved

fix with cmp.Equal

0edfd1c

Koeng101 merged commit 34de749 into ioToBio Sep 16, 2023
3 checks passed

delete-merged-branch bot deleted the slow5StreamVByte2 branch September 16, 2023 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add svb compression to slow5 #341

Add svb compression to slow5 #341

Koeng101 commented Sep 11, 2023

Koeng101 commented Sep 11, 2023 •

edited

carreter commented Sep 12, 2023

Koeng101 commented Sep 12, 2023

carreter commented Sep 13, 2023

Add svb compression to slow5 #341

Add svb compression to slow5 #341

Conversation

Koeng101 commented Sep 11, 2023

Koeng101 commented Sep 11, 2023 • edited

carreter commented Sep 12, 2023

Koeng101 commented Sep 12, 2023

carreter commented Sep 13, 2023

Koeng101 commented Sep 11, 2023 •

edited