Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run length bwt #440

Merged
merged 9 commits into from
Jan 24, 2024
Merged

Run length bwt #440

merged 9 commits into from
Jan 24, 2024

Conversation

TwFlem
Copy link
Contributor

@TwFlem TwFlem commented Jan 19, 2024

Changes in this PR

This PR is a memory optimization for the BWT. This implements the RLBWT, Run-Length Burrows Wheeler Transform. The difference between the RLBWT and the BWT is that instead of having the whole last column (the actual BWT string), we have a run length compressed version of the last column. For example, the Last column of banana is: "annb$aa". It's run length compression is "anb$a".

The RLBWT is very effective for repetitive texts like genomic data. For example, in the test data we have for the BWT, we repeat the base string 3 times. The resulting Last column has a length of 340 in the normal BWT implementation. In this implementation, it only has a length of 94 (along with some negligible auxiliary info about run start positions and character cumulative counts).

Ben would do a much better job at explaining the mechanics and O complexity than I:

Why are you making these changes?

This is a needed change for the BWT to be a viable tool for real workloads. There will be two follow ups after this PR:

  1. reducing the size of the suffix array using an r-index (this will also decrease consumed memory by an order of magnitude)
  2. serializing the BWT to disk as a cache. Calculating the BWT takes a lot of up front effort, but can save a lot of time in the long run if the workload involves searching through the BWT. We can this tool even more valuable for Poly's users by allowing them to opt into a way to serialize the BWT so they can write it to disk.

Are any changes breaking? (IMPORTANT)

No

Pre-merge checklist

All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.

  • New packages/exported functions have docstrings.
  • New/changed functionality is thoroughly tested.
  • New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
  • Changes are documented in CHANGELOG.md in the [Unreleased] section.
  • All code is properly formatted and linted.
  • The PR template is filled out.

PS: I didn't update the changelog since the BWT is still unreleased at the time of writing.

search/bwt/bitvector.go Outdated Show resolved Hide resolved
}

func (bwt BWT) getFirstColumnStr() string {
firstColumn := strings.Builder{}
Copy link
Contributor Author

@TwFlem TwFlem Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful for debugging. I recommend keeping it for the future.

I've rewritten this thing about 7 times thinking I wouldn't need it again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments explaining why it's helpful for debugging so we know not to delete in future? A test in bwt_test.go would be a good place to keep an example since it's unexported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyStiles I added an explanation and demo in this commit: fe3b686#diff-c08e2ebed9f275352a42178cdcc0b15abc0ca64e019428e5fe8b6ff5ea3af188R636

Also, I'm not sure if we have any better way of switching on debug for Poly's internals (didn't find anything), but I added that debug field in the BWT. LMK if there's another approach that is preferred or if the explanation along with the lint ignore annotation is good enough.

search/bwt/rsa_bitvector.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the debugging functions and the lint directive but without some example usage I'm not sure how they'd be used to debug and would probably end up removing them in the future if they were merged without them.

Think you could write some tiny tests just to demo their usage? You can just put each example in a simple non-example test and refer to it in-line with something like, "To see usage example check out TestDebugRelatedFunction in bwt_test.go.

}

func (bwt BWT) getFirstColumnStr() string {
firstColumn := strings.Builder{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments explaining why it's helpful for debugging so we know not to delete in future? A test in bwt_test.go would be a good place to keep an example since it's unexported.

search/bwt/bitvector.go Outdated Show resolved Hide resolved
search/bwt/rsa_bitvector.go Outdated Show resolved Hide resolved
@TimothyStiles TimothyStiles merged commit 05169d2 into bebop:main Jan 24, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants