Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Burrows-Wheeler transform #360

Closed
carreter opened this issue Sep 23, 2023 · 5 comments
Closed

Implement Burrows-Wheeler transform #360

carreter opened this issue Sep 23, 2023 · 5 comments
Assignees
Labels
good first issue Good for newcomers proposal A proposed feature or enhancement stale
Milestone

Comments

@carreter
Copy link
Collaborator

No description provided.

@carreter carreter changed the title Implement BurrowsWheeler Implement Burrows-Wheeler transform Sep 23, 2023
@carreter carreter added this to the v1.0 milestone Sep 23, 2023
@carreter carreter added the needs-triage An issue that needs to be triaged label Sep 23, 2023
@carreter
Copy link
Collaborator Author

@TimothyStiles can you elaborate on what exactly it is we need here and triage the issue here + in the roadmap?

@TwFlem
Copy link
Contributor

TwFlem commented Nov 9, 2023

I'd love to take a whack at this!

@TimothyStiles might have some other intentions in mind, but Burrows Wheeler Transform (BWT) can be used as a data structure to query whether or not a sub sequence within a sequence exists.

This can be useful for determining whether or not a sub sequence of nucleotides exist within a sequence of nucleotides.

It is very memory efficient and exact matches on reads are fast.

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

Maybe the exact case is a good first step? Do we want to report the location of the match as well- seems important?

@Koeng101
Copy link
Contributor

Koeng101 commented Nov 9, 2023

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

On the alignment side, I think this basically becomes bwa https://bio-bwa.sourceforge.net . What I think could be neat is if you could somehow make it get 98% matching working, so it could be used for #396 in auto-annotating features. Right now that is done in plannotate with BLAST, but I'm pretty sure it is done that way because BLAST is an easy tool to just import and use. This could be a good application!

For actual large-scale sequence alignment, minimap2 is probably a better route - tried and true with nanopore-type alignment, which IMO is the best upcoming DNA sequencing method. (Also it's what I use right now)

@TimothyStiles
Copy link
Collaborator

TimothyStiles commented Nov 9, 2023

@TimothyStiles might have some other intentions in mind, but Burrows Wheeler Transform (BWT) can be used as a data structure to query whether or not a sub sequence within a sequence exists.

This can be useful for determining whether or not a sub sequence of nucleotides exist within a sequence of nucleotides.

It is very memory efficient and exact matches on reads are fast.

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

Maybe the exact case is a good first step? Do we want to report the location of the match as well- seems important?

@TwFlem I'd love to see you take a crack at this. At a conference on my phone rn.

What we're ultimately looking for is a fast BWT implementation that essentially compresses our sequence to be used for search, alignment, etc.

BWT is common enough that LLMs can create a decent naive implementation that can be easily unit tested (I've actually done this).

What I'd suggest is that we get a naive implementation going that we can merge early with high test coverage and documentation then optimize that implementation with @matiasinsaurralde's or @soypat's, or @/josharian's (on discord) input. That way we don't get stuck on optimization early and different people can focus on optimization, and downstream applications at the same time without blocking each other.

@TimothyStiles TimothyStiles added good first issue Good for newcomers proposal A proposed feature or enhancement and removed needs-triage An issue that needs to be triaged labels Nov 29, 2023
@TwFlem TwFlem mentioned this issue Dec 6, 2023
6 tasks
TimothyStiles added a commit that referenced this issue Jan 18, 2024
* basic bitvector

* jacobsons start and refactor to uint for accurate machine words

* confident that jacobson rank is working

* reusing the incoming bitvector instead of copying everyithing for
jacobson rank

* access and bounds checking

* just do uint64 for simplicity. bound checking and access

* bit vector fixes, rsa good enough, wavelet start

* Simple wavelet tree with access

* wavelet fix access, add select, fix rsa bitvector select

* got count working, but had to throw out jacobsons

* rsa fixes and refactors

* bwt locate

* extract

* doc BWT, refactor, and return a possible error during construction

* add TODO about sorting and the nullChar

* bwt examples

* wavelet tree doc

* wavelet tree explanation

* doc and note for waveletTree

* add bwt high level. move wavelet tree's some rsa bv docs

* simplify bitvector, docs for bitvector and rsaBitvector

* Cite Ben Langmead.

---------

Co-authored-by: Willow Carretero Chavez <sandiegobutterflies@gmail.com>
Co-authored-by: Timothy Stiles <tim@stiles.io>
Copy link

github-actions bot commented Feb 6, 2024

This issue has had no activity in the past 2 months. Marking as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers proposal A proposed feature or enhancement stale
Projects
None yet
Development

No branches or pull requests

4 participants