-
-
Notifications
You must be signed in to change notification settings - Fork 69
/
bwt.go
689 lines (585 loc) · 22.9 KB
/
bwt.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
package bwt
import (
"errors"
"fmt"
"math"
"strings"
"golang.org/x/exp/slices"
)
/*
For the BWT usage, please read the BWT methods
below. To understand what it is and how
it works for either curiosity or maintenance, then read below.
# BWT
BWT Stands for (B)urrows-(W)heeler (T)ransform. The BWT aids in
text compression and acts as a search index for any arbitrary
sequence of characters. With the BWT and some auxiliary data
structures, we can analyze a sequence in a memory and run time
efficient manner.
## BWT Transform
The first step to build the BWT is to get the BWT itself.
This is done by:
1. Appending a null terminating character to the end of a sequence
2. Rotate the sequence so that the last character is now the first
3. Repeat 2. N times where N is the length of the sequence
4. Lexicographically sort the NxN matrix of rotated sequences where
the null termination character is always the least-valued
5. Store the first and last column of the matrix. The last column
is the output of the BWT. The first column is needed to run queries
on the BWT of the original sequence.
Lets use banana as an example.
banana$ $banana
$banana a$banan
a$banan ana$ban
na$bana => anana$b
ana$ban banana$
nana$ba na$bana
anana$b nana$ba
Output:
Last Column (BWT): annb$aa
First Column: $aaabnn
## LF Mapping Properties
From now on we will refer to the Last Column as L and the First as F
There are a few special properties here to be aware of. First, notice
how the characters of the same rank show up in the same order for each
column:
L: a0 n0 n1 b0 $0 a1 a2
F: $0 a0 a1 a2 b0 n0 n1
That is to say the characters' rank for each column appear in ascending
order. For example: a0 < a1 < a2. This is true for all BWTs
The other important property to observe is that since the BWT is the
result of rotating each string, each character in the L column precedes
the corresponding character in the F column.
To best way to show this is to rebuild the original sequence
using the F and L columns. We do this by rebuilding the original string
in reverse order starting with the nullChar.
Original string: ______$0
F($0) -> L(a0) -> _____a0$0
F(a0) -> L(n0) -> ____n0a0$0
F(n0) -> L(a1) -> ___a1n0a0$0
F(a1) -> L(n1) -> __n1a1n0a0$0
F(n1) -> L(a2) -> _a2n1a1n0a0$0
F(a2) -> L(b0) -> b0a2n1a1n0a0$0
F(b0) -> L($0) -> Complete
If we take the rank subscripts away from: b0a2n1a1n0a0$0
We get... "banana$" !
## LF Mapping Usage
From these properties, the most important concept emerges- the LF Mapping.
The LF mapping is what enables us to query and analyze the BWT to gain
insight about the original sequence.
For example, let's say we wanted to count the number of occurrences of the
pattern "ana" in "banana". We can do this by:
1. Lookup the last char of the sequence, a, in the F column
2. Find that range of a's, [1, 4)
3. Take the next previous character in the pattern, n
4. Find the rank of n before the range from 2. [0, 1) = 0
5. Find the rank of n in the range from 2. [1, 4) = 2
6. Look up the start range of the n's in the F column, 5
7. Add the result from 4 and 5 respectively to form the next
L search range: [5+0, 5+2) = [5, 7)
8. Take next previous character in the pattern, a
9. Take the rank of "a" before position 5, which is 1
10. Take the rank of "a" before position 7, which is 3
11. Lookup the a's in the F column again, but add the results
from 9 and 10 to the start range to get the next search
range = [1+1, 1+3) = [2, 4)
12. That is beginning of our pattern, we sub subtract the end and start
of the search range to get our count, 4-2=2
Another way to look at this is that we are constantly refining our search
range for each character of the pattern we are searching for. Once we
reach the end of the pattern, our final range represents the a's which
start our pattern. If the range < 0, then at some point our search ranged
has collapsed and we can conclude that there is no matching pattern.
## Suffix Array
For other operations such as Locate and Extract, we need another auxiliary
data structure, the suffix array. Since rows of the BWT can map to any
position within the original sequence, we need some kind of reference as to
which BWT rows map to which positions in the original sequence. We can do this by storing
the positions of each character from the original sequence to each of the corresponding
rows in the BWT column. With our banana example:
F: $0 a0 a1 a2 b0 n0 n1
SA: [6 5 3 1 0 4 2]
If we take our count example for the pattern "ana" above, you'll remember
that our final search range was [2, 4). You'll also remember that we counted
2 occurrences of "ana" by subtracting the end of the range from the start, 4-2=2.
If iterate from 2 to 4, we can lookup the corresponding SA entry for the BWT rows 2 and 3.
If we look up 2 in the SA, we'll find that our first offset is at position 3 in the original sequence ban"ana"
If we look up 3 in the SA, we'll find that our second offset is at position 1 in the original sequence b"ana"na
## Notes on Performance
The explanation above leads to a very naive implementation. For example,
having the full SA would take way more memory than the BWT itself. Assuming
int64, that would 8 times the amount of memory of the BWT in its plain text
representation! In the implementation below, we may instead sample the SA
and do additional look ups as needed to find the offsets we need.
Similarly, storing both the F and L column as plain text would take double the
amount of memory to store the original sequence... BWT is used for text
compression, not expansion! That's why in the below implementation, you
will see other data structures that lower the amount of memory
needed. You will also notice that we can make huge improvements by
compressing sequences by runs of characters like with the F column.
Instead of:
F: $0 a0 a1 a2 b0 n0 n1
Since F is lexicographically sorted, we can have:
F: {$: [0, 1)}, {a: [1, 4)}, {b: [4, 5)} {n: [5, 7)}
Although these performance enhancements may lead to a different implementation to what is
described above, any implementation will just be an LF mapping- just with a few more steps.
NOTE: The above is just to explain what is happening at a high level. Please
reference the implementation below to see how the BWT is actually currently
working
Many of the Ideas come from Ben Langmead.
He has a whole YouTube playlist about BWT Indexing: https://www.youtube.com/watch?v=5G2Db41pSHE&list=PL2mpR0RYFQsADmYpW2YWBrXJZ_6EL_3nu
*/
const nullChar = "$"
// BWT Burrows-Wheeler Transform
// Compresses and Indexes a given sequence so that it can be
// be used for search, alignment, and text extraction. This is
// useful for sequences so large that it would be beneficial
// to reduce its memory footprint while also maintaining a way
// to analyze and work with the sequence.
type BWT struct {
// firstColumnSkipList is the first column of the BWT. It is
// represented as a list of skipEntries because the first column of
// the BWT is always lexicographically ordered. This saves time and memory.
firstColumnSkipList []skipEntry
// suffixArray an array that allows us to map a position in the first
// column to a position in the original sequence. This is needed to be
// able to extract text from the BWT.
suffixArray []int
// runLengthCompressedBWT is the compressed version of the BWT. The compression
// is for each run. For Example:
// the sequence "banana" has BWT "annb$aa"
// the run length compression of "annb$aa" is "anb$a"
// This helps us save a lot of memory while still having a search index we can
// use to align the original sequence. This allows us to understand how many
// runs of a certain character there are and where a run of a certain rank exists.
runBWTCompression waveletTree
// runStartPositions are the starting position of each run in the original sequence
// For example:
// "annb$aa" will have the runStartPositions [0, 1, 3, 4, 5]
// This helps us map our search range from "uncompressed BWT Space" to its
// "compressed BWT Run Space". With this, we can understand which runs we need
// to consider during LF mapping.
runStartPositions runInfo
// runCumulativeCounts is the cumulative count of characters for each run.
// This helps us efficiently lookup the number of occurrences of a given
// character before a given offset in "uncompressed BWT Space"
// For Example:
// "annb$aa" will have the runCumulativeCounts:
// "a": [0, 1, 3],
// "n": [0, 2],
// "b": [0, 1],
// "$": [0, 1],
runCumulativeCounts map[string]runInfo
// flag for turning on BWT debugging
debug bool
}
// Count represents the number of times the provided pattern
// shows up in the original sequence.
func (bwt BWT) Count(pattern string) (count int, err error) {
defer bwtRecovery("Count", &err)
err = isValidPattern(pattern)
if err != nil {
return 0, err
}
searchRange := bwt.lfSearch(pattern)
return searchRange.end - searchRange.start, nil
}
// Locate returns a list of offsets at which the beginning
// of the provided pattern occurs in the original
// sequence.
func (bwt BWT) Locate(pattern string) (offsets []int, err error) {
defer bwtRecovery("Locate", &err)
err = isValidPattern(pattern)
if err != nil {
return nil, err
}
searchRange := bwt.lfSearch(pattern)
if searchRange.start >= searchRange.end {
return nil, nil
}
numOfOffsets := searchRange.end - searchRange.start
offsets = make([]int, numOfOffsets)
for i := 0; i < numOfOffsets; i++ {
offsets[i] = bwt.suffixArray[searchRange.start+i]
}
return offsets, nil
}
// Extract this allows us to extract parts of the original
// sequence from the BWT.
// start is the beginning of the range of text to extract inclusive.
// end is the end of the range of text to extract exclusive.
// If either start or end are out of bounds, Extract will panic.
func (bwt BWT) Extract(start, end int) (extracted string, err error) {
defer bwtRecovery("Extract", &err)
err = validateRange(start, end)
if err != nil {
return "", err
}
if end > bwt.getLenOfOriginalStringWithNullChar()-1 {
return "", fmt.Errorf("end [%d] exceeds the max range of the BWT [%d]", end, bwt.getLenOfOriginalStringWithNullChar()-1)
}
if start < 0 {
return "", fmt.Errorf("start [%d] exceeds the min range of the BWT [0]", start)
}
strB := strings.Builder{}
for i := start; i < end; i++ {
fPos := bwt.getFCharPosFromOriginalSequenceCharPos(i)
skip := bwt.lookupSkipByOffset(fPos)
strB.WriteByte(skip.char)
}
return strB.String(), nil
}
// Len return the length of the sequence used to build the BWT
func (bwt BWT) Len() int {
return bwt.getLenOfOriginalStringWithNullChar() - 1
}
// GetTransform returns the last column of the BWT transform of the original sequence.
func (bwt BWT) GetTransform() string {
lastColumn := strings.Builder{}
lastColumn.Grow(bwt.getLenOfOriginalStringWithNullChar())
for i := 0; i < bwt.runBWTCompression.length; i++ {
currChar := bwt.runBWTCompression.Access(i)
var currCharEnd int
if i+1 >= len(bwt.runStartPositions) {
currCharEnd = bwt.getLenOfOriginalStringWithNullChar()
} else {
currCharEnd = bwt.runStartPositions[i+1]
}
for lastColumn.Len() < currCharEnd {
lastColumn.WriteByte(currChar)
}
}
return lastColumn.String()
}
//lint:ignore U1000 Ignore unused function. This is valuable for future debugging
func (bwt BWT) getFirstColumnStr() string {
firstColumn := strings.Builder{}
firstColumn.Grow(bwt.getLenOfOriginalStringWithNullChar())
for i := 0; i < len(bwt.firstColumnSkipList); i++ {
e := bwt.firstColumnSkipList[i]
for j := e.openEndedInterval.start; j < e.openEndedInterval.end; j++ {
firstColumn.WriteByte(e.char)
}
}
return firstColumn.String()
}
// getFCharPosFromOriginalSequenceCharPos looks up mapping from the original position
// of the sequence to its corresponding position in the First Column of the BWT
// NOTE: This clearly isn't ideal. Instead of improving this implementation, this will be replaced with
// something like r-index in the near future.
func (bwt BWT) getFCharPosFromOriginalSequenceCharPos(originalPos int) int {
for i := range bwt.suffixArray {
if bwt.suffixArray[i] == originalPos {
return i
}
}
panic("Unable to find the corresponding original position for a character in the original sequence in the suffix array. This should not be possible and indicates a malformed BWT.")
}
// lfSearch LF Search- Last First Search.
// Finds the valid range within the BWT index where the provided pattern is possible.
// If the final range is <= 0, then the pattern does not exist in the original sequence.
func (bwt BWT) lfSearch(pattern string) interval {
searchRange := interval{start: 0, end: bwt.getLenOfOriginalStringWithNullChar()}
for i := 0; i < len(pattern); i++ {
if bwt.debug {
printLFDebug(bwt, searchRange, i)
}
if searchRange.end-searchRange.start <= 0 {
return interval{}
}
c := pattern[len(pattern)-1-i]
nextStart := bwt.getNextLfSearchOffset(c, searchRange.start)
nextEnd := bwt.getNextLfSearchOffset(c, searchRange.end)
searchRange.start = nextStart
searchRange.end = nextEnd
}
return searchRange
}
func (bwt BWT) getNextLfSearchOffset(c byte, offset int) int {
nearestRunStart := bwt.runStartPositions.FindNearestRunStartPosition(offset + 1)
maxRunInCompressedSpace := bwt.runBWTCompression.Rank(c, nearestRunStart)
skip, ok := bwt.lookupSkipByChar(c)
if !ok {
return 0
}
cumulativeCounts, ok := bwt.runCumulativeCounts[string(c)]
if !ok {
return 0
}
cumulativeCountBeforeMaxRun := cumulativeCounts[maxRunInCompressedSpace]
currRunStart := bwt.runStartPositions.FindNearestRunStartPosition(offset)
currentRunChar := string(bwt.runBWTCompression.Access(currRunStart))
extraOffset := 0
// It is possible that an offset currently lies within a run of the same
// character we are inspecting. In this case, cumulativeCountBeforeMaxRun
// is not enough since the Max Run in this case does not include the run
// the offset is currently in. To adjust for this, we must count the number
// of character occurrences since the beginning of the run that the offset
// is currently in.
if c == currentRunChar[0] {
o := bwt.runStartPositions[nearestRunStart]
extraOffset += offset - o
}
return skip.openEndedInterval.start + cumulativeCountBeforeMaxRun + extraOffset
}
// lookupSkipByChar looks up a skipEntry by its character in the First Column
func (bwt BWT) lookupSkipByChar(c byte) (entry skipEntry, ok bool) {
for i := range bwt.firstColumnSkipList {
if bwt.firstColumnSkipList[i].char == c {
return bwt.firstColumnSkipList[i], true
}
}
return skipEntry{}, false
}
// lookupSkipByOffset looks up a skipEntry based off of an
// offset of the Fist Column of the BWT.
func (bwt BWT) lookupSkipByOffset(offset int) skipEntry {
if offset > bwt.getLenOfOriginalStringWithNullChar()-1 {
msg := fmt.Sprintf("offset [%d] exceeds the max bound of the BWT [%d]", offset, bwt.getLenOfOriginalStringWithNullChar()-1)
panic(msg)
}
if offset < 0 {
msg := fmt.Sprintf("offset [%d] exceeds the min bound of the BWT [0]", offset)
panic(msg)
}
for skipIndex := range bwt.firstColumnSkipList {
if bwt.firstColumnSkipList[skipIndex].openEndedInterval.start <= offset && offset < bwt.firstColumnSkipList[skipIndex].openEndedInterval.end {
return bwt.firstColumnSkipList[skipIndex]
}
}
msg := fmt.Sprintf("could not find the skip entry that falls within the range of the skip column at a given offset. range: [0, %d) offset: %d", bwt.getLenOfOriginalStringWithNullChar(), offset)
panic(msg)
}
func (bwt BWT) getLenOfOriginalStringWithNullChar() int {
return bwt.firstColumnSkipList[len(bwt.firstColumnSkipList)-1].openEndedInterval.end
}
type interval struct {
start int
end int
}
type skipEntry struct {
char byte
// openEndedInterval start is inclusive and end is exclusive
openEndedInterval interval
}
// New returns a BWT of the provided sequence
// The provided sequence must not contain the nullChar
// defined in this package. If it does, New will return
// an error.
func New(sequence string) (BWT, error) {
err := validateSequenceBeforeTransforming(&sequence)
if err != nil {
return BWT{}, err
}
sequence += nullChar
prefixArray := make([]string, len(sequence))
for i := 0; i < len(sequence); i++ {
prefixArray[i] = sequence[len(sequence)-i-1:]
}
sortPrefixArray(prefixArray)
suffixArray := make([]int, len(sequence))
charCount := 0
runBWTCompressionBuilder := strings.Builder{}
var runStartPositions runInfo
runCumulativeCounts := make(map[string]runInfo)
var prevChar *byte
for i := 0; i < len(prefixArray); i++ {
currChar := sequence[getBWTIndex(len(sequence), len(prefixArray[i]))]
if prevChar == nil {
prevChar = &currChar
}
if currChar != *prevChar {
runBWTCompressionBuilder.WriteByte(*prevChar)
runStartPositions = append(runStartPositions, i-charCount)
addRunCumulativeCountEntry(runCumulativeCounts, *prevChar, charCount)
charCount = 0
prevChar = &currChar
}
charCount++
suffixArray[i] = len(sequence) - len(prefixArray[i])
}
runBWTCompressionBuilder.WriteByte(*prevChar)
runStartPositions = append(runStartPositions, len(prefixArray)-charCount)
addRunCumulativeCountEntry(runCumulativeCounts, *prevChar, charCount)
fb := strings.Builder{}
for i := 0; i < len(prefixArray); i++ {
fb.WriteByte(prefixArray[i][0])
}
skipList := buildSkipList(prefixArray)
wt, err := newWaveletTreeFromString(runBWTCompressionBuilder.String())
if err != nil {
return BWT{}, err
}
return BWT{
firstColumnSkipList: skipList,
suffixArray: suffixArray,
runBWTCompression: wt,
runStartPositions: runStartPositions,
runCumulativeCounts: runCumulativeCounts,
}, nil
}
func addRunCumulativeCountEntry(rumCumulativeCounts map[string]runInfo, char byte, charCount int) {
cumulativeCountsOfChar, ok := rumCumulativeCounts[string(char)]
if ok {
cumulativeCountsOfChar = append(cumulativeCountsOfChar, charCount+cumulativeCountsOfChar[len(cumulativeCountsOfChar)-1])
} else {
cumulativeCountsOfChar = runInfo{0, charCount}
}
rumCumulativeCounts[string(char)] = cumulativeCountsOfChar
}
// buildSkipList compressed the First Column of the BWT into a skip list
func buildSkipList(prefixArray []string) []skipEntry {
prevChar := prefixArray[0][0]
skipList := []skipEntry{{char: prevChar, openEndedInterval: interval{start: 0}}}
for i := 1; i < len(prefixArray); i++ {
currChar := prefixArray[i][0]
if currChar != prevChar {
skipList[len(skipList)-1].openEndedInterval.end = i
skipList = append(skipList, skipEntry{
char: currChar,
openEndedInterval: interval{start: i},
})
prevChar = currChar
}
}
skipList[len(skipList)-1].openEndedInterval.end = len(prefixArray)
return skipList
}
// getBWTIndex helps us calculate the corresponding character that would
// be in the L column without having to rotate the full string.
// For example:
// Original string: banana$
// Rotation: ana$___
// Position: 7-4-1= 2
// Original[3]: n
func getBWTIndex(lenOfSequenceBeingBuilt, lenOfSuffixArrayVisited int) int {
bwtCharIndex := lenOfSequenceBeingBuilt - lenOfSuffixArrayVisited - 1
if bwtCharIndex == -1 {
bwtCharIndex = lenOfSequenceBeingBuilt - 1
}
return bwtCharIndex
}
func sortPrefixArray(prefixArray []string) {
slices.SortFunc(prefixArray, func(a, b string) bool {
minLen := int(math.Min(float64(len(a)), float64(len(b))))
for i := 0; i < minLen; i++ {
if a[i] == b[i] {
continue
}
if a[i] == nullChar[0] {
return true
}
if b[i] == nullChar[0] {
return false
}
return a[i] < b[i]
}
return len(a) < len(b)
})
}
func bwtRecovery(operation string, err *error) {
if r := recover(); r != nil {
rErr := fmt.Errorf("BWT %s InternalError=%s", operation, r)
*err = rErr
}
}
// runInfo each element of runInfo should represent an offset i where i
// corresponds to the start of a run in a given sequence. For example,
// aaaabbccc would have the run info [0, 4, 6]
type runInfo []int
// FindNearestRunStartPosition given some offset, find the nearest starting position for the.
// beginning of a run. Another way of saying this is give me the max i where runStartPositions[i] <= offset.
// This is needed so we can understand which run an offset is a part of.
func (r runInfo) FindNearestRunStartPosition(offset int) int {
start := 0
end := len(r) - 1
for start < end {
mid := start + (end-start)/2
if r[mid] < offset {
start = mid + 1
continue
}
if r[mid] > offset {
end = mid - 1
continue
}
return mid
}
if r[start] > offset {
return start - 1
}
return start
}
func isValidPattern(s string) (err error) {
if len(s) == 0 {
return errors.New("Pattern can not be empty")
}
return nil
}
func validateRange(start, end int) (err error) {
if start >= end {
return errors.New("Start must be strictly less than end")
}
return nil
}
func validateSequenceBeforeTransforming(sequence *string) (err error) {
if len(*sequence) == 0 {
return fmt.Errorf("Provided sequence must not by empty. BWT cannot be constructed")
}
if strings.Contains(*sequence, nullChar) {
return fmt.Errorf("Provided sequence contains the nullChar %s. BWT cannot be constructed", nullChar)
}
return nil
}
// printLFDebug this will print the first column and last column of the BWT along with some ascii visualizations.
// This is very helpful for debugging the LF mapping. For example, lets say you're in the middle of making some changes to the LF
// mapping and the test for counting starts to fail. To understand where the LF search is going wrong, you
// can do something like the below to outline which parts of the BWT are being searched some given iteration.
//
// For Example, if you had the BWT of:
// "rowrowrowyourboat"
// and wanted to Count the number of occurrences of "row"
// Then the iterations of the LF search would look something like:
//
// BWT Debug Begin Iteration: 0
// torbyrrru$wwaoooow
// $abooooorrrrtuwwwy
// ^^^^^^^^^^^^^^^^^^X
//
// BWT Debug Begin Iteration: 1
// torbyrrru$wwaoooow
// $abooooorrrrtuwwwy
// ______________^^^X
//
// BWT Debug Begin Iteration: 2
// torbyrrru$wwaoooow
// $abooooorrrrtuwwwy
// _____^^^X
//
// Where:
// * '^' denotes the active search range
// * 'X' denotes one character after the end of the active search searchRange
// * '_' is visual padding to help align the active search range
//
// NOTE: It can also be helpful to include the other auxiliary data structures. For example, it can be very helpful to include
// a similar visualization for the run length compression to help debug and understand which run were used to compute the active
// search window during each iteration.
func printLFDebug(bwt BWT, searchRange interval, iteration int) {
first := bwt.getFirstColumnStr()
last := bwt.GetTransform()
lastRunCompression := bwt.runBWTCompression.reconstruct()
fullASCIIRange := strings.Builder{}
fullASCIIRange.Grow(searchRange.end + 1)
for i := 0; i < searchRange.start; i++ {
fullASCIIRange.WriteRune('_')
}
for i := searchRange.start; i < searchRange.end; i++ {
fullASCIIRange.WriteRune('^')
}
fullASCIIRange.WriteRune('X')
fmt.Println("BWT Debug Begin Iteration:", iteration)
fmt.Println(last)
fmt.Println(first)
fmt.Println(fullASCIIRange.String())
fmt.Println(lastRunCompression)
}