Possible false negatives in query #5

hmusta · 2019-10-24T16:00:12Z

Hi,

I've been using COBS in a pipeline I'm working on, but I've noticed what appear to be false negatives in COBS' querying results.

I've used the following script to build compressed COBS and Mantis indexes for the attached input sequences

PREFIX=ERR1218773
PREFIX2=ERR1217061
NAME=small

# COBS
cobs compact-construct -k 31 -f 0.01 -T 1 --num-hashes 7 inputs/ $NAME.cobs_compact

# Mantis
squeakr count -e -k 31 -t 1 -o inputs/$PREFIX.ser inputs/$PREFIX.unitigs.fq
squeakr count -e -k 31 -t 1 -o inputs/$PREFIX2.ser inputs/$PREFIX2.unitigs.fq
echo inputs/$PREFIX.ser > inputs.txt
echo inputs/$PREFIX2.ser >> inputs.txt
mantis build -s 25 -i inputs.txt -o mantis

# query
cobs query -i $NAME.cobs_compact -f queries/query.fa -T 1 -t 0.0 --load-complete
mantis query -1 -p mantis/ -o /dev/stdout queries/query.seq.txt

When I query with Mantis, I get

[2019-10-24 17:46:48.174] [mantis_console] [info] Reading colored dbg from disk.
[2019-10-24 17:46:48.297] [mantis_console] [info] Read colored dbg with 9205186 k-mers and 3 color classes
[2019-10-24 17:46:48.297] [mantis_console] [info] Reading query kmers from disk.
[2019-10-24 17:46:48.298] [mantis_console] [info] Total k-mers to query: 831
[2019-10-24 17:46:48.298] [mantis_console] [info] Querying the colored dbg.
-----------------------------------------
| Query time  | Time = 277.25 us
-----------------------------------------
0       831
inputs/ERR1218773.ser   642
inputs/ERR1217061.ser   4
[2019-10-24 17:46:48.298] [mantis_console] [info] Writing done.

Whereas when I query the file with COBS, I get

Reading complete index
Read 33.331 MiB / 33.331 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 2
ERR1217061      10
ERR1218773      1
TIMER info=search hashes=0.000268999 io=0.000221758 and rows=2.1032e-05 sort results=1.396e-06 total=0.000513185

If I exclude PREFIX2 (the second input file), I get the following result in COBS

Reading complete index
Read 20.586 MiB / 20.586 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 1
ERR1218773      5
TIMER info=search hashes=0.000157673 io=0.000260737 and rows=2.2831e-05 sort results=1.326e-06 total=0.000442567

So it seems like the addition of extra samples leads to a reduction in the number of reported matches. I observe the same behavior if I construct a classic index as well. I've also done some tests with larger data sets where no matches are reported in cases where Mantis reports several.

Overall, the reported numbers are much lower than those reported by Mantis, so I'm not sure how to interpret these results.

inputs.tar.gz
queries.tar.gz

Please let me know if there's any other info I can provide to help look into this.

Best,
Harun

The text was updated successfully, but these errors were encountered:

bingmann · 2019-10-28T14:53:34Z

thanks for the long bug report, I did find two important errors:
the FastQ reader didnt work with your lines > 64K length. commit 2e79c30
And I fixed the DNA canonicalization. in commit b1720a1

bingmann · 2019-10-28T14:54:35Z

Hope the new version works.

hmusta · 2019-10-28T15:00:36Z

Thanks for your help! I'll let you know how things go

hmusta · 2019-10-28T15:01:42Z

Since commit b1720a1 was only applied to the classic index, can I assume that my compact indexes are correct?

bingmann · 2019-10-28T15:02:30Z

Yes, compacts are built out of classic indexes.

hmusta · 2019-10-29T13:44:54Z

I reran my script and it seems like the results are much closer, but still a bit off. Now, when I query, I get the following

Reading complete index
Read 384.157 MiB / 384.157 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 2
ERR1218773      546
ERR1217061      7
TIMER info=search hashes=0.000107381 io=0.000480277 and rows=2.0661e-05 sort results=1.128e-06 total=0.000609447

The greater number of matches for ERR1217061 (7 instead of 4) can probably be explained by false positives, but I'm still not sure why there are only 546 matches to ERR1218773 instead of 642.

bingmann · 2019-10-29T13:47:27Z

Did you add the --canonicalize flag for cobs compact-construct?

I saw mantis mirrors lexicographically larger k-mers. COBS doesnt by default atm.

hmusta · 2019-10-29T13:52:20Z

Ok, that does indeed fix the problem, I had forgotten to re-enable it during my testing. Thanks for your help!

I'll close this issue then!

…ingmann#5) When combining classic indices, for each batch the combinations of rows from each constituent index are written to an output block. The output block is reused for next batch. As we use bitwise OR operation to combine rows from the constituent indices, the output block should be reset to all 0s before being reused. Otherwise, previous set bits will be carried over to next batch and accumulating false positives till the end of the batch processing loop.

hmusta closed this as completed Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible false negatives in query #5

Possible false negatives in query #5

hmusta commented Oct 24, 2019 •

edited

Loading

bingmann commented Oct 28, 2019

bingmann commented Oct 28, 2019

hmusta commented Oct 28, 2019

hmusta commented Oct 28, 2019

bingmann commented Oct 28, 2019

hmusta commented Oct 29, 2019 •

edited

Loading

bingmann commented Oct 29, 2019 •

edited

Loading

hmusta commented Oct 29, 2019 •

edited

Loading

Possible false negatives in query #5

Possible false negatives in query #5

Comments

hmusta commented Oct 24, 2019 • edited Loading

bingmann commented Oct 28, 2019

bingmann commented Oct 28, 2019

hmusta commented Oct 28, 2019

hmusta commented Oct 28, 2019

bingmann commented Oct 28, 2019

hmusta commented Oct 29, 2019 • edited Loading

bingmann commented Oct 29, 2019 • edited Loading

hmusta commented Oct 29, 2019 • edited Loading

hmusta commented Oct 24, 2019 •

edited

Loading

hmusta commented Oct 29, 2019 •

edited

Loading

bingmann commented Oct 29, 2019 •

edited

Loading

hmusta commented Oct 29, 2019 •

edited

Loading