Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible false negatives in query #5

Closed
hmusta opened this issue Oct 24, 2019 · 8 comments
Closed

Possible false negatives in query #5

hmusta opened this issue Oct 24, 2019 · 8 comments

Comments

@hmusta
Copy link

hmusta commented Oct 24, 2019

Hi,

I've been using COBS in a pipeline I'm working on, but I've noticed what appear to be false negatives in COBS' querying results.

I've used the following script to build compressed COBS and Mantis indexes for the attached input sequences

PREFIX=ERR1218773
PREFIX2=ERR1217061
NAME=small

# COBS
cobs compact-construct -k 31 -f 0.01 -T 1 --num-hashes 7 inputs/ $NAME.cobs_compact

# Mantis
squeakr count -e -k 31 -t 1 -o inputs/$PREFIX.ser inputs/$PREFIX.unitigs.fq
squeakr count -e -k 31 -t 1 -o inputs/$PREFIX2.ser inputs/$PREFIX2.unitigs.fq
echo inputs/$PREFIX.ser > inputs.txt
echo inputs/$PREFIX2.ser >> inputs.txt
mantis build -s 25 -i inputs.txt -o mantis

# query
cobs query -i $NAME.cobs_compact -f queries/query.fa -T 1 -t 0.0 --load-complete
mantis query -1 -p mantis/ -o /dev/stdout queries/query.seq.txt

When I query with Mantis, I get

[2019-10-24 17:46:48.174] [mantis_console] [info] Reading colored dbg from disk.
[2019-10-24 17:46:48.297] [mantis_console] [info] Read colored dbg with 9205186 k-mers and 3 color classes
[2019-10-24 17:46:48.297] [mantis_console] [info] Reading query kmers from disk.
[2019-10-24 17:46:48.298] [mantis_console] [info] Total k-mers to query: 831
[2019-10-24 17:46:48.298] [mantis_console] [info] Querying the colored dbg.
-----------------------------------------
| Query time  | Time = 277.25 us
-----------------------------------------
0       831
inputs/ERR1218773.ser   642
inputs/ERR1217061.ser   4
[2019-10-24 17:46:48.298] [mantis_console] [info] Writing done.

Whereas when I query the file with COBS, I get

Reading complete index
Read 33.331 MiB / 33.331 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 2
ERR1217061      10
ERR1218773      1
TIMER info=search hashes=0.000268999 io=0.000221758 and rows=2.1032e-05 sort results=1.396e-06 total=0.000513185

If I exclude PREFIX2 (the second input file), I get the following result in COBS

Reading complete index
Read 20.586 MiB / 20.586 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 1
ERR1218773      5
TIMER info=search hashes=0.000157673 io=0.000260737 and rows=2.2831e-05 sort results=1.326e-06 total=0.000442567

So it seems like the addition of extra samples leads to a reduction in the number of reported matches. I observe the same behavior if I construct a classic index as well. I've also done some tests with larger data sets where no matches are reported in cases where Mantis reports several.

Overall, the reported numbers are much lower than those reported by Mantis, so I'm not sure how to interpret these results.

inputs.tar.gz
queries.tar.gz

Please let me know if there's any other info I can provide to help look into this.

Best,
Harun

@bingmann
Copy link
Owner

thanks for the long bug report, I did find two important errors:
the FastQ reader didnt work with your lines > 64K length. commit 2e79c30
And I fixed the DNA canonicalization. in commit b1720a1

@bingmann
Copy link
Owner

Hope the new version works.

@hmusta
Copy link
Author

hmusta commented Oct 28, 2019

Thanks for your help! I'll let you know how things go

@hmusta
Copy link
Author

hmusta commented Oct 28, 2019

Since commit b1720a1 was only applied to the classic index, can I assume that my compact indexes are correct?

@bingmann
Copy link
Owner

Yes, compacts are built out of classic indexes.

@hmusta
Copy link
Author

hmusta commented Oct 29, 2019

I reran my script and it seems like the results are much closer, but still a bit off. Now, when I query, I get the following

Reading complete index
Read 384.157 MiB / 384.157 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 2
ERR1218773      546
ERR1217061      7
TIMER info=search hashes=0.000107381 io=0.000480277 and rows=2.0661e-05 sort results=1.128e-06 total=0.000609447

The greater number of matches for ERR1217061 (7 instead of 4) can probably be explained by false positives, but I'm still not sure why there are only 546 matches to ERR1218773 instead of 642.

@bingmann
Copy link
Owner

bingmann commented Oct 29, 2019

Did you add the --canonicalize flag for cobs compact-construct?

I saw mantis mirrors lexicographically larger k-mers. COBS doesnt by default atm.

@hmusta
Copy link
Author

hmusta commented Oct 29, 2019

Ok, that does indeed fix the problem, I had forgotten to re-enable it during my testing. Thanks for your help!

I'll close this issue then!

@hmusta hmusta closed this as completed Oct 29, 2019
Zhicheng-Liu added a commit to Zhicheng-Liu/cobs that referenced this issue Jul 22, 2021
…ingmann#5)

When combining classic indices, for each batch the combinations of
rows from each constituent index are written to an output block.
The output block is reused for next batch.

As we use bitwise OR operation to combine rows from the constituent
indices, the output block should be reset to all 0s before being
reused. Otherwise, previous set bits will be carried over to next
batch and accumulating false positives till the end of the batch
processing loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants