-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Degraded performance of k-mer filter from FASTQ files #45
Comments
This is clearly related to this change: #20 |
The older versions used a count min filter for everything, the change then added a bloom filter for singleton k-mers, then the countmin filter is used with everything seen >1 time. Things to try:
@rderelle it would be good if we could get a test case set up where we can check number of k-mers. Would you be happy to send me the example read files you are using? |
Hi John, here are the fastq files I used (simulated 150bp paired-end reads at 60x coverage from 2 M. tuberculosis genomes): Thanks! |
Using a dictionary, which is exact (branch v0.3.1 This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k) |
v0.2.2 Giving 8 FN, 1681 FP NB: There should be no false negatives, perhaps this is singletons or another change in later versions |
@rderelle you might want to give the version on the |
increasing sizes in v0.3.1 1188 FPs; 1593 FNs This looks better, but I'm not sure why there are FNs. I need to think about why this might be |
I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information. I'll give it a try with version v0.3.1. |
Sorry @rderelle, this isn't very clear above (these are mostly notes to self while I debug). On the |
I believe I have found the bug causing an elevated FN rate. Adding in some logging for an FN k-mer:
So here it looks like the bloom filter is working fine (although note its FP rate is around 1% by design), but the problem is the k-mer hits a false positive in the countmin filter. K-mers are only added when the count is exactly the same as the minimum, to avoid doing multiple lookups once already added. So this skips over this criteria, and the k-mer isn't added. It might be hard to pick sizes for the CM and bloom filter that give good error rates in every case, but I will try and optimise them a bit here. I don't really want to give these as options for the user as it's too hard to set good values. Perhaps, as in sketchlib, I could add an exact k-mer counter as an option. Probably trading ~2x runtime and memory will be worth it for many. |
Performance of combined filters (c.f. exact 99s, 2.4 Gb, 0 FN, 0 FP) table is a WIP which I'll update as I run more parameter combinations
So far: bloom filter width looking like it's well set. Countmin tradeoff could be better, some increase in both width and height probably best. Also worth trying: bloom filter + exact dict. edit: Replacing the countmin filter with a dict is looking like the best option here. One final thing to try is a double bloom filter. |
Adding a second bloom filter for the two-counts doesn't help, so will go ahead with single bloom + hashmap |
@rderelle this should be fixed in v0.3.1, which has just been released |
Works great. Thanks! |
Reported by @rderelle
I generated 2 simulated set of 60x reads from 2 genomes. When analysed using ska 0.3.0, I obtained a total of 4.348.071 kmers. But with the 'old' ska 0.2.1, I obtained 4.361.745 kmers. It seems to me that, between versions 0.2.1 and 0.3.0, perhaps too many kmers are filtered out. I then tried different version of ska:
v0.3.0 4.348.071 kmers
v0.2.4 4.348.071 kmers
v0.2.3 4.348.071 kmers
v0.2.2 4.361.745 kmers
v0.2.1 4.361.745 kmers
v0.2.0 4.361.745 kmers
For all these versions, my command lines were:
ska build --threads 2 -k 41 --min-count 5 -f list_files.txt -o out
ska nk --full-info out.skf > out.txt
the potential issue seems to be related to the bloom filter implemented in "count_min_filter.rs".
I could increase the number of kmers observed by ska by increasing the size of the bloom filter** in v0.2.3.
** I'm not familiar with bloom filters, so I increased everything:
const BLOOM_WIDTH: usize = 1 << 28; (previously 1 << 27)
const BITS_PER_ENTRY: usize = 14; (previously 12)
const CM_WIDTH: usize = 1 << 28; (previously 1 << 24)
The text was updated successfully, but these errors were encountered: