Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Degraded performance of k-mer filter from FASTQ files #45

Closed
johnlees opened this issue Jul 7, 2023 · 14 comments · Fixed by #46
Closed

Degraded performance of k-mer filter from FASTQ files #45

johnlees opened this issue Jul 7, 2023 · 14 comments · Fixed by #46
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@johnlees
Copy link
Member

johnlees commented Jul 7, 2023

Reported by @rderelle

I generated 2 simulated set of 60x reads from 2 genomes. When analysed using ska 0.3.0, I obtained a total of 4.348.071 kmers. But with the 'old' ska 0.2.1, I obtained 4.361.745 kmers. It seems to me that, between versions 0.2.1 and 0.3.0, perhaps too many kmers are filtered out. I then tried different version of ska:

v0.3.0 4.348.071 kmers
v0.2.4 4.348.071 kmers
v0.2.3 4.348.071 kmers
v0.2.2 4.361.745 kmers
v0.2.1 4.361.745 kmers
v0.2.0 4.361.745 kmers

For all these versions, my command lines were:
ska build --threads 2 -k 41 --min-count 5 -f list_files.txt -o out
ska nk --full-info out.skf > out.txt

the potential issue seems to be related to the bloom filter implemented in "count_min_filter.rs".

I could increase the number of kmers observed by ska by increasing the size of the bloom filter** in v0.2.3.

** I'm not familiar with bloom filters, so I increased everything:

const BLOOM_WIDTH: usize = 1 << 28; (previously 1 << 27)
const BITS_PER_ENTRY: usize = 14; (previously 12)
const CM_WIDTH: usize = 1 << 28; (previously 1 << 24)

@johnlees
Copy link
Member Author

johnlees commented Jul 7, 2023

This is clearly related to this change: #20

@johnlees
Copy link
Member Author

johnlees commented Jul 7, 2023

The older versions used a count min filter for everything, the change then added a bloom filter for singleton k-mers, then the countmin filter is used with everything seen >1 time.

Things to try:

  • Replace both with a dictionary (which is exact). Check runtime and memory use.
  • Increase the bloom filter size, also checking memory use.

@rderelle it would be good if we could get a test case set up where we can check number of k-mers. Would you be happy to send me the example read files you are using?

@johnlees johnlees self-assigned this Jul 7, 2023
@johnlees johnlees added bug Something isn't working enhancement New feature or request labels Jul 7, 2023
@rderelle
Copy link

rderelle commented Jul 7, 2023

Hi John,

here are the fastq files I used (simulated 150bp paired-end reads at 60x coverage from 2 M. tuberculosis genomes):
https://drive.google.com/drive/folders/1goHqRI2SWiXIoqgWw_HfdRmD2yjzbHc3?usp=sharing

Thanks!

@johnlees
Copy link
Member Author

johnlees commented Jul 9, 2023

Using a dictionary, which is exact (branch exact_dict)
1m39s; 2.4Gb
k-mers=4360072
sample_kmers=[4352822, 4356260]

v0.3.1
Bloom + countmin
bloom params: 2^27; 12. cm params 2^24; 3.
47.41s; 1.4Gb
k-mers=4348071
sample_kmers=[4324446, 4327670]

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

@johnlees
Copy link
Member Author

johnlees commented Jul 9, 2023

v0.2.2
countmin only
cm params 2^27; 3.
1m56s; 1.9Gb (I think also without the only enter dictionary once optimisation. My memory is that time should be more like 1m)
k-mers=4361745
sample_kmers=[4353667, 4357112]

Giving 8 FN, 1681 FP

NB: There should be no false negatives, perhaps this is singletons or another change in later versions

@johnlees
Copy link
Member Author

johnlees commented Jul 9, 2023

@rderelle you might want to give the version on the exact_dict branch a go, and see if that recovers all the variants you expect in your downstream pipeline

@johnlees
Copy link
Member Author

johnlees commented Jul 9, 2023

increasing sizes in v0.3.1
bloom params: 2^28; 12. cm params 2^25; 4.
53s; 1.7Gb
k-mers=4359667
sample_kmers=[4351774, 4355311]

1188 FPs; 1593 FNs

This looks better, but I'm not sure why there are FNs. I need to think about why this might be

@rderelle
Copy link

rderelle commented Jul 9, 2023

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.

I'll give it a try with version v0.3.1.
Thanks!

@johnlees
Copy link
Member Author

johnlees commented Jul 9, 2023

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.

Sorry @rderelle, this isn't very clear above (these are mostly notes to self while I debug). On the exact_dist branch I am counting exactly so no counting errors, therefore with the filters (which have errors) I can quantify FN and FP by comparing the output of ska nk.

@johnlees
Copy link
Member Author

johnlees commented Jul 9, 2023

I believe I have found the bug causing an elevated FN rate. Adding in some logging for an FN k-mer:

2023-07-09T20:39:03.902Z WARN  [ska::ska_dict] Sample reads/lin_1.11.fq.gz
2023-07-09T20:39:06.540Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Count 2
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Count 3
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Count 9
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Count 10

So here it looks like the bloom filter is working fine (although note its FP rate is around 1% by design), but the problem is the k-mer hits a false positive in the countmin filter. K-mers are only added when the count is exactly the same as the minimum, to avoid doing multiple lookups once already added. So this skips over this criteria, and the k-mer isn't added.

It might be hard to pick sizes for the CM and bloom filter that give good error rates in every case, but I will try and optimise them a bit here. I don't really want to give these as options for the user as it's too hard to set good values. Perhaps, as in sketchlib, I could add an exact k-mer counter as an option. Probably trading ~2x runtime and memory will be worth it for many.

@johnlees
Copy link
Member Author

johnlees commented Jul 9, 2023

Performance of combined filters (c.f. exact 99s, 2.4 Gb, 0 FN, 0 FP)

table is a WIP which I'll update as I run more parameter combinations
(don't take exact CPU time too seriously, compiler flags and other stuff I was running varied)

cm width cm height bloom width time memory FP FN
2^24 3 2^27 47.41s 1.4Gb 33031 45032
2^25 4 2^28 53s 1.7Gb 1188 1593
2^28 3 2^27 57s 2.7Gb 26 36
2^27 4 2^27 60s 2.2Gb 34 47
2^24 3 2^29 67s 1.5Gb 33011 45021
2^24 8 2^27 85s 1.7Gb 1255 1690
2^26 4 2^27 68s 1.7Gb 213 281
NA (dict) NA (dict) 2^27 46s 1.4Gb 0 0

So far: bloom filter width looking like it's well set. Countmin tradeoff could be better, some increase in both width and height probably best.

Also worth trying: bloom filter + exact dict.

edit: Replacing the countmin filter with a dict is looking like the best option here. One final thing to try is a double bloom filter.

@johnlees
Copy link
Member Author

Adding a second bloom filter for the two-counts doesn't help, so will go ahead with single bloom + hashmap

@johnlees
Copy link
Member Author

@rderelle this should be fixed in v0.3.1, which has just been released

@rderelle
Copy link

Works great. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants