Degraded performance of k-mer filter from FASTQ files #45

johnlees · 2023-07-07T10:43:27Z

I generated 2 simulated set of 60x reads from 2 genomes. When analysed using ska 0.3.0, I obtained a total of 4.348.071 kmers. But with the 'old' ska 0.2.1, I obtained 4.361.745 kmers. It seems to me that, between versions 0.2.1 and 0.3.0, perhaps too many kmers are filtered out. I then tried different version of ska:

v0.3.0 4.348.071 kmers
v0.2.4 4.348.071 kmers
v0.2.3 4.348.071 kmers
v0.2.2 4.361.745 kmers
v0.2.1 4.361.745 kmers
v0.2.0 4.361.745 kmers

For all these versions, my command lines were:
ska build --threads 2 -k 41 --min-count 5 -f list_files.txt -o out
ska nk --full-info out.skf > out.txt

the potential issue seems to be related to the bloom filter implemented in "count_min_filter.rs".

I could increase the number of kmers observed by ska by increasing the size of the bloom filter** in v0.2.3.

** I'm not familiar with bloom filters, so I increased everything:

const BLOOM_WIDTH: usize = 1 << 28; (previously 1 << 27)
const BITS_PER_ENTRY: usize = 14; (previously 12)
const CM_WIDTH: usize = 1 << 28; (previously 1 << 24)

johnlees · 2023-07-07T10:44:21Z

This is clearly related to this change: #20

johnlees · 2023-07-07T10:47:56Z

The older versions used a count min filter for everything, the change then added a bloom filter for singleton k-mers, then the countmin filter is used with everything seen >1 time.

Things to try:

Replace both with a dictionary (which is exact). Check runtime and memory use.
Increase the bloom filter size, also checking memory use.

@rderelle it would be good if we could get a test case set up where we can check number of k-mers. Would you be happy to send me the example read files you are using?

rderelle · 2023-07-07T11:06:34Z

Hi John,

here are the fastq files I used (simulated 150bp paired-end reads at 60x coverage from 2 M. tuberculosis genomes):
https://drive.google.com/drive/folders/1goHqRI2SWiXIoqgWw_HfdRmD2yjzbHc3?usp=sharing

Thanks!

johnlees · 2023-07-09T16:07:50Z

Using a dictionary, which is exact (branch exact_dict)
1m39s; 2.4Gb
k-mers=4360072
sample_kmers=[4352822, 4356260]

v0.3.1
Bloom + countmin
bloom params: 2^27; 12. cm params 2^24; 3.
47.41s; 1.4Gb
k-mers=4348071
sample_kmers=[4324446, 4327670]

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

johnlees · 2023-07-09T16:27:28Z

v0.2.2
countmin only
cm params 2^27; 3.
1m56s; 1.9Gb (I think also without the only enter dictionary once optimisation. My memory is that time should be more like 1m)
k-mers=4361745
sample_kmers=[4353667, 4357112]

Giving 8 FN, 1681 FP

NB: There should be no false negatives, perhaps this is singletons or another change in later versions

johnlees · 2023-07-09T16:28:34Z

@rderelle you might want to give the version on the exact_dict branch a go, and see if that recovers all the variants you expect in your downstream pipeline

johnlees · 2023-07-09T16:38:23Z

increasing sizes in v0.3.1
bloom params: 2^28; 12. cm params 2^25; 4.
53s; 1.7Gb
k-mers=4359667
sample_kmers=[4351774, 4355311]

1188 FPs; 1593 FNs

This looks better, but I'm not sure why there are FNs. I need to think about why this might be

rderelle · 2023-07-09T16:52:24Z

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.

I'll give it a try with version v0.3.1.
Thanks!

johnlees · 2023-07-09T20:43:09Z

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.

Sorry @rderelle, this isn't very clear above (these are mostly notes to self while I debug). On the exact_dist branch I am counting exactly so no counting errors, therefore with the filters (which have errors) I can quantify FN and FP by comparing the output of ska nk.

johnlees · 2023-07-09T20:48:45Z

I believe I have found the bug causing an elevated FN rate. Adding in some logging for an FN k-mer:

2023-07-09T20:39:03.902Z WARN  [ska::ska_dict] Sample reads/lin_1.11.fq.gz
2023-07-09T20:39:06.540Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Count 2
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Count 3
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Count 9
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Count 10

So here it looks like the bloom filter is working fine (although note its FP rate is around 1% by design), but the problem is the k-mer hits a false positive in the countmin filter. K-mers are only added when the count is exactly the same as the minimum, to avoid doing multiple lookups once already added. So this skips over this criteria, and the k-mer isn't added.

It might be hard to pick sizes for the CM and bloom filter that give good error rates in every case, but I will try and optimise them a bit here. I don't really want to give these as options for the user as it's too hard to set good values. Perhaps, as in sketchlib, I could add an exact k-mer counter as an option. Probably trading ~2x runtime and memory will be worth it for many.

johnlees · 2023-07-09T21:05:15Z

Performance of combined filters (c.f. exact 99s, 2.4 Gb, 0 FN, 0 FP)

table is a WIP which I'll update as I run more parameter combinations
(don't take exact CPU time too seriously, compiler flags and other stuff I was running varied)

cm width	cm height	bloom width	time	memory	FP	FN
2^24	3	2^27	47.41s	1.4Gb	33031	45032
2^25	4	2^28	53s	1.7Gb	1188	1593
2^28	3	2^27	57s	2.7Gb	26	36
2^27	4	2^27	60s	2.2Gb	34	47
2^24	3	2^29	67s	1.5Gb	33011	45021
2^24	8	2^27	85s	1.7Gb	1255	1690
2^26	4	2^27	68s	1.7Gb	213	281
NA (dict)	NA (dict)	2^27	46s	1.4Gb	0	0

So far: bloom filter width looking like it's well set. Countmin tradeoff could be better, some increase in both width and height probably best.

Also worth trying: bloom filter + exact dict.

edit: Replacing the countmin filter with a dict is looking like the best option here. One final thing to try is a double bloom filter.

johnlees · 2023-07-10T09:46:46Z

Adding a second bloom filter for the two-counts doesn't help, so will go ahead with single bloom + hashmap

johnlees · 2023-07-10T10:30:35Z

@rderelle this should be fixed in v0.3.1, which has just been released

rderelle · 2023-07-10T15:08:33Z

Works great. Thanks!

johnlees self-assigned this Jul 7, 2023

johnlees added bug Something isn't working enhancement New feature or request labels Jul 7, 2023

johnlees mentioned this issue Jul 10, 2023

Restore filter accuracy for fastq input #46

Merged

johnlees closed this as completed in #46 Jul 10, 2023

johnlees mentioned this issue Jul 20, 2023

Feature requests: different mixture model; combined ska build & cov #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Degraded performance of k-mer filter from FASTQ files #45

Degraded performance of k-mer filter from FASTQ files #45

johnlees commented Jul 7, 2023 •

edited

Loading

johnlees commented Jul 7, 2023

johnlees commented Jul 7, 2023

rderelle commented Jul 7, 2023

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023

rderelle commented Jul 9, 2023 •

edited

Loading

johnlees commented Jul 9, 2023 •

edited

Loading

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023 •

edited

Loading

johnlees commented Jul 10, 2023

johnlees commented Jul 10, 2023

rderelle commented Jul 10, 2023

Degraded performance of k-mer filter from FASTQ files #45

Degraded performance of k-mer filter from FASTQ files #45

Comments

johnlees commented Jul 7, 2023 • edited Loading

johnlees commented Jul 7, 2023

johnlees commented Jul 7, 2023

rderelle commented Jul 7, 2023

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023

rderelle commented Jul 9, 2023 • edited Loading

johnlees commented Jul 9, 2023 • edited Loading

johnlees commented Jul 9, 2023

johnlees commented Jul 9, 2023 • edited Loading

johnlees commented Jul 10, 2023

johnlees commented Jul 10, 2023

rderelle commented Jul 10, 2023

johnlees commented Jul 7, 2023 •

edited

Loading

rderelle commented Jul 9, 2023 •

edited

Loading

johnlees commented Jul 9, 2023 •

edited

Loading

johnlees commented Jul 9, 2023 •

edited

Loading