Feature requests: different mixture model; combined ska build & cov #50

rderelle · 2023-07-20T07:15:04Z

Hi John,

Thanks a lot for implementing the calculation of the optimal kmer count threshold (ska cov).
I tested it using ska 0.3.1 and a paired-end sample (fastq files available here: https://drive.google.com/drive/folders/1HVO-6mOd7bh7CPOjXhA3lWAT0GA_8SC8?usp=sharing). My command line was:

./ska_0.3.1 cov reads_kraken/ERR8158003_1.fastq.gz reads_kraken/ERR8158003_2.fastq.gz -k 31 > cov.txt

Given the distribution of kmer counts, the cutoff should be around 4-6, but ska outputs a value of 17. I think this high treeshold is related to the "Error" message below obtained for low kmer sizes during the model fittng.

This is the beginning of the file 'cov.txt':

Count K_mers Mixture_density Component
1 25838523 3.2408096206707826e-1 Error
2 1993278 1.620404810335391e-1 Error
3 129884 5.4013493677846365e-2 Error
4 97141 1.3503373419461602e-2 Error
5 28053 2.7006746838923196e-3 Error
6 35628 4.5011244731538646e-4 Error
7 24702 6.430177818791246e-5 Error
8 35486 8.03772227348904e-6 Error
9 30571 8.93080252609934e-7 Error
10 39312 8.930802526126602e-8 Error
11 40110 8.118911389065957e-9 Error
12 44332 6.765759585478362e-10 Error
13 43658 5.20443537193471e-11 Error
14 49357 3.7176916177683685e-12 Error
15 47325 2.4891833425171915e-13 Error
16 51653 2.009020801325787e-14 Error
17 52514 1.921693705685228e-14 Coverage
18 53472 6.883935794424811e-14 Coverage
19 52873 2.4488922538072723e-13 Coverage
20 53588 8.282019181080119e-13 Coverage
21 54322 2.667583813189892e-12 Coverage
22 54876 8.201562459958256e-12 Coverage
23 55197 2.411959247393554e-11 Coverage
24 54703 6.797667675764525e-11 Coverage
25 54701 1.8391668285808207e-10 Coverage
26 54189 4.784636868075234e-10 Coverage
27 52711 1.1986335327845915e-9 Coverage
28 52430 2.895540187815414e-9 Coverage
29 51131 6.753560645868026e-9 Coverage
30 51195 1.522694413626287e-8 Coverage
31 49749 3.322402658582442e-8 Coverage
32 48342 7.02268990941827e-8 Coverage
33 48280 1.4394306882794722e-7 Coverage
34 47080 2.863604561107574e-7 Coverage
35 46010 5.534089852758182e-7 Coverage
36 45616 1.039788261971188e-6 Coverage
37 45274 1.9008348747272694e-6 Coverage
38 44701 3.383467426815247e-6 Coverage
39 44463 5.868114750185855e-6 Coverage
40 43372 9.922927345843403e-6 Coverage
41 43026 1.637031965870808e-5 Coverage

Also, would it be possible to perform the calculation of the estimated cutoff 'on the fly' while extracting the kmers from fastq files? As it is, we have to run ska twice (first to estimate the cutoff and then to build the kmer dictionary), which kind of defeat the purpose (twice the runtimes).

Thanks and best,
Romain

rderelle · 2023-07-20T07:41:49Z

Here is the coverage histogram

johnlees · 2023-07-20T08:24:49Z

Thanks for the testing and files. The cov.txt file is actually fine, the coverage/error column is actually just to assist with the plotting.

It looks like the default mixture model just doesn't fit this data that well. I'll have a play around with some other types of mixture for the second component to see if we can offer an alternative in these cases.

I'll also think about adding an auto option rather than running ska cov separately, that might be easier after fixing #45. Of course you do end up counting k-mers twice this way if you run on every read set, but it's still only about ~60s per sample.

johnlees self-assigned this Jul 20, 2023

johnlees added the enhancement New feature or request label Jul 20, 2023

johnlees changed the title ~~Error with ska cov, and possible improvement~~ Feature requests: different mixture model; combined ska build & cov Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature requests: different mixture model; combined ska build & cov #50

Feature requests: different mixture model; combined ska build & cov #50

rderelle commented Jul 20, 2023 •

edited

Loading

rderelle commented Jul 20, 2023

johnlees commented Jul 20, 2023

Feature requests: different mixture model; combined ska build & cov #50

Feature requests: different mixture model; combined ska build & cov #50

Comments

rderelle commented Jul 20, 2023 • edited Loading

rderelle commented Jul 20, 2023

johnlees commented Jul 20, 2023

rderelle commented Jul 20, 2023 •

edited

Loading