Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature requests: different mixture model; combined ska build & cov #50

Open
rderelle opened this issue Jul 20, 2023 · 2 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@rderelle
Copy link

rderelle commented Jul 20, 2023

Hi John,

Thanks a lot for implementing the calculation of the optimal kmer count threshold (ska cov).
I tested it using ska 0.3.1 and a paired-end sample (fastq files available here: https://drive.google.com/drive/folders/1HVO-6mOd7bh7CPOjXhA3lWAT0GA_8SC8?usp=sharing). My command line was:

./ska_0.3.1 cov reads_kraken/ERR8158003_1.fastq.gz reads_kraken/ERR8158003_2.fastq.gz -k 31 > cov.txt

Given the distribution of kmer counts, the cutoff should be around 4-6, but ska outputs a value of 17. I think this high treeshold is related to the "Error" message below obtained for low kmer sizes during the model fittng.

This is the beginning of the file 'cov.txt':

Count K_mers Mixture_density Component
1 25838523 3.2408096206707826e-1 Error
2 1993278 1.620404810335391e-1 Error
3 129884 5.4013493677846365e-2 Error
4 97141 1.3503373419461602e-2 Error
5 28053 2.7006746838923196e-3 Error
6 35628 4.5011244731538646e-4 Error
7 24702 6.430177818791246e-5 Error
8 35486 8.03772227348904e-6 Error
9 30571 8.93080252609934e-7 Error
10 39312 8.930802526126602e-8 Error
11 40110 8.118911389065957e-9 Error
12 44332 6.765759585478362e-10 Error
13 43658 5.20443537193471e-11 Error
14 49357 3.7176916177683685e-12 Error
15 47325 2.4891833425171915e-13 Error
16 51653 2.009020801325787e-14 Error
17 52514 1.921693705685228e-14 Coverage
18 53472 6.883935794424811e-14 Coverage
19 52873 2.4488922538072723e-13 Coverage
20 53588 8.282019181080119e-13 Coverage
21 54322 2.667583813189892e-12 Coverage
22 54876 8.201562459958256e-12 Coverage
23 55197 2.411959247393554e-11 Coverage
24 54703 6.797667675764525e-11 Coverage
25 54701 1.8391668285808207e-10 Coverage
26 54189 4.784636868075234e-10 Coverage
27 52711 1.1986335327845915e-9 Coverage
28 52430 2.895540187815414e-9 Coverage
29 51131 6.753560645868026e-9 Coverage
30 51195 1.522694413626287e-8 Coverage
31 49749 3.322402658582442e-8 Coverage
32 48342 7.02268990941827e-8 Coverage
33 48280 1.4394306882794722e-7 Coverage
34 47080 2.863604561107574e-7 Coverage
35 46010 5.534089852758182e-7 Coverage
36 45616 1.039788261971188e-6 Coverage
37 45274 1.9008348747272694e-6 Coverage
38 44701 3.383467426815247e-6 Coverage
39 44463 5.868114750185855e-6 Coverage
40 43372 9.922927345843403e-6 Coverage
41 43026 1.637031965870808e-5 Coverage

Also, would it be possible to perform the calculation of the estimated cutoff 'on the fly' while extracting the kmers from fastq files? As it is, we have to run ska twice (first to estimate the cutoff and then to build the kmer dictionary), which kind of defeat the purpose (twice the runtimes).

Thanks and best,
Romain

@rderelle
Copy link
Author

Here is the coverage histogram
coverage_histogram

@johnlees
Copy link
Member

Thanks for the testing and files. The cov.txt file is actually fine, the coverage/error column is actually just to assist with the plotting.

It looks like the default mixture model just doesn't fit this data that well. I'll have a play around with some other types of mixture for the second component to see if we can offer an alternative in these cases.

I'll also think about adding an auto option rather than running ska cov separately, that might be easier after fixing #45. Of course you do end up counting k-mers twice this way if you run on every read set, but it's still only about ~60s per sample.

@johnlees johnlees self-assigned this Jul 20, 2023
@johnlees johnlees added the enhancement New feature or request label Jul 20, 2023
@johnlees johnlees changed the title Error with ska cov, and possible improvement Feature requests: different mixture model; combined ska build & cov Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants