-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible changes to fastq input #52
Comments
Another thought, a As there will be no error correction makes sense to have a counter for each of four middle bases and then do a simple model to generate the call (e.g. [1, 0, 0, 5] -> G; [1, 3, 4, 0] -> Y. Probably looks like the bcftools model, feels multinomial to me). Or just keep the BBF to filter singletons (include middle base in hash) |
Important point from @rderelle from his work on fastlin: many fastq files in SRA come from BAMs, so are sorted and you cannot take the first n reads as a random subsample. Checking for subsequent reads having the same minimiser would probably catch this in most cases. |
I might have a suggestion regarding the fast mode (codename: 'flyover'): (1) to parse the first fastq file and only extract kmers from 1 read every 20 reads. This approach would be able to analyse all types of fastq files (sorted or not). However, it would only make sense if the algorithm used to parse the fastq files is ultra-fast as it would require 2 successive parsing in most cases (e.g., 'seq_io' as in fastlin). Edit: the examples mentioned above are probably not realistic (samples with very high coverages) since this approach should only divide the runtimes by a factor 3-4 at the most (depending on the cutoff). |
The minimisers approach seems simpler to implement but, for the sake of the discussion (I'm not advocating against it), it might suffer from two caveats: May I ask what profiling program did you use to generate these profiles? :) |
It's flamegraph, which is really nice |
For mapping to a reference with reads, using a binary fuse filter for the ref knots could be a good choice. See https://docs.rs/xorf/latest/xorf/ |
Two possible additions
An equivalent to the
-m
option in ska1:We can filter on ambiguous, but not frequency of ambig.
Early stop/fast mode
This could probably made to go much faster by:
For now, the first of these could be a useful addition
The text was updated successfully, but these errors were encountered: