Possible changes to fastq input #52

johnlees · 2023-08-14T11:57:27Z

Two possible additions

An equivalent to the `-m` option in ska1:

Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option.

We can filter on ambiguous, but not frequency of ambig.

Early stop/fast mode

This could probably made to go much faster by:

Stopping reading the input after n reads (set by e.g. 15x coverage)
For 'online' or ref-based analyses, only including k-mers in the ref and turning off filters (may need minimiser based bins to improve caching).

For now, the first of these could be a useful addition

The text was updated successfully, but these errors were encountered:

johnlees · 2023-08-24T15:08:32Z

Another thought, a fastcall mode. Give a ref (.fa or .skf) and call SNPs against it given either a coverage or CPU time threshold.

As there will be no error correction makes sense to have a counter for each of four middle bases and then do a simple model to generate the call (e.g. [1, 0, 0, 5] -> G; [1, 3, 4, 0] -> Y. Probably looks like the bcftools model, feels multinomial to me). Or just keep the BBF to filter singletons (include middle base in hash)

johnlees · 2023-08-29T09:36:21Z

Important point from @rderelle from his work on fastlin: many fastq files in SRA come from BAMs, so are sorted and you cannot take the first n reads as a random subsample. Checking for subsequent reads having the same minimiser would probably catch this in most cases.

rderelle · 2023-08-29T21:05:21Z

I might have a suggestion regarding the fast mode (codename: 'flyover'):

(1) to parse the first fastq file and only extract kmers from 1 read every 20 reads.
(2) at the end of this first pass, if the average coverage of the split-kmers is above the cutoff (e.g. 80X), then SKA can end the analysis ... the runtime has been roughly divided by 20.
(3) if not, SKA could calculate how many more 20th are needed to reach the coverage cutoff (if 2 fastq files, SKA could reasonably assume that the number of reads in both files are the same) -> second pass with for instance extracting kmers from the 2nd, 3rd and 4th reads every 20 reads if 15% more reads are needed (in this case the runtime has been roughly divided by 5)

This approach would be able to analyse all types of fastq files (sorted or not). However, it would only make sense if the algorithm used to parse the fastq files is ultra-fast as it would require 2 successive parsing in most cases (e.g., 'seq_io' as in fastlin).

Edit: the examples mentioned above are probably not realistic (samples with very high coverages) since this approach should only divide the runtimes by a factor 3-4 at the most (depending on the cutoff).

johnlees · 2023-08-30T08:47:39Z

Thanks for the suggestion! Here's some profiling I did with fastq input (with the older countmin filter, which is a bit slower than the current implementation):

The bloom filter takes around ~75% of the time when reading a file in, file reading is about 8% of the time (mostly spent on decompression with .fq.gz input, but when I've tested previously decompressing doesn't usually make sense because the disk access takes proportionally longer), the k-mer processing is around 7% of the time, dictionary adds are minimal.

So the reading is actually very fast if we skip the parsing, probably around 10-15x faster than current run, which might make this viable.

Finally, I think the best solution would be to check for sorted reads with minimisers, in which case you can just do one run, but if sorted then subsample

johnlees · 2023-08-30T08:48:18Z

Full profile if you are interested (open in firefox)

rderelle · 2023-08-30T15:20:33Z

The minimisers approach seems simpler to implement but, for the sake of the discussion (I'm not advocating against it), it might suffer from two caveats:
_ you might need to introduce another cutoff for the comparison of the minimisers and decide if reads are sorted or not (not entirely sure about this point as I have never used minimisers myself)
_ if SKA concludes that reads are sorted, it would not be able to uniformly subsample along the fastq files without knowing a priori how many reads there are.

May I ask what profiling program did you use to generate these profiles? :)

johnlees · 2023-08-30T19:51:25Z

May I ask what profiling program did you use to generate these profiles? :)

It's flamegraph, which is really nice

johnlees · 2024-06-15T11:44:32Z

For mapping to a reference with reads, using a binary fuse filter for the ref knots could be a good choice. See https://docs.rs/xorf/latest/xorf/

johnlees added the enhancement New feature or request label Aug 14, 2023

johnlees self-assigned this Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible changes to fastq input #52

Possible changes to fastq input #52

johnlees commented Aug 14, 2023

johnlees commented Aug 24, 2023

johnlees commented Aug 29, 2023

rderelle commented Aug 29, 2023 •

edited

Loading

johnlees commented Aug 30, 2023

johnlees commented Aug 30, 2023 •

edited

Loading

rderelle commented Aug 30, 2023

johnlees commented Aug 30, 2023

johnlees commented Jun 15, 2024

Possible changes to fastq input #52

Possible changes to fastq input #52

Comments

johnlees commented Aug 14, 2023

An equivalent to the -m option in ska1:

Early stop/fast mode

johnlees commented Aug 24, 2023

johnlees commented Aug 29, 2023

rderelle commented Aug 29, 2023 • edited Loading

johnlees commented Aug 30, 2023

johnlees commented Aug 30, 2023 • edited Loading

rderelle commented Aug 30, 2023

johnlees commented Aug 30, 2023

johnlees commented Jun 15, 2024

An equivalent to the `-m` option in ska1:

rderelle commented Aug 29, 2023 •

edited

Loading

johnlees commented Aug 30, 2023 •

edited

Loading