This component is intended to test the integrity of the provided FastQ files. It does so by attempting to parse uncompressed or compressed (gz
, bz2
or zip
) FastQ files (paired-end or single-end). During this parse, if the FastQ files are not corrupt, it retrieves the following information:
- sequence encoding: Estimates the sequence encoding based on the quality scores. This information can then be passed to other components that might required it.
-
estimated coverage: Provides a rough coverage estimation for each sample based on a user-provided genome size (see Parameters). This estimation is essentially
$$\frac{\text{number of base pairs}}{(\text{genome size} \times 1e^{6})}$$ This information is written to the
reports
directory (See Published reports) - maximum read length.: Retrieves the maximum read length for each sample.
Important
If the minCoverage
parameter value is set to higher than 0, this component will filter samples with an estimated coverage below that threshold.
- Input type:
FastQ
- Output type:
FastQ
Note
The default input parameter for FastQ data is --fastq
. You can change the --fastq
parameter default pattern (fastq/*_{1,2}.*
) according to input file names (e.g.: --fastq "path/to/fastq/*R{1,2}.*"
).
genomeSize
: Genome size estimate for the samples. It is used to estimate the coverage and other assembly parameters and checks.minCoverage
: Minimum coverage for a sample to proceed. Can be set to 0 to allow any coverage.
Note
You can use these parameters as in the following example: --genomeSize 3
.
None.
reports/coverage
: CSV table with estimated sequencing coverage for each sample.reports/corrupted
: Text file with list of corrupted samples.
None.
flowcraft.templates.integrity_coverage
tableRow
:Raw BP
: Number of nucleotides.Reads
: Number of reads.Coverage
: Estimated coverage.
plotData
:sparkline
: Number of nucleotides.
warnings
:- When the enconding and/or phred score cannot be inferred from FastQ files.
fail
:- When estimated coverage is below the provided threshold.