Skip to content

Commit

Permalink
Merge pull request #219 from ctmrbio/improve-host-removal
Browse files Browse the repository at this point in the history
Improve host removal
  • Loading branch information
boulund committed Jun 8, 2023
2 parents 587778e + 047fe3c commit e3ed492
Show file tree
Hide file tree
Showing 7 changed files with 324 additions and 107 deletions.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Expand Up @@ -16,11 +16,15 @@ situations.

## [0.7.0] Unreleased
### Added
- Host removal: Bowtie2 now available as an option for host removal.

### Fixed

### Changed

- Preprocessing summary: Preprocessing summary script can now output a table of
read counts regardless of which combination of read QC and host removal is
used.

### Deprecated

### Removed
Expand Down
23 changes: 15 additions & 8 deletions config/config.yaml
Expand Up @@ -33,7 +33,9 @@ keep_local: False # Keep local copies of remote input files,
# Pipeline steps included
#########################
qc_reads: True
host_removal: True
host_removal:
kraken2: True
bowtie2: False
multiqc_report: True
naive:
assess_depth: False
Expand Down Expand Up @@ -61,13 +63,18 @@ fastp:
extra: ""
keep_output: False # StaG deletes fastp output files after host removal, set to True to keep them.
remove_host:
db_path: "" # [Required] Path to folder containing a Kraken2 database with host sequences (taxo.k2d, etc.)
confidence: 0.1 # Kraken2 confidence parameter, normally set to 0.1
extra: "--quick" # Additional command line arguments to kraken2
keep_kraken: False # StaG deletes the kraken and kreport output files by default, set to True to keep them.
keep_kreport: False # Keep the kreport files for host removal
keep_fastq: True # Keep the host removed fastq files, set to False to remove them automatically.
keep_host_fastq: False # Keep the host-containing fastq files, set to False to remove them automatically.
kraken2:
db_path: "" # [Required] Path to folder containing a Kraken2 database with host sequences (taxo.k2d, etc.)
confidence: 0.1 # Kraken2 confidence score, float in [0,1]
extra: "--quick" # Additional command line arguments to kraken2
keep_kraken: False # Keep the kraken files for host removal, set to False to remove them automatically.
keep_kreport: False # Keep the kreport files for host removal, set to False to remove them automatically.
keep_fastq: True # Keep the host-removed fastq files, set to False to remove them automatically.
keep_host_fastq: False # Keep the host-containing fastq files, set to False to remove them automatically.
bowtie2:
db_path: "" # [Required] Path to bowtie2 database to use for host removal, including database filename prefix (no .1.bt2 etc)
extra: "--sensitive"
keep_fastq: True # Keep the host-removed fastq files, set to False to remove them automatically.
multiqc:
extra: ""

Expand Down
42 changes: 27 additions & 15 deletions docs/source/modules.rst
@@ -1,5 +1,6 @@
.. _BBCountUnique: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/calcuniqueness-guide/
.. _FastP: https://github.com/OpenGene/fastp
.. _Bowtie2: https://bowtie-bio.sourceforge.net/bowtie2/index.shtml
.. _BBMap: https://sourceforge.net/projects/bbmap/
.. _FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
.. _Kaiju: http://kaiju.binf.ku.dk/
Expand All @@ -21,8 +22,8 @@ Modules
|full_name| is a workflow framework that connects several other tools. The
basic assumption is that all analyses start with a quality control of the
sequencing reads (using `FastP`_), followed by host sequence removal (using
`Kraken2`_). This section of the documentation aims to describe useful details
about the separate tools that are used in |full_name|.
`Kraken2`_ or `Bowtie2`_). This section of the documentation aims to describe
useful details about the separate tools that are used in |full_name|.

The following subsections describe the function of each module included in
|full_name|. Each tool produces output in a separate subfolder inside the
Expand Down Expand Up @@ -52,35 +53,46 @@ reads are output into ``fastp``. Output filenames are::

remove_host
--------------
The ``remove_host`` module can use either `Kraken2`_ or `Bowtie2`_ to classify
reads against a database of host sequences to remove reads matching to
non-desired host genomes.

.. note::

It is possible to skip host removal. StaG then replaces the output files
with symlinks to the fastp output files.


:Tool: `Kraken2`_
:Output folder: ``host_removal``

The ``remove_host`` module uses `Kraken2`_ to classify reads against a database
of host sequences to remove reads matching to non-desired host genomes. The
output are two sets of pairs of paired-end FASTQ files, and optionally one
Kraken2 classification file and one Kraken2 summary report. In addition, two
PDF files with 1) a basic histogram plot of the proportion of host reads
detected in each sample, and 2) a barplot of the same. A TSV table with the raw
proportion data is also provided::
The output from Kraken2 are two sets of pairs of paired-end FASTQ files, and
optionally one Kraken2 classification file and one Kraken2 summary report. In
addition, two PDF files with 1) a basic histogram plot of the proportion of
host reads detected in each sample, and 2) a barplot of the same. A TSV table
with the raw proportion data is also provided::

<sample>_{1,2}.fq.gz
<sample>.host_{1,2}.fq.gz
host_barplot.pdf
host_histogram.pdf
host_proportions.txt

.. note::

It is possible to skip host removal. StaG then replaces the output files
with symlinks to the fastp output files.
:Tool: `Bowtie2`_
:Output folder: ``host_removal``

The output from Bowtie2 is a set of paired-end FASTQ files::

<sample>_{1,2}.fq.gz



preprocessing_summary
---------------------
This module summarize the number of reads passing through each preprocessing
step and produces a summary table and a basic line plot showing the proportions
of reads after each step. For more detailed information about read QC please
refer to the MulitQC report.
step and produces a summary table showing the number of reads after each step.
For more detailed information about read QC please refer to the MulitQC report.


multiqc
Expand Down
9 changes: 6 additions & 3 deletions profiles/ctmr_gandalf/config.yaml
Expand Up @@ -49,7 +49,8 @@ default-resources:
- mem_mb=10240
set-threads:
- fastp=20
- remove_host=20
- kraken2_remove_host=20
- bowtie2_remove_host=20
- bbcountunique=4
- sketch=8
- kaiju=32
Expand All @@ -69,8 +70,10 @@ set-threads:
set-resources:
- fastp:mem_mb=20240
- fastp:time="02:00:00"
- remove_host:mem_mb=10240
- remove_host:time="02:00:00"
- kraken2_remove_host:mem_mb=10240
- kraken2_remove_host:time="02:00:00"
- bowtie2_remove_host:mem_mb=10240
- bowtie2_remove_host:time="06:00:00"
- kaiju:mem_mb=10240
- kaiju:time="10:00:00"
- kraken2:mem_mb=10240
Expand Down

0 comments on commit e3ed492

Please sign in to comment.