-
Notifications
You must be signed in to change notification settings - Fork 72
kneaddata
KneadData is a tool designed to perform quality control on metagenomic sequencing data, especially data from microbiome experiments. This tool aims to perform principled in silico separation of bacterial reads from these "contaminant" reads, be they from the host, from bacterial 16S sequences, or other user-defined sources.
For more information, see here.
Please direct your questions to KneadData bioBakery Support Forum.
KneadData can be installed with Conda, PyPi,run from a Docker image or can be manually installed. Please note that if you are using bioBakery (Vagrant VM or Google Cloud), you do not need to install KneadData because the tool and its dependencies are already installed.
$ pip install kneaddata
Requirements:
conda install -c biobakery kneaddata
Requirements :
docker run -it biobakery/kneaddata bash
Requirements :
Requirements :
STEP 1 : $ git clone https://github.com/biobakery/kneaddata.git
STEP 2 : $ cd kneaddata
STEP 3 : $ python setup.py install
Please make sure that you have the latest version of KneadData installed before proceeding.
kneaddata --version
.
To view all options, type kneaddata --help | less
. Use the arrow key to move up and down. Type q
to quit back to the prompt.
usage: kneaddata [-h] [--version] [-v] [-i1 INPUT1] [-i2 INPUT2]
[-un UNPAIRED] -o OUTPUT_DIR
[-db REFERENCE_DB] [--bypass-trim]
[--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>]
[-q {phred33,phred64}] [--run-bmtagger] [--run-trf]
[--run-fastqc-start] [--run-fastqc-end] [--store-temp-output]
[--remove-intermediate-output] [--cat-final-output]
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG]
[--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY]
[--trimmomatic-options TRIMMOMATIC_OPTIONS]
[--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS]
[--no-discordant] [--reorder] [--serial]
[--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH]
[--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI]
[--minscore MINSCORE] [--maxperiod MAXPERIOD]
[--fastqc FASTQC_PATH]
Step 1: Create a new working directory on your Desktop
.
Open a new terminal and type:
cd Desktop
mkdir kneaddata_analysis
Step 2: Downloading demo input files
As input, KneadData can use FASTQ files. Please click here to download the 6 demo input fastq files. Save this at the correct location i.e. the folder we just created. Alternatively you can navigate to the folder and use wget
or curl -LO
to download.
cd kneaddata_analysis
wget https://github.com/biobakery/kneaddata/files/4703820/input.zip
OR
curl -LO https://github.com/biobakery/kneaddata/files/4703820/input.zip
Use ls
to check if input.zip
is present in the working directory.
Step 2: Unzip the folder and list its contents
unzip input.zip
ls input
You should see
demo_db.1.bt2 demo_db.2.bt2 demo_db.3.bt2 demo_db.4.bt2 demo_db.rev.1.bt2 demo_db.rev.2.bt2 SE_extra.fastq seq1.fastq seq2.fastq singleEnd.fastq
There are several fastq files. The singleEnd.fastq
file is a single-end reads file while the seq1.fastq
and seq2.fastq
files are the paired-end reads files. The bt2
files are the Bowtie2 database files.
Let us run KneadData on both types of reads.
kneaddata --unpaired input/singleEnd.fastq --reference-db input/demo_db --output kneaddataOutputSingleEnd
The progress is displayed on the terminal. Also, peek the log file
in the output folder to see KneadData default parameters, values and workflow.
Initial number of reads ( /home/hutlab_public/Desktop/kneaddata_analysis/input/singleEnd.fastq ): 16902.0
Running Trimmomatic ...
Total reads after trimming ( /home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata.trimmed.fastq ): 16457.0
Decontaminating ...
Running bowtie2 ...
Total reads after removing those found in reference database ( /home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata_demo_db_bowtie2_clean.fastq ): 15907.0
Total reads after merging results from multiple databases ( /home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata.fastq ): 15907.0
Final output file created:
/home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata.fastq
The above command takes three arguments (--unpaired
, --reference-db
, --output
).
Please note that all of the three arguments are required.
The output from this command is stored in the kneaddataOutputSingleEnd
which is automatically created.
Output:
KneadData produces at mininimum four primary output files when running a single-ended dataset (this number will increase when dealing with a paired-end dataset). Use ls kneaddataOutputSingleEnd
to list the files. What are these files?
Output File Name | Description |
---|---|
singleEnd_kneaddata.fastq | Final output of KneadData after running Trimmomatic + Bowtie2+ optional tools |
singleEnd_kneaddata_singleEnd_db_bowtie2_contam.fastq | FASTQ file containing reads that were identified as contaminants from the database |
singleEnd_kneaddata.log | Log file of the kneaddata run |
singleEnd_kneaddata.trimmed.fastq | This file has trimmed reads after running Trimmomatic |
The kneaddata_read_count_table
utility generates the summary report of all the fastq files in the output folder. Notice that the input
to this utility is the output folder from the kneaddata
command where all output fastqs live.
kneaddata_read_count_table --input kneaddataOutputSingleEnd --output kneaddata_read_count_table.tsv
Open this file with Libre Office Calc or type
column -t -s $'\t' kneaddata_read_count_table.tsv | less -S
Output:
Sample raw single trimmed single decontaminated demo_db single final single
singleEnd_kneaddata 16902.0 16457.0 15907.0 15907.0
The input file contained 16902 raw reads
of which 455 were removed post-trimming, hence 16457 trimmed reads
. Of these 550 were contaminant reads
(check this using $ wc -l kneaddataOutputSingleEnd/singleEnd_kneaddata_singleEnd_db_bowtie2_contam.fastq
; remember that fastq files have 4 lines per read) and hence we have 15907 final high-quality reads
which passed contamination.
It is a good idea to take a look at the QC scores for reads before and after clean-up i.e. trimming. In general, this practice informs us of the quality of our sequences, how much data we may stand to lose, and how to proceed with downstream analyses. We will use FastQC for this. You can follow the steps below to download and install this software.
Downloading FastQC
- Please click on the link here to get the command line compatible FastQC tool.
Installing FastQC
-
Unzip the downloaded fastqc_v0.11.9.zip using
unzip fastqc_v0.11.9.zip
-
Change your directory to
FastQC
and either you can double click on FastQC icon to run the graphical interface or use Kneaddata to run FastQC using command line. -
Please make sure that the downloaded
fastqc
is executable. If you get an error, you can change permissions for all files in theFastQC
folder withsudo chmod -R 777 FastQC
Using FastQC
- We can invoke FastQC from within KneadData in two ways:
--run-fastqc-start
and--run-fastqc-end
to check the quality of our reads before and after KneadData has trimmed them. You can even do any one of those which is of interest to you. We will use another single-end read file for this calledSE_extra.fastq
.
kneaddata --unpaired input/SE_extra.fastq --reference-db input/demo_db --output kneaddataOutputFastQC --run-fastqc-start --run-fastqc-end
Output
- The results are at
kneaddataOutputFastQC/fastqc
. There are 2.html
files. One for the raw:SE_extra_fastqc.html
and another for the processedSE_extra_kneaddata_fastqc.html
reads. We can open these files with the browser. You will see that there is a significant difference in the quality of reads before and after trimming.
Before:
After:
Additional important arguments
-
[--fastqc <path>]
where is the path to the folder where the fastqc executable is.
For paired end reads, we have to provide the forward and reverse reads for all sequences which are stored in 2 files.
kneaddata --input1 input/seq1.fastq --input2 input/seq2.fastq --reference-db input/demo_db --output kneaddataOutputPairedEnd
Note that the output is twice as much because we now have information for 2 files. You can get a consolidated report with kneaddata_read_count_table
like we saw earlier. The final files (in bold) are the files with high-quality decontaminated reads.
Output:
Output Files (Listed in the order created) | Description |
---|---|
seq1_kneaddata.log | Log file of the kneaddata run |
seq1_kneaddata.trimmed.1.fastq | This file has trimmed reads (Mate 1) as a output of Paired Ends run in Trimmomatic |
seq1_kneaddata.trimmed.2.fastq | This file has trimmed reads (Mate 2) as a output of Paired Ends run in Trimmomatic |
seq1_kneaddata.trimmed.single.1.fastq | This file has trimmed reads (only Mate 1 survived) after running Trimmomatic |
seq1_kneaddata.trimmed.single.2.fastq | This file has trimmed reads (only Mate 2 survived) after running Trimmomatic |
seq1_kneaddata_demo_db_bowtie2_paired_contam_1.fastq | FASTQ file containing reads that were identified as contaminants from the database. |
seq1_kneaddata_demo_db_bowtie2_paired_contam_2.fastq | FASTQ file containing reads that were identified as contaminants from the database. |
seq1_kneaddata_demo_db_bowtie2_unmatched_1_contam.fastq | This file includes reads (Mate 1) that were identified as contaminants from the database |
seq1_kneaddata_demo_db_bowtie2_unmatched_2_contam.fastq | This file includes reads (Mate 2) that were identified as contaminants from the database |
seq1_kneaddata_paired_1.fastq | Final output of KneadData after running Trimmomatic + Bowtie2 for seq1 |
seq1_kneaddata_paired_2.fastq | Final output of KneadData after running Trimmomatic + Bowtie2 for seq1 |
seq1_kneaddata_unmatched_1.fastq | Final output of KneadData after running Trimmomatic + Bowtie2 for seq1 (only Mate 1 survived) |
seq1_kneaddata_unmatched_2.fastq | Final output of KneadData after running Trimmomatic + Bowtie2 for seq1 (only Mate 2 survived) |
KneadData uses third party software for trimming and decontamination. The default values for several parameters in these software (to be used by KneadData) are carefully picked by our developers so that they will apply/work for most real world sequence data. However, there are provisions to change these if you deem it necessary for your data. Let us explore some of these.
Kneaddata runs Trimmomatic as default for the first quality control step in the workflow i.e. trimming. See below the default values for Trimmomatic parameters (you will find these when you $kneaddata --help
.
--trimmomatic TRIMMOMATIC_PATH
path to trimmomatic
[ DEFAULT : $PATH ]
--max-memory MAX_MEMORY
max amount of memory
[ DEFAULT : 500m ]
--trimmomatic-options TRIMMOMATIC_OPTIONS
options for trimmomatic
[ DEFAULT : ILLUMINACLIP:/TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50 ]
MINLEN is set to 50 bases
Let us change the MINLEN
parameter to 90
using the --trimmomatic-options=
and observe how it changes our results.
kneaddata --unpaired input/singleEnd.fastq --output kneaddataOutputTrimmomatic --reference-db input/demo_db --trimmomatic-options="MINLEN:90"
kneaddata_read_count_table --input kneaddataOutputTrimmomatic/ --output kneaddata_read_count_table2.tsv
column -t -s $'\t' kneaddata_read_count_table2.tsv | less -S
Before(MINLEN:50):
Sample raw single trimmed single decontaminated demo_db single final single
singleEnd_kneaddata 16902.0 16457.0 15907.0 15907.0
After:
Sample raw single trimmed single decontaminated demo_db single final single
singleEnd_kneaddata 16902.0 16902.0 16304.0 16304.0
Using the 'MINLEN:90' parameter let us change the minimum length of reads passing QC from the default 50 bases to 90 bases. However, this more stringent parameter didn't result in a more stringent QC result, as we can see by all '16902' of the reads passing QC. Let's try again, this time keeping the default KneadData parameters with only the 'MINLEN' value changed:
kneaddata --unpaired input/singleEnd.fastq --output kneaddataOutputTrimmomatic2 --reference-db input/demo_db --trimmomatic-options="ILLUMINACLIP:/TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:90"
kneaddata_read_count_table --input kneaddataOutputTrimmomatic2/ --output kneaddata_read_count_table3.tsv
column -t -s $'\t' kneaddata_read_count_table3.tsv | less -S
That looks better:
Sample raw single trimmed single decontaminated demo_db single final single
singleEnd_kneaddata 16902.0 15819.0 15331.0 15331.0
Originally, inclusion of the --trimmomatic-options"MINLEN:90"
argument overrode the default --trimmomatic-options
, which included information to trim adapters correctly. By including this info with our changed MINLEN
, we make the QC more stringent as we originally intended: we remove 1083 reads at the QC step. Without providing the Illumina adaptors to Trimmomatic to trim correctly, as in the previous attempt, all of the reads were longer than our specified minimum of 90 bases and all reads passed the QC filter step, incorrectly.
The full list and description of Trimmomatic parameters is given in the Trimmomatic Manual.
A tandem repeat (TR) in DNA is two or more adjacent, approximate copies of a pattern of nucleotides. Kneaddata uses TRF to locate and remove reads which look like TR. The default values for TRF parameters are as follows:
trf arguments:
--trf TRF_PATH path to TRF
[ DEFAULT : $PATH ]
--match MATCH matching weight
[ DEFAULT : 2 ]
--mismatch MISMATCH mismatching penalty
[ DEFAULT : 7 ]
--delta DELTA indel penalty
[ DEFAULT : 7 ]
--pm PM match probability
[ DEFAULT : 80 ]
--pi PI indel probability
[ DEFAULT : 10 ]
--minscore MINSCORE minimum alignment score to report
[ DEFAULT : 50 ]
--maxperiod MAXPERIOD
maximum period size to report
[ DEFAULT : 500 ]
In KneadData 0.12.0, TRF is run automatically. Inclusion of the flag --bypass-trf
will turn off TRF.
A full list and description of TRF parameters can be found in the TRF Manual.
-
Quality Scores:
--quality-scores {phred33,phred64} [ DEFAULT : phred33 ]
Example:
kneaddata --unpaired input/singleEnd.fastq --reference-db input/demo_db --output kneaddataOutput --quality-scores phred64
- Produce a Single Output File:
--cat-final-output
Downstream analysis steps with bioBakery tools use a single input fastq file. This can be created by combining reads from forward and reverse fastq files together manually with cat
. Alternatively, the flag --cat-final-output
will create a single output for all results. This file will be named seq1.fastq.gz
and contain final paired forward and reverse reads as well as forward and reverse QC'd orphan reads. (seq1_kneaddata_paired_1.fastq, seq1_kneaddata_paired_2.fastq, seq1_kneaddata_unmatched_1.fastq, seq1_kneaddata_unmatched_2.fastq)
-
Threads and Processes:
--threads <1> [ Default : 1 ]
;--processes <1> [ Default : 1 ]
Example:
kneaddata --unpaired input/singleEnd.fastq --reference-db input/demo_db --output kneaddataOutput --threads 4 --processes 4
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA