Skip to content

kneaddata

jjensen44 edited this page Jun 12, 2023 · 41 revisions

KneadData Tutorial Badge

KneadData is a tool designed to perform quality control on metagenomic sequencing data, especially data from microbiome experiments. This tool aims to perform principled in silico separation of bacterial reads from these "contaminant" reads, be they from the host, from bacterial 16S sequences, or other user-defined sources.

For more information, see here.

Please direct your questions to KneadData bioBakery Support Forum.

NOTE: This tutorial corresponds to Kneaddata v0.12.0


Table of contents


Overview

Install

KneadData can be installed with Conda, PyPi,run from a Docker image or can be manually installed. Please note that if you are using bioBakery (Vagrant VM or Google Cloud), you do not need to install KneadData because the tool and its dependencies are already installed.


Install with PyPi:

$ pip install kneaddata

Requirements:


Install with Conda:

conda install -c biobakery kneaddata

Requirements :


Install with Docker:

docker run -it biobakery/kneaddata bash

Requirements :


Requirements :

Install from source:

STEP 1 : $ git clone https://github.com/biobakery/kneaddata.git
STEP 2 : $ cd kneaddata
STEP 3 : $ python setup.py install


Running KneadData

Please make sure that you have the latest version of KneadData installed before proceeding. kneaddata --version.

To view all options, type kneaddata --help | less. Use the arrow key to move up and down. Type q to quit back to the prompt.

usage: kneaddata [-h] [--version] [-v] [-i1 INPUT1] [-i2 INPUT2]
                 [-un UNPAIRED] -o OUTPUT_DIR
                 [-db REFERENCE_DB] [--bypass-trim]
                 [--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>]
                 [-q {phred33,phred64}] [--run-bmtagger] [--run-trf]
                 [--run-fastqc-start] [--run-fastqc-end] [--store-temp-output]
                 [--remove-intermediate-output] [--cat-final-output]
                 [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG]
                 [--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY]
                 [--trimmomatic-options TRIMMOMATIC_OPTIONS]
                 [--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS]
                 [--no-discordant] [--reorder] [--serial]
                 [--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH]
                 [--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI]
                 [--minscore MINSCORE] [--maxperiod MAXPERIOD]
                 [--fastqc FASTQC_PATH]

Input Files

Step 1: Create a new working directory on your Desktop. Open a new terminal and type:

cd Desktop
mkdir kneaddata_analysis

Step 2: Downloading demo input files As input, KneadData can use FASTQ files. Please click here to download the 6 demo input fastq files. Save this at the correct location i.e. the folder we just created. Alternatively you can navigate to the folder and use wget or curl -LO to download.

cd kneaddata_analysis 
wget https://github.com/biobakery/kneaddata/files/4703820/input.zip

OR 

curl -LO https://github.com/biobakery/kneaddata/files/4703820/input.zip

Use ls to check if input.zip is present in the working directory.

Step 2: Unzip the folder and list its contents

unzip input.zip
ls input

You should see

demo_db.1.bt2  demo_db.2.bt2  demo_db.3.bt2  demo_db.4.bt2  demo_db.rev.1.bt2  demo_db.rev.2.bt2  SE_extra.fastq  seq1.fastq  seq2.fastq  singleEnd.fastq

There are several fastq files. The singleEnd.fastq file is a single-end reads file while the seq1.fastq and seq2.fastq files are the paired-end reads files. The bt2 files are the Bowtie2 database files.

Let us run KneadData on both types of reads.

Single End Reads

kneaddata --unpaired input/singleEnd.fastq --reference-db input/demo_db --output kneaddataOutputSingleEnd

The progress is displayed on the terminal. Also, peek the log file in the output folder to see KneadData default parameters, values and workflow.

Initial number of reads ( /home/hutlab_public/Desktop/kneaddata_analysis/input/singleEnd.fastq ): 16902.0
Running Trimmomatic ... 
Total reads after trimming ( /home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata.trimmed.fastq ): 16457.0
Decontaminating ...
Running bowtie2 ... 
Total reads after removing those found in reference database ( /home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata_demo_db_bowtie2_clean.fastq ): 15907.0
Total reads after merging results from multiple databases ( /home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata.fastq ): 15907.0
Final output file created: 
/home/hutlab_public/Desktop/kneaddata_analysis/kneaddataOutputSingleEnd/singleEnd_kneaddata.fastq

The above command takes three arguments (--unpaired, --reference-db, --output). Please note that all of the three arguments are required. The output from this command is stored in the kneaddataOutputSingleEnd which is automatically created.

Output: KneadData produces at mininimum four primary output files when running a single-ended dataset (this number will increase when dealing with a paired-end dataset). Use ls kneaddataOutputSingleEnd to list the files. What are these files?

Output File Name Description
singleEnd_kneaddata.fastq Final output of KneadData after running Trimmomatic + Bowtie2+ optional tools
singleEnd_kneaddata_singleEnd_db_bowtie2_contam.fastq FASTQ file containing reads that were identified as contaminants from the database
singleEnd_kneaddata.log Log file of the kneaddata run
singleEnd_kneaddata.trimmed.fastq This file has trimmed reads after running Trimmomatic

Kneaddata Count Table

The kneaddata_read_count_table utility generates the summary report of all the fastq files in the output folder. Notice that the input to this utility is the output folder from the kneaddata command where all output fastqs live.

kneaddata_read_count_table --input kneaddataOutputSingleEnd --output kneaddata_read_count_table.tsv

Open this file with Libre Office Calc or type

column -t -s $'\t' kneaddata_read_count_table.tsv | less -S

Output:

Sample               raw single  trimmed single  decontaminated demo_db single  final single
singleEnd_kneaddata  16902.0       16457.0           15907.0                          15907.0

The input file contained 16902 raw reads of which 455 were removed post-trimming, hence 16457 trimmed reads. Of these 550 were contaminant reads (check this using $ wc -l kneaddataOutputSingleEnd/singleEnd_kneaddata_singleEnd_db_bowtie2_contam.fastq; remember that fastq files have 4 lines per read) and hence we have 15907 final high-quality reads which passed contamination.

Checking with FastQC

It is a good idea to take a look at the QC scores for reads before and after clean-up i.e. trimming. In general, this practice informs us of the quality of our sequences, how much data we may stand to lose, and how to proceed with downstream analyses. We will use FastQC for this. You can follow the steps below to download and install this software.

Downloading FastQC

  • Please click on the link here to get the command line compatible FastQC tool.

Installing FastQC

  • Unzip the downloaded fastqc_v0.11.9.zip using unzip fastqc_v0.11.9.zip

  • Change your directory to FastQC and either you can double click on FastQC icon to run the graphical interface or use Kneaddata to run FastQC using command line.

  • Please make sure that the downloaded fastqc is executable. If you get an error, you can change permissions for all files in the FastQC folder with

    sudo chmod -R 777 FastQC
    

Using FastQC

  • We can invoke FastQC from within KneadData in two ways: --run-fastqc-start and --run-fastqc-end to check the quality of our reads before and after KneadData has trimmed them. You can even do any one of those which is of interest to you. We will use another single-end read file for this called SE_extra.fastq.
    kneaddata --unpaired input/SE_extra.fastq --reference-db input/demo_db --output kneaddataOutputFastQC --run-fastqc-start --run-fastqc-end

Output

  • The results are at kneaddataOutputFastQC/fastqc. There are 2 .html files. One for the raw:SE_extra_fastqc.html and another for the processed SE_extra_kneaddata_fastqc.html reads. We can open these files with the browser. You will see that there is a significant difference in the quality of reads before and after trimming.

Before:

After:

Additional important arguments

  • [--fastqc <path>] where is the path to the folder where the fastqc executable is.

Paired End Reads

For paired end reads, we have to provide the forward and reverse reads for all sequences which are stored in 2 files.

kneaddata --input1 input/seq1.fastq --input2 input/seq2.fastq --reference-db input/demo_db --output kneaddataOutputPairedEnd 

Note that the output is twice as much because we now have information for 2 files. You can get a consolidated report with kneaddata_read_count_table like we saw earlier. The final files (in bold) are the files with high-quality decontaminated reads.

Output:

Output Files (Listed in the order created) Description
seq1_kneaddata.log Log file of the kneaddata run
seq1_kneaddata.trimmed.1.fastq This file has trimmed reads (Mate 1) as a output of Paired Ends run in Trimmomatic
seq1_kneaddata.trimmed.2.fastq This file has trimmed reads (Mate 2) as a output of Paired Ends run in Trimmomatic
seq1_kneaddata.trimmed.single.1.fastq This file has trimmed reads (only Mate 1 survived) after running Trimmomatic
seq1_kneaddata.trimmed.single.2.fastq This file has trimmed reads (only Mate 2 survived) after running Trimmomatic
seq1_kneaddata_demo_db_bowtie2_paired_contam_1.fastq FASTQ file containing reads that were identified as contaminants from the database.
seq1_kneaddata_demo_db_bowtie2_paired_contam_2.fastq FASTQ file containing reads that were identified as contaminants from the database.
seq1_kneaddata_demo_db_bowtie2_unmatched_1_contam.fastq This file includes reads (Mate 1) that were identified as contaminants from the database
seq1_kneaddata_demo_db_bowtie2_unmatched_2_contam.fastq This file includes reads (Mate 2) that were identified as contaminants from the database
seq1_kneaddata_paired_1.fastq Final output of KneadData after running Trimmomatic + Bowtie2 for seq1
seq1_kneaddata_paired_2.fastq Final output of KneadData after running Trimmomatic + Bowtie2 for seq1
seq1_kneaddata_unmatched_1.fastq Final output of KneadData after running Trimmomatic + Bowtie2 for seq1 (only Mate 1 survived)
seq1_kneaddata_unmatched_2.fastq Final output of KneadData after running Trimmomatic + Bowtie2 for seq1 (only Mate 2 survived)

Changing defaults

KneadData uses third party software for trimming and decontamination. The default values for several parameters in these software (to be used by KneadData) are carefully picked by our developers so that they will apply/work for most real world sequence data. However, there are provisions to change these if you deem it necessary for your data. Let us explore some of these.


Trimmomatic

Kneaddata runs Trimmomatic as default for the first quality control step in the workflow i.e. trimming. See below the default values for Trimmomatic parameters (you will find these when you $kneaddata --help.

  --trimmomatic TRIMMOMATIC_PATH
                        path to trimmomatic
                        [ DEFAULT : $PATH ]
  --max-memory MAX_MEMORY
                        max amount of memory
                        [ DEFAULT : 500m ]
  --trimmomatic-options TRIMMOMATIC_OPTIONS
                        options for trimmomatic
                        [ DEFAULT : ILLUMINACLIP:/TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50 ]
                        MINLEN is set to 50 bases

Let us change the MINLEN parameter to 90 using the --trimmomatic-options= and observe how it changes our results.

kneaddata --unpaired input/singleEnd.fastq --output kneaddataOutputTrimmomatic --reference-db input/demo_db --trimmomatic-options="MINLEN:90"
kneaddata_read_count_table --input kneaddataOutputTrimmomatic/ --output kneaddata_read_count_table2.tsv
column -t -s $'\t' kneaddata_read_count_table2.tsv | less -S

Before(MINLEN:50):

Sample               raw single  trimmed single  decontaminated demo_db single  final single
singleEnd_kneaddata  16902.0       16457.0           15907.0                          15907.0

After:

Sample               raw single  trimmed single  decontaminated demo_db single  final single
singleEnd_kneaddata  16902.0       16902.0           16304.0                          16304.0                        

Using the 'MINLEN:90' parameter let us change the minimum length of reads passing QC from the default 50 bases to 90 bases. However, this more stringent parameter didn't result in a more stringent QC result, as we can see by all '16902' of the reads passing QC. Let's try again, this time keeping the default KneadData parameters with only the 'MINLEN' value changed:

kneaddata --unpaired input/singleEnd.fastq --output kneaddataOutputTrimmomatic2 --reference-db input/demo_db --trimmomatic-options="ILLUMINACLIP:/TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:90"
kneaddata_read_count_table --input kneaddataOutputTrimmomatic2/ --output kneaddata_read_count_table3.tsv
column -t -s $'\t' kneaddata_read_count_table3.tsv | less -S

That looks better:

Sample               raw single  trimmed single  decontaminated demo_db single  final single
singleEnd_kneaddata  16902.0       15819.0           15331.0                          15331.0

Originally, inclusion of the --trimmomatic-options"MINLEN:90" argument overrode the default --trimmomatic-options, which included information to trim adapters correctly. By including this info with our changed MINLEN, we make the QC more stringent as we originally intended: we remove 1083 reads at the QC step. Without providing the Illumina adaptors to Trimmomatic to trim correctly, as in the previous attempt, all of the reads were longer than our specified minimum of 90 bases and all reads passed the QC filter step, incorrectly.

The full list and description of Trimmomatic parameters is given in the Trimmomatic Manual.


Tandem Repeats Finder

A tandem repeat (TR) in DNA is two or more adjacent, approximate copies of a pattern of nucleotides. Kneaddata uses TRF to locate and remove reads which look like TR. The default values for TRF parameters are as follows:

trf arguments:
  --trf TRF_PATH        path to TRF
                        [ DEFAULT : $PATH ]
  --match MATCH         matching weight
                        [ DEFAULT : 2 ]
  --mismatch MISMATCH   mismatching penalty
                        [ DEFAULT : 7 ]
  --delta DELTA         indel penalty
                        [ DEFAULT : 7 ]
  --pm PM               match probability
                        [ DEFAULT : 80 ]
  --pi PI               indel probability
                        [ DEFAULT : 10 ]
  --minscore MINSCORE   minimum alignment score to report
                        [ DEFAULT : 50 ]
  --maxperiod MAXPERIOD
                        maximum period size to report
                        [ DEFAULT : 500 ]

In KneadData 0.12.0, TRF is run automatically. Inclusion of the flag --bypass-trf will turn off TRF. A full list and description of TRF parameters can be found in the TRF Manual.


Other KneadData options

  • Quality Scores: --quality-scores {phred33,phred64} [ DEFAULT : phred33 ]

Example:

  kneaddata --unpaired input/singleEnd.fastq --reference-db input/demo_db --output kneaddataOutput --quality-scores phred64
  • Produce a Single Output File:
--cat-final-output

Downstream analysis steps with bioBakery tools use a single input fastq file. This can be created by combining reads from forward and reverse fastq files together manually with cat. Alternatively, the flag --cat-final-output will create a single output for all results. This file will be named seq1.fastq.gz and contain final paired forward and reverse reads as well as forward and reverse QC'd orphan reads. (seq1_kneaddata_paired_1.fastq, seq1_kneaddata_paired_2.fastq, seq1_kneaddata_unmatched_1.fastq, seq1_kneaddata_unmatched_2.fastq)

  • Threads and Processes: --threads <1> [ Default : 1 ];--processes <1> [ Default : 1 ]

Example:

kneaddata --unpaired input/singleEnd.fastq --reference-db input/demo_db --output kneaddataOutput --threads 4 --processes 4

Clone this wiki locally