Skip to content

Fungal ITS2 Pipeline Using Both Reads

Zewei Song edited this page Feb 2, 2017 · 13 revisions

Here is a typical processing pipeline for the fungal ITS2 region. I have used an small dataset as an example. The example was sequenced at the University of Minnesota Genomic Center (UMGC), using their dual index protocol. With the advance of longer reads in Illumina's HiSeq and MiSeq, it is now possible for us to assemble full length ITS1 (or ITS2) region using both reads generated by the sequencing machine. Using assembled full length ITS region gave us many advantaged comparing to only use the Read1 sequences.

I have highlighted several check points in this pipeline. These check points are where I would stop and check the output of the current command. Because the quality of sequencing data is somehow variable, you may need to check the parameter to suit your own data.

The only difference of this pipeline from the ITS1 pipeline is that we are now trimming off the 5.8S head and LSU tail, instead of SSU head and 5.8S tail from all sequences. The LSU tail is somehow variable among sequences. So instead of using cutadapt, we just slice the last 60 bp off each sequences. It would be an interest topic on how to use the LSU tail, but that will be another story. Read this page for more information about the structure of the amplicons.

0. The example dataset

You can download my example dataset for testing the pipeline. This data set contains six soil samples from Cedar Creek, MN, USA, each contains 20,000 sequences randomly picked from the raw file. These samples were sequenced for both ITS1 and ITS2 region using MiSeq 300 bp PE runs.

wget https://github.com/ZeweiSong/FAST_example/archive/v1.2.tar.gz
tar xf v1.2.tar.gz
mv FAST_example-1.2/ITS2/* ./

The Read1 and Read2 files are in the two folders.

You can take a look at the shell script its2_pipeline.sh for all the commands needed (download it here). You can also run the script, but I highly recommend to run each command separately for the first time. The instruction for install all required programs can be found in its2_pipeline.sh, but you can also Follow this page to set up the FAST package, cutadapt, Vsearch, and PEAR in your working folder.

1. Label and merge all sequencing files.

Generate sample mapping files for both reads:

bin/FAST/fast.py -generate_mapping -i read1 -o read1_map.txt
bin/FAST/fast.py -generate_mapping -i read2 -o read2_map.txt

Check point 0:

Open the two mapping file and check the sample names. If you have use underscore ("_") to name you sample, the sample names will not be automatically detected. You can change the first column to name your samples. Take a look at this page on how sample is labeled in FAST.

Add labels to all sequences. The label style is "LableName_Number". Number is consecutive number starting from 0 for each sample (file). This style is compatible with Qiime. You can use the less command to take a look at the new files.

bin/FAST/fast.py -add_labels -m read1_map.txt -i read1 -o read1_labeled -t 4
bin/FAST/fast.py -add_labels -m read2_map.txt -i read2 -o read2_labeled -t 4

Merge all labeled sequences:

bin/FAST/fast.py -merge_seqs -i read1_labeled -o read1.fastq
bin/FAST/fast.py -merge_seqs -i read2_labeled -o read2.fastq

Some pipelines prefer to process individual file before merge them into one. I prefer the other way so we don't spawn many commands.

2. Trim off sequencing primers.

The two sequences used here are the universal sequencing primers from Illumina, remember to change them if you are using a different protocol (check with your agent). You can check UMGC's detailed protocol here. We are trying to detect and remove Read2 sequencing primer from all Read1 sequences, and vice versa. Here we ask cutadapt to keep sequences longer than 50bp in both Reads, otherwise it will keep all sequences even if they are zero in length. This is important for PEAR, our pair-end merger, since PEAR cannot recognize zero length sequence.

cutadapt -a CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -A CTGTCTCTTATACACATCTGACGCTGCCGACGA -o read1.cut.fastq -p read2.cut.fastq read1.fastq read2.fastq -m 50

Check point 1:

Take a look at the cutadapt output. Only 1.8% of Read1 sequences contains part of the Read2 sequencing primer. And about 2.2% Read2 sequences contains part of the Read1 sequencing primer. This is different than what we saw in the ITS1 data, where more than half Read1 sequences contain adapter. This indicates that the ITS2 region in our samples are longer than that of ITS1 region.

Total read pairs processed:            120,000
  Read 1 with adapter:                   2,114 (1.8%)
  Read 2 with adapter:                   2,682 (2.2%)
Pairs written (passing filters):       120,000 (100.0%)

3. Merge Read1 and Read2 sequences.

Now we can merge both reads together using PEAR:

pear -f read1.cut.fastq -r read2.cut.fastq -o merge.pear.its2 -k -j 4

The -k option tells PEAR to output original Read2 sequences to unassembled and discarded files, otherwise it will output in reverse compliment form. You can use the -j option to specify the number of CPUs. This is useful for handling a large dataset.

Check point 2:

Take a look at the output of PEAR. We kept 98.5% sequences (this is even better than our ITS1 data set). I have tested PEAR on multiple MiSeq 300 bp and HiSeq 250 bp data for both ITS1 and ITS2 region and always got more than 90% sequences assembled. If you have a large portion of not assembled reads, it is a good idea to check the quality of your data.

Assembled reads ...................: 118,198 / 120,000 (98.498%)
Discarded reads ...................: 0 / 120,000 (0.000%)
Not assembled reads ...............: 1,802 / 120,000 (1.502%)

Why are we using PEAR? Take a look at this page for my testing on pair-end merge programs.

4. Remove 5.8S head and LSU tail.

The PCR primer we used for this data set are 5.8SR and ITS4. So after these primers, we will get part of 5.8S at the head, and LSU at the tail of our ITS2 amplicons. It is important to trim off these two region since they are highly conservative, and will influence our OTU clustering. Also, the reference database - UNITE also trim their record using ITSx. This means that UNITE does not contain any SSU or LSU region. It is reasonable that for ITS2, we at least trim off its LSU tail, since LSU will never align to UNITE database during taxonomy assignment.

We can first get a consensus of the frequency of nucleotide from head and tail:

bin/FAST/fast.py -nucl_freq -i merge.pear.its2.assembled.fastq -o merge.pear.head.txt
bin/FAST/fast.py -nucl_freq -i merge.pear.its2.assembled.fastq -o merge.pear.tail.txt -tail

Check point 3:

Open these two files in EXCEL and checking the nucleotide frequency. Unlike what we found in the ITS1 data set, it is hard too tell the conservative region in our ITS2 data set. I feed these data into ITSx, and it turns out that most of our sequences contains a 125 bp 5.8S head, and 60 bp LSU head. But their sequences close to ITS2 are also variable.

This leaves us several choices:

  1. We don't trim the sequences. the 5.8S head will still align to UNITE, but not the LSU.
  2. We trim off 125 bp from the head, and 60 bp from the tail. But we will lost the variable region in 5.8S.
  3. We trim off the 5.8SR primer, and the 60 bp tail from all sequences.

In this pipeline, I will use the third choice, but it is not the only way to do it. You can read this page for more details about the 5.8S head and LSU tail in ITS2.

We first trim off the head using 5' anchor in cutadapt, using the 5.8SR primer:

cutadapt -g ^TCGATGAAGAACGCAGCG -o merge.pear.its2.cut_f.fastq merge.pear.its2.assembled.fastq --discard-untrimmed

Check point 4:

We have 98.6% sequences left. If you have to discard too many sequences, try to increase the error tolerant by setting -e 0.2 (the default is 0.1 or 10%), or using normal 5' primer -g TCGATGAAGAACGCAGCG (by removing the ^).

Total reads processed:                 118,198
Reads with adapters:                   116,529 (98.6%)
Reads written (passing filters):       116,529 (98.6%)

We then trim off the tail for 60 bp from the previous file, these are likely to be LSU.

bin/FAST/fast.py -truncate_seqs -i merge.pear.its2.cut_f.fastq -slice 0,60 -o merge.pear.its2.cut_fr.fastq

5. Discard low quality sequences.

We can now filter the FASTQ file with max expected error (<1):

bin/vsearch --fastq_filter merge.pear.its1.cut_fr.fastq --fastq_maxee 1 --fastaout merge.pear.its2.maxee1.fasta --fasta_width 0

Check point 5:

We have kept 109,435 out of 120,000 raw reads at this point.

HiSeq data with shorter read length (250 bp versus 300 bp in MiSeq) will contain more low quality bases. You can try to change --maxee to 2 or 3 in keep more sequences.

109435 sequences kept (of which 0 truncated), 7092 sequences discarded.

6. Dereplication and OTU clustering.

Let's dereplicate the sequences so only unique sequence were kept:

bin/FAST/fast.py -dereplicate -i merge.pear.its2.maxee1.fasta -o raw.qc.derep -t 4

Two files will be generated:

  • raw.qc.derep.txt is an OTU map in which each line contains the name of samples that have identical sequences. This map is in the same style as used by QIIME. Remember the dereplication is equivalent to a de novo OTU clustering at the simialrity of 100%.
  • raw.qc.derep.fasta is the dereplicated FASTA file, in which all sequences are unique.

These two files together contains the exactly same information as the cleaned raw data merge.pear.its2.maxee2.fasta. But sometimes it is hard to keep track of both. That is why at the end of this pipeline, I merged the OTU map and FASTA file into a single FAST map. But at this point, we will still use two file separately.

From 109435 sequences, we found out that there are only 24,442 unique ones. For example, the largest dereplicated unit contains 4,815 identical sequences:

Sequences dereplicated, clasped from 109435 into 24442 sequences.
Dereplicated OTU size: Max=4815, Min=1, Average=4.

Usually we will remove singletons, since they are likely to be artifact:

bin/FAST/fast.py -filter_otu_map -i raw.qc.derep.txt -o raw.qc.derep.size2.txt -min_size 2

Check point 6:

Out of 24,442 dereplicated unit, only 3,960 contains at least two sequences. The others are singletons. The total sequences we can use for OTU clustering is 88,953.

Original OTU map:
         OTU=24442 (Total Sequences=109435, Max=4815, Min=1, Ave=4)
Filtered OTU map:
         OTU=3960 (Total Sequences=88953, Max=4815, Min=2, Ave=22)

Now we can use the new dereplicated map to pick a new sequence file, and add a size label for VSEARCH:

bin/FAST/fast.py -pick_seqs -i raw.qc.derep.fasta -map raw.qc.derep.size2.txt -o raw.qc.derep.size2.fasta -sizeout

One last step before clustering OTU, let's check for chimeras. I prefer reference chimera checking since it can be done using multiple CPUs. For small dataset, de novo chimera checking is also possible, but as far as I tested, at this point, we don't have many chimera left due to the previous quality controls:

bin/vsearch --uchime_ref raw.qc.derep.size2.fasta --nonchimeras raw.qc.derep.size2.uchime.fasta --db database/sh_general_release_dynamic_22.08.2016.fasta --sizeout --fasta_width 0 --thread 4

Here the -sizeout option will add a size annotation to each sequence (;size=XXX). This is require by VSEARCH for chimera checking and OTU clustering.

We can now cluster OTUS, -id sets the similarity to 97%:

bin/vsearch --cluster_size raw.qc.derep.size2.uchime.fasta --centroids raw.qc.vsearch.fasta --fasta_width 0 -id 0.97 --sizein --uc raw.qc.uc.txt --threads 4

7. Generate OTU table and representative sequences.

VSEARCH does not output OTU map, so we will use its UC style output to generate one:

bin/FAST/fast.py -parse_uc_cluster -i raw.qc.uc.txt -o raw.qc.vsearch.txt

We now have two sets of OTU maps with their corresponding sequences:

Dereplicate map:

  • raw.qc.derep.size2.txt <--- This is the dereplicated map
  • raw.qc.derep.size2.uchime.fasta <--- This is the dereplicated sequences

OTU map:

  • raw.qc.vsearch.txt <--- This is the OTU map
  • raw.qc.vsearch.fasta <--- This is the OTU sequences

OTU map and its sequences contains information on how those dereplicated sequences cluster together. While Dereplicate map and its sequences contains information on the abundance (count) of each OTU in every sample. Combining them together will give us the OTU table.

In FAST pipeline, I used a new style to combine OTU map and sequences into a single file. It is in JSON format that can be readily read into a Python dictionary. I'll have a page explaining this later.

Combine the Dereplicate map and sequences:

bin/FAST/fast.py -generate_fast_map -map raw.qc.derep.size2.txt -seq raw.qc.derep.size2.uchime.fasta -o fast.derep.txt -derep

Combine the OTU map and sequences:

bin/FAST/fast.py -generate_fast_map -map raw.qc.vsearch.txt -seq raw.qc.vsearch.fasta -o fast.otu.txt -otu

Combine to FAST derep map and OTU map into a single hybrid:

bin/FAST/fast.py -combine_fast_map -derep_map fast.derep.txt -otu_map fast.otu.txt -o fast.hybrid.txt

Rename the OTUs, so them will start with OTU_:

bin/FAST/fast.py -rename_otu_map -fast_map fast.hybrid.txt -o fast.hybrid.otu.txt

Generate the OTU table from the FAST hybrid map, along with the representative sequences:

bin/FAST/fast.py -make_otu_table -fast_map fast.hybrid.otu.txt -o otu_table.txt -rep rep_seq.fasta

Let's take a look at the sequencing depth and OTU count for each sample:

bin/FAST/fast.py -summary_otu_table -otu otu_table.txt -o otu_report.txt

8. Assign taxonomy and filter.

Check point 7:

This is where you can use any method you prefer to assign taxonomy to the OTUs. Many people using BLAST (including me), but many other programs that do alignment are equally good or better than BLAST (like the VSEARCH I used here). Bayesian classifier such as RDP and SINTAX in USEARCH are also good alternatives. I am just providing an example here:

We can now align our OTU sequences to the UNITE database using VSEARCH. What i output using the --userout optin is "Name of OTU + Aligned UNITE record + Length of OTU sequences + Length of OTU sequences aligned to reference + Similarity".

bin/vsearch --usearch_global rep_seq.fasta -db database/sh_general_release_dynamic_22.08.2016.fasta --userout taxa.vsearch.txt --userfields query+target+ql+pairs+id --id 0.6

Using the alignment score, I usually filer based on the percentage of alignment query length and similar to keep OTUs that are likely to be in Fungal Kingdom. Using the following command, any OTU that align more than 70% of its sequences to a reference with a similarity higher than 75% were kept. I picked these threshold based on the paper of Tedersoo et al. 2015.

bin/FAST/fast.py -filter_taxonomy -i taxa.vsearch.txt -op taxa.vsearch.fungi.txt -match_length 0.7 -pident 75

We can now discard all OTUs that are likely not fungi from our OTU table:

bin/FAST/fast.py -assign_taxonomy -otu otu_table.txt -tax taxa.vsearch.fungi.txt -o otu_table.fungi.txt -remove_unassigned

From all fungal OTUs, keep the taxonomy with simialrity larger than 0.97. This is a very stringent filtering, you can lower the value if you want to include more assignment.

bin/FAST/fast.py -filter_taxonomy -i taxa.vsearch.fungi.txt -op taxa.vsearch.species.txt -match_length 0 -pident 97

Assign taxa to those OTUs pass the simialrity threshold for species, the others will be "no_hit". You can also attach all matching score to the OTU table by using -scores option.

bin/FAST/fast.py -assign_taxonomy -otu otu_table.fungi.txt -tax taxa.vsearch.species.txt -o otu_table.taxa.txt

9. Rarefy the OTU table.

Let's take a look at the sequencing depth and OTU count again for each sample:

bin/FAST/fast.py -summary_otu_table -otu otu_table.taxa.txt -o otu_report2.txt

If you sum up all samples, we got 91,058 sequences in the unrarefied OTU table, this is 75.8% of our raw sequences. It seems that 14,000 is a good depth to rarefy our samples.

Usually people will then rarefy the OTU table down to a certain depth (usually the lowest depth). In this pipeline, the rarefaction can be iterated multiple times for each sample (1,000), and the sample with medium OTU richness will be picked to include in the final OTU table. Rarefy the OTU table to certain depth with 1,000 iterations:

bin/FAST/fast.py -rarefy_otu_table -otu otu_table.taxa.txt -o otu.table.taxa.rare.txt -d 11000 -iter 1000 -t 4

The file otu.table.taxa.rare.txt is the final product of our data set.

10. The structure of OTU

Similar to ITS1 data, we can further deconstruct each OTU for its abundance distribution divided by each unique sequences.

Since each OTU is actually an aggregation of many dereplicated sequences, we can take a look at the structure of each individual OTU. If we assume a good sequencing quality, any different in these dereplicated sequences is nucleotide polymorphism. Let's take a look at an example:

bin/FAST/fast.py -otu_deconstruct -map fast.hybrid.otu.txt -o otu_deconstruct

Open the file OTU_1.txt and we can see that sample B1 and B2 are dominated by "derep_6" and "derep_11" while smaple C1 and C2 are dominated by derep_1:

If we align these two sequences, we can see there are several SNP among them:

CLUSTAL O(1.2.3) multiple sequence alignment

derep_11      AAATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATA
derep_1       AAATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATA
derep_6       AAATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATA
              ************************************************************

derep_11      TTGCGCTCTCTGGTATTCCGGAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAACTC
derep_1       TTGCGCTCTCTGGTATTCCGGAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAACTC
derep_6       TTGCGCTCTCTGGTATTCCGGAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAACTC
              ************************************************************

derep_11      CCCTTTCTTTTTTGAAATGGGAGCTGGACTTGAGTGATCCCAACGCTTTTCTCACCGAAA
derep_1       CCTTTTCTTTTTTGAAATGGGAGCTGGACTTGAGTGATCCCAACGCTTCTCTCACCGAGA
derep_6       CCCTTTCTTTTTTGAAATGGGAGCTGGACTTGAGTGATCCCAACGCTTTTCTCACCGAAA
              ** ********************************************* *********.*

derep_11      AGCGGCGGGTTACTTGAAATGCAGGTGCAGCTGGACTTTTCTCTGAGCTATAAGCATATC
derep_1       AGTGGCGGGTTACTTGAAATGCAGGTGCAGCTGGACTTTTCTCTGAGCTATAAGCATATC
derep_6       AGTGGCGGGTTACTTGAAATGCAGGTGCAGCTGGACTTTTCTCTGAGCTAAAAGCATATC
              ** ***********************************************:*********

derep_11      TATTTAGTCTGCCTAAAAAACAGATTATTACCTTTGCTGCAGCTAACATAAAGGAGATTA
derep_1       TATTTAGTCTGCCTAAAAAACAGATTATTACCTTTGCTGCAGCTAACATAAAGGAGATTA
derep_6       TATTTAGTCTGCCTAAAAAACAGATTATTACCTTTGCTGCAGCTAACATAAAGGAGATTA
              ************************************************************

derep_11      GTTCTTGTGCTGACTGATGCAGGATTCACAGAGACAGCTTCGGCTGGCCTTTGTAA
derep_1       GTTCTTATGCTGACTGATGCAGGATTCACAGAGACAGCTTCGGCTGACTTTGTAA-
derep_6       GTTCTTGTGCTGACTGATGCAGGATTCACAGAGACAGCTTCGGCTGGCTTTGTAA-
              ******.***************************************.* **  :* 

Are these real polymorphism or just artifact error? And if they are real, how can we use this information? I don't have a good answer for this and would like to hear what you think!