Skip to content

Fungal ITS1 Pipeline Using Read1 Sequences

Zewei Song edited this page Feb 2, 2017 · 4 revisions

This page is about a typical processing pipeline using the Read1 sequences of the fungal ITS1 region. The example was sequenced at the University of Minnesota Genomic Center (UMGC), using their dual index protocol.

I have highlighted several check points in this pipeline. These check points are where I would stop and check the output of the current command. Because the quality of sequencing data is somehow variable, you may need to check the parameter to suit your own data.

Follow this pipeline for using Read1 and Read2 for pair-end merge. Pair-end merge usually generate better quality data than using only Read1 data.

0. The example dataset

You can download my example dataset for testing. This data set contains six soil samples from Cedar Creek, MN, USA, each contains 20,000 sequences randomly picked from the raw file. These samples were sequenced for both ITS1 and ITS2 region using MiSeq 300 bp PE runs.

wget https://github.com/ZeweiSong/FAST_example/archive/v1.1.tar.gz
tar xf v1.1.tar.gz
mv FAST_example-1.1/ITS1/* ./

The Read1 and Read2 files are in the two folders.

1. Label and merge all sequencing files.

Generate sample mapping files for both reads:

bin/FAST/fast.py -generate_mapping -i read1 -o read1_map.txt

Check point 0:

Open the two mapping file and check the sample names. If you have use underscore ("_") to name you sample, the sample names will not be automatically detected. You can change the first column to name your samples. Take a look at this page on how sample is labeled in FAST.

Add labels to all sequences. The label style is "LableName_Number". Number is consecutive number starting from 0 for each sample (file). This style is compatible with Qiime. You can use the less command to take a look at the new files.

bin/FAST/fast.py -add_labels -m read1_map.txt -i read1 -o read1_labeled -t 4

Merge all labeled sequences:

bin/FAST/fast.py -merge_seqs -i read1_labeled -o read1.fastq

2. Trim off sequencing primers.

We will remove the Read2 sequencing primer from all Read1 sequences:

cutadapt -a CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -o read1.cut.fastq read1.fastq

Check point 1. Here is the report:

Total reads processed:                 120,000
Reads with adapters:                    75,198 (62.7%)
Reads written (passing filters):       120,000 (100.0%)

3. Remove SSU and 5.8S.

Similar to the pair-end merge data, we can trim off the 67 bp SSU head. Go to this page for more details on why we trim this head off.

cutadapt -g ^CTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATT -o read1.its1.cut_f.fastq read1.cut.fastq --discard-untrimmed

Total reads processed:                 120,000
Reads with adapters:                   113,869 (94.9%)
Reads written (passing filters):       113,869 (94.9%)

We can also trim off the SSU tail, but this time we will keep all sequences since some of them may be too long for the sequencing primer to reach the SSU region:

cutadapt -a AACTTTCAACAACGGATCTCTTGGYTCTSGCATCGATGAAGAACGCAGC -o read1.its1.cut_fr.fastq read1.its1.cut_f.fastq

Here is the report:

Total reads processed:                 113,869
Reads with adapters:                    99,317 (87.2%)
Reads written (passing filters):       113,869 (100.0%)

4. Discard low quality sequences.

For quality filter, we will trim each sequence from the end until their max expected error is less than 1. This is actually a feature I asked for!

bin/vsearch --fastq_filter read1.its1.cut_fr.fastq --fastq_truncee 1 --fastaout read1.its1.maxee1.fasta --fasta_width 0

Let's get an estimate on the length distribution of our data:

bin/FAST/fast.py -stat_seqs -i read1.its1.maxee1.fasta -o read1.its1.maxee1.report.txt

It seems that if we truncate our sequences at 150 bp, we will keep about 80% sequences. Truncate them shorter will give us more sequences, but we are losing the variation in the ITS region. Truncate them longer will give us fewer sequence. This is also an good example why we should use pair-end merge for a full length ITS1 or ITS2 region. USEARCH author has a [good page](Unpaired reads that cover partial amplicons) for the explanation on whether or not to truncate sequences.

Let's truncate the sequences any way:

bin/FAST/fast.py -truncate_seqs -i read1.its1.maxee1.fasta -fixed_length 150 -o read1.its1.maxee1.L150.fasta

Here is the report:

read1.its1.maxee1.fasta contains 113869 records.
Cutting sequences to a fixed length: 150 ...
84839 sequences were cut to 150 and save in read1.its1.maxee1.L150.fasta.

5. Dereplication and OTU clustering.

All the following command is the same as pair-end merge pipeline.

Dereplication:

bin/FAST/fast.py -dereplicate -i read1.its1.maxee1.L150.fasta -o raw.qc.derep -t 4

Remove singletons:

bin/FAST/fast.py -filter_otu_map -i raw.qc.derep.txt -o raw.qc.derep.size2.txt -min_size 2
bin/FAST/fast.py -pick_seqs -i raw.qc.derep.fasta -map raw.qc.derep.size2.txt -o raw.qc.derep.size2.fasta -sizeout

Remove chimeras:

bin/vsearch --uchime_denovo raw.qc.derep.size2.fasta --nonchimeras raw.qc.derep.size2.uchime.fasta --sizeout --fasta_width 0

Cluster OTU using VSEARCH:

bin/vsearch --cluster_size raw.qc.derep.size2.uchime.fasta --centroids raw.qc.vsearch.fasta --fasta_width 0 -id 0.97 --sizein --uc raw.qc.uc.txt --threads 4

Parse the UC output for an OTU map:

bin/FAST/fast.py -parse_uc_cluster -i raw.qc.uc.txt -o raw.qc.vsearch.txt

Generate the FAST maps:

bin/FAST/fast.py -generate_fast_map -map raw.qc.derep.size2.txt -seq raw.qc.derep.size2.uchime.fasta -o fast.derep.txt -derep
bin/FAST/fast.py -generate_fast_map -map raw.qc.vsearch.txt -seq raw.qc.vsearch.fasta -o fast.otu.txt -otu
bin/FAST/fast.py -combine_fast_map -derep_map fast.derep.txt -otu_map fast.otu.txt -o fast.hybrid.txt

Rename all OTUs:

bin/FAST/fast.py -rename_otu_map -fast_map fast.hybrid.txt -o fast.hybrid.otu.txt

Make an OTU table:

bin/FAST/fast.py -make_otu_table -fast_map fast.hybrid.otu.txt -o otu_table.txt -rep rep_seq.fasta

7. Assign taxonomy and filter.

This is where you can use any method you prefer to assign taxonomy to the OTUs. Many people using BLAST (including me), but many other programs that do alignment are equally good or better than BLAST (like the VSEARCH I used here). Bayesian classifier such as RDP and SINTAX in USEARCH are also good alternatives. I am just providing an example here:

We can now align our OTU sequences to the UNITE database using VSEARCH. What i output using the --userout optin is "Name of OTU + Aligned UNITE record + Length of OTU sequences + Length of OTU sequences aligned to reference + Similarity".

bin/vsearch --usearch_global rep_seq.fasta -db database/sh_general_release_dynamic_22.08.2016.fasta --userout taxa.vsearch.txt --userfields query+target+ql+pairs+id --id 0.6

Using the alignment score, I usually filer based on the percentage of alignment query length and similar to keep OTUs that are likely to be in Fungal Kingdom. Using the following command, any OTU that align more than 70% of its sequences to a reference with a similarity higher than 75% were kept. I picked these threshold based on the paper of Tedersoo et al. 2015.

bin/FAST/fast.py -filter_taxonomy -i taxa.vsearch.txt -op taxa.vsearch.fungi.txt -match_length 0.7 -pident 75

We can now discard all OTUs that are likely not fungi from our OTU table:

bin/FAST/fast.py -assign_taxonomy -otu otu_table.txt -tax taxa.vsearch.fungi.txt -o otu_table.fungi.txt -remove_unassigned

From all fungal OTUs, keep the taxonomy with simialrity larger than 0.97. This is a very stringent filtering, you can lower the value if you want to include more assignment.

bin/FAST/fast.py -filter_taxonomy -i taxa.vsearch.fungi.txt -op taxa.vsearch.species.txt -match_length 0 -pident 97

Assign taxa to those OTUs pass the simialrity threshold for species, the others will be "no_hit". You can also attach all matching score to the OTU table by using -scores option.

bin/FAST/fast.py -assign_taxonomy -otu otu_table.fungi.txt -tax taxa.vsearch.species.txt -o otu_table.taxa.txt

9. Rarefy the OTU table.

Let's take a look at the sequencing depth and OTU count again for each sample:

bin/FAST/fast.py -summary_otu_table -otu otu_table.taxa.txt -o otu_report2.txt

If you sum up all samples, we got 91,058 sequences in the unrarefied OTU table, this is 75.8% of our raw sequences. It seems that 14,000 is a good depth to rarefy our samples.

Usually people will then rarefy the OTU table down to a certain depth (usually the lowest depth). In this pipeline, the rarefaction can be iterated multiple times for each sample (1,000), and the sample with medium OTU richness will be picked to include in the final OTU table. Rarefy the OTU table to certain depth with 1,000 iterations:

bin/FAST/fast.py -rarefy_otu_table -otu otu_table.taxa.txt -o otu.table.taxa.rare.txt -d 9000 -iter 1000 -t 4

The file otu.table.taxa.rare.txt is the final product of our data set.