Intro to QIIME
#Intro to QIIME for amplicon analysis
Authored by Ashley Shade, with contributions by Sang-Hoon Lee and Siobhan Cusack
EDAMAME tutorials have a CC-BY license. Share, adapt, and attribute please!
- This tutorial will contribute towards an understanding of microbial amplicon analysis
- Install auxillary software on the QIIME EC2 image
- Subsample a large amplicon dataset for workflow development and tutorial testing
- Assemble paired-end reads
- Execute a shell script to automate a process
- Explore input and output files for QIIME workflows and scripts
- Understand the structure and components of a good mapping file
- Move sequences into the QIIME environment from an outside tool using "add_qiime_labels.py"
- Obtain summary information about sequence files (fasta, fna, fastq)
- Define operational taxaonomic units (OTUs)
- Execute a QIIME workflow, and understand the separate steps in the workflow
- Align sequences, assign taxonomy, and build a tree with representative sequences from OTU definitions
###Handout of workflow:
##1.1 Getting started For this tutorial, we will be using the 16S sequencing data that we previously downloaded and unzipped. Let's connect to our EC2 instance, and then wget our data.
wget https://s3.amazonaws.com/edamame/EDAMAME_16S.tar.gz tar -zxvf EDAMAME_16S.tar.gz
One we've done that, we'll navigate to the directory containing those files:
You should see 54 files, all ending in .fastq.
##1.2 Subsampling We initially had 54 samples with ~200,000 reads each, which is way too big to handle for a workshop, or for troubleshooting and developing an analysis workflow. In order to efficiently process these data, we had to subsample them to 10,000 reads per sample. We already did this for you, but let's practice subsampling on this smaller dataset so that you know how to do it.
We used Seqtk for subsampling, so first we'll have to install Seqtk.
git clone https://github.com/lh3/seqtk.git
Navigate to the new seqtk directory to execute the next commands.
sudo make sudo cp seqtk /usr/local/bin
Now that we've installed Seqtk, we'll run this code to randomly pick 500 reads from each of our samples. Navigate to the EDAMAME_16S/Fastq folder, then run seqtk:
seqtk sample -s 100 C01D01F_sub.fastq 500 > C01D01F_sub500.fastq
This command is saying "pull 500 sequences from the fastq file specified." The -s 100 flag specifies using a random seed (100). For paired-end reads that are going to be merged, it is important to specify the exact same seed so that the reads can be matched correctly later. We should have a new file with subsampled sequences in it called C01D01_sub500.fastq.
Sanity check: how do we know that this new file actually has 500 sequences in it? If we insepct the header of the fastq file, we'll see that every sequences begins with a character string "@HWI." Thus, we can use grep to count the number of @HWI's in the file to count the number of sequences.
grep -c @HWI C01D01F_sub500.fastq
Our sequences are 16S rRNA amplicons sequenced with MiSeq. We will use PANDAseq to assemble the forward and reverse reads.
###1.3.1. Install pandaseq onto the QIIME AMI instance First we will need to install pandaseq as it is not included in the QIIME environment. Move back to your home directory before installing.
This should not return anything since we have not yet installed it.
git clone http://github.com/neufeld/pandaseq.git/ cd pandaseq ./autogen.sh && ./configure && make && sudo make install sudo ldconfig which pandaseq
If pandaseq has installed properly, this will return "/usr/local/bin/pandaseq"
###1.3.2. Join paired-ends with PANDAseq
Now, navigate back to the EDAMAME 16S subsampled dataset directory, and use
mkdir to create a new directory called "pandaseq_merged_reads"
Then, execute the command to join paired-end reads with PANDAseq:
pandaseq -f Fastq/C01D01F_sub.fastq -r Fastq/C01D01R_sub.fastq -w pandaseq_merged_reads/C01D01_merged.fasta -g pandaseq_merged_reads/C01D01_merged.log -B -A simple_bayesian -l 253 -L 253 -o 47 -O 47 -t 0.9
Let's look carefully at the anatomy of this command.
pandaseqcalls the package of pandaseq scripts.
-fC01D01F_sub.fastq tells the script where to find the forward read.
-rtells the script where to find its matching reverse read.
-w pandaseq_merged_reads/C01D01_merged.fastadirects the script to make a new fasta file of the assembled reads and to put it in the "pandaseq_merged_reads" directory.
-g C01D01_merged.logselects an option of creating a log file.
-Bmeans that the input sequences do not have a barcode.
-Ais the algorithm used for assembly.
-Lspecifies the maximum length of the assembled reads, which, in truth, should be 253 bp. This is a very important option to specify, otherwise PANDAseq will assemble a lot of crazy-long sequences.
-Ospecifies the amount of overlap allowed between the forward and reverse sequences.
-tis a quality score between 0 and 1 that each sequence must meet to be kept in the final output. In our experience, this is a very important parameter - we did some testing of the threshold and found that 0.9 was optimal.
All of the above information, and more options, are fully described in the PANDAseq Manual.. The log file includes details as to how well the merging went.
###1.3.3. Sanity check and file inspection.
There are some questions you may be having: What does pandaseq return? How do I know that it returned sequences the size I expect? Are the primers still attached?
It turns out that we had primers/barcodes removed before we started merging, but let's pretend for a minute that we had no idea about that. How could we check? Say we know that one of the barcode sequences is: GTCTAATTCCGA
grep GTCTAATTCCGA C01D01_merged.fasta
When you execute the above command, the terminal does not return anything. This means that the primer sequence was not found in the file.
We can double check our sanity by using a positive control. Let's use
grep to find a character string that we know is there, for instance the "M03127" string identifying the first sequence.
grep M03127 C01D01_merged.fasta
Whoa! That is hard to read all of those lines. Let's put the results into a list by appending
> list.txt to the command. The ">" symbol means to output the results to a new file, which is specified next.
grep M03127 C01D01_merged.fasta > list.txt
This creates a new file called "list.txt", in which all instances of the character string "M03127" are provided. Let's look at the head.
Our positive control worked, and we should be convinced and joyous that we executed
grep correctly AND that the primers were trimmed by PANDAseq. We can now remove the list file.
Let's also remove the merged file and its log; we're going to make a new one using AUTOMATION in a second.
rm C01D01_merged.fasta rm C01D01_merged.log
###1.3.4. Automate paired-end merging with a shell script. This shell script was written by Joshua Herr for EDAMAME2014
We would have to execute an iteration of the PANDAseq command for every pair of reads that need to be assembled. This could take a long time. So, we'll use a shell script to automate the task. You'll also need this list of file names.
To automatically download the script and list onto the AMI, first navigate to the "Fastq" directory, use
curl to get the raw files, and make a new pandaseq_merged_reads directory to put the merged reads into.
curl -O https://raw.githubusercontent.com/edamame-course/2015-tutorials/master/QIIME_files/list2.txt
curl -O https://raw.githubusercontent.com/edamame-course/2015-tutorials/master/QIIME_files/Cen_pandaseq_merge.sh mkdir pandaseq_merged_reads
Change permissions on the script to make it executable:
chmod 755 Cen_pandaseq_merge.sh
Execute the script from the Fastq Directory. Bonus: A great opportunity to try tmux!
###1.3.5 Sanity check #2.
How many files were we expecting from the assembly? There were 54 pairs to be assembled, and we are generating one assembled fasta and one log for each. Thus, the pandaseq_merged_reads directory should contain 108 files. Navigate up one directory, and then use the
wc (word count) command to check the number of files in the directory.
ls -1 pandaseq_merged_reads | wc -l
The terminal should return the number "108." Let's move this whole directory up one level so that we can access more easily with QIIME:
mv pandaseq_merged_reads/ ..
Congratulations, you lucky duck! You've assembled paired-end reads!
##1.4 Understanding the QIIME mapping file
QIIME requires a mapping file for most analyses. This file is important because it links the sample IDs with their metadata (and, with their primers/barcodes if using QIIME for quality-control).
Let's spend few moments getting to know the mapping file. Navigate to the MappingFiles subdirectory in the EDAMAME_16S/MappingFiles directory.
A clear and comprehensive mapping file should contain all of the information that will be used in downstream analyses. The mapping file includes both categorical (qualitative) and numeric (quantitative) contextual information about a sample. This could include, for example, information about the subject (sex, weight), the experimental treatment, time or spatial location, and all other measured variables (e.g., pH, oxygen, glucose levels). Creating a clear mapping file will provide direction as to appropriate analyses needed to test hypotheses. Basically, all information for all anticipated analyses should be in the mapping file.
Hint: Mapping files are also a great way to organize all of the data for posterity in the research group, and can provide a clear framework for making a database. New lab members interested in repeating the analysis should have all of the required information in the mapping file. PIs should ask their students to curate and deposit both mapping files and raw data files.
Guidelines for formatting map files:
- Mapping files should be tab-delimited
- The first column must be "#SampleIDs" (commented out using the
- SampleIDs are VERY IMPORTANT. Choose wisely! Ideally, a user who did not design the experiment should be able to distinguish the samples easily. In QIIME, SampleIDs must be alphanumeric characters or periods. They cannot have underscores.
- The last column must be "Description".
- There can be as many in-between columns of contextual data as needed.
- If you plan to use QIIME for quality control (which we do not need because the PANDAseq merger included QC), the BarcodeSequence and LinkerPrimer sequence columns are also needed, as the second and third columns, respectively.
- Excel can cause formatting heartache. See more details here.
##1.5 Getting assembled reads into the one big ol' data file, and extracting summary information
QIIME expects all of the data to be in one file, and, currently, we have one separate fastq file for each assembled read. We will add labels to each sample and merge into one fasta file using the
add_qiime_labels.py script. Documentation is here.
Navigate back to the EDAMAME_16S/ directory, then execute the following command:
add_qiime_labels.py -i pandaseq_merged_reads/ -m MappingFiles/Centralia_Full_Map.txt -c InputFastaFileName -n 1
Inspect the new file "combined_seqs.fna."
Observe that QIIME has added the SampleIDs from the mapping file to the start of each sequence. This allows QIIME to quickly link each sequence to its sampleID and metadata.
While we are inspecting the combined_seqs.fna file, let's confirm how many sequences we have in the dataset.
count_seqs.py -i combined_seqs.fna
This is a nice QIIME command to call frequently, because it provides the total number of sequences in a file, as well as some information about the lengths of those sequences. But, suppose we wanted to know more than the median/mean of these sequences?
Another trick with QIIME is that you can call all the mothur commands within the QIIME environment, which is very handy. mothur offers a very useful command called
summary.seqs, which operates on a fasta/fna file to give summary statistics about its contents. We will cover mothur in all its glory later, but for now, execute the command:
Note that both summary.seqs and count_seqs.py have returned the same total number of seqs in the .fna file. Use the following command to quit the mothur environment and return to QIIME.
##1.6 Picking Operational Taxonomic Units, OTUs. ###1.6.1. Preamble Picking OTUs is sometimes called "clustering," as sequences with some threshold of identity are "clustered" together to into an OTU.
Important decision: Should I use a de-novo method of picking OTUs or a reference-based method, or some combination? (Or not at all?). The answer to this will depend, in part, on what is known about the community a priori. For instance, a human or mouse gut bacterial community will have lots of representatives in well-curated 16S databases, simply because these communities are relatively well-studied. Therefore, a reference-based method may be preferred. The limitation is that any taxa that are unknown or previously unidentified will be omitted from the community. As another example, a community from a lesser-known environment, like Mars or a cave, or a community from a relatively less-explored environment would have fewer representative taxa in even the best databases. Therefore, one would miss a lot of taxa if using a reference-based method. The third option is to use a reference database but to set aside any sequences that do not have good matches to that database, and then to cluster these de novo.
For this tutorial, we are going to use an OTU-picking approach that uses a reference to identify as many OTUs as possible, but also includes any "new" sequences that do not hit the database. It is called "open reference" OTU picking, and you can read more about it in this paper by Rideout et al. Discussion of the workflow by the QIIME developers is here
We use the QIIME workflow command:
pick_open_reference_otus.py for this step. Documentation is here.
The default QIIME 1.9.1 method for OTU picking is uclust, but we will use the usearch61 algorithm because it incorporates a chimera-check and other quality filtering. However, we encourage you to explore different OTU clustering algorithms to understand how they perform. They are not created equal.
###1.6.2 Installing usearch61 While in the home directory, get the usearch and usearch61 files:
cd .. curl -O https://raw.githubusercontent.com/edamame-course/2015-tutorials/master/QIIME_files/usearch5.2.236_i86linux32 curl -O https://raw.githubusercontent.com/edamame-course/2015-tutorials/master/QIIME_files/usearch6.1.544_i86linux32
Now we need to put them in the correct path and change the permissions:
sudo cp usearch5.2.236_i86linux32 /usr/local/bin/usearch sudo chmod +x /usr/local/bin/usearch sudo cp usearch6.1.544_i86linux32 /usr/local/bin/usearch61 sudo chmod +x /usr/local/bin/usearch61 print_qiime_config.py -tf
This should show that the install did not have any failures.
###1.6.3. OTU picking
This next step will take about 45 minutes. Bonus: A great opportunity to try tmux!
Before continuing, make sure you are in the "EDAMAME_16S" directory.
pick_open_reference_otus.py -i combined_seqs.fna -m usearch61 -o usearch61_openref/ -f
In the above script:
- We tell QIIME to look for the input file
- We chose the clustering method usearch61
- We specify that output files should go in a new folder, usearch61_openref/
- We tell the program to overwrite already-existing files in the folder if we are running this program more than once (-f)
Other default parameters of interest:
- Singletons are removed from the OTU table (default flag --min_otu_size)
- Alignment is performed with PyNAST
- Taxonomy is assigned with uclust
- Default workflow values can be changed using a parameter file
- We do not perform prefiltering, as per the recommendations of Rideout et al.
###1.6.4. Exploring results from OTU picking workflow #####"Steps" from open reference picking
In usearch61_openref/, we can see several new directories and files. Let's explore them, starting with the "step1.." directories. As the documentation for pick_open_reference_otus.py explains:
- Step 1 picks OTUs based on a reference database, producing a file of successfully clustered OTUs and a file of sequences that failed to cluster based on the reference database.
- Step 2 performs computationally expensive de novo clustering for a subset of the failed sequences from step 1, and picks a representative sequence from each new cluster to add to the original database.
- Step 3 picks OTUs from all of the failed sequences, not just the subset used in step 2, based on the new database generated (of ref+ de novos) in step 2.
- Step 4 performs de novo clustering on all remaining OTUs.
An great overview of the steps of the open-reference process is provided by Figure 1 of Rideout et al. 2014.
If you navigate into one of the "step" directories, wou will see a series of output files, including representative sequences of OTU clusters ("rep_set"). Take your time to explore these files using the
more commands. Then, navigate back to the usearch61_openrefs directory.
What are the other directories? The open reference OTU picking also automatically takes the next steps towards building the OTU table. The pynast_aligned_seqs directory and the uclust_assigned_taxonomy each have outputs and logs from alignments and taxonomic assignment, respectively. Notice that the directories are named so that the algorithm/tool used to perform the task is provided (e.g., pynast was used for alignment, uclust was used for taxonomy). Very smart!
#####Alignment output Navigate into the pynast directory. There are three files waiting there: one file of sequences that failed to align, one of sequences that did align, one of "pre-alignment" files, and a log file. Inspect each.
If you want to build a tree in with some other out-of-QIIME software, this is where you would find the rep_set alignment. The log file provides a rep-sequence by rep-sequence report. If you needed align sequences as an independent step, you would use
align_seqs.py; documentation here.
How many failed alignments were there?
count_seqs.py -i rep_set_failures.fasta
We see that there were 32 rep. sequences that failed to align, and approximately 24831 that did. (Also, notice what short-read alignments generally look like...not amazing).
Sanity check? If you like, BLAST the top sequence that failed to align to convince yourself that it is, indeed, a pitiful failure.
If, in the future, you ever have a large proportion of rep seqs that fail to align, it could be due to:
- Hooray! These are novel organisms! (But, think about the novelty of the habitat before jumping to conclusions.)
- The alignment parameters were too stringent for short reads, causing "real" sequences to fail alignment.
- The paired-end merger algorithm (e.g., pandaseq) did not do a perfect job, and concatenated ends that do not belong together.
- Some combination of the above, as well as some other scenarios.
The failed-to-align sequences are filtered automatically with this QIIME otu-picking workflow (really, the removing the entire OTU cluster that they represent); the filtered results are in the no_pynast_failures.biom file.
In the "taxonomy" directory, you will find a log file and the specific taxonomic assignments given to each representative sequence, linked to the OTU ID of that representative sequence
#####Other Very Important Files in usearch_openref/
- OTU map
Explore this file. It links the exact sequences in each sample to its OTU ID. You should see an OTU ID (starting with the number of the first OTU that was picked) in the the left most column. After that number, there is a list of Sequence IDs that have been clustered into that OTU ID. The first part of the sequence ID is the SampleID from which it came, and the second part is the sequence number within that sample.
You will notice that some files have "mc2" appended to them. "mc2" designates that the nminimum count of sequences for any particular OTU was 2. In other words, that file has singleton OTUs already filtered out.
Sanity check: How can you compare the OTUs in the full dataset versus the singletons-omitted dataset?
- Biom-formatted OTU tables
These tables have the extension ".biom" There are lots of important resources for understanding and working with the "biome" formatted tables, which were developed to deal with very large, sparse datasets, like those for microbial communities. There are several versions - some omitting singletons (mc2), some containing taxonomic assignment of otus (w_tax), some omitting alignment failures (_no_pynast_failures).
- Representative sequences (one from each OTU)
This is not an alignment, but the list of representative sequences used to assign taxonomy to the OTU, to make the alignment, and to build the tree.
- Phylogenetic tree m
You can import this tree into any tree-visualization software that accepts the .tre extension (Newick format). This is made from from an alignment of representative sequences (in the pynast directory). The OTU ID is given first, and then the branch length to the next node. This format is generally compatible with other tree-building and viewing software. For example, I have used it to input into the Interactive Tree of Life to build visually appealing figures. Topiary Explorer is another options for visualization, and is a QIIME add-on.
- Log files Open them up! You will be delighted! It has all of the information you ever wanted to know about the parameters and tools you've just used for your workflow analysis! Hint: most of this information is needed when writing the methods sections of manuscripts using sequencing data.
##Congratulations! You just had the QIIME of Your Life!
#Resources and help
- QIIME offers a suite of developer-designed tutorials.
- Documentation for all QIIME scripts.
- There is a very active QIIME Forum on Google Groups. This is a great place to troubleshoot problems, responses often are returned in a few hours!
- The QIIME Blog provides updates like bug fixes, new features, and new releases.
- QIIME development is on GitHub.
- Remember that QIIME is a workflow environment, and the original algorithms/software that are compiled into QIIME must be referenced individually (e.g., PyNAST, RDP classifier, FastTree etc...)