Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time
146 lines (108 sloc) 4.88 KB

Estimating gene/read abundances in metagenomes

Authored by Jin Choi for EDAMAME2016


Overarching Goal

  • This tutorial will contribute towards an understanding of quantitative analyses of metagenome data
  • It focuses on estimating abundances of reads to an assembled reference.

Learning Objectives

  • Understanding how to estimate abundances of reads in a representative gene reference
  • Understanding read mapping
  • Understanding mapping file formats
  • Understanding how to use a mapping program (Bowtie2, samtools, bcftools)
  • Apply reference mapping to assess read abundances and quantify gene presence

Reference mapping is useful...

  • If you want to detect SNPs
  • If you want to estimate abundance of genes in metagenomic or metatranscriptomic data


Install mapping software for this tutorial, Bowtie2 and SamTools. BT2_HOME is the default name where Bowtie2 is installed.

Bowtie2 is a read mapping software.

mv bowtie2-2.2.9 BT2_HOME

Set up the path to installed software (you need to set up path again if you are logged in later):

export PATH

Install SamTools. SamTools is a software that we use to work with files that are outputted by Bowtie2. It is a software that is often used with mapping tools. Mapping files are generally very big and get unwieldy because of their size, SamTools helps us deal with these large files in a memory efficient approaches but sometimes it adds a lot of steps at a cost of speeding up analysis.

sudo apt-get -y install samtools

Download data

cd ~/metagenome
tar -zxvf infant_gut.sub.tar.gz

Do the mapping

Now let’s map all of the reads to the reference. Start by indexing the reference genome. Indexing a reference stores in a memory efficient way on your computer:

cd ~/metagenome
bowtie2-build megahit_out/final.contigs.fa reference

Now, do the mapping of the raw reads to the reference genome (the -1 and -2 indicate the paired-end reads):

for x in SRR*_1.sub.fastq.gz;
  do bowtie2 -x reference -1 $x -2 ${x%_1*}_2.sub.fastq.gz -S ${x%_1*}.sam 2> ${x%_1*}.out;

This file contains all of the information about where each read hits our reference assembly contigs.

Next, index the reference genome with samtools. Another indexing step for memory efficiency for a different tool. In the mapping world, get used to indexing since the files are huge:

samtools faidx megahit_out/final.contigs.fa

Convert the SAM into a BAM file (What is the SAM/BAM?):

To reduce the size of a SAM file, you can convert it to a BAM file (SAM to BAM!) - put simply, this compresses your giant SAM file.

for x in *.sam;
  do samtools import megahit_out/final.contigs.fa.fai $x $x.bam;

Sort the BAM file - again this is a memory saving and sometimes required step, we sort the data so its easy to query with our questions:

for x in *.bam;
  do samtools sort $x $x.sorted;

And index the sorted BAM file:

for x in *.sorted.bam;
  do samtools index $x;

Counting alignments

This command:

samtools view -c -f 4 SRR492065.sam.bam.sorted.bam

-c Instead of printing the alignments, only count them and print the total number. All filter options, such as -f, -F, and -q, are taken into account. -f INT Only output alignments with all bits set in INT present in the FLAG field. INT can be specified in hex by beginning with 0x (i.e. /^0x[0-9A-F]+/) or in octal by beginning with 0 (i.e. /^0[0-7]+/) [0].

will count how many reads DID NOT align to the reference (77608).

This command:

samtools view -c -F 4 SRR492065.sam.bam.sorted.bam

-F INT Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified in hex by beginning with 0x (i.e. /^0x[0-9A-F]+/) or in octal by beginning with 0 (i.e. /^0[0-7]+/) [0].

will count how many reads DID align to the reference (122392).

And this command:

gunzip -c SRR492065_1.sub.fastq.gz | wc

will tell you how many lines there are in the FASTQ file (100,000). Reminder: there are four lines for each sequence.

This number give you idea how many sequences are mapped (%)

You can find this information in .out file also.

Make a summary of the counts

for x in *.sorted.bam
do samtools idxstats $x > $x.idxstats.txt

We have some scripts that we use to process this file.

git clone
python mapping_tools/ *.idxstats.txt > counts.txt
less counts.txt

And there you are - you've created an abundance table. Like an OTU count table, you can now use this file for statistical analyses when you do the mapping for multiple samples.