This code will show how to map reads of paired-end metagenomes to a reference genome using `bbmap` in BASH. 

Sample data from ENA:

Arctic Ocean metagenomes sampled aboard CGC Healy during the 2015 GEOTRACES Arctic research cruise Secondary Study Accession:ERP015773 Study Title:Arctic Ocean metagenomes from HLY1502 Center Name:UNIVERSITY OF ALASKA FAIRBANKS Study Name:Arctic Ocean metagenomes ENA-FIRST-PUBLIC:2016-05-27 ENA-LAST-UPDATE:2016-05-25

Can be found at: https://www.ebi.ac.uk/ena/browser/view/PRJEB14154?show=reads

I have used the first 5 pairs of Generated FASTQ files

For the sake of brevity, I will use just one reference genome for mapping in this exercise, [E.coli K-12 substr. MG1655](https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2). Other reference genomes can be found at https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/

move into content folder:

In [1]:
cd example_content

examine content...notice all of the .fastq.gz files are in separate subfolders

In [2]:
ls

ERR1424899	ERR1424900	ERR1424901	ERR1424902	ERR1424903


move up one directory and create a new subdirectory to move all of the .fastq.gz files into one place. Then check that the directory was made with ls.

In [3]:
cd ..

First, we need to make a copy of the original data before moving it

In [4]:
cp -R example_content example_content_copy

In [5]:
mkdir all_data

In [6]:
ls

Mapping_simple.ipynb			example_content_copy
Untitled.ipynb				references
all_data				trimming_classification_basic.ipynb
example_content				trimming_classification_moreqc.ipynb
example_content_2


In [7]:
cd example_content

locate all files ending with .gz in all subfolders within the directory. The `*` character means that any other characters can preceed .gz. The `mindepth` command specifies to perform commands that follow at the subdirectory level (1=root). The empty `{}` allows all files meeting the criteria to be moved.  The `print` command allows user to monitor files

In [8]:
find . -mindepth 2 -type f -name '*.gz' -print -exec mv {} ../all_data \;

./ERR1424899/ERR1424899_1.fastq.gz
./ERR1424899/ERR1424899_2.fastq.gz
./ERR1424900/ERR1424900_1.fastq.gz
./ERR1424900/ERR1424900_2.fastq.gz
./ERR1424901/ERR1424901_1.fastq.gz
./ERR1424901/ERR1424901_2.fastq.gz
./ERR1424902/ERR1424902_1.fastq.gz
./ERR1424902/ERR1424902_2.fastq.gz
./ERR1424903/ERR1424903_1.fastq.gz
./ERR1424903/ERR1424903_2.fastq.gz


In [9]:
cd ../all_data

In [10]:
ls

ERR1424899_1.fastq.gz	ERR1424901_1.fastq.gz	ERR1424903_1.fastq.gz
ERR1424899_2.fastq.gz	ERR1424901_2.fastq.gz	ERR1424903_2.fastq.gz
ERR1424900_1.fastq.gz	ERR1424902_1.fastq.gz
ERR1424900_2.fastq.gz	ERR1424902_2.fastq.gz


In [11]:
rm -r ../example_content

Define the path to the reference

In [13]:
ref_path="references/genome_assemblies_genome_fasta_ecoli_K12_MG1655/GCF_000005845.2_ASM584v2_genomic.fna.gz"

create output directories

In [16]:
mkdir ../reports

In [17]:
mkdir ../bams

In [24]:
mkdir ../indices #Specify the location to write the index/genome files, if you don't want it in the current working directory.

Let's run some of the most basic commands using bbtools to map our pair of reads to the reference genome. 

In [25]:
bbmap.sh ref=../${ref_path} in1=ERR1424899_1.fastq.gz in2=ERR1424899_2.fastq.gz outm=../bams/ecoli_ERR1424899.bam minid=0.9 path=../indices covstats=../reports/ecoli_ERR1424899_stats.txt covhist=../reports/ecoli_ERR1424899_covhist.txt  basecov=../reports/ecoli_ERR1424899_basecov.txt -Xmx3200m

/usr/local/Cellar/bbtools/38.95/libexec//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx3200m -Xms3200m -cp /usr/local/Cellar/bbtools/38.95/libexec/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=../references/genome_assemblies_genome_fasta_ecoli_K12_MG1655/GCF_000005845.2_ASM584v2_genomic.fna.gz in1=ERR1424899_1.fastq.gz in2=ERR1424899_2.fastq.gz outm=../bams/ecoli_ERR1424899.bam minid=0.9 path=../indices covstats=../reports/ecoli_ERR1424899_stats.txt covhist=../reports/ecoli_ERR1424899_covhist.txt basecov=../reports/ecoli_ERR1424899_basecov.txt -Xmx3200m
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=../references/genome_assemblies_genome_fasta_ecoli_K12_MG1655/GCF_000005845.2_ASM584v2_genomic.fna.gz, in1=ERR1424899_1.fastq.gz, in2=ERR1424899_2.fastq.gz, outm=../bams/ecoli_ERR1424899.bam, minid=0.9, path=../indices, covstats=../reports/ecoli_ERR1424899_stats.txt, covhist=../reports/ecoli_ERR1424899_covhist.txt, basecov=../re

In [23]:
bbmap.sh --help


BBMap
Written by Brian Bushnell, from Dec. 2010 - present
Last modified February 13, 2020

Description:  Fast and accurate splice-aware read aligner.
Please read bbmap/docs/guides/BBMapGuide.txt for more information.

To index:     bbmap.sh ref=<reference fasta>
To map:       bbmap.sh in=<reads> out=<output sam>
To map without writing an index:
    bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

in=stdin will accept reads from standard in, and out=stdout will write to 
standard out, but file extensions are still needed to specify the format of the 
input and output files e.g. in=stdin.fa.gz will read gzipped fasta from 
standard in; out=stdout.sam.gz will write gzipped sam.

Indexing Parameters (required when building the index):
nodisk=f                Set to true to build index in memory and write nothing 
                        to disk except output.
ref=<file>              Specify the reference sequence.  Only do this ONCE, 
                        when buildin