This code will show how to map reads of paired-end metagenomes to a reference genome using `bbmap` in BASH. 

Sample data from ENA:

Arctic Ocean metagenomes sampled aboard CGC Healy during the 2015 GEOTRACES Arctic research cruise Secondary Study Accession:ERP015773 Study Title:Arctic Ocean metagenomes from HLY1502 Center Name:UNIVERSITY OF ALASKA FAIRBANKS Study Name:Arctic Ocean metagenomes ENA-FIRST-PUBLIC:2016-05-27 ENA-LAST-UPDATE:2016-05-25

Can be found at: https://www.ebi.ac.uk/ena/browser/view/PRJEB14154?show=reads

I have used the first 5 pairs of Generated FASTQ files

For the sake of brevity, I will use just one reference genome for mapping in this exercise, [E.coli K-12 substr. MG1655](https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2). Other reference genomes can be found at https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/

move into content folder:

In [1]:
cd example_content

examine content...notice all of the .fastq.gz files are in separate subfolders

In [2]:
ls

ERR1424899	ERR1424900	ERR1424901	ERR1424902	ERR1424903


move up one directory and create a new subdirectory to move all of the .fastq.gz files into one place. Then check that the directory was made with ls.

In [3]:
cd ..

First, we need to make a copy of the original data before moving it

In [4]:
cp -R example_content example_content_copy

In [5]:
mkdir all_data

In [6]:
ls

Mapping_simple.ipynb			example_content_copy
Untitled.ipynb				references
all_data				trimming_classification_basic.ipynb
example_content				trimming_classification_moreqc.ipynb
example_content_2


In [7]:
cd example_content

locate all files ending with .gz in all subfolders within the directory. The `*` character means that any other characters can preceed .gz. The `mindepth` command specifies to perform commands that follow at the subdirectory level (1=root). The empty `{}` allows all files meeting the criteria to be moved.  The `print` command allows user to monitor files

In [8]:
find . -mindepth 2 -type f -name '*.gz' -print -exec mv {} ../all_data \;

./ERR1424899/ERR1424899_1.fastq.gz
./ERR1424899/ERR1424899_2.fastq.gz
./ERR1424900/ERR1424900_1.fastq.gz
./ERR1424900/ERR1424900_2.fastq.gz
./ERR1424901/ERR1424901_1.fastq.gz
./ERR1424901/ERR1424901_2.fastq.gz
./ERR1424902/ERR1424902_1.fastq.gz
./ERR1424902/ERR1424902_2.fastq.gz
./ERR1424903/ERR1424903_1.fastq.gz
./ERR1424903/ERR1424903_2.fastq.gz


In [9]:
cd ../all_data

In [10]:
ls

ERR1424899_1.fastq.gz	ERR1424901_1.fastq.gz	ERR1424903_1.fastq.gz
ERR1424899_2.fastq.gz	ERR1424901_2.fastq.gz	ERR1424903_2.fastq.gz
ERR1424900_1.fastq.gz	ERR1424902_1.fastq.gz
ERR1424900_2.fastq.gz	ERR1424902_2.fastq.gz


In [11]:
rm -r ../example_content

Define the path to the reference

In [3]:
ref_path="references/genome_assemblies_genome_fasta_ecoli_K12_MG1655/GCF_000005845.2_ASM584v2_genomic.fna.gz"

create output directories

In [16]:
mkdir ../reports

In [17]:
mkdir ../bams

In [24]:
mkdir ../indices #Specify the location to write the index/genome files, if you don't want it in the current working directory.

Let's run some of the most basic commands using bbtools to map our pairs of reads to the reference genome. Here, I am calling `bbmaps` , defining the reference genome with `ref`, inputting the forward and reverse reads of one sample as `in1` and `in2`, outputting the resulting sam or bam file (`outm`) and several text reports (coverage statistics, coverage histogram, base coverage). the argument `minid` defines what reads are acceptable to retain based on their alignment identity with the reference genome. By using `outm` instead of `out`, we output only mapped reads to the sam file. `out` will output mapped and unmapped reads, and `outu` will output only unmapped reads. By default, `bbmaps` sets the argument `ambiguous` to best. Meaning that reads will be mapped to the first best site. By default, the `kmer` length is always 13. bbmap will output sam files unless samtools is installed, in which case bam files can be output.

There are several on-screen outputs. At the top, there are basic statistics about read pairing. Then, there are indivdual statistics for the forward and reverse reads. 

In [4]:
bbmap.sh ref=../${ref_path} in1=ERR1424899_1.fastq.gz in2=ERR1424899_2.fastq.gz outm=../bams/ecoli_ERR1424899.bam minid=0.9 path=../indices covstats=../reports/ecoli_ERR1424899_stats.txt covhist=../reports/ecoli_ERR1424899_covhist.txt  basecov=../reports/ecoli_ERR1424899_basecov.txt -Xmx3200m

/usr/local/Cellar/bbtools/38.95/libexec//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx3200m -Xms3200m -cp /usr/local/Cellar/bbtools/38.95/libexec/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=../references/genome_assemblies_genome_fasta_ecoli_K12_MG1655/GCF_000005845.2_ASM584v2_genomic.fna.gz in1=ERR1424899_1.fastq.gz in2=ERR1424899_2.fastq.gz outm=../bams/ecoli_ERR1424899.bam minid=0.9 path=../indices covstats=../reports/ecoli_ERR1424899_stats.txt covhist=../reports/ecoli_ERR1424899_covhist.txt basecov=../reports/ecoli_ERR1424899_basecov.txt -Xmx3200m
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=../references/genome_assemblies_genome_fasta_ecoli_K12_MG1655/GCF_000005845.2_ASM584v2_genomic.fna.gz, in1=ERR1424899_1.fastq.gz, in2=ERR1424899_2.fastq.gz, outm=../bams/ecoli_ERR1424899.bam, minid=0.9, path=../indices, covstats=../reports/ecoli_ERR1424899_stats.txt, covhist=../reports/ecoli_ERR1424899_covhist.txt, basecov=../re

Let's take a look at some of the reports, starting with the basic statistics

In [27]:
cat ../reports/ecoli_ERR1424899_stats.txt

#ID	Avg_fold	Length	Ref_GC	Covered_percent	Covered_bases	Plus_reads	Minus_reads	Read_GC	Median_fold	Std_Dev
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	0.1603	4641652	0.5079	1.8876	87617	2195	2194	0.5130	0	3.65


Above is the basic statistics report. This includes the read ID (`ID`), the average coverage depth (`Avg_fold`), `Length` is the length of the reference genome, `Ref_GC` is the GC content of the reference genome, `Covered percent` is the percent of reference bases covered, `Plus_reads` and `Minus_reads` refer to the read-scoring algorithm. In a read, each base matching the reference genome is scored +100 points, the first mismatch is scored -127 points, and consecutive mismatching bases are scored negative points but less and less and the length of mismatching bases increases. 

As a reminder, the coverage of a genome is calculated as `N x L/G` Where `G` is the length of the original genome, `N` is the number of reads, and `L` is the average read length.

Now, the first bit of the coverage histogram report:

In [3]:
head ../reports/ecoli_ERR1424899_covhist.txt

#Coverage	numBases
0	4554035
1	35865
2	25433
3	3434
4	5019
5	1948
6	2427
7	989
8	840


This report is convenient for plotting. 

Second, let's look at a bit of the base coverage report. Warning, this report type is VERY large.

In [4]:
head ../reports/ecoli_ERR1424899_basecov.txt

#RefName	Pos	Coverage
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	0	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	1	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	2	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	3	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	4	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	5	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	6	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	7	0
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	8	0


This report goes through coverage position by position in the reference genome.

Finally, let's look at the alignment file. By default, if samtools is installed, bbmap will automatically output the much smaller binary version (a `bam`, instead of a `sam`). This is desirable to save storgae space. The `bam` file is only interpretable by a computer, as shown below

In [3]:
head ../bams/ecoli_ERR1424899.bam

     � BC ����N�0���)� ��n��H�JtW�����,'��F�Sl' ^��I*Xi%N{�<��o�3c�/�'�O=o��������.n����b4�<��4N��	��k��A�|/��R"c5F�sʐi���^ҋ0<sru(�ځ�+�o�8��F�����*�����ߧG�U0��(���h:��T�FH�Yf�D)w��>
���櫿R�_���yw8�E�������2N��O��W�Ϣ8��P�E�Ћ�Z��������?�S����v�^�-}dz>����������x��t��| Ͼ�x ���       � BC �N��}�4IZ�3��u��;�LfgwgOTgVfo��L��{��VQҖ->�`�D��  tB6�l�������am/l$-h�$���Z��a�u���9�1�����k���SUY�U��=�q���NMgWU�<����y2�m������Z�}���>��G�P�?��߄����Oϗ��77_vyv3�ꓳ�듇�Ӈ'����u�ӧ���qqqq	�__�<������O�g��3�_��/5^�`�����_o���V����O����������-�i^���9~��*N��G)�~D�SB��%���`a&�D� �*
L����-���?  �V^�Z��"^�4��g�"i拔��c�O�L2%�L�(�"I�G�( �P�#�b�"LF,�
�gW�[ ���������"��O��gݠ��*^o���[�Ҡ��BG]^����h��{�UGӦ�٫���w�י�gV��fϺ�~���|�D�p�PEQD�/#*΄B�z~���rङH�H�,`4\�4V��,�C2	B/"y��I���Pi2�T�B���f�J"�P/�d��	UK��Q"}�s兂�G�����o}�lX[Ր��q3��Q3W�g�)�o���嬋3�t:�>:��8z�p=wzt�rgz��ިs���;�q�r��r�[�\9X���j5����X��W�Z�7�
�;���R9���<�w��[���|?�<�gLH

To view the file in `sam` format, we can call on `samtools`. Here, I am using the `view` command and using `>` to specify input and output locations and formats. The `h` flag indicates that the `sam` file will have a header. 

In [2]:
samtools view -h ../bams/ecoli_ERR1424899.bam > ../bams/ecoli_ERR1424899.sam

Let's take a look at the `sam` file, which provides us with useful information about the alignment that we can read. The @HD flag indicates the file-level metadata header line. The `@SQ` flag indicates the reference sequence dictionary. The `@PG` flag indicates the program record. Within @PG, `ID` indicates the program record identifier and `PN` indicates the program used (in this case, both are BBMap) and `VN` indicates the version of the program used. `CL` indicates the command used. Thereafter is the alignment information, which has a particular standardized format. This can be found on the [samtools github](https://samtools.github.io/hts-specs/SAMv1.pdf). I will only bring attention to a few things.

First, note the fifth column in any alignment, which indicates the `MAPQ`, or the probability that the mapping position is wrong. A value of 255 indicates that the MAPQ is not available. A "good" MAPQ really depends on the application. But, to put it into perspective, a score of 10 means that there is a 90% probability that the mapping position is correct, a score of 3 corresponds to 50%, and a score of 30 corresponds to 99.9%. Careful though! The "goodness" of MAPQ scores can vary between programs (see [this useful review](https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/))

Also, note the sixth column in any alignment, the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string. These contain a bunch of numbers and equal signs. In `bbmap`, the `=` indicates a matching base, an `x` indicates a substitution, and an `M` 
 indicates a degenerate base. 

In [3]:
head ../bams/ecoli_ERR1424899.sam

@HD	VN:1.4	SO:unsorted
@SQ	SN:NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	LN:4641652
@PG	ID:BBMap	PN:BBMap	VN:38.95	CL:java -ea -Xmx3200m -Xms3200m align2.BBMap build=1 overwrite=true fastareadlen=500 ref=../references/genome_assemblies_genome_fasta_ecoli_K12_MG1655/GCF_000005845.2_ASM584v2_genomic.fna.gz in1=ERR1424899_1.fastq.gz in2=ERR1424899_2.fastq.gz outm=../bams/ecoli_ERR1424899.bam minid=0.9 path=../indices covstats=../reports/ecoli_ERR1424899_stats.txt covhist=../reports/ecoli_ERR1424899_covhist.txt basecov=../reports/ecoli_ERR1424899_basecov.txt -Xmx3200m
ERR1424899.639 M03580:21:000000000-APRY1:1:1101:15556:2188	99	NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome	226988	2	2X6=1X18=1X6=1X9=1X2=3X8=1X2=3X6=1X2=1X21=1X6=1X11=1X2=1X14=1X6=1X37=2X1=1X1=1X7=1X4=1X11=1X2=1X1=1X1=1X2=2X1=	=	226991	223	GAGCTGGACGTATCAGAAGTGCGAATGTTGACATGAGTAACGATCAAAGAGGTGAAAAACCTCTTCGCCGAAAAACCAAGGGTTCCTGTCCAACGCTAATCGTGGCAGGGTGAGGCGGCCCCTAAGGCGAGGGCG