# Aligning the query sequence

Now that we have built the reference index using BWA, we can align the query sequence (input.fq) to this reference.

Let us take a look again at the files:

In [1]:
ls -lh

total 481M
-rw-r--r-- 1 root root  15K Jul 24 00:35 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  16K Jul 24 00:19 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  61K Jul 24 00:19 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  11K Jul 24 00:19 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 24 00:20  chr5.fa
-rw-r--r-- 1 root root  588 Jul 24 00:32  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 24 00:32  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 24 00:32  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 24 03:13  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 24 00:32  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 24 00:33  chr5.fa.sa
-rw-r--r-- 1 root root 820K Jul 24 00:19  input.fq
-rw-r--r-- 1 root root 225K Jul 24 00:22  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 24 00:22  input_fastqc.zip
-rw-r--r-- 1 root root    0 Jul 24 03:13  mapped.bam
-rw-r--r-- 1 root root   92 Jul 24 03:13  mapped.sort.bam
-rw-r--r-- 1

### Taking a peek at the input FASTQ file

In [2]:
head input.fq

@SRRQ866988.19885082
CCAAGTAAGATTGAGCTTGAAGGCTGTTCTCATTTTGTAAAAACATAAGCTCAGGAAGTGTTGAAGATATTTTAACTCTACACTGAGACTT
+SRRQ866988.19885082
GIIGIIIIIIIIHIIIIIIIIIIIIIIIIIIGIIIIIIIIIIHIIIIIGIIIIEHBGGEGIIHIHIIIFIIIIHIIBHIIGEHIE<EII<G
@SRRQ866988.19885085
GAATAACCCTGTGCCCTCCAGAACTGGTGCCCTGTGAACACCCAAAAGCAAAGAGAAGTGACTCTTGTTCCTAATGTGGAAAGAGCAGAAC
+SRRQ866988.19885085
GDGEDBGGDG<DGGGGFGGGDG?G8EDC>FFF>FCDGGG:GEGBGGG>GD==8??66:2:FCECED<CCCGHHHBGEDGEAAB<8ACC8C?
@SRRQ866988.19885086
GGAGTCTCAGAGAGAGGTGTGACCTGGACCCTGCCTGCCTCTCCAGCTGCACTCACAGCATCCTCACCATCTTCACTCTGCTTGGTCCCAC


### Aligning the sequence file using BWA

We will again load the BWA module, and run BWA to see the options

In [3]:
bwa


Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are

: 1

There are several options for BWA alignment (mem, bwasw) that are optimized for sequences of different lengths. For most purposes, BWA mem will give good results.

Let us take a look at the options for alignment using the mem option

In [4]:
bwa mem 


Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:

       -t INT        number of threads [1]
       -k INT        minimum seed length [19]
       -w INT        band width for banded alignment [100]
       -d INT        off-diagonal X-dropoff [100]
       -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
       -y INT        seed occurrence for the 3rd round seeding [20]
       -c INT        skip seeds with more than INT occurrences [500]
       -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
       -W INT        discard a chain if seeded bases shorter than INT [0]
       -m INT        perform at most INT rounds of mate rescues for each read [50]
       -S            skip mate rescue
       -P            skip pairing; mate rescue performed unless -S also in use

Scoring options:

       -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
     

: 1

For a simple alignment, we will just need to specify 2 things:

- reference index (we will use the prefix name)
- the input query fastq file

The BWA program will output the alignments in the SAM format (we will look at this shortly). To save the output to a file, we will redirect the output to a file using the > operator.

In [5]:
bwa mem -t 4 chr5.fa input.fq > mapped.sam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 3714 sequences (337974 bp)...
[M::mem_process_seqs] Processed 3714 reads in 0.130 CPU sec, 0.037 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 chr5.fa input.fq
[main] Real time: 0.206 sec; CPU: 0.268 sec


Let us take a look at the SAM output

In [6]:
head mapped.sam

@SQ	SN:chr5	LN:181538259
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem -t 4 chr5.fa input.fq
SRRQ866988.19885082	0	chr5	148971889	60	91M	*	0	0	CCAAGTAAGATTGAGCTTGAAGGCTGTTCTCATTTTGTAAAAACATAAGCTCAGGAAGTGTTGAAGATATTTTAACTCTACACTGAGACTT	GIIGIIIIIIIIHIIIIIIIIIIIIIIIIIIGIIIIIIIIIIHIIIIIGIIIIEHBGGEGIIHIHIIIFIIIIHIIBHIIGEHIE<EII<G	NM:i:0	MD:Z:91	AS:i:91	XS:i:0
SRRQ866988.19885085	0	chr5	148973059	60	91M	*	0	0	GAATAACCCTGTGCCCTCCAGAACTGGTGCCCTGTGAACACCCAAAAGCAAAGAGAAGTGACTCTTGTTCCTAATGTGGAAAGAGCAGAAC	GDGEDBGGDG<DGGGGFGGGDG?G8EDC>FFF>FCDGGG:GEGBGGG>GD==8??66:2:FCECED<CCCGHHHBGEDGEAAB<8ACC8C?	NM:i:0	MD:Z:91	AS:i:91	XS:i:0
SRRQ866988.19885086	0	chr5	148973888	60	91M	*	0	0	GGAGTCTCAGAGAGAGGTGTGACCTGGACCCTGCCTGCCTCTCCAGCTGCACTCACAGCATCCTCACCATCTTCACTCTGCTTGGTCCCAC	HHHHHHHHHHGBGHHHGEHGHHHH@DGEGGH>HHHCCFCCAGGGDHHEFHEHFFFBB<FDEFCB1BBB=@AA@D=??DBE>:4*8@@9<>A	NM:i:1	MD:Z:12C78	AS:i:86	XS:i:0
SRRQ866988.19885087	0	chr5	148974777	60	91M	*	0	0	CTTACTAAAACTCACCATGTGTCAAGCGCTTCACTGACATCATCTTATTTAATCCTCACAACA

## Looking at the SAM format

The SAM format is a tab delimited text file for storing alignments. The file usually starts with a header containing one/several lines marked by the letter @. This usually specifies the reference chromosomes used in the alignment, as well the the parameters used for the alignment

Following the header, each line of alignment consists of several tab-delimited columns.


<pre>QNAME FLAG RNAME POS MAPQ CIGAR MRNM MPOS ISIZE SEQ QUAL [TAG:VTYPE:VALUE[...]]</pre>
* The first 11 are mandatory
* Additional columns can be added using the format TAG:VTYPE:VALUE

Let us take a look at one alignment for the SAM output

In [7]:
head -n 3 mapped.sam

@SQ	SN:chr5	LN:181538259
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem -t 4 chr5.fa input.fq
SRRQ866988.19885082	0	chr5	148971889	60	91M	*	0	0	CCAAGTAAGATTGAGCTTGAAGGCTGTTCTCATTTTGTAAAAACATAAGCTCAGGAAGTGTTGAAGATATTTTAACTCTACACTGAGACTT	GIIGIIIIIIIIHIIIIIIIIIIIIIIIIIIGIIIIIIIIIIHIIIIIGIIIIEHBGGEGIIHIHIIIFIIIIHIIBHIIGEHIE<EII<G	NM:i:0	MD:Z:91	AS:i:91	XS:i:0


We can break down the contents according to the tab-delimited columns:


* QNAME = SRRQ866988.19885082 
* FLAG = 0 
* RNAME = chr5
* POS = 148351452
* MAPQ = 60
* CIGAR = 91M
* MRNM = *
* MPOS = 0
* ISIZE = 0
* SEQ = CCAAGTAAGATTGAGCTTGAAGGCTGTTCTCATTTTGTAAAAACATAAGCTCAGGAAGTGTTGAAGATATTTTAACTCTACACTGAGACTT
* QUAL = GIIGIIIIIIIIHIIIIIIIIIIIIIIIIIIGIIIIIIIIIIHIIIIIGIIIIEHBGGEGIIHIHIIIFIIIIHIIBHIIGEHIE<EII<G

The tags (after column 11) form additional columns:

* NM:i:0 = Edit distance, integer type, 0
