### PairwiseAligner

The Bio.Align.PairwiseAligner Class implements the Needleman-Wunsch, Smith-Waterman, Gotoh (three-state), and Waterman-Smith-Beyer global and local pairwise alignment algorithms. For an in-depth consultation of the algorithms, you can consult Durbin et al. [1].

#### Basic usage

In [1]:
from Bio import Align
aligner = Align.PairwiseAligner()

seq1 = "GATTACA"
seq2 = "GCATGCA"

alignments = aligner.align(seq1, seq2)
for alignment in alignments:
    print(alignment)

score = aligner.score(seq1, seq2)
score

G-ATTA-CA
|-||---||
GCAT--GCA

G-ATTA-CA
|-|-|--||
GCA-T-GCA

G-ATT-ACA
|-||---||
GCAT-G-CA

G-ATT-ACA
|-|-|--||
GCA-TG-CA

G-AT-TACA
|-||---||
GCATG--CA

G-ATTACA
|-||.-||
GCATG-CA

G-ATTACA
|-||-.||
GCAT-GCA

G-ATTACA
|-|-|.||
GCA-TGCA



5.0

By default, a global pairwise alignment is performed, which finds the optimal alignment over the whole length of seq1 and seq2.

In [7]:
print(aligner)

Pairwise sequence aligner with parameters
  wildcard: None
  match_score: 1.000000
  mismatch_score: 0.000000
  target_internal_open_gap_score: 0.000000
  target_internal_extend_gap_score: 0.000000
  target_left_open_gap_score: 0.000000
  target_left_extend_gap_score: 0.000000
  target_right_open_gap_score: 0.000000
  target_right_extend_gap_score: 0.000000
  query_internal_open_gap_score: 0.000000
  query_internal_extend_gap_score: 0.000000
  query_left_open_gap_score: 0.000000
  query_left_extend_gap_score: 0.000000
  query_right_open_gap_score: 0.000000
  query_right_extend_gap_score: 0.000000
  mode: global



Depending on the gap scoring parameters and mode, a PairwiseAligner object automatically chooses the appropriate algorithm to use for pairwise sequence alignment.

In [None]:
if not (target_gap_function || query_gap_function)

target_gap_open == target_gap_extend
&& query_gap_open == query_gap_extend
&& target_left_open == target_left_extend
&& target_right_open == target_right_extend
&& query_left_open == query_left_extend
&& query_right_open == query_right_extend

In [6]:
print(aligner.algorithm)

Needleman-Wunsch


In [9]:
aligner.mode = 'local'
print(aligner.algorithm)

Smith-Waterman


In [9]:
aligner.match_score = 5
aligner.mismatch_score = -3
aligner.extend_gap_score = -3
aligner.gap_score = -3

In [10]:
alignments = aligner.align(seq1, seq2)
for alignment in alignments:
    print(alignment)

G-ATTACA
|-||.-||
GCATG-CA

G-ATTACA
|-||-.||
GCAT-GCA

G-ATTACA
|-|-|.||
GCA-TGCA



#### Alignment objects

In [13]:
aligner = Align.PairwiseAligner()
seq1 = "GATTACA"
seq2 = "GCATGCA"
alignments = aligner.align(seq1, seq2)
alignment = alignments[0]
alignment

<Bio.Align.PairwiseAlignment at 0x7ff404226400>

In [14]:
alignment.score

3.0

In [11]:
alignment.target

'GATTACA'

In [16]:
print(alignment)

GAACT
||--|
GA--T



In [17]:
alignment.shape

(2, 5)

In [18]:
aligner.mode = 'local'
local_alignments = aligner.align("TGAACT", "GAC")
local_alignment = local_alignments[0]
print(local_alignment)

TGAACT
 ||-|
 GA-C



In [19]:
local_alignment.shape

(2, 4)

In [20]:
aligner.mode = "global"
aligner.mismatch_score = -10
aligner.extend_gap_score = -10
alignments = aligner.align("AAACAAA", "AAAGAAA")
len(alignments)

2

In [21]:
print(alignments[0])

AAAC-AAA
|||--|||
AAA-GAAA



Use the substitutions method to find the number of substitutions between each pair of nucleotides:

In [22]:
target = "AAAAAAAACCCCCCCCGGGGGGGGTTTTTTTT"
query = "AAAAAAACCCTCCCCGGCCGGGGTTTAGTTT"
alignments = aligner.align(target, query)
aligner.mismatch_score = -1
aligner.gap_score = -1
aligner.extend_gap_score = -1
# extent
alignments = aligner.align(target, query)
len(alignments)

8

In [23]:
print(alignments[0])

AAAAAAAACCCCCCCCGGGGGGGGTTTTTTTT
|||||||-|||.||||||..|||||||..|||
AAAAAAA-CCCTCCCCGGCCGGGGTTTAGTTT



In [25]:
print(alignments[0].substitutions)

    A   C   G   T
A 7.0 0.0 0.0 0.0
C 0.0 7.0 0.0 1.0
G 0.0 2.0 6.0 0.0
T 1.0 0.0 1.0 6.0



### Calculating a substitution matrix from a pairwise sequence alignment

In [26]:
from Bio import AlignIO
alignments = AlignIO.read("../data/opuntia.aln", "clustal")
print(alignments)

Alignment with 7 rows and 906 columns
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191


In [27]:
alignment = alignments[0]
from Bio.Align.substitution_matrices import Array
frequency = Array("ACGTN", dims=2)
seq1 = alignments[2].seq
seq2 = alignments[-1].seq
for c1, c2 in zip(seq1, seq2):
    if (c1 in "ACGTN") & (c2 in "ACGTN"):
        frequency[c1, c2] += 1
print(frequency)

      A     C     G     T   N
A 369.0   1.0   2.0   1.0 0.0
C   1.0 117.0   1.0   1.0 0.0
G   1.0   0.0 120.0   1.0 0.0
T   2.0   2.0   1.0 271.0 1.0
N   0.0   0.0   0.0   0.0 0.0



In [28]:
frequency = alignments.substitutions
print(frequency)

       A      C      G   N      T
A 7814.0   15.0   19.0 2.5   22.0
C   15.0 2473.0    6.0 0.5   22.5
G   19.0    6.0 2548.0 2.0   12.0
N    2.5    0.5    2.0 0.0    4.0
T   22.0   22.5   12.0 4.0 5730.0



We normalize against the total number to find the probability of each substitution, and create a symmetric matrix:

In [30]:
import numpy as np

probabilities = frequency / np.sum(frequency)
probabilities = (probabilities + probabilities.transpose()) / 2.0
print(probabilities.format("%.4f"))

       A      C      G      N      T
A 0.4162 0.0008 0.0010 0.0001 0.0012
C 0.0008 0.1317 0.0003 0.0000 0.0012
G 0.0010 0.0003 0.1357 0.0001 0.0006
N 0.0001 0.0000 0.0001 0.0000 0.0002
T 0.0012 0.0012 0.0006 0.0002 0.3052



In [45]:
from Bio import SeqIO
seq1 = SeqIO.read("../data/alpha.faa", "fasta")
seq2 = SeqIO.read("../data/beta.faa", "fasta")
aligner = Align.PairwiseAligner()
score = aligner.score(seq1.seq, seq2.seq)
print(score)

72.0


In [46]:
alignments = aligner.align(seq1.seq, seq2.seq)

In this example, the total number of optimal alignments is huge (more than 4 × 1037), and calling len(alignments) will raise an OverflowError:

In [33]:
len(alignments)

OverflowError: number of optimal alignments is larger than 9223372036854775807

In [34]:
alignment = alignments[0]

In [35]:
print(alignment.score)

72.0


In [40]:
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [38]:
print(alignment)

MV-LS-PAD--KTN--VK-AA-WGKV-----GAHAGEYGAEALE-RMFLSF----P-TTKTY--FPHF----DLSHGS---AQVK-G------HGKKV--A--DA-LTNAVAHV-DDMPNALS----A-LSD-LHAH--KLR-VDPV-NFK-LLSHC---LLVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-
||-|--|----|----|--|--||||-----|---||--|--|--|--|------|-|------|--|----|||------|-|--|------|||||--|--|--|--|--|--|---|-|-----|-||--||----||--|||--||--||------|-|---||-|-------||||--|--|------|----|--|------|----||--
MVHL-TP--EEK--SAV-TA-LWGKVNVDEVG---GE--A--L-GR--L--LVVYPWT----QRF--FESFGDLS---TPDA-V-MGNPKVKAHGKKVLGAFSD-GL--A--H-LD---N-L-KGTFATLS-ELH--CDKL-HVDP-ENF-RLL---GNVL-V-CVLA-H---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H



In [47]:
aligner.algorithm

'Needleman-Wunsch'

Better alignments are usually obtained by penalizing gaps: higher costs for opening a gap and lower costs for extending an existing gap. For amino acid sequences match scores are usually encoded in matrices like PAM or BLOSUM. Thus, a more meaningful alignment for our example can be obtained by using the BLOSUM62 matrix, together with a gap open penalty of 10 and a gap extension penalty of 0.5:

In [64]:
from Bio.Align import substitution_matrices

aligner = Align.PairwiseAligner()
aligner.open_gap_score = -10
aligner.extend_gap_score = -10
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
score = aligner.score(seq1.seq, seq2.seq)
print(score)

246.0


In [65]:
aligner.algorithm

'Needleman-Wunsch'

In [53]:
alignments = aligner.align(seq1.seq, seq2.seq)
len(alignments)

2

In [66]:
print(alignments[0])

MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
||-|.|..|..|.|.||||--...|.|.|||.|.....|.|...|..|-|||-----.|...||.|||||..|.....||.|........||.||..||.|||.||.||...|...||.|...||||.|.|...|..|.|...|..||.
MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH



#### Affine gap scores

| Opening scores                 | Extending scores                 |
|--------------------------------|----------------------------------|
| query_left_open_gap_score      | query_left_extend_gap_score      |
| query_internal_open_gap_score  | query_internal_extend_gap_score  |
| query_right_open_gap_score     | query_right_extend_gap_score     |
| target_left_open_gap_score     | target_left_extend_gap_score     |
| target_internal_open_gap_score | target_internal_extend_gap_score |
| target_right_open_gap_score    | target_right_extend_gap_score    |

| target | query | score                            |
|:------:|:-----:|----------------------------------|
|    A   |   -   | query left open gap score        |
|    C   |   -   | query left extend gap score      |
|    C   |   -   | query left extend gap score      |
|    G   |   G   | match score                      |
|    G   |   T   | mismatch score                   |
|    G   |   -   | query internal open gap score    |
|    A   |   -   | query internal extend gap score  |
|    A   |   -   | query internal extend gap score  |
|    T   |   T   | match score                      |
|    A   |   A   | match score                      |
|    G   |   -   | query internal open gap score    |
|    C   |   C   | match score                      |
|    -   |   C   | target internal open gap score   |
|    -   |   C   | target internal extend gap score |
|    C   |   C   | match score                      |
|    T   |   G   | mismatch score                   |
|    C   |   C   | match score                      |
|    -   |   C   | target internal open gap score   |
|    A   |   A   | match score                      |
|    -   |   T   | target right open gap score      |
|    -   |   A   | target right extend gap score    |
|    -   |   A   | target right extend gap score    |

In [10]:
aligner = Align.PairwiseAligner()
def my_gap_score_function(start, length):
    if start==2:
        return -1000
    else:
        return -1 * length

In [11]:
aligner.query_gap_score = my_gap_score_function
alignments = aligner.align("AACTT", "AATT")
for alignment in alignments:
    print(alignment)

AACTT
-|.||
-AATT

AACTT
|-.||
A-ATT

AACTT
||.-|
AAT-T

AACTT
||.|-
AATT-

