#Pairwise Sequence Alignments with Biopython
Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid).

Identifying the similar region enables us to infer a lot of information like what traits are conserved between species, how close different species genetically are, how species evolve, etc.

Pairwise sequence alignment uses a dynamic programming to the optimal alignment between the two sequences, scoring based on their similarity (how similar they are) or distance (how different they are), and then assessing the significance of this score.
## Types of alignments
1. **Global alignment:** This method finds the best alignment over the entire lengths of the 2 sequences. What is the maximum similarity between sequence X and Y?
2. **Local alignment:** This method finds the most similar subsequences among the 2 sequences. What is the maximum similarity between a subsequence of X and a subsequence of Y?

Moreover, when doing alignments, you can specify the match score and gap penalties.

1. The **match score** indicates the compatibility between an alignment of two characters in the sequences. Highly compatible characters should be given **positive scores**, and incompatible ones should be given negative scores or 0.

2. The **gap penalties** should be negative.

## Biopython: Bio.pairwise2
Biopython includes two built-in pairwise aligners: Bio.pairwise2 module and PairwiseAligner class within the Bio.Align module (since Biopython version 1.72). Both can perform global and local alignments. -> **focus on pairwise2**
The names of the alignment functions in this module follow the convention alignmenttypeXY where alignmenttype is either “global” or “local” and XY is a 2 character code indicating the parameters it takes. The first character X indicates the parameters for matches (and mismatches), and the second Y indicates the parameters for gap penalties.

**The match parameters are:**

1. x - No parameters. Identical characters have score of 1, otherwise 0

2. m - A match score is the score of identical chars, otherwise mismatch score. Keywords: match, mismatch

3. d - A dictionary returns the score of any pair of characters. Keyword: match_dict

4. c - A callback function returns scores. Keyword: match_fn

**The gap penalty parameters are:**

1. x - No gap penalties
2. s - Same open and extend gap penalties for both sequences. Keywords: open, extend
3. d - The sequences have different open and extend gap penalties. Keywords openA, extendA, openB, extendB
4. c - A callback function returns the gap penalties. Keywords gap_A_fn, gap_B_fn

#### **For example: global alignment**
Just consider that for doing local alignment is the same procedure only change, the parameters of calling global for local!.

In [1]:
!pip install biopython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biopython
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 12.8 MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.79


In [2]:
from Bio import pairwise2 #import module

In [6]:
#parameters for this example
# global.xx means -> matches score 1, mismatches 0 nad no gap penalty
alignments = pairwise2.align.globalxx('ACCGGT', 'ACGT')

for alignment in alignments:
  print(pairwise2.format_alignment(*alignment)) 

ACCGGT
| | ||
A-C-GT
  Score=4

ACCGGT
||  ||
AC--GT
  Score=4

ACCGGT
| || |
A-CG-T
  Score=4

ACCGGT
|| | |
AC-G-T
  Score=4



In [4]:
alignments

[Alignment(seqA='ACCGGT', seqB='A-C-GT', score=4.0, start=0, end=6),
 Alignment(seqA='ACCGGT', seqB='AC--GT', score=4.0, start=0, end=6),
 Alignment(seqA='ACCGGT', seqB='A-CG-T', score=4.0, start=0, end=6),
 Alignment(seqA='ACCGGT', seqB='AC-G-T', score=4.0, start=0, end=6)]

In [7]:
#parameters for this example
# global.mx means -> matches score 2, mismatches -1 nad no gap penalty
alignments = pairwise2.align.globalmx('ACCGGT', 'ACGT', match=2, mismatch=-1)
for alignment in alignments:
  print(pairwise2.format_alignment(*alignment)) 

ACCGGT
| | ||
A-C-GT
  Score=8

ACCGGT
||  ||
AC--GT
  Score=8

ACCGGT
| || |
A-CG-T
  Score=8

ACCGGT
|| | |
AC-G-T
  Score=8



In [17]:
#parameters for this example
# global.xs means -> matches score 1, mismatches 0, and opening gap -2, extended gap -1
alignments = pairwise2.align.globalxs('ACCGGT', 'ACGT', open=-2, extend=-1)

for alignment in alignments:
  print(pairwise2.format_alignment(*alignment)) 

ACCGGT
||  ||
AC--GT
  Score=1



In [15]:
alignments

[Alignment(seqA='ACCGGT', seqB='AC--GT', score=0.0, start=0, end=6)]

In [23]:
from math import log
#callback function
def gap_function(x,y):  # x is gap position in seq, y is gap length
  if y == 0: #no gap
    return 0
  elif y == 1: #gap open penalty
    return -2
  return - (2+ y/4.0 + log(y)/2.0)

# globalmc - matches score 5, mismatches -4, gap penalty defined through function gap_function

alignments = pairwise2.align.globalmc("ACCCCCGT","ACG", match=5, mismatch=-4,
                                      gap_A_fn=gap_function, gap_B_fn=gap_function)

for alignment in alignments:
  print(pairwise2.format_alignment(*alignment))


ACCCCCGT
|    || 
A----CG-
  Score=9.30685

ACCCCCGT
||    | 
AC----G-
  Score=9.30685



##### Protein alignment example

In [22]:
# globaldx - matching/mismatching dictionary scores read from blosum62 matrix, no gap penalty
from Bio.Align import substitution_matrices
matrix = substitution_matrices.load("BLOSUM62") # blosum62 scoring matrix for sequence alignment of proteins
alignments= pairwise2.align.globaldx("KEVLA","EVL", match_dict=matrix)
for alignment in alignments:
    print(pairwise2.format_alignment(*alignment))


KEVLA
 ||| 
-EVL-
  Score=13



In [21]:
alignments

[Alignment(seqA='KEVLA', seqB='-EVL-', score=13.0, start=0, end=5)]