# Sequence alignment in BioPython

In this lab, we will introduce BioPython facilities to build, parse, and store alignments, both pairwise and multiple. 

## Sequence alignment I/O

Before looking into how to compute an alignment, let's have a look into BioPython routines for the manipulation of existing alignments. This comes in handy when processing (multiple) sequence alignments coming from resources such as [Pfam](http://pfam-legacy.xfam.org/).

### Read alignment

To manipulate sequence alignment, BioPython features the [Bio.AlignIO](https://biopython.org/docs/latest/api/Bio.AlignIO.html) package with methods to read a single alignment (`Bio.AlignIO.read()`) or multiple alignments (`Bio.AlignIO.parse()`). `read` returns a single [MultipleSeqAlignment](https://biopython.org/docs/latest/api/Bio.Align.html#Bio.Align.MultipleSeqAlignment) object while `parse` returns an iterator over `MultipleSeqAlignment` objects. Both methods accept a file handle (or path) and [format](https://biopython.org/docs/latest/api/Bio.AlignIO.html#file-formats) of the alignment. For example, the Pfam alignments are stored in the [Stockholm file format](https://en.wikipedia.org/wiki/Stockholm_format). Let's read in the Coronavirus spike glycoprotein S1 family MSA [PF01600](https://www.ebi.ac.uk/interpro/entry/pfam/PF01600/).

In [1]:
from Bio import AlignIO
alignment = AlignIO.read("{}/PF01600_full.sto".format('data'), "stockholm")
print(alignment)

Alignment with 32 rows and 557 columns
----------------------YQVL-P---DSGEFSDNLFTVG...PK- S5YNL4_9ALPC/219-623
-------------MRANIRNSQ--------------TDVCTTIQ...PQ- H9BR08_9NIDO/50-472
----------------------YSIC-K---NCSGFPDHVFAAG...PK- A0A0U1WHD9_9ALPC/218-633
---------------------V-QNC-T--GNCEEYANNIFSTE...P-T D9J204_9ALPC/237-647
---------------------C-SNCTD---QCASYVANVFVTQ...PS- Q0PL12_9ALPC/21-443
--------------------YP---C-P---TSSPFVSGDCVIK...RQ- V5TFD8_9GAMC/303-694
----------------------YTLC-D---NCTGFPQHVFATM...P-L A0A0U1WHD7_9ALPC/215-642
----------------------YSVC-N---DCAGFPKYVFAVN...P-A B1PHI8_9ALPC/221-631
---------------YEFCEDY-----E---YCTATATNVFAPT...PS- H9TEX4_9ALPC/251-675
----------------------YSVC-D---DCDGFPKYVFAVT...P-P B1PHJ5_9ALPC/221-631
----------------------YTLC-S---NCSGFPQHVFAVG...PK- A0A0U1WHB6_9ALPC/218-633
---------------------V-SNCTD---QCASYVANVFTTQ...PS- SPIKE_CVPPU/248-668
----------------------YSVC-T---ECDGFPKHVFPVL...T-E K4JZP8_9ALPC/219-610
----------------

We can iterate over the records in the alignment, obtaining the individual sequences as `SeqRecord`s (see the first lab for details on `SeqRecord`).

In [2]:
for record in alignment:
    print(record.letter_annotations)
    print(f"{record.id}\n{record.dbxrefs}\n{record.seq}")   
    print("$$$$$$$$$$$$$$$$$$$$$$")


{}
S5YNL4_9ALPC/219-623
[]
----------------------YQVL-P---DSGEFSDNLFTVGDDGSIPP-SFGFNNWFVLSNSSSIISGTVVSNQPLRLT---C--LWPIP-----SSTGALATI---YFNGTN----GA-QCN------------GFDS--NAPFDAIRFNL--NGTLSGHNFVS----GFVLHAANGATLGFSCTNSTDAPYLR-------QIPFGI-GDT-PYYCYLNV---------TTDINSTMSFVGALPLNLREIVIA-SNGDVYMNGYRYFAAGDLSSVDVELPSQQV--FGSTFWTIAFTVFETVLLEVDGTSINRMLYCD--NPL-NRVKCSHTQFDLVDGFYPLT--DVDLAVKPFTF-VTLPTFADHSFVYFNFSLMF-----DDLN----------EDFRLQSFNLTINGQL---------SYCVQSRQFTT-SGSVRTNT------------------NHQFGFYTQRAAS---------NGCPFTIDTLNNYLTFGRICFSFG-ESGAGCGVDVMVESQYNMFKVT---T---IFVSYSEGDIIAGMPK-
$$$$$$$$$$$$$$$$$$$$$$
{}
H9BR08_9NIDO/50-472
[]
-------------MRANIRNSQ--------------TDVCTTIQQGGFIPS-TFTFPQWYVLTNGSTFLQGEYTLSQPLLAN---A--HFCPR-----KNSDGYWRY---SFNNSCL-FPDH-RCQDHWYDSQNPICLGWNNT-FGLSDNIRINI--NISHDEYQSHG---GYVSLTLESGSVVNITCTNNSDPSTVTL---ATSLLPWARAIDQ-PMYCFANL---------TTGTASQLDFMGMLPPLVSELAFD-RTGGIYINGYRYYLTSALRDVDFKLKRND----TAEYFAVTWANYTDVHLSVDAGAIEKIKYCN--TPL-DRLACDMNVFNLSDGVYSYT--SLEKASVPETF-VTLP

We can also obtain column-level annotations, i.e. annotations which hold for all the sequences in the alignment. 

In [3]:
for key in sorted(alignment.column_annotations.keys()):
    print("{}: {}\n".format(key, alignment.column_annotations[key]))

GC:seq_cons: .........................C.s...pCsuaspplFssppsGhIPs.sFsFsNWFhLTNoSo.lsGhhsohQPLhls...C..LWslP.....shpsssthh...hFstst....ts.pCN............Ghss..ssssDslRFsL..NFTsst..ttu....slsLpossss.hshoCoNsossssts.......hlPFGh.sst.shYCassh............stsshpFlGhLPPsV+EhsIs.+.GshYlNGYphFshsslpslpFNlosss....sssFWTlAasshs-VLlplssTsIppllYCs..ssl.NplKCpQlpFsLsDGFYsho..s.tsspls+Th.VpLPpahsHohlslslslsh.....stps.......spshts..t..slshssss.s.......shCVsospFol.plphtst.....................sshshssslps.........GsCPFohsplNNYLoFsplCFShs.s.suuCphslh.tpthhtphhph..s...lYVsapcGspIsGlPp.

secondary_structure: ......................XXXX.X...XXXXXXXXXXXXXTTSXXXT.TTTSSSSXXXBSSSEESSEEEEEEEEEEEC...E..CECEC.....TTSSEEEEE...CSTSTT...-SS.SST............SCCGG.TSXXSEEEEEE..ESTSTTS-SSE....EEEEEETTCEEEEEEEETTSSCCCHH......HSXXSSX.SSS.SXEEEECC.........EETTEEEEEEEECXXSSXSEEEEE.TTSEEEETTEEEEEESSCCEEEEEEEXTT....TTXECEEEEEEEEEEEEEEETTEEEEEEETT.TSHH.HHHHHHTTSSSXXSEEEEEG..GGXBSSSEEEE.EXSXXXX-EEEEEEEEEEEE.....C--S......TTCECCCSTE

### Write alignment
In order to serialize a `MultipleSeqAlignment` object, we need to call `Bio.AlignIO.write()` method and pass in the alignment object, the path to the file and format.

In [4]:
AlignIO.write(alignment, 'PF01600_serialized.faa', 'fasta')

1

We can also serialize the alignment directly with the print function.

In [5]:
print(format(alignment, "clustal"))

CLUSTAL X (1.81) multiple sequence alignment


S5YNL4_9ALPC/219-623                ----------------------YQVL-P---DSGEFSDNLFTVGDDGSIP
H9BR08_9NIDO/50-472                 -------------MRANIRNSQ--------------TDVCTTIQQGGFIP
A0A0U1WHD9_9ALPC/218-633            ----------------------YSIC-K---NCSGFPDHVFAAGQDGTIP
D9J204_9ALPC/237-647                ---------------------V-QNC-T--GNCEEYANNIFSTEPGGIIP
Q0PL12_9ALPC/21-443                 ---------------------C-SNCTD---QCASYVANVFVTQPGGFIP
V5TFD8_9GAMC/303-694                --------------------YP---C-P---TSSPFVSGDCVIKDWVWIR
A0A0U1WHD7_9ALPC/215-642            ----------------------YTLC-D---NCTGFPQHVFATMENGEIP
B1PHI8_9ALPC/221-631                ----------------------YSVC-N---DCAGFPKYVFAVNEGGTIP
H9TEX4_9ALPC/251-675                ---------------YEFCEDY-----E---YCTATATNVFAPTVGGYIP
B1PHJ5_9ALPC/221-631                ----------------------YSVC-D---DCDGFPKYVFAVTEGGEVP
A0A0U1WHB6_9ALPC/218-633            ----------------------YTLC-S---NCSGFPQHVFAVGPDG

### Manipulate alignment

The `MultipleSeqAlignment` has several convenience methods for creating and manipulating existing alignments.

In [11]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
a = SeqRecord(Seq("AAAACGT"), id="Alpha")
b = SeqRecord(Seq("AAA-CGT"), id="Beta")
c = SeqRecord(Seq("AAAAGGT"), id="Gamma")
align = MultipleSeqAlignment([a, b, c],
                             annotations={"tool": "demo"},
                             column_annotations={"stats": "CCCXCCC"})

In [12]:
print(align)
print(list(align))
print(len(align))
print(align.annotations)
print(align.column_annotations)

Alignment with 3 rows and 7 columns
AAAACGT Alpha
AAA-CGT Beta
AAAAGGT Gamma
[SeqRecord(seq=Seq('AAAACGT'), id='Alpha', name='<unknown name>', description='<unknown description>', dbxrefs=[]), SeqRecord(seq=Seq('AAA-CGT'), id='Beta', name='<unknown name>', description='<unknown description>', dbxrefs=[]), SeqRecord(seq=Seq('AAAAGGT'), id='Gamma', name='<unknown name>', description='<unknown description>', dbxrefs=[])]
3
{'tool': 'demo'}
{'stats': 'CCCXCCC'}


To add a sequence to an existing alignemnt, we can use the `append` and `extend` mehtods (the sequence lenght must match the MSA length).

In [13]:
align.append(SeqRecord(Seq("--AAG-T"), id="Delta"))

In [14]:
print(align)

Alignment with 4 rows and 7 columns
AAAACGT Alpha
AAA-CGT Beta
AAAAGGT Gamma
--AAG-T Delta


Alignment can be also extended column-wise:

In [15]:
print(align + align)

Alignment with 4 rows and 14 columns
AAAACGTAAAACGT Alpha
AAA-CGTAAA-CGT Beta
AAAAGGTAAAAGGT Gamma
--AAG-T--AAG-T Delta


It is possible to slice the alignment both column and row-wise.

In [16]:
print(align[-1], '\n')
print(align[1:3], '\n')
print(align[1:3, 1:4], '\n')
print(align[:,1:4], '\n')

ID: Delta
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('--AAG-T') 

Alignment with 2 rows and 7 columns
AAA-CGT Beta
AAAAGGT Gamma 

Alignment with 2 rows and 3 columns
AA- Beta
AAA Gamma 

Alignment with 4 rows and 3 columns
AAA Alpha
AA- Beta
AAA Gamma
-AA Delta 



Note that MSA slicing takes into account the annotations.

In [17]:
sliced = align[:, 1:4]
print(sliced)
print(sliced.column_annotations)

Alignment with 4 rows and 3 columns
AAA Alpha
AA- Beta
AAA Gamma
-AA Delta
{'stats': 'CCX'}


Note that the combination of column-wise extension and slicing enables the removal of columns.

In [18]:
print(align[:, 1:2] + align[:, 4:5])

Alignment with 4 rows and 2 columns
AC Alpha
AC Beta
AG Gamma
-G Delta


If a more advanced array manipulation is required, MSA can be converted to a NumPy array.

In [19]:
import numpy as np
align_array = np.array([list(rec) for rec in align])
print(align_array.shape)
print(align_array[1:2])

(4, 7)
[['A' 'A' 'A' '-' 'C' 'G' 'T']]


### ---- Begin Exercise ----
- Write a function that takes in a Pfam family, loads from the disk the corresponding MSA, and computes the sum-of-pairs score for each column. To do that, the user will also have to specify the scoring matrix (see below how to work with scoring matrices). The method will compute SP for every column and, together with the number of gaps for those columns. The results will be stored in `column_annotations` of the MSA. The MSA with the enriched `column_annotations` will be returned to the user.
### ---- End Exercise ----

## Obtaining pairwise alignment

BioPython includes two sequence aligners - the "old" [Bio.pairwise2 module](https://biopython.org/docs/latest/api/Bio.pairwise2.html) and the new [PairwiseAligner class](https://biopython.org/docs/latest/api/Bio.Align.html#Bio.Align.PairwiseAligner) that is part of the [Bio.Align package](https://biopython.org/docs/latest/api/Bio.Align.html). It is suggested to use the `PairwiseAligner` class as it provides a faster and more efficient implementation. That said, both aligners should return the same results.

In [21]:
from Bio import Align
aligner = Align.PairwiseAligner()

In [22]:
seq1 = "GAACT"
seq2 = "GAT"

aligner = Align.PairwiseAligner(match_score=1.0)
aligner.match_score = 2.0

print(aligner.score(seq1, seq2))

6.0


In [23]:
alignments = aligner.align(seq1, seq2)
for a in alignments:
    print(a)

target            0 GAACT 5
                  0 ||--| 5
query             0 GA--T 3

target            0 GAACT 5
                  0 |-|-| 5
query             0 G-A-T 3



In [24]:
aligner.mode

'global'

In [25]:
seq1 = "AGAACTC"
seq2 = "GAACC"
aligner.gap_score = -1
aligner.match_score = 1
for mode in ['global', 'local']:
    aligner.mode = mode
    print(aligner.algorithm)
    alignments = aligner.align(seq1, seq2)
    for a in alignments:
        print(a.score)
        print(a)

Needleman-Wunsch
3.0
target            0 AGAACTC 7
                  0 -||||-| 7
query             0 -GAAC-C 5

Smith-Waterman
4.0
target            1 GAAC 5
                  0 |||| 4
query             0 GAAC 4



The aligner has a truly fine-grained control over the gap penalties.

In [26]:
print(aligner)

Pairwise sequence aligner with parameters
  wildcard: None
  match_score: 1.000000
  mismatch_score: 0.000000
  target_internal_open_gap_score: -1.000000
  target_internal_extend_gap_score: -1.000000
  target_left_open_gap_score: -1.000000
  target_left_extend_gap_score: -1.000000
  target_right_open_gap_score: -1.000000
  target_right_extend_gap_score: -1.000000
  query_internal_open_gap_score: -1.000000
  query_internal_extend_gap_score: -1.000000
  query_left_open_gap_score: -1.000000
  query_left_extend_gap_score: -1.000000
  query_right_open_gap_score: -1.000000
  query_right_extend_gap_score: -1.000000
  mode: local



Opening scores|Extending scores
---|---
query_left_open_gap_score|query_left_extend_gap_score
query_internal_open_gap_score|query_internal_extend_gap_score
query_right_open_gap_score|query_right_extend_gap_score
target_left_open_gap_score|target_left_extend_gap_score
target_internal_open_gap_score|target_internal_extend_gap_score
target_right_open_gap_score|target_right_extend_gap_score

target|	query|	score|
---|---|---|
A|	-|	query left open gap score|
C|	-|	query left extend gap score|
C|	-|	query left extend gap score|
G|	G|	match score|
G|	T|	mismatch score|
G|	-|	query internal open gap score|
A|	-|	query internal extend gap score|
A|	-|	query internal extend gap score|
T|	T|	match score|
A|	A|	match score|
G|	-|	query internal open gap score|
C|	C|	match score|
-|	C|	target internal open gap score|
-|	C|	target internal extend gap score|
C|	C|	match score|
T|	G|	mismatch score|
C|	C|	match score|
-|	C|	target internal open gap score|
A|	A|	match score|
-|	T|	target right open gap score|
-|	A|	target right extend gap score|
-|	A|	target right extend gap score|


Meta-attribute |	Attributes it maps to
---|---
gap_score	|target_gap_score, query_gap_score
open_gap_score|	target_open_gap_score, query_open_gap_score
extend_gap_score|	target_extend_gap_score, query_extend_gap_score
internal_gap_score|	target_internal_gap_score, query_internal_gap_score
internal_open_gap_score|	target_internal_open_gap_score, query_internal_open_gap_score
internal_extend_gap_score|	target_internal_extend_gap_score, query_internal_extend_gap_score
end_gap_score|	target_end_gap_score, query_end_gap_score
end_open_gap_score|	target_end_open_gap_score, query_end_open_gap_score
end_extend_gap_score|	target_end_extend_gap_score, query_end_extend_gap_score
left_gap_score|	target_left_gap_score, query_left_gap_score
right_gap_score| target_right_gap_score, query_right_gap_score
left_open_gap_score|	target_left_open_gap_score, query_left_open_gap_score
left_extend_gap_score|	target_left_extend_gap_score, query_left_extend_gap_score
right_open_gap_score|	target_right_open_gap_score, query_right_open_gap_score
right_extend_gap_score|	target_right_extend_gap_score, query_right_extend_gap_score
target_open_gap_score|	target_internal_open_gap_score, target_left_open_gap_score,
_|target_right_open_gap_score
target_extend_gap_score|	target_internal_extend_gap_score, target_left_extend_gap_score,
_|target_right_extend_gap_score
target_gap_score|	target_open_gap_score, target_extend_gap_score
query_open_gap_score|	query_internal_open_gap_score, query_left_open_gap_score,
_|query_right_open_gap_score
query_extend_gap_score|	query_internal_extend_gap_score, query_left_extend_gap_score,
_|query_right_extend_gap_score
query_gap_score|	query_open_gap_score, query_extend_gap_score
target_internal_gap_score|	target_internal_open_gap_score, target_internal_extend_gap_score
target_end_gap_score|	target_end_open_gap_score, target_end_extend_gap_score
target_end_open_gap_score|	target_left_open_gap_score, target_right_open_gap_score
target_end_extend_gap_score|	target_left_extend_gap_score, target_right_extend_gap_score
target_left_gap_score|	target_left_open_gap_score, target_left_extend_gap_score
target_right_gap_score|	target_right_open_gap_score, target_right_extend_gap_score
query_end_gap_score|	query_end_open_gap_score, query_end_extend_gap_score
query_end_open_gap_score|	query_left_open_gap_score, query_right_open_gap_score
query_end_extend_gap_score|	query_left_extend_gap_score, query_right_extend_gap_score
query_internal_gap_score|	query_internal_open_gap_score, query_internal_extend_gap_score
query_left_gap_score|	query_left_open_gap_score, query_left_extend_gap_score
query_right_gap_score|	query_right_open_gap_score, query_right_extend_gap_score

It is even possible to have a general gap-scoring function.

In [27]:
def my_gap_score_function(start, length):
    if start == 2:
        return -1000
    else:
        return -1 * length

for query_gap_score in [0, my_gap_score_function]:
    print(query_gap_score)
    aligner.query_gap_score = query_gap_score
    alignments = aligner.align("AACCCTT", "AATT")
    for alignment in alignments:
        print(alignment)

0
target            0 AACCCTT 7
                  0 ||---|| 7
query             0 AA---TT 4

<function my_gap_score_function at 0x00000236E2B98E00>
target            0 AA 2
                  0 || 2
query             0 AA 2

target            5 TT 7
                  0 || 2
query             2 TT 4



For protein sequences, it is reasonable to use a substitution matrix. BioPython is distributed with plenty of substitution matrices (including PAM and BLOSUM), which are available via the [scoring_matrices](https://biopython.org/docs/latest/api/Bio.Align.substitution_matrices.html) subpackage of `Bio.Align`. To find out which matrices are available, we can call the `load` method with no argument (the matrices are stored as flat files in `Bio/Align/scoring_matrices/data`). The same method is then used to load a specific matrix.

In [28]:
from Bio.Align import substitution_matrices
substitution_matrices.load()

['BENNER22',
 'BENNER6',
 'BENNER74',
 'BLASTN',
 'BLASTP',
 'BLOSUM45',
 'BLOSUM50',
 'BLOSUM62',
 'BLOSUM80',
 'BLOSUM90',
 'DAYHOFF',
 'FENG',
 'GENETIC',
 'GONNET1992',
 'HOXD70',
 'JOHNSON',
 'JONES',
 'LEVIN',
 'MCLACHLAN',
 'MDM78',
 'MEGABLAST',
 'NUC.4.4',
 'PAM250',
 'PAM30',
 'PAM70',
 'RAO',
 'RISLER',
 'SCHNEIDER',
 'STR',
 'TRANS']

In [29]:
m = substitution_matrices.load("BLOSUM62")
print(m)
print("A->R substitution score: {}".format(m['A', 'R']))

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
     A    R    N    D    C    Q    E    G    H    I    L    K    M    F    P    S    T    W    Y    V    B    Z    X    *
A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 -1.0  0.0 -2.0 -1.0 -1.0 -1.0 -1.0 -2.0 -1.0  1.0  0.0 -3.0 -2.0  0.0 -2.0 -1.0  0.0 -4.0
R -1.0  5.0  0.0 -2.0 -3.0  1.0  0.0 -2.0  0.0 -3.0 -2.0  2.0 -1.0 -3.0 -2.0 -1.0 -1.0 -3.0 -2.0 -3.0 -1.0  0.0 -1.0 -4.0
N -2.0  0.0  6.0  1.0 -3.0  0.0  0.0  0.0  1.0 -3.0 -3.0  0.0 -2.0 -3.0 -2.0  1.0  0.0 -4.0 -2.0 -3.0  3.0  0.0 -1.0 -4.0
D -2.0 -2.0  1.0  6.0 -3.0  0.0  2.0 -1.0 -1.0 -3.0 -4.0 -1.0 -3.0 -3.0 -1.0  0.0 -1.0 -4.0 -3.0 -3.0  4.0  1.0 -1.0 -4.0
C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 -4.0 -3.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0 -3.0 -1.0 -1.0 -2.0 -2.0 -1.0 -3.0 -3.0 -2.0 -4.0
Q -1.0  1.0  0.0  0.

Let's use the BLOSUM62 matrix to find an alignment between spike glycoprotein in SARS ([P59594](https://www.uniprot.org/uniprot/P59594)) and spike glycoprotein in bat coronavirus ([R9QTA0](https://www.uniprot.org/uniprot/R9QTA0)). But first, let's see what happens if we do not specify the scoring matrix.

In [30]:
from Bio import SeqIO
s_sars = SeqIO.read('data/spike_sars_cv.faa', 'fasta')
s_bat = SeqIO.read('data/spike_bat_cv.faa', 'fasta')

In [31]:
aligner = Align.PairwiseAligner()
alignments = aligner.align(s_sars.seq, s_bat.seq)

In [32]:
print(len(alignments))

OverflowError: number of optimal alignments is larger than 9223372036854775807

We got so many alignments because the scoring system results in plenty of alignments with the same score. Let's inspect.

In [33]:
print(aligner)

Pairwise sequence aligner with parameters
  wildcard: None
  match_score: 1.000000
  mismatch_score: 0.000000
  target_internal_open_gap_score: 0.000000
  target_internal_extend_gap_score: 0.000000
  target_left_open_gap_score: 0.000000
  target_left_extend_gap_score: 0.000000
  target_right_open_gap_score: 0.000000
  target_right_extend_gap_score: 0.000000
  query_internal_open_gap_score: 0.000000
  query_internal_extend_gap_score: 0.000000
  query_left_open_gap_score: 0.000000
  query_left_extend_gap_score: 0.000000
  query_right_open_gap_score: 0.000000
  query_right_extend_gap_score: 0.000000
  mode: global



In [34]:
print(alignments[0].format('fasta'))
print(alignments[1].format('fasta'))
# print(alignments[0])
# print(alignments[1])

>
MFIFLLFLTLT--S------G----SDLD----RCTTFDDVQAPNYTQHTSSM--RGVYYP-DE-IFRSDT-LY-LTQDL-FLPFYSN--VTGFHTI-----NH----T---FG-NP-VIPFK-DGIYFAATEK-SNVV-RGWV-FGSTM--NNK-SQSV-III-NNSTN---V-VIRACNFE-LCDN--P-FFA-VS-----KPMGT----QTHTMIFDNAF-NCTFEYISD-AF--S--LDVSE-KSGN---FK-HLREFV-FKNK-DGFLYVYKG-YQ----PID-VVRDL---PSGFNT--LKPIFK---LPLGINITNF--RAIL---TA-FSPA--QDIWG----TS--AAAYF-VGYLKP-TTFMLKYDE--N--GTITD--AVDCSQN-PLA-ELKCSV--KS-FE-ID-KGIYQTSNFRVVPSGD----VVRFPNITNL-CPFGE--VFNATK-FPSVYAWER-KKISN-CVADYS-VLYNST-FFSTFKCYGVSAT--KLN-DLCF-SNVYADS-F------VVKGDDVRQI-APGQ-TGVIADYNYKLPDDFM-GCVL-AWNTR-NI-DATST-GN-YNYKYRYL--RHGK--LR-PFERDI-SNVPFSPD--GKPC--TPPALNC--YWPLNDYGFYTTTGIGYQP-----Y---RVVVLSFELLNAPATVCGPKLSTD-LI-KNQCVNFNFNGLT-GTGVLTP-SSKR-FQP-FQQFGRDV-SDFTDSVRDPK-TS-EILDISPCSFGGVSVITPGTNASS-EVAVLYQDVNCTDVS-TAIH-ADQLTPAWRI-YSTG-NNVFQTQAGCLIGAEHVDT--SYECDIPIGAGICASYHTV-S-LLRSTS-QKSIVAYTMSLGAD--SSIAYS-NNT-IAIPTNFSISI-TTEVMPVSMAKTSVDCN-MYICGDST-ECA-NLLLQYGSFCTQLNRALS-GIAA-EQDR-NTR-EVFAQVKQMYKTPTL--KY-FGGFNFSQILPDP

Copy the above output into a text editor to get rid of the strange text wrapping (might look OK depending on your Notebook's environment). Alternatively, we can convert the output into an MSA object and format it as a clustal alignment (we are using string splitting here because `Bio.Align.PairwiseAlignment` [slicing](https://biopython.org/docs/latest/api/Bio.Align.html#Bio.Align.PairwiseAlignment.__getitem__) does not seem to be implemented at the time of writing this notebook).

In [35]:
from Bio.Seq import Seq
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord
aln_str = str(alignments[0])
a = SeqRecord(Seq(aln_str.splitlines()[0]), id="sars")
b = SeqRecord(Seq(aln_str.splitlines()[1]), id="bat")
msa = MultipleSeqAlignment([a, b])
print(format(msa, "clustal"))

CLUSTAL X (1.81) multiple sequence alignment


sars                                target            0 MFIFLLFLTLT--S------G----SDLD-
bat                                                   0 |-|-||-|-|---|------|----|----

sars                                ---RCTTFDDVQAPNYTQHTSSM--RGVYY
bat                                 ---|-|-||--|---|----||---|||||





Now let's use the BLOSUM62 matrix.

In [36]:
aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")

In [37]:
aligner.open_gap_score = -11
aligner.extend_gap_score = -1

In [38]:
alignments = aligner.align(s_sars.seq, s_bat.seq)

In [39]:
len(alignments)

9

### ---- Begin Exercise ----

- Iterate over the alignments and print out the alignments togeter with percentage identity (you can use the `substitutions` property which is an `np` 2D array)
- Compare the results with what you get from UniProt's BLAST similarity search for the `SARS` protein and with what you get from using [EMBOSS Needle](https://www.ebi.ac.uk/jdispatcher/psa/emboss_needle). Are they any different?

### ---- End Exercise ----

## Obtaining MSA

<span style="color:red"> !!!!!!!! The following only works in BioPython up until version 1.76. !!!!!!!!</span>

As there is no single agreed-upon standard for how to align multiple sequences, there exists no algorithm implemented directly in BioPython. Instead, BioPython supports running external tools (which need to be installed on the target system) and wrapping their outputs into an MSA alignment, which can then be further processed in BioPython. The wrappers are defined in the [Bio.Align.Applications](https://biopython.org/docs/latest/api/Bio.Align.Applications.html) module.

In [40]:
import Bio.Align.Applications
dir(Bio.Align.Applications)


Due to the on going maintenance burden of keeping command line application
wrappers up to date, we have decided to deprecate and eventually remove these
modules.

We instead now recommend building your command line and invoking it directly
with the subprocess module.


['ClustalOmegaCommandline',
 'ClustalwCommandline',
 'DialignCommandline',
 'MSAProbsCommandline',
 'MafftCommandline',
 'MuscleCommandline',
 'PrankCommandline',
 'ProbconsCommandline',
 'TCoffeeCommandline',
 '_ClustalOmega',
 '_Clustalw',
 '_Dialign',
 '_MSAProbs',
 '_Mafft',
 '_Muscle',
 '_Prank',
 '_Probcons',
 '_TCoffee',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__']

In [None]:
from Bio.Align.Applications import ClustalwCommandline
clustalw_cline = ClustalwCommandline(r"c:\Program Files (x86)\ClustalW2\clustalw2.exe", infile="data/PF01600_full_length_sequences.fasta")

In [None]:
stdout, stderr = clustalw_cline()

In [None]:
print(stdout)

The alignment is, in the case of ClustalW, actually written into an output file so we can then read it as we would do with any MSA.

In [None]:
from Bio import AlignIO
alignment = AlignIO.read("data/PF01600_full_length_sequences.aln", "clustal")
print(alignment)

The tree based on which the MSA is created is also available and can be visualized.

In [None]:
from Bio import Phylo
tree = Phylo.read("data/PF01600_full_length_sequences.dnd", "newick")
Phylo.draw_ascii(tree)