## Multiple sequence alignment

The input files are two .fasta files, `human.fa` and `mouse.fa`, containing ~10k sequences each, representing the variable region of the heavy chain of human and mouse antibodies respectively.

The variable (V) portion (of both heavy and light chains) of the antibody is responsible for the binding of the antigen, and it subdivided into four _framework_ regions (FR) separated by three _complementarity determining regions_ (CDRs), also called hypervariable regions. The CDRs directly contact a portion of the antigen surface, so they differ significantly from an antibody to another, since they are responsible for the affinity towards only one specific epitope. On the other hand, the FRs serve as scaffold to hold the CDRs in position, thus they are expected to be more conserved across the range of all antibodies of a given species.

On these basis, it is possible to perform multiple sequence alignment (MSA) of V regions sequences to spot FR segments and CDRs in the given sequences by looking at the variability of aminoacids residues across sequences.

MSA will be performed using [Clustal Omega](http://www.clustal.org/omega/) (due to its speed compared to ClustalW) and [MUSCLE](https://www.drive5.com/muscle/); Clustal Omega can be installed locally using conda before running the command created with the Biopython wrapper (see code). Instead, MUSCLE does not need any installation, it is sufficient to download the executable file.

In [31]:
#!conda install -y -c bioconda clustalo
from Bio.Align.Applications import MuscleCommandline, ClustalOmegaCommandline, ClustalwCommandline

### MSA with Clustal Omega

In [25]:
# human sequences

in_file = "data/human.fa"
out_file = 'data/alignments/human_msa_clustalO.fasta'

# generate the command line
clustalOmega_cline = ClustalOmegaCommandline(infile=in_file, outfile = out_file)
print(clustalOmega_cline) 

# perform the MSA, which will be written in out_file file path
clustalOmega_cline()
print('Alignemnt finished')

clustalo -i data/human.fa -o data/human_msa_clustalO.fasta
Alignemnt finished


In [3]:
# mouse sequences

in_file = "data/mouse.fa"
out_file = 'data/alignments/mouse_msa_clustalO.fasta'

clustalOmega_cline = ClustalOmegaCommandline(infile=in_file, outfile = out_file)
print(clustalOmega_cline) 

clustalOmega_cline()
print('Alignment finished')

clustalo -i data/mouse.fa -o data/mouse_msa_clustalo.fasta
Alignment finished


### MSA with MUSCLE

MUSCLE can perform MSA using the default PPP algorithm (`-align` command) or the Super5 algorithm (`-super5` command) that allows to align large datasets more efficiently. Since there seem to be no way to select the Super5 algorithm using the Biopython wrapper, the command line version was directly used.

In [32]:
import os
muscle_path = '~/Downloads/muscle5.1.linux_intel64'

In [33]:
# human sequences

in_file = "data/human.fa"
out_file = "data/alignments/human_msa_muscle.fasta"

os.system(f'{muscle_path} -super5 {in_file} -output {out_file}')


muscle 5.1.linux64 [12f0e2]  16.3Gb RAM, 8 cores
Built Jan 13 2022 23:17:13
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 9997 seqs, length avg 122 max 139

00:00 8.2Mb   100.0% Derep 9996 uniques, 0 dupes
00:00 9.1Mb  CPU has 8 cores, running 8 threads                      
00:51 12Mb    100.0% UCLUST 9997 seqs EE<0.01, 746 centroids, 9250 members
00:54 73Mb    100.0% UCLUST 746 seqs EE<0.30, 1 centroids, 744 members    
00:54 73Mb    100.0% Make cluster MFAs                                
1 clusters pass 1                     
00:56 342Mb   100.0% UCLUST 746 seqs EE<0.10, 5 centroids, 740 members
00:56 342Mb   100.0% Make cluster MFAs                                
5 clusters pass 2                     
00:56 342Mb  
00:56 342Mb  Align cluster 1 / 5 (255 seqs)
00:56 342Mb  
01:10 565Mb   100.0% Calc posteriors 
01:18 583Mb   100.0% Consistency (1/2) 
01:25 583Mb   100.0% Consistency (2/2) 
01:25 584Mb   100.0% UPGMA5           
01:27 592Mb   100.0% Refining
0

0

In [34]:
# mouse sequences 

in_file = "data/mouse.fa"
out_file = "data/alignments/mouse_msa_muscle.fasta"

os.system(f'{muscle_path} -super5 {in_file} -output {out_file}')


muscle 5.1.linux64 [12f0e2]  16.3Gb RAM, 8 cores
Built Jan 13 2022 23:17:13
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 9997 seqs, length avg 119 max 130

00:00 7.9Mb   100.0% Derep 9996 uniques, 0 dupes
00:00 9.0Mb  CPU has 8 cores, running 8 threads                      
01:06 12Mb    100.0% UCLUST 9997 seqs EE<0.01, 1009 centroids, 8987 members
01:09 72Mb    100.0% UCLUST 1010 seqs EE<0.30, 1 centroids, 1008 members   
01:09 73Mb    100.0% Make cluster MFAs                                  
1 clusters pass 1                     
01:12 207Mb   100.0% UCLUST 1010 seqs EE<0.10, 3 centroids, 1006 members
01:12 207Mb   100.0% Make cluster MFAs                                  
4 clusters pass 2                     
01:12 207Mb  
01:12 207Mb  Align cluster 1 / 4 (500 seqs)
01:12 207Mb  
02:03 1.0Gb   100.0% Calc posteriors 
02:55 1.5Gb   100.0% Consistency (1/2) 
03:46 1.5Gb   100.0% Consistency (2/2) 
03:46 1.5Gb   100.0% UPGMA5           
03:57 1.5Gb   100.0% Re

0