# Use biopython to make valid CLUSTAL formatted MSAs, check sequence of manually edited alignment, and add consensus line

This is meant to represent a typical workflow where a combination of these steps would be used.

------

##  Using biopython to make valid CLUSTAL formatted MSAs

Biopython seems to not require strict adherence to line lengths for the sequence blocks. For exampe, the first line can have 83 residues for each sequence while the other lines can have the tpyical fifty or sixty residues. (CLUSTAL produced by Biopython seem to be fifty per line while those produced by EMBL-EBI are sixty.) Since it can read in uneven CLUSTAL style alignments that can arise when one tries to manually edit a multiple sequence alignment (MSA), biopython is suited to converting the hand-edited sequence to the more standardized CLUSTAL format. ('Semi-official' specifications of the format can be found [here](http://meme-suite.org/doc/clustalw-format.html) and [here](http://scikit-bio.org/docs/0.4.2/generated/skbio.io.format.clustal.html) and [here](http://scikit-bio.org/docs/0.4.2/generated/skbio.io.format.clustal.html) and [here](https://www.ebi.ac.uk/Tools/msa/clustalw2/help/faq.html#18). A good example can be viewed [here](http://wwwabi.snv.jussieu.fr/public/Clustal2Dna/clustal.html). I say 'semi-official' because I have come accross many deviations; one main deviation is that many files start 'CLUSTAL' and not 'CLUSTAL W' or 'CLUSTAL W'. A case in point is the output produced by [EMBL-EBI's MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/), an example of which is shown as outout in cell '[2]' below. Another deviation is that often the line showing the degree of conservation is not included.) 

It is probably always best to perform this step ASAP if you have sequence alignments with uneven blocks. This way the 'standardized' CLUSTAL format will be utilized by downstream steps. Not all computational tools will be written to be as lenient as biopython is about the standard format. I personally have written scripts that rely on the first set of sequence blocks to establish the number of columns. Hence, having uneven width for the sequence blocks would cause errors.


Here a hand-edited multiple sequence alignment will be converted to more standard form.  
(The process illustrated here is very reminiscent of the section entitled 'File Format Conversion' at the wiki page for ['The module for multiple sequence alignments, AlignIO'](https://biopython.org/wiki/AlignIO).)

In [1]:
# Get an alignment file
!curl -o alignment.clw https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/a3ec6fb9d5c3f558a4b8666ce1cbc20c356fe4de/unevenStv1p_Vph1p_muscle_alignment.clw

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3632  100  3632    0     0  18251      0 --:--:-- --:--:-- --:--:-- 18343


In [2]:
# Display original alignment to confirm starting state. NOTE UNEVEN LENGTH OF SEQUENCES ON LINES.
!head -12 alignment.clw

CLUSTAL multiple sequence alignment by MUSCLE (3.8)


STV1            -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLRRFDEVER
VPH1            MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIRRLDNVER
                  ::********:*: **:*** *: *: :: **::.:. . ***..: **** :**::**:*:***

STV1            MVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE
VPH1            QYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI---------
                   :: .:::**  : :     **:    * ..::  *  ..:*         

STV1            NVNDMVKEITDCESRARQLDESLDSLRSKLNDLLEQRQVIFECSKFIEVNPGIAGRATNP


In [3]:
# Read in original alignment
from Bio import SeqIO
from Bio import AlignIO
orig_alignment = AlignIO.read('alignment.clw', "clustal")

In [4]:
# Save original alignment as a new file
AlignIO.write(orig_alignment, 'standardized_alignment.clw', "clustal"); # based on # https://biopython.org/wiki/AlignIO;

In [5]:
#check produced file
!head -12 standardized_alignment.clw

CLUSTAL X (3.8) multiple sequence alignment


STV1                                -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTA
VPH1                                MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRA
                                      ::********:*: **:*** *: *: :: **::.:. . ***..: *

STV1                                FQRGYVNQLRRFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGN
VPH1                                FQRTFVNEIRRLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSG
                                    *** :**::**:*:***   :: .:::**  : :     **:    * ..

STV1                                DIAQPDMADLINTMEPLSLENVNDMVKEITDCESRARQLDESLDSLRSKL


-----

(Aside:   
When I was comparing the results direct from [EMBL-EBI's MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/) to what my script to add the consensus symbols produces I noticed some discrepancies.

The third line shows that the consensus symbols for MUSCLE alignment are differently defined than other place I have seen:

    STV1            NVNDMVKEITDCESRARQLDESLDSLRSKLNDLLEQRQVIFECSKFIEVNPGIAGRATNP
    VPH1            --DDYVRNASYLEERLIQMEDATDQIEVQKNDLEQYRFILQSGDEFF-----LKGDNTDS
                      :* *.: :  *.*  *:::: *.:  : *** : * :: . .:*:     : *  *:.
                          ^                   ^            

I put an upward arrow head (super-script character) pointing out the two that
don't match what I have seen elsewhere for conserved residues.

- Why is K and R substituion not strongly similar?  Should be according to [here](https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ#BioinformaticsToolsFAQ-WhatdoconsensussymbolsrepresentinaMultipleSequenceAlignment?) and [here](https://www.genome.jp/tools/clustalw/clustalw_readme.html) and [here](https://en.wikipedia.org/wiki/Clustal).
- Why is E and R substituion not weakly similar? Should be according to [here](https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ#BioinformaticsToolsFAQ-WhatdoconsensussymbolsrepresentinaMultipleSequenceAlignment?) and [here](https://www.genome.jp/tools/clustalw/clustalw_readme.html) and [here](https://en.wikipedia.org/wiki/Clustal).

My script `calculate_cons_for_clustal_protein.py`, demonstrated in the last section of this notebook, annotates these correctly. I only noticed when I was trying to test `calculate_cons_for_clustal_protein.py` and noticed I wasn't producing things matching perfect with the symbols that MUSCLE adds. )

----

## Check sequence of manually edited alignment

In the process of manuall editing a multiple sequence file, it is easy to erroneously delete sequence. This section will demonstrate using `check_seq_in_MSAclustal_consistent_with_FASTA.py` to make sure the sequence in the edited file is valid. It checks against a user-provided FASTA. It is suggested this FASTA come directly from an 'official' source.

First, we'll get the script by running the next cell:

In [6]:
# Get a file if not yet retrieved / check if file exists
import os
file_needed = "check_seq_in_MSAclustal_consistent_with_FASTA.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/{file_needed}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11396  100 11396    0     0  65120      0 --:--:-- --:--:-- --:--:-- 65120


For these proteins, they came from the [Saccharomyces Genome Database (SGD)](https://www.yeastgenome.org/) page for each respective encoding gene, and then from the 'Protein' tab (example: [STV1 protein tab](https://www.yeastgenome.org/locus/S000004658/protein)) under 'Sequence' about half way down the page , I clicked on the button 'Download Sequence (.fsa)'. The **files from following that process have already been placed in the `data` directory in this session**, assuming you launched a session using `launch binder` from [here](https://github.com/fomightez/cl_sq_demo-binder). Let's copy those files over to our working directory to make things simpler when indicating the files to use below.

In [7]:
!cp ../data/S288C_* .

Since the sequence in this case is from yeast, the FASTA can also be obtained using my script `get_protein_seq_as_FASTA.py` which will work in binder session launched from where this active notebook was also lanched. See about `get_protein_seq_as_FASTA.py` [here](https://github.com/fomightez/yeastmine).

With the preparation steps complete, we are not ready to check the sequence reprensented in the alignment is consistent with the FASTA file from an 'official' source. First for Stv1p. Note in the next command after calling the script, you supply the alignment file name, followed by the identifier of the sequence in the alignment you want to compare, followed by the name (or path) of the FASTA file containing the sequence.

In [8]:
%run check_seq_in_MSAclustal_consistent_with_FASTA.py alignment.clw STV1 S288C_YMR054W_STV1_protein.fsa


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 100 (give or take a few).**
Alignment file read...STV1 sequence collected from alignment...FASTA file read...Checking...
Are the sequences the same?  ...  ...


True


Note the following would have also worked if we hadn't copied the FASTA file to the current working directory:

```python
%run check_seq_in_MSAclustal_consistent_with_FASTA.py alignment.clw STV1 ../data/S288C_YMR054W_STV1_protein.fsa
```

Importantly, the `True` returned says the sequence of Stv1p in the alignment matched with that official FASTA sequence. Thus helping us be confident nothing got changed along the way while hand editing.

If you wanted to check for Vph1p you'd run the following:

```python
%run check_seq_in_MSAclustal_consistent_with_FASTA.py alignment.clw VPH1 S288C_YOR270C_VPH1_protein.fsa
```

You can choose to do that or not.

Yay! Sequence of one of the sequences in the manually-edited multiple sequence alignment has been confirmed against the official record to verifying no errors were introduced during editing.  
It is best the process is repeated with any others as well if this was really part of a pipeling you were processing.

## Optional: Adjust width of alignment

Biopython produces alignments with 50 sequence characters per line even if you started with ones like those that come from [EMBL-EBI's MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/) or similar tools with 60 sequence characters per line. Clustal-formateted data with fifty sequence characters per line can be adjusted back to 60 (or some options) using [Mview](https://www.ebi.ac.uk/Tools/msa/mview/) and then parsing the output to actual CLUSTAL format. (Admittedly, it is a little kludgy but it works and it is less software to maintain by relying on [MView](https://www.ebi.ac.uk/Tools/msa/mview/) for most of the heavy lifting.)

This illustrates doing that. 

#### Step 1: Use MView to get alignment with width desired.

First, the alignment with fifty sequence characters per line needs to be submitted to [MView](https://www.ebi.ac.uk/Tools/msa/mview/) and the settings adjusted to make sure the right form comes back. The following lines describe doing that. **YOU DO NOT ACTUALLY NEED TO DO THAT FOR THE EXAMPLE; a pre-made version will be retrieved in the next cell to save you from needing to actually do these steps.**
 
In the top box at [MView](https://www.ebi.ac.uk/Tools/msa/mview/), paste in the text of the alignment that written with Biopython's `AlignIO.write()` method above. In that example, the file containing the alignment text is called `standardized_alignment.clw`. 

Set `INPUT FORMAT` to 'CLUSTAL'.

Under `STEP 3 - Set output parameters` at [MView](https://www.ebi.ac.uk/Tools/msa/mview/), click on the 'More options...' button and adjust the settings to match the image below. Adjust the 'ALIGNMENT WIDTH' in those options to what you'd like. 

![options_settings](../imgs/mview_settings_for_parse.png)

Essentially, in addition to choosing 'MVIEW' as `OUTPUT FORMAT`, 'ON' for `ALIGNMENT`, the desired width, most settings are adjusted to 'OFF', in particular `HTML MARKUP`, `RULER`, and `CONSENSUS`.

Submit the job and let it run.   
We need to bring the result into this Jupyter session. Click the `Download Alignment File` button just above the text output. Currently, clicking that in my browser bings up a page with just the ouput text where I need to highlight all the text and copy and then copy the entire block of text and paste it into a file here in the Jupyter session.  Call it `mview_out.txt`. 

You can run the next cell to get the `mview_out.txt` file that would result from those steps.

In [9]:
# Get pre-made mview output
!curl -o mview_out.txt https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/5860f3b4c6aaf25d348dd9a188670cb89e68792e/uv_mview_output.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2735  100  2735    0     0  18605      0 --:--:-- --:--:-- --:--:-- 18605


To show we have that:

In [10]:
!head mview_out.txt

Reference sequence (1): STV1 Identities normalised by aligned length.
1 STV1 100.0% 100.0%  -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLR 
2 VPH1  91.5%  47.2%  MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIR 

1 STV1 100.0% 100.0%  RFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE 
2 VPH1  91.5%  47.2%  RLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI--------- 

1 STV1 100.0% 100.0%  NVNDMVKEITDCESRARQLDESLDSLRSKLNDLLEQRQVIFECSKFIEVNPGIAGRATNP 
2 VPH1  91.5%  47.2%  --DDYVRNASYLEERLIQMEDATDQIEVQKNDLEQYRFILQSGDEFF-----LKGDNTDS 



#### Step 2: Use my script to convert that to standardized CLUSTAL format.

From the view above, you'll see that MView got things most of the way there.  Specifically, there is the idenitifier and the sequence of the specified width in each sequence block. My script mainly parses those out and adjusts a few things to make it standard CLUSTAL format.  
The main source for the 'standardized' specification seems to be [here](http://meme-suite.org/doc/clustalw-format.html); however, see the top of this notebook for more about the specification.

The next cell will retrieve the script.

In [11]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/mview_to_CLUSTAL.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13679  100 13679    0     0  84438      0 --:--:-- --:--:-- --:--:-- 84962


Now we will point the script at that `mview_out.txt` file.

In [12]:
%run mview_to_CLUSTAL.py mview_out.txt

MView output read...collected identifiers and sequences...arranging for output...

Alignment converted from MView to CLUSTAL saved as 'mview_out_clustalized.clw'.
Finished.


To see the result:

In [13]:
!head mview_out_clustalized.clw

CLUSTAL multiple sequence alignment by mview_to_CLUSTAL (0.1.0)

STV1            -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLR
VPH1            MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIR

STV1            RFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE
VPH1            RLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI---------

STV1            NVNDMVKEITDCESRARQLDESLDSLRSKLNDLLEQRQVIFECSKFIEVNPGIAGRATNP
VPH1            --DDYVRNASYLEERLIQMEDATDQIEVQKNDLEQYRFILQSGDEFF-----LKGDNTDS


Let's rename that back to something more general for use below.

In [14]:
!mv mview_out_clustalized.clw aligned.clw

## Add a consensus symbol line to an MSA

Multiple sequence alignments from various sources don't come with the consensus symbols line typically provided by [EMBL-EBI's MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/). Theese symbols are described [here](https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ#BioinformaticsToolsFAQ-WhatdoconsensussymbolsrepresentinaMultipleSequenceAlignment?). Or sometimes they can get lost or need substantial updating following manual editing to the point is easier to remove them and start over to add them correctly. Here `calculate_cons_for_clustal_protein.py` is used to add a consensus line to an multiple sequence alignment.  I have a separate script for nucleic acids, called `calculate_cons_for_clustal_nucleic.py`, see about it [here](https://github.com/fomightez/sequencework/tree/master/alignment-utilities).

(**Note the width will be 60 and not 50 if the optional step was included.**)

In [15]:
import os
file_needed = "calculate_cons_for_clustal_protein.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/calculate_cons_for_clustal_protein.py


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24598  100 24598    0     0   155k      0 --:--:-- --:--:-- --:--:--  157k


In [16]:
%run calculate_cons_for_clustal_protein.py aligned.clw

Alignment file read...top line identifier determined as 'STV1'...bottom line identifier determined as 'VPH1'...
individual lines for each sequence identifier parsed...determining conservation of aligned sequences...

Alignment with conservation indication symbols added saved as 'aligned_plusCONS.clw'.
Finished.


In [17]:
!cat aligned_plusCONS.clw

CLUSTAL multiple sequence alignment by mview_to_CLUSTAL (0.1.0)

STV1            -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLR
VPH1            MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIR
                  ::********:*: **:*** *: *: :: **::.:. . ***..: **** :**::*

STV1            RFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE
VPH1            RLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI---------
                *:*:***   :: .:::**  : :     **:    * ..::  *  ..:*         

STV1            NVNDMVKEITDCESRARQLDESLDSLRSKLNDLLEQRQVIFECSKFIEVNPGIAGRATNP
VPH1            --DDYVRNASYLEERLIQMEDATDQIEVQKNDLEQYRFILQSGDEFF-----LKGDNTDS
                  :* *:: :  *.*  *:::: *.:. : *** : * :: . .:*:     : *  *:.

STV1            EIEQEERDVDEFRMTPDDISETLSDAFSFDDETPQDRGALGNDLTRNQSVEDLSFLEQGY
VPH1            TSYMDEDMIDA---NGENIAAAIGASVNY-------------------------------
                    :*  :*    . ::*: ::. :..:                        

See [here](https://www.ebi.ac.uk/seqdb/confluence/display/THD/Help+-+Clustal+Omega+FAQ#Help-ClustalOmegaFAQ-Whatdotheconsensussymbolsmeaninthealignment?) for interpreting the symbols now below each aligned sub-section of the multiple sequence alignment.

Possible subsequent use for the consensus symbols line:  

Beyond visually displaying relatedness in a multiple sequence alignment, these symbols can be used for categorizing residues to make commands for highlighting in molecular visualization. See [here for an example](https://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/cl_demo-binder%20Categorize%20conservation%20in%20a%20MSA%20and%20use%20that%20to%20generate%20molvis%20commands.ipynb) that uses `categorize_residues_based_on_conservation_relative_consensus_line.py` script described [here](https://github.com/fomightez/sequencework/tree/master/alignment-utilities). The notebook can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder) and then selecting from the index to go to the 'Categorize conservation in a MSA and use that to generate molvis commands' page. The demo was put in the structure work demo series because it was mainly developed to work towards making commands for molecular visualization.

-----
Enjoy!