# Use biopython to make valid CLUSTAL formatted MSAs, check sequence of manually edited alignment, and add consensus line

This is meant to represent a typical workflow where a combination of these steps would be used.

------

##  Using biopython to make valid CLUSTAL formatted MSAs

Biopython seems to not require strict adherence to line lengths for the sequence blocks. For exampe, the first line can have 83 residues for each sequence while the other lines can have the tpyical sixty residues. Since it can read in uneven CLUSTAL style alignments that can arise when one tries to manually edit a multiple sequence alignment (MSA), biopython is suited to converting the hand-edited sequence to the more standardized CLUSTAL format. It is probably always best to perform this step ASAP if you have sequence alignments with uneven blocks. This way the 'standardized' CLUSTAL format will be utilized by downstream steps. Not all computational tools will be written to be as lenient as biopython is about the standard format. I personally have written scripts that rely on the first set of sequence blocks to establish the number of columns. Hence, having uneven width for the sequence blocks would cause errors.

Here a hand-edited multiple sequence alignment will be converted to more standard form.

## Check sequence of manually edited alignment

In the process of manuall editing a multiple sequence file, it is easy to erroneously delete sequence. This section will demonstrate using `check_seq_in_MSAclustal_consistent_with_FASTA.py` to make sure the sequence in the edited file is valid. It checks against a user-provided FASTA. It is suggested this comes directly from an 'official' source.

## Add a consensus symbol line to an MSA

Multiple sequence alignments from various sources don't come with the consensus symbols line typically provided by [EMBL-EBI's MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/). Theese symbols are described [here](https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ#BioinformaticsToolsFAQ-WhatdoconsensussymbolsrepresentinaMultipleSequenceAlignment?). Or sometimes they can get lost or need substantial updating following manual editing to the point is easier to remove them and start over to add them correctly. Here `calculate_cons_for_clustal_protein.py` is used to add a consensus line to an multiple sequence alignment.  I have a separate script for nucleic acids, called `calculate_cons_for_clustal_nucleic.py`, see about it [here](https://github.com/fomightez/sequencework/tree/master/alignment-utilities).

Possible subsequent use for the consensus symbols line:  

Beyond visually displaying relatedness in a multiple sequence alignment, these symbols can be used for categorizing residues to make commands for highlighting in molecular visualization. See [here for an example](https://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/cl_demo-binder%20Categorize%20conservation%20in%20a%20MSA%20and%20use%20that%20to%20generate%20molvis%20commands.ipynb) that uses `categorize_residues_based_on_conservation_relative_consensus_line.py` script described [here](https://github.com/fomightez/sequencework/tree/master/alignment-utilities). The notebook can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder) and then selecting from the index to go to the 'Categorize conservation in a MSA and use that to generate molvis commands' page. The demo was put in the structure work demo series because it was mainly developed to work towards making commands for molecular visualization.