# MSA

The MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).

## Features

- [**Read**](#Reading-MSA-files) and [**write**](#Writing-MSA-files) MSAs in `Stockholm`, `FASTA` or `Raw` format
- Handle MSA annotations
- Edit the MSA, e.g. delete columns or sequences, change sequence order, shuffling...
- Keep track of positions and annotations after modifications on the MSA
- Describe a MSA, e.g. mean percent identity, sequence coverage, gap percentage...

In [1]:
using MIToS.MSA

INFO: Recompiling stale cache file /home/dzea/.julia/lib/v0.4/MIToS.ji for module MIToS.


## MSA IO

### Reading MSA files

The main function for reading files in MIToS is `read` and it is defined in the `Utils` module. This function takes a filename/path and lot of arguments, opens the file and uses the arguments to call the `parse` function. `read` decides how to open the file, using the prefixes and suffixes of the file name, while `parse` does the actual parsing of the file. You can `read` **gzipped files** if they have the `.gz` extension and also **files of the web**.  
The second argument of `read` and `parse` is the file `Format`. The supported MSA formats at the moment are `Stockholm`, `FASTA` and `Raw`.  
For example, reading in Julia the full Stockholm MSA of the family PF07388 using the Pfam RESTful interface will be:

In [8]:
read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4168  100  4168    0     0   6882      0 --:--:-- --:--:-- --:--:--  6889


4x458 MIToS.MSA.AnnotatedMultipleSequenceAlignment:
 -  -  -  -  -  -  -  -  -  -  -  -  -  …  -  -  -  -  -  -  -  -  -  -  -  -
 M  L  K  K  I  K  K  A  L  F  Q  P  K     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  K  K  L  S  G  L  M  Q  D  I  K     D  F  Q  K  Y  R  I  K  Y  L  Q  L
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -

The third (and optional) argument of `read` and `parse` is the output MSA type:  
  
<p>
    <dl>
    
        <dt><code>Matrix{Residue}</code></dt>
        <dd>It is the default output format for a <code>Raw</code> file.</dd>
        
        <dt><code>MultipleSequenceAlignment</code></dt>
        <dd>It contains the sequence names/identifiers.</dd>
        
        <dt><code>AnnotatedMultipleSequenceAlignment</code></dt>
        <dd>The richest MSA format of MIToS and the default for <code>FASTA</code> and <code>Stockholm</code> files. It includes sequences names and MSA annotations.</dd>
       
    </dl>
</p>


In [9]:
read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm, Matrix{Residue})

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4168  100  4168    0     0   1138      0  0:00:03  0:00:03 --:--:--  1138

4x458 Array{MIToS.MSA.Residue,2}:
 -  -  -  -  -  -  -  -  -  -  -  -  -  …  -  -  -  -  -  -  -  -  -  -  -  -
 M  L  K  K  I  K  K  A  L  F  Q  P  K     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  K  K  L  S  G  L  M  Q  D  I  K     D  F  Q  K  Y  R  I  K  Y  L  Q  L
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -

100  4168  100  4168    0     0   1138      0  0:00:03  0:00:03 --:--:--  1138


Given that `read` call `parse`, you should look into the documentation of the last one to know the available keyword arguments. The optional keyword arguments using in MSA IO are:

<p>
<dl class="dl-horizontal">

<dt><code>generatemapping</code></dt>
<dd>
If <code>checkalphabet</code> is <code>true</code> (default to <code>false</code>), sequence and columns mapping are generated and saved in the MSA annotations. <span class="text-warning">The default is <code>false</code> to not overwrite mappings by mistake when you read an annotated MSA file saved with MIToS.</span>
</dd>

<dt><code>useidcoordinates</code></dt>
<dd>
If <code>useidcoordinates</code> is <code>true</code> (default to <code>false</code>), MIToS uses the coordinates in the sequence names of the form <i>seqname/start-end</i> to generate sequence mappings. This is safe and useful with fresh downloaded Pfam MSAs. <span class="text-warning">Please be careful if you are reading a MSA saved with MIToS. MIToS deletes unaligned insert columns, therefore the sequences would be disrupted if there were insert columns.</span>
</dd>

<dt><code>checkalphabet</code></dt>
<dd>
The <code>parse</code> function converts each character in sequence strings to a MIToS <code>Residue</code>. Lowercase characters, dots and degenerated or non standard residues are converted to gaps. If <code>checkalphabet</code> is <code>true</code> (<code>false</code> by default), <code>read</code> deletes all the sequences with non-standard residues. The 20 natural residues are A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V.
</dd>

<dt><code>deletefullgaps</code></dt>
<dd>
Given that lowercase characters and dots are converted to gaps, unaligned insert columns in a MSA derived from a HMM profile are converted into full gap columns. <code>deletefullgaps</code> is <code>true</code> by default, deleting full gaps columns and therefore insert columns.
</dd>

</dl>
</p>

<div class="panel panel-warning">
<div class="panel-heading">
		<strong>If you are deriving scores from gaps...</strong>
	</div>
	<div class="panel-body">
		If you are using MIToS to derive information scores from gaps, you will want to set <code>checkalphabet</code> to <code>true</code>. This prevents counting non standard residues as gaps.
	</div>
</div>

When `read` returns an `AnnotatedMultipleSequenceAlignment`, it uses the MSA `Annotations` to inform about performed modifications on the MSA. To access this notes, use `printmodifications`:

In [16]:
msa = read("http://pfam.xfam.org/family/PF01565/alignment/full", Stockholm, checkalphabet=true)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.0M  100 10.0M    0     0   891k      0  0:00:11  0:00:11 --:--:-- 1289k


12364x139 MIToS.MSA.AnnotatedMultipleSequenceAlignment:
 P  S  L  I  A  R  C  K  S  A  D  D  V  …  V  V  T  A  D  G  R  Q  L  -  -  -
 P  L  V  I  V  T  A  L  N  V  A  H  I     L  I  D  V  K  G  R  I  L  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     L  V  L  A  D  G  S  L  V  R  C  S
 P  S  Y  V  V  K  A  T  N  V  A  Q  I     V  V  T  P  D  G  R  F  V  T  A  -
 P  R  A  A  V  R  C  A  T  A  E  A  V     L  F  E  G  T  G  V  V  E  W  V  -
 P  D  V  V  V  L  P  K  N  V  G  Q  V  …  V  V  L  P  N  G  D  V  L  -  -  -
 -  A  Y  Y  I  T  P  H  N  E  T  A  L     -  V  -  -  -  -  -  -  -  -  -  -
 P  L  C  I  V  T  P  R  N  A  S  H  V     M  V  D  A  N  G  N  L  L  -  -  -
 P  S  I  V  I  A  P  G  T  E  N  D  V     I  V  L  A  N  G  D  F  -  -  -  -
 P  A  A  V  L  R  P  R  S  A  Q  D  I     V  V  T  G  T  G  E  L  V  R  C  S
 -  -  -  -  -  -  -  -  -  -  -  -  -  …  V  C  D  G  D  -  -  -  -  -  -  -
 P  P  F  V  V  N  A  T  E  P  G  H  V     V  V  T  P  T  G  E  V  V  A  -  -
 -  -  L

In [17]:
printmodifications(msa)

-------------------
2016-03-04T09:28:31

deletenotalphabetsequences!  :  Deletes 21 sequences with ambiguous or not standard residues (Alphabet: ARNDCQEGHILKMFPSTWYV-. )
filtersequences! : 21 sequences have been deleted.
deletefullgaps!  :  Deletes 621 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
filtercolumns! : 621 columns have been deleted.


### Writing MSA files

In [6]:
names(msa)



3-element Array{Symbol,1}:
 :id         
 :msa        
 :annotations

 in depwarn at deprecated.jl:73
 in names at deprecated.jl:50
 in include_string at loading.jl:266
 in execute_request_0x535c5df2 at /home/dzea/.julia/v0.4/IJulia/src/execute_request.jl:177
 in eventloop at /home/dzea/.julia/v0.4/IJulia/src/IJulia.jl:141
 in anonymous at task.jl:447
while loading In[6], in expression starting on line 1


In [4]:
msa = read( "http://pfam.xfam.org/family/PF07388/alignment/full", 
            Stockholm, generatemapping=true, useidcoordinates=true, 
            checkalphabet=true)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4168  100  4168    0     0   7701      0 --:--:-- --:--:-- --:--:--  7690

4x458 MIToS.MSA.AnnotatedMultipleSequenceAlignment:
 -  -  -  -  -  -  -  -  -  -  -  -  -  …  -  -  -  -  -  -  -  -  -  -  -  -
 M  L  K  K  I  K  K  A  L  F  Q  P  K     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  K  K  L  S  G  L  M  Q  D  I  K     D  F  Q  K  Y  R  I  K  Y  L  Q  L
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -

100  4168  100  4168    0     0   7700      0 --:--:-- --:--:-- --:--:--  7690


In [None]:
?MIToS.MSA.@res_str

<div class="panel panel-info">
    <div class="panel-heading">
        <strong>Julia help mode</strong>
    </div>
    <div class="panel-body">
        <p>If you type <code>?</code> at the beginning of the Julia REPL line, you will enter in the Julia help mode. In this mode, Julia prints the help or <strong>documentation</strong> of the entered element. This is a nice way of getting information about MIToS functions, types, etc. from Julia.</p>
    </div>
</div>

In [None]:
?MIToS.MSA.AbstractMultipleSequenceAlignment

In [None]:
msa = read(msa_file, Stockholm, Matrix{Residue})

In [None]:
?MIToS.MSA.MultipleSequenceAlignment

In [None]:
msa = read(msa_file, Stockholm, MultipleSequenceAlignment)

In [None]:
msa.id

In [None]:
msa["F112_SSV1/3-112"]

Similar to this, MIToS defines an `AnnotatedMultipleSequenceAlignment` that also includes annotations.

In [None]:
fieldnames(AnnotatedMultipleSequenceAlignment)

In [None]:
msa = read(msa_file, Stockholm, AnnotatedMultipleSequenceAlignment, generatemapping=true, useidcoordinates=true)

In [None]:
msa.annotations

## MSA annotations

In [None]:
?MIToS.MSA.Annotations

In [None]:
fieldnames(Annotations)

MIToS uses MSA annotations to keep track of:  
- **Modifications** of the MSA (`MIToS_...`) as deletion of sequences or columns.  
- Positions numbers in the original MSA file (**column mapping:** `ColMap`)  
- Position of the residues in the sequence (**sequence mapping:** `SeqMap`)  

In [None]:
printmodifications(msa)

In [None]:
getcolumnmapping(msa)

In [None]:
getsequencemapping(msa,"F112_SSV1/3-112")