# MSA

The `MSA` module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).

In [28]:
using MIToS.MSA

- [Residues](#Residues)
- [Multiple Sequence Alignments](#Multiple-Sequence-Alignments)
- [MSA annotations](#MSA-annotations)

## Residues

This module defines the `Residue` type. It represents the 20 natural amino acids and a `GAP` value to represent insertion, deletion but also missing data: ambiguous residues and non natural amino acids.  
Each residue is encoded as an integer number, this allows fast indexing operation using `Residue`s of probability or frequency matrices.    

In [21]:
for residue in res"ARNDCQEGHILKMFPSTWYV-"
    println(residue, " ", Int(residue))
end

A 1
R 2
N 3
D 4
C 5
Q 6
E 7
G 8
H 9
I 10
L 11
K 12
M 13
F 14
P 15
S 16
T 17
W 18
Y 19
V 20
- 21


Macros of the form `@name_str` are applied to string as prefixes: `name"..."`.  
In particular, the MIToS macro `@res_str` takes a string and returns a `Vector` of `Residues` (sequence).

In [29]:
@res_str("ARNDCQEGHILKMFPSTWYV-") == res"ARNDCQEGHILKMFPSTWYV-"

true

## Multiple Sequence Alignments

In [62]:
msa_file = MIToS.Pfam.downloadpfam("PF09645")

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   647  100   647    0     0   1021      0 --:--:-- --:--:-- --:--:--  1022


"PF09645.stockholm.gz"

The basic implementation of a Multiple Sequence Alignments is a `Matrix` of `Residue`s.

In [63]:
msa = read(msa_file, Stockholm, Matrix{Residue})

2x110 Array{MIToS.MSA.Residue,2}:
 -  -  -  -  -  -  -  V  A  Q  Q  L  F  …  -  -  -  -  -  -  -  -  -  -  -  -
 Q  T  L  N  S  Y  K  M  A  E  I  M  Y     E  Q  T  D  Q  G  F  I  K  A  K  Q

The type `MultipleSequenceAlignment` also includes **sequence identifiers**.

In [69]:
fieldnames(MultipleSequenceAlignment)

2-element Array{Symbol,1}:
 :id 
 :msa

The sequence identifiers are saved as an `IndexedArray` of the [IndexedArrays package<span class="fa fa-external-link" aria-hidden="true"></span>](https://github.com/garrison/IndexedArrays.jl), this allows fast indexing of the MSA using the sequence names.

In [70]:
msa = read(msa_file, Stockholm, MultipleSequenceAlignment)

2x110 MIToS.MSA.MultipleSequenceAlignment:
 -  -  -  -  -  -  -  V  A  Q  Q  L  F  …  -  -  -  -  -  -  -  -  -  -  -  -
 Q  T  L  N  S  Y  K  M  A  E  I  M  Y     E  Q  T  D  Q  G  F  I  K  A  K  Q

In [71]:
msa.id

2-element IndexedArrays.IndexedArray{ASCIIString}:
 "Y070_ATV/2-70"  
 "F112_SSV1/3-112"

In [72]:
msa["F112_SSV1/3-112"]

110-element MIToS.MSA.AlignedSequence:
 Q
 T
 L
 N
 S
 Y
 K
 M
 A
 E
 I
 M
 Y
 ⋮
 E
 Q
 T
 D
 Q
 G
 F
 I
 K
 A
 K
 Q

Similar to this, MIToS defines an `AnnotatedMultipleSequenceAlignment` that also includes annotations.

In [75]:
fieldnames(AnnotatedMultipleSequenceAlignment)

3-element Array{Symbol,1}:
 :id         
 :msa        
 :annotations

In [76]:
msa = read(msa_file, Stockholm, AnnotatedMultipleSequenceAlignment, generatemapping=true, useidcoordinates=true)

2x110 MIToS.MSA.AnnotatedMultipleSequenceAlignment:
 -  -  -  -  -  -  -  V  A  Q  Q  L  F  …  -  -  -  -  -  -  -  -  -  -  -  -
 Q  T  L  N  S  Y  K  M  A  E  I  M  Y     E  Q  T  D  Q  G  F  I  K  A  K  Q

In [81]:
msa.annotations

#=GF ID   F-112
#=GF AC   PF09645.7
#=GF DE   F-112 protein
#=GF AU   Coggill P
#=GF SE   pdb_2cmx
#=GF GA   22.40 22.40;
#=GF TC   23.00 37.80;
#=GF NC   19.60 22.30;
#=GF BM   hmmbuild HMM.ann SEED.ann
#=GF SM   hmmsearch -Z 11927849 -E 1000 --cpu 4 HMM pfamseq
#=GF TP   Domain
#=GF DR   INTERPRO; IPR018601;
#=GF CC   F-112 protein is of 70-110 residues and is found in viruses. Its
#=GF CC   winged-helix structure suggests a DNA-binding function.
#=GF SQ   2
#=GF NCol   119
#=GF ColMap   6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115
#=GF MIToS_2016-03-01T19:18:01   deletefullgaps!  :  Deletes 9 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not 

## MSA annotations

In [83]:
?MIToS.MSA.Annotations

The `Annotations` type is basically a container for `Dict`s with the annotations of a multiple sequence alignment. `Annotations` was designed for storage of annotations of the **Stockholm format**.


In [88]:
fieldnames(Annotations)

4-element Array{Symbol,1}:
 :file     
 :sequences
 :columns  
 :residues 

MIToS uses MSA annotations to keep track of:  
- **Modifications** of the MSA (`MIToS_...`) as deletion of sequences or columns.  
- Positions numbers in the original MSA file (**column mapping:** `ColMap`)  
- Position of the residues in the sequence (**sequence mapping:** `SeqMap`)  

In [85]:
printmodifications(msa)

-------------------
2016-03-01T19:18:01

deletefullgaps!  :  Deletes 9 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
filtercolumns! : 9 columns have been deleted.


In [86]:
getcolumnmapping(msa)

110-element Array{Int64,1}:
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
   ⋮
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115

In [87]:
getsequencemapping(msa,"F112_SSV1/3-112")

110-element Array{Int64,1}:
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
   ⋮
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112