# MSA API

- [Types](#Types)
    - [Residues](#Residues)
    - [Annotations](#Annotations)
    - [Multiple Sequence Alignments](#Multiple-Sequence-Alignments)
    - [Sequences](#Sequences)
    - [MSA formats](#MSA-formats)
    - [Clustering](#Clustering)
- [Constants](#Constants)
- [Macros](#Macros)
- [Methods](#Methods)
    - [Imported from Base](#Imported-from-Base)
    - [Imported from Clustering.jl](#Imported-from-Clustering.jl)

In [1]:
using MIToS.MSA

In [2]:
?MIToS.MSA

The MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).

**Features**

  * Read and write MSAs in `Stockholm`, `FASTA` or `Raw` format
  * Handle MSA annotations
  * Edit the MSA, e.g. delete columns or sequences, change sequence order, shuffling...
  * Keep track of positions and annotations after modifications on the MSA
  * Describe a MSA, e.g. mean percent identity, sequence coverage, gap percentage...

```julia

using MIToS.MSA
```


<div class="panel panel-info">
    <div class="panel-heading">
        <strong>Julia help mode</strong>
    </div>
    <div class="panel-body">
        <p>If you type <code>?</code> at the beginning of the Julia REPL line, you will enter in the Julia help mode. In this mode, Julia prints the help or <strong>documentation</strong> of the entered element. This is a nice way of getting information about MIToS functions, types, etc. from Julia.</p>
    </div>
</div>

<a href="#"><i class="fa fa-arrow-up"></i></a>

## Types

### Residues

In [3]:
?MIToS.MSA.Residue

Most of the **MIToS** design is created around the `Residue` bitstype. It represents the 20 natural amino acids and a GAP value to represent insertion, deletion but also missing data: ambiguous residues and non natural amino acids. Each residue is encoded as an integer number, this allows fast indexing operation using Residues of probability or frequency matrices.

**Residue creation and conversion**

Creation and `convert`ion of `Residue`s should be treated carefully. `Residue` is encoded as an 8 bits type similar to `Int8`, to get faster indexing using `Int(x::Residue)`. In this way, `Int`, `Int8` and other signed integers returns the integer value encoded by the residue. Conversions to and from `Char`s and `Uint8` are different, to use the `Char`acter representation in IO operations.

```julia

julia> alanine = Residue('A')
A

julia> Int(alanine)
1

julia> Char(alanine)
'A'

julia> UInt8(alanine) # 0x41 == 65 == 'A'
0x41

julia> for residue in res"ARNDCQEGHILKMFPSTWYV-"
           println(residue, " ", Int(residue))
       end
A 1
R 2
N 3
D 4
C 5
Q 6
E 7
G 8
H 9
I 10
L 11
K 12
M 13
F 14
P 15
S 16
T 17
W 18
Y 19
V 20
- 21

```


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Annotations

In [4]:
?MIToS.MSA.Annotations

The `Annotations` type is basically a container for `Dict`s with the annotations of a multiple sequence alignment. `Annotations` was designed for storage of annotations of the **Stockholm format**.

MIToS also uses MSA annotations to keep track of:

  * **Modifications** of the MSA (`MIToS_...`) as deletion of sequences or columns.
  * Positions numbers in the original MSA file (**column mapping:** `ColMap`)
  * Position of the residues in the sequence (**sequence mapping:** `SeqMap`)


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Multiple Sequence Alignments

In [5]:
?MIToS.MSA.AbstractMultipleSequenceAlignment

MIToS MSAs are subtypes of `AbstractMatrix{Residue}`, because the most basic implementation of a MIToS MSA is a `Matrix` of `Residue`s.


In [6]:
?MIToS.MSA.MultipleSequenceAlignment

This MSA type include the `Matrix` of `Residue`s and the sequence names. To allow fast indexing of MSAs using **sequence identifiers**, they are saved as an `IndexedArray`.


In [7]:
?MIToS.MSA.AnnotatedMultipleSequenceAlignment

...


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Sequences

In [8]:
?MIToS.MSA.AbstractAlignedSequence

MIToS sequences are subtypes of `AbstractVector{Residue}`.


In [9]:
?MIToS.MSA.AlignedSequence



No documentation found.

**Summary:**

```julia
type MIToS.MSA.AlignedSequence <: MIToS.MSA.AbstractAlignedSequence
```

**Fields:**

```julia
id       :: ASCIIString
index    :: Int64
sequence :: Array{MIToS.MSA.Residue,1}
```


In [10]:
?MIToS.MSA.AnnotatedAlignedSequence

No documentation found.

**Summary:**

```julia
type MIToS.MSA.AnnotatedAlignedSequence <: MIToS.MSA.AbstractAlignedSequence
```

**Fields:**

```julia
id          :: ASCIIString
index       :: Int64
sequence    :: Array{MIToS.MSA.Residue,1}
annotations :: MIToS.MSA.Annotations
```


<a href="#"><i class="fa fa-arrow-up"></i></a>

### MSA formats

In [11]:
?MIToS.MSA.Stockholm

No documentation found.

**Summary:**

```julia
immutable MIToS.MSA.Stockholm <: MIToS.Utils.Format
```


In [12]:
?MIToS.MSA.FASTA

No documentation found.

**Summary:**

```julia
immutable MIToS.MSA.FASTA <: MIToS.Utils.Format
```


In [13]:
?MIToS.MSA.Raw

No documentation found.

**Summary:**

```julia
immutable MIToS.MSA.Raw <: MIToS.Utils.Format
```


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Clustering

In [1]:
Docs.typesummary( MIToS.MSA.ClusteringResult ) # imported from Clustering.jl

LoadError: LoadError: UndefVarError: MIToS not defined
while loading In[1], in expression starting on line 1

In [15]:
?MIToS.MSA.NoClustering

No documentation found.

**Summary:**

```julia
immutable MIToS.MSA.NoClustering <: Clustering.ClusteringResult
```


In [16]:
?MIToS.MSA.SequenceClusters

Data structure to represent sequence clusters. The sequence data itself is not included.


<a href="#"><i class="fa fa-arrow-up"></i></a>

## Constants

In [17]:
?MIToS.MSA.GAP

`GAP` is the character/number representation on **MIToS** for gaps (also for non standard residues). Lowercase characters and dots are also encoded as `GAP` in conversion from `String`s and `Char`s. This `Residue` constant is encoded as `Residue(21)`.


<a href="#"><i class="fa fa-arrow-up"></i></a>

## Macros

In [18]:
?MIToS.MSA.@res_str

The MIToS macro `@res_str` takes a string and returns a `Vector` of `Residues` (sequence).

```julia

julia> res"MIToS"
5-element Array{MIToS.MSA.Residue,1}:
 M
 I
 T
 -
 S

```


<a href="#"><i class="fa fa-arrow-up"></i></a>

## Methods

In [19]:
?MIToS.MSA.swap!

`swap!(ia::IndexedArray, to::Int, from::Int)` interchange/swap the values on the indices `to` and `from` in the `IndexedArray`


In [20]:
?MIToS.MSA.annotations

Returns the annotations of a MSA or a sequence.


In [21]:
?MIToS.MSA.filtersequences!

Allows to filter sequences on a MSA using a `AbstractVector{Bool}` mask (removes `false`s). For `AnnotatedMultipleSequenceAlignment`s the annotations are updated.

`filtersequences!(data::Annotations, ids::IndexedArray, mask::AbstractArray{Bool,1})` is useful for deleting annotations for a group of sequences. `ids` should be an `IndexedArray` with the `seqname`s of the annotated sequences and `mask` should be a logical vector.


In [22]:
?MIToS.MSA.filtercolumns!

Allows to filter columns/positions on a MSA using a `AbstractVector{Bool}` mask. For `AnnotatedMultipleSequenceAlignment`s or `AnnotatedAlignedSequence`s the annotations are updated.

`filtercolumns!(data::Annotations, mask)` is useful for deleting annotations for a group of columns (creating a subset in place).


In [23]:
?MIToS.MSA.empty

Creates empty MSA `Annotations` of length 0 using `sizehint!`


In [24]:
?MIToS.MSA.getannotfile

`getannotfile(ann[, feature[,default]])` returns per file annotation for `feature`


In [25]:
?MIToS.MSA.getannotcolumn

`getannotcolumn(ann[, feature[,default]])` returns per column annotation for `feature`


In [26]:
?MIToS.MSA.getannotsequence

`getannotsequence(ann[, seqname, feature[,default]])` returns per sequence annotation for `(seqname, feature)`


In [27]:
?MIToS.MSA.getannotresidue

`getannotresidue(ann[, seqname, feature[,default]])` returns per residue annotation for `(seqname, feature)`


In [28]:
?MIToS.MSA.setannotfile!

`setannotfile!(ann, feature, annotation)` stores per file `annotation` for `feature`


In [29]:
?MIToS.MSA.setannotcolumn!

`setannotcolumn!(ann, feature, annotation)` stores per column `annotation` (1 char per column) for `feature`


In [30]:
?MIToS.MSA.setannotsequence!

`setannotsequence!(ann, seqname, feature, annotation)` stores per sequence `annotation` for `(seqname, feature)`


In [31]:
?MIToS.MSA.setannotresidue!

`setannotresidue!(ann, seqname, feature, annotation)` stores per residue `annotation` (1 char per residue) for `(seqname, feature)`


In [32]:
?MIToS.MSA.annotate_modification!

Annotates on file annotations the modifications realized by MIToS on the MSA


In [33]:
?MIToS.MSA.delete_annotated_modifications!

Deletes all the MIToS annotated modifications


In [34]:
?MIToS.MSA.printmodifications

Prints MIToS annotated modifications


In [35]:
?MIToS.MSA.getresidues

Allows you to access the residues in a `Matrix{Residues}`/`Vector{Residues}` without annotations.


In [36]:
?MIToS.MSA.getsequence

Returns an `AlignedSequence` from the `MultipleSequenceAlignment`

Returns an `AnnotatedAlignedSequence` with all annotations of sequence from the `AnnotatedMultipleSequenceAlignment`

Gives you the annotations of the Sequence


In [37]:
?MIToS.MSA.getresiduesequences

Gives you a `Vector{Vector{Residue}}` with all the sequences of the MSA without Annotations


In [38]:
?MIToS.MSA.nsequences

Gives you the number of sequences on the `MultipleSequenceAlignment`


In [39]:
?MIToS.MSA.ncolumns

Gives you the number of columns/positions on the MSA or aligned sequence

`ncolumns(ann::Annotations)` returns the number of columns/residues with annotations. This function returns `-1` if there is not annotations per column/residue.


In [40]:
?MIToS.MSA.gapfraction

Calculates the fraction of gaps on the `Array` (alignment, sequence, column, etc.). This function can take an extra `dim` argument for calculation of the gap fraction over the given dimension


In [41]:
?MIToS.MSA.residuefraction

Calculates the fraction of residues (no gaps) on the `Array` (alignment, sequence, column, etc.) This function can take an extra `dim` argument for calculation of the residue fraction over the given dimension


In [42]:
?MIToS.MSA.coverage

Coverage of the sequences with respect of the number of positions on the MSA


In [43]:
?MIToS.MSA.columngapfraction

Fraction of gaps per column/position on the MSA


In [44]:
?MIToS.MSA.setreference!

Puts the sequence `i` as reference (as the first sequence) of the MSA. This function swaps the sequences 1 and `i`, also an `id` can be used to select the sequence.


In [45]:
?MIToS.MSA.gapstrip!

This functions deletes/filters sequences and columns/positions on the MSA on the following order:

  * Removes all the columns/position on the MSA with gaps on the reference sequence (first sequence)
  * Removes all the sequences with a coverage with respect to the number of columns/positions on the MSA **less** than a `coveragelimit` (default to `0.75`: sequences with 25% of gaps)
  * Removes all the columns/position on the MSA with **more** than a `gaplimit` (default to `0.5`: 50% of gaps)


In [46]:
?MIToS.MSA.adjustreference!

Removes positions/columns of the MSA with gaps in the reference (first) sequence


In [47]:
?MIToS.MSA.asciisequence

Gives an ASCIIString with the sequence number `seq` of the MSA


In [48]:
?MIToS.MSA.gapstrip

Creates a new `Matrix{Residue}` with deleted sequences and columns/positions on the MSA:

  * Removes all the columns/position on the MSA with gaps on the reference sequence (first sequence)
  * Removes all the sequences with a coverage with respect to the number of columns/positions on the MSA **less** than a `coveragelimit` (default to `0.75`: sequences with 25% of gaps)
  * Removes all the columns/position on the MSA with **more** than a `gaplimit` (default to `0.5`: 50% of gaps)


In [49]:
?MIToS.MSA.adjustreference

Creates a new Matrix{Residue}. This function deletes positions/columns of the MSA with gaps in the reference (first) sequence


In [50]:
?MIToS.MSA.filtersequences

No documentation found.

`MIToS.MSA.filtersequences` is a generic `Function`.

```julia
# 1 method for generic function "filtersequences":
filtersequences(msa::Array{MIToS.MSA.Residue,2}, mask::AbstractArray{Bool,1}) at /home/diego/.julia/v0.4/MIToS/src/MSA/MultipleSequenceAlignment.jl:175
```


In [51]:
?MIToS.MSA.filtercolumns

No documentation found.

`MIToS.MSA.filtercolumns` is a generic `Function`.

```julia
# 2 methods for generic function "filtercolumns":
filtercolumns(msa::Array{MIToS.MSA.Residue,2}, mask::AbstractArray{Bool,1}) at /home/diego/.julia/v0.4/MIToS/src/MSA/MultipleSequenceAlignment.jl:198
filtercolumns(seq::Array{MIToS.MSA.Residue,1}, mask::AbstractArray{Bool,1}) at /home/diego/.julia/v0.4/MIToS/src/MSA/MultipleSequenceAlignment.jl:199
```


In [52]:
?MIToS.MSA.getcolumnmapping

No documentation found.

`MIToS.MSA.getcolumnmapping` is a generic `Function`.

```julia
# 1 method for generic function "getcolumnmapping":
getcolumnmapping(msa::MIToS.MSA.AnnotatedMultipleSequenceAlignment) at /home/diego/.julia/v0.4/MIToS/src/MSA/MultipleSequenceAlignment.jl:466
```


In [53]:
?MIToS.MSA.getsequencemapping

No documentation found.

`MIToS.MSA.getsequencemapping` is a generic `Function`.

```julia
# 2 methods for generic function "getsequencemapping":
getsequencemapping(msa::MIToS.MSA.AnnotatedMultipleSequenceAlignment, seq_id::ASCIIString) at /home/diego/.julia/v0.4/MIToS/src/MSA/MultipleSequenceAlignment.jl:468
getsequencemapping(msa::MIToS.MSA.AnnotatedMultipleSequenceAlignment, seq_num::Int64) at /home/diego/.julia/v0.4/MIToS/src/MSA/MultipleSequenceAlignment.jl:471
```


In [54]:
?MIToS.MSA.shuffle_columnwise!

Shuffles the residues in each column


In [55]:
?MIToS.MSA.shuffle_sequencewise!

Shuffles the residues in each sequence


In [56]:
?MIToS.MSA.shuffle_residues_sequencewise!

Shuffles the residues in each sequence, keeping fixed the gap positions


In [57]:
?MIToS.MSA.shuffle_residues_columnwise!

Shuffles the residues in each column, keeping fixed the gap positions


In [58]:
?MIToS.MSA.percentidentity

Computes quickly if two aligned sequences have a identity value greater than a given `threshold` value. Returns a boolean value.

Calculates the fraction of identities between two aligned sequences.

The identity value is calculated as the number of identical characters in the i-th position of both sequences divided by the length of both sequences. Positions with gaps in both sequences are not counted in the length of the sequence.

Calculates the identity between all the sequences on a MSA. You can indicate the output element type with the last optional parameter (`Float64` by default). For a MSA with a lot of sequences, you can use `Float32` or `Flot16` in order to avoid the `OutOfMemoryError()`.


In [59]:
?MIToS.MSA.meanpercentidentity

Returns the mean of the percent identity between the sequences of a MSA. If the MSA has 300 sequences or less, the mean is exact. If the MSA has more sequences, 44850 random pairs of sequences are used for the estimation. The number of samples can be changed using the second argument.


In [60]:
?MIToS.MSA.getweight

`getweight(c, i::Int)`

This function returns the weight of the sequence number `i`. getweight should be defined for any type used for `count!`/`count` in order to use his weigths.

Get the weights of all clusters in the set.


In [61]:
?MIToS.MSA.hobohmI

Sequence clustering using the Hobohm I method from Hobohm et. al. 1992.

*Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.*


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Imported from Base

In [62]:
?MIToS.MSA.names

```
names(x::Module[, all=false[, imported=false]])
```

Get an array of the names exported by a `Module`, with optionally more `Module` globals according to the additional parameters.

Returns the names of the MSA sequences.


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Imported from Clustering.jl

In [63]:
?MIToS.MSA.nclusters

Get the number of clusters in `SequenceClusters`.


In [64]:
?MIToS.MSA.counts

Get sample counts of clusters as a `Vector`. Each `k` value is the number of samples assigned to the k-th cluster.


In [65]:
?MIToS.MSA.assignments

Get a vector of assignments, where the `i` value is the index/number of the cluster to which the i-th sequence is assigned.


<a href="#"><i class="fa fa-arrow-up"></i></a>