Sequence motif analysis using Bio.motifs
========================================

This Demo gives an overview of the functionality of the `Bio.motifs`
package included in Biopython. Most of this notebook describes the new `Bio.motifs` package included in Biopython onwards.

Speaking of other libraries, if you are reading this you might be
interested in [TAMO](http://fraenkel.mit.edu/TAMO/), another python
library designed to deal with sequence motifs. It supports more
*de-novo* motif finders, but it is not a part of Biopython and has some
restrictions on commercial use.

Motif objects
-------------

Since we are interested in motif analysis, we need to take a look at
`Motif` objects in the first place. For that we need to import the
Bio.motifs library:

In [1]:
from Bio import motifs


and we can start creating our first motif objects. We can either create
a `Motif` object from a list of instances of the motif, or we can obtain
a `Motif` object by parsing a file from a motif database or motif
finding software.

### Creating a motif from instances

Suppose we have these instances of a DNA motif:

In [2]:
from Bio.Seq import Seq
instances = [Seq("TACAA"),
    Seq("TACGC"),
    Seq("TACAC"),
    Seq("TACCC"),
    Seq("AACCC"),
    Seq("AATGC"),
    Seq("AATGC")]

then we can create a Motif object as follows:

In [3]:
m = motifs.create(instances)

The instances are saved in an attribute `m.instances`, which is
essentially a Python list with some added functionality, as described
below. Printing out the Motif object shows the instances from which it
was constructed:

In [5]:
print(m)

TACAA
TACGC
TACAC
TACCC
AACCC
AATGC
AATGC



The length of the motif is defined as the sequence length, which should
be the same for all instances:

In [6]:
len(m)

5

The Motif object has an attribute `.counts` containing the counts of
each nucleotide at each position. Printing this counts matrix shows it
in an easily readable format:

In [7]:
print(m.counts)

        0      1      2      3      4
A:   3.00   7.00   0.00   2.00   1.00
C:   0.00   0.00   5.00   2.00   6.00
G:   0.00   0.00   0.00   3.00   0.00
T:   4.00   0.00   2.00   0.00   0.00



You can access these counts as a dictionary:

In [8]:
m.counts['A']

[3, 7, 0, 2, 1]

In [9]:
m.counts['T', 0]

4

You can also directly access columns of the counts matrix

In [10]:
m.counts[:, 3]

{'A': 2, 'C': 2, 'G': 3, 'T': 0}

In [11]:
m.alphabet

'ACGT'

The motif has an associated consensus sequence, defined as the sequence
of letters along the positions of the motif for which the largest value
in the corresponding columns of the `.counts` matrix is obtained:

In [12]:
m.consensus

Seq('TACGC')

as well as an anticonsensus sequence, corresponding to the smallest
values in the columns of the `.counts` matrix:

In [13]:
m.anticonsensus

Seq('CCATG')

In [14]:
m.degenerate_consensus

Seq('WACVC')

Here, W and R follow the IUPAC nucleotide ambiguity codes: W is either A or T, and V is A, C, or G @cornish1985. The degenerate consensus sequence is constructed following the rules specified by Cavener @cavener1987.

We can also get the reverse complement of a motif:

In [15]:
r = m.reverse_complement()
r.consensus

Seq('GCGTA')

In [16]:
print(r)

TTGTA
GCGTA
GTGTA
GGGTA
GGGTT
GCATT
GCATT



The reverse complement and the degenerate consensus sequence are only
defined for DNA motifs.

### Creating a sequence logo


We should get our logo saved as a PNG in the specified file.

Reading motifs
--------------

Creating motifs from instances by hand is a bit boring, so it’s useful
to have some I/O functions for reading and writing motifs. There are not
any really well established standards for storing motifs, but there are
a couple of formats that are more used than others.

### JASPAR

One of the most popular motif databases is
[JASPAR](http://jaspar.genereg.net). In addition to the motif sequence
information, the JASPAR database stores a lot of meta-information for
each motif. The module `Bio.motifs` contains a specialized class
`jaspar.Motif` in which this meta-information is represented as
attributes:

-   `matrix_id` - the unique JASPAR motif ID, e.g. ’MA0004.1’

-   `name` - the name of the TF, e.g. ’Arnt’

-   `collection` - the JASPAR collection to which the motif
    belongs, e.g. ’CORE’

-   `tf_class` - the structual class of this TF, e.g. ’Zipper-Type’

-   `tf_family` - the family to which this TF belongs, e.g.
    ’Helix-Loop-Helix’

-   `species` - the species to which this TF belongs, may have multiple
    values, these are specified as taxonomy IDs, e.g. 10090

-   `tax_group` - the taxonomic supergroup to which this motif
    belongs, e.g. ’vertebrates’

-   `acc` - the accession number of the TF protein, e.g. ’P53762’

-   `data_type` - the type of data used to construct this motif, e.g.
    ’SELEX’

-   `medline` - the Pubmed ID of literature supporting this motif, may
    be multiple values, e.g. 7592839

-   `pazar_id` - external reference to the TF in the
    [PAZAR](http://pazar.info) database, e.g. ’TF0000003’

-   `comment` - free form text containing notes about the construction
    of the motif

The `jaspar.Motif` class inherits from the generic `Motif` class and
therefore provides all the facilities of any of the motif formats —
reading motifs, writing motifs, scanning sequences for motif instances
etc.

JASPAR stores motifs in several different ways including three different
flat file formats and as an SQL database. All of these formats
facilitate the construction of a counts matrix. However, the amount of
meta information described above that is available varies with the
format.

#### The JASPAR `sites` format

The first of the three flat file formats contains a list of instances.
As an example, these are the beginning and ending lines of the JASPAR
`Arnt.sites` file showing known binding sites of the mouse
helix-loop-helix transcription factor Arnt.

```
>MA0004 ARNT 1
CACGTGatgtcctc
>MA0004 ARNT 2
CACGTGggaggtac
>MA0004 ARNT 3
CACGTGccgcgcgc
...
>MA0004 ARNT 18
AACGTGacagccctcc
>MA0004 ARNT 19
AACGTGcacatcgtcc
>MA0004 ARNT 20
aggaatCGCGTGc

```

The parts of the sequence in capital letters are the motif instances
that were found to align to each other.

We can create a `Motif` object from these instances as follows:

In [18]:
from Bio import motifs
with open("../data/Arnt.sites") as handle:
    arnt = motifs.read(handle, "sites")

The instances from which this motif was created is stored in the
`.instances` property:

In [19]:
print(len(arnt.instances))

20


In [20]:
for instance in arnt.instances:
    print(instance)

CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
AACGTG
AACGTG
AACGTG
AACGTG
CGCGTG


The counts matrix of this motif is automatically calculated from the
instances:

In [22]:
print(arnt.counts)

        0      1      2      3      4      5
A:   4.00  19.00   0.00   0.00   0.00   0.00
C:  16.00   0.00  20.00   0.00   0.00   0.00
G:   0.00   1.00   0.00  20.00   0.00  20.00
T:   0.00   0.00   0.00   0.00  20.00   0.00



In [28]:
import re
import copy
from Bio import SeqIO

class MsView():
    def __init__(self, fasta_file, plot_width=80):
        self.fasta_file = fasta_file
        self.plot_width = plot_width
        self._bg = None
        self.records = None
        self.from_file()
        
    def from_file(self, fasta_file=None, frm = 'fasta'):
        if fasta_file is not None:
            self.fasta_file = fasta_file
        with open(self.fasta_file) as handle:
            self.records = [record for record in SeqIO.parse(handle, frm)]
            self._alignment_length = max([len(record.seq) for record in self.records])
        self.set_bg()
    
    def set_bg(self):
        bg_c = {
            'A': 10,
            'C': 11,
            'G': 12,
            'T': 9,
            'U': 9
        }
        self._bg = []
        for seq in self.records:
            self._bg.append([bg_c.get(n, 255) for n in seq.seq])
            
    def to_str(self, pattern=''):
        tmp_str = []
        bg_copy = copy.deepcopy(self._bg)
        try:
            regex = re.compile(pattern, re.I)
        except:
            regex = re.compile('', re.I)
        for i, seq in enumerate(self.records):
            for m in regex.finditer(str(seq.seq)):
                for j in range(m.start(),m.end()):
                    bg_copy[i][j] = 245
                    
        num_blks = self._alignment_length // self.plot_width
        for i_b in range(num_blks):
            b_start = i_b*self.plot_width
            b_end = (i_b + 1)*self.plot_width
            tmp_str.append("{:03d}{}{:03d}\n".format(b_start, (self.plot_width-6)*'.', b_end))
            for seq, bgs in zip(self.records, bg_copy):
                for n, bg in zip(seq[b_start: b_end], bgs[b_start: b_end]):
                    tmp_str.append(f"\033[48;5;{bg}m{n}")
                tmp_str.append("\033[0;0m\n")
            tmp_str.append("\n")
        return "".join(tmp_str)
    
    def view(self, pattern=''):
        print(self.to_str(pattern))

    def __str__(self):
        return self.to_str()

In [45]:
from ipywidgets import interact, widgets
from IPython.display import display

fas = widgets.Text(
    value='../data/motif_sample.fasta',
    placeholder='Fasta file',
    description='File:',
    disabled=False
)

pattern = widgets.Text(
    value='',
    placeholder='Pattern to search',
    description='Pattern:',
    disabled=False
)

msview = MsView(fas.value)
out = widgets.interactive_output(msview.view, {'pattern': pattern})
widgets.VBox([widgets.VBox([fas, pattern]), out])

VBox(children=(VBox(children=(Text(value='../data/motif_sample.fasta', description='File:', placeholder='Fasta…