Finding Genes using Motif Patterns
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
tutorial
LICENSE
README
figmop
setup.py
tutorial.pdf

README

======
Figmop
======
Version 1.1

This script uses the patternHmm Python package to identify genes in a raw 
genome sequence, after training a profile hidden Markov model on protein 
sequences. Besides patternHmm, it requires that MEME and MAST have already been
installed on the system.


Installation
============

This package is small, containing the single executable Python file 'figmop'.
Installation is done using the standard ``distutils`` software, and so should 
be relatively straight forward. Unpack the distribution (either the .tar.gz or 
the .zip file), and change directory into it. Then install the package with the
command:

``python setup.py install``

You may need super-user privileges to create the necessary directories, in 
which case the command would be: 

``sudo python setup.py install``

On Windows the command should be just:

``setup.py install``

This should copy the script to your default location for Python scripts, which
should be ``/usr/local/bin/`` or perhaps ``/usr/bin/`` on Unix and MacOSX.

Once this is finished it is recommended that you go through the included
tutorial, but regardless you are free to delete the distribution folder.


Usage
=====

For detailed instructions read the included tutorial.pdf document.

Figmop was designed to locate genes in raw genome data, as certain gene families
tend to confound gene-finding software even with a high quality assembly. This
would cause your gene family to be missing from the gene models and protein
sequences, even though the genes may be present in their entirety within the 
genome.

Creating the Model File
-----------------------

As an example, let us say that we are searching for cytochrome P450 (CYP) genes.
The first step is to assemble a reasonable number of related CYP protein 
sequences, preferebly > 20. You then should run MEME on your sequences, varying
the number of motifs it finds until you are satisfied that they form a
consistent pattern. I found that 15 motifs described the CYPs well, with a full
length sequence containing 13, but numbers may vary for your gene family.

Next create a model file with Figmop, using the command:

``figmop --generate_model_file 13``

This will create a Python file, 'model_file.py', which is used to specify
the parameters of our gene family. I found many good CYP sequences that only
contained 5 of the 13 motifs (all in the correct order), so I set my 
'min_matches' to 5. The 'max_genome_region' specifies the maximum size of the
footprint of a gene. Unless your ogranism has exceptionally many and large
introns, the default setting of 40,000 should be sufficient. Next, set the full
path to the 'meme.txt' file generated by MEME.

Now we must specify the pattern we found using MEME, which is done by modifying
the 'matchEmissions' dictionary in the model file. For the CYPs, the pattern of
motifs I found was: Motif 8, Motif 10, Motif 11, Motif 2, Motif 15, Motif 7,
etc. To indicate this, I set the first 3 entries of the dictionary to be:
``'M1': {'8': 1.0}, 'M2': {'10': 1.0}, 'M3': {'11': 1.0}``. Notice that the
motifs are represented by a string. In my sequences, I actually found 2 subtypes
of the CYP pattern. In the first the next 3 motifs were 2, 15, 7, while in the
second those three motifs were replaced by 12 and 9. This is easily accomodated
by the model, and so the next 3 entries of the dictionary were set as:
``'M4': {'2': 1.0, '12': 1.0}, 'M5': {'15': 1.0, '9': 1.0}, 'M6': {'7': 1.0}``.
Finally, from biological data I knew that the final motif in my pattern, which
happened to be Motif 1, was very important to the function of my proteins. To
indicate this I increased the match probability of the final entry in the 
dictionary: ``'M13': {'1': 2.0}``.

This kind of profile Hidden Markov Model allows for several motifs in the 
pattern to be skipped, and for random motifs to be present within the pattern;
the penalties for each of these cases are controlled by the transition
probabilities in the second dictionary in the model file. If desired you can
modify the default values to make the pattern more or less strict, but in
general this shouldn't be necessary.

Running Figmop
--------------

One model file is used to describe one gene family, and can be applied to many
different genomes. Once you have this file, Figmop is simple to run. The most
common usage is:

``figmop model_file.py genome_file.fa genome_mast_out -o my_run``

Here, genome_file.fa is the fasta formatted file of scaffolds or chromosomes for
your organism, my_run is a path to some directory where you want your results to
be saved, and genome_mast_out is a path to some directory to hold the raw genome
MAST results. These are required as an intermediate step, but are not useful for
the user. However, they may take 5-20 minutes to run, and so are only generated 
once. If the program detects that the specified directory doesn't exist or that 
the required file can't be found, it will run MAST automatically. That output 
depends only on the genome and your original MEME file, and so can be reused 
even if the model file is changed between runs.

One common option is the ``--sequence_padding`` or ``-p`` flag. When Figmop
finds a match to your pattern, it retrieves the genomic sequence at that
location (and takes the inverse complement if it came from the negative strand).
If the first motif in your pattern did not contain the start codon of your gene
(and there's no reason it would be expected to), neither would the sequence
returned by Figmop. Specifying a number for the sequence_padding option causes
Figmop to return the sequence of the match, and that many additional base pairs
on either side. The magnitude will depend on your gene family and your pattern,
but may be from several dozen to several thousand.

Finally, there is an ``--e_value`` or ``-e`` option that can be set. It is
important to note that this is the E-value cutoff used by the final MAST run,
and is influenced by the length of your genes and the quality of the matches to
each motif, not by the pattern or order they occur in (though they will
certainly be related in the vast majority of cases).